Distributed Systems SLO Calculator
Premise
If your customers expect 99.99% uptime across shared, interdependent services, you need to be shooting WAY higher on some of those components. Here is the math breakdown:
For REDUNDANT components that back one another up, you add them like this:
X + (1-Y)*Y = Total uptime
For example: Adding a second 95% uptime webserver to a pool gives a total uptime of 99.75% (95%+(1-95%)*95%)
This increase happens because the system works when any of the components work.
Adding a third server iterates on this:
[Combined Uptime]+(1-[Combined Uptime])*(Uptime of additional Instance)
So, adding a third web server at 95% would increase the uptime to 99.875% ([99.75%+(1-99.75%)*95%)].
However, for DEPENDENT components, you multiply them together.
If x depends on y, their combined uptime is (X * Y)
For example. IF web server X is a 95% uptime design, and it relies on a 95% infrastructure component, their combined uptime ceiling is 90.25%. This decrease happens because the system only works when all the components work.
See a bug? Want to help contribute? Here is the github repo: https://github.com/bnelford/distributedsystemsSLOcalc