Distributed Systems SLO Calculator

Premise

If your customers expect 99.99% uptime across shared, interdependent services, you need to be shooting WAY higher on some of those components. Here is the math breakdown:

For REDUNDANT components that back one another up, you add them like this:

X + (1-Y)*Y = Total uptime

For example: Adding a second 95% uptime webserver to a pool gives a total uptime of 99.75% (95%+(1-95%)*95%)

This increase happens because the system works when any of the components work.

Adding a third server iterates on this:

[Combined Uptime]+(1-[Combined Uptime])*(Uptime of additional Instance)

So, adding a third web server at 95% would increase the uptime to 99.875% ([99.75%+(1-99.75%)*95%)].


However, for DEPENDENT components, you multiply them together.

If x depends on y, their combined uptime is (X * Y)

For example. IF web server X is a 95% uptime design, and it relies on a 95% infrastructure component, their combined uptime ceiling is 90.25%. This decrease happens because the system only works when all the components work.


See a bug? Want to help contribute? Here is the github repo: https://github.com/bnelford/distributedsystemsSLOcalc