SLI & SLO

Error budgets, burn rates, and alerts that actually mean something

An SLI (Service-Level Indicator) is a measurement: fraction of HTTP requests served in under 300ms with status < 500. An SLO (Service-Level Objective) is a target on that measurement: 99.9% over the last 28 days. An SLA (Service-Level Agreement) is a contract that says what happens if the SLO is breached — usually money.

The Google SRE approach reframes reliability work around the error budget: the difference between the target and 100%. At 99.9% over 28 days, you have ~40 minutes of allowable downtime per month. While the budget is positive, ship features fast. When it goes negative, freeze deploys and fix reliability. This single number aligns engineering and product on when to push and when to stabilize.

The Anatomy of an SLO

Every SLO needs three components: an SLI to measure, a target to hit, and a window to measure over. Get any of these wrong and the SLO is meaningless.

99.9% of HTTP requests succeed in <300ms over 28 days SLI good_events / total_events requests in <300ms SLO target on the SLI over a window 99.9% / 28 days SLA contract money or credits 99.5% or refund Error budget = (1 - SLO) x window 99.9% x 28 days = 40min 19s of allowable bad-event time per window Burn rate = how fast you're using up that budget right now

Key Numbers

90%
~3 days/month of allowed downtime — "one nine"
99%
~7h/month — "two nines"
99.9%
~43min/month — "three nines"
99.99%
~4.3min/month — "four nines"
99.999%
~26s/month — "five nines"; usually unachievable
28 days
standard SLO window (rolling, not calendar month)
14.4x
burn rate that exhausts a 28-day budget in 2 days

Defining a Good SLI

A good SLI is a ratio of good events over total events. Both numerator and denominator must be measurable from production. The user experience — not internal health — is what defines "good".

{`# Availability SLI: fraction of requests that succeed
sli_availability = sum(rate(http_requests_total{status!~"5.."}[5m]))
                 / sum(rate(http_requests_total[5m]))

# Latency SLI: fraction of requests served quickly
# (must use histogram, not average; averages lie)
sli_latency = sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
            / sum(rate(http_request_duration_seconds_count[5m]))

# Joint SLI: succeeds AND is fast
sli_joint = sum(rate(http_request_duration_seconds_bucket{
              le="0.3", status!~"5.."}[5m]))
          / sum(rate(http_request_duration_seconds_count[5m]))

# Freshness SLI for a data pipeline: rows updated within deadline
sli_freshness = count(time() - max_over_time(last_updated[1h]) < 3600)
              / count(last_updated)`}

The denominator matters. "Average latency" is not an SLI — it has no good/total ratio and one outlier ruins it. Histogram-based fractions are the standard form because they roll up cleanly and survive label aggregation.

Error Budget Math

The error budget is what you're allowed to "spend" on failures. It is denominated in the same units as your SLI: bad-event-fraction * window.

{`SLO target  = 99.9%
window      = 28 days = 28 * 86400 s = 2,419,200 s
budget      = (1 - 0.999) * 2,419,200 = 2,419.2 s = ~40 minutes

# If your service does 100M requests in 28 days:
budget_in_requests = 100_000_000 * 0.001 = 100,000 bad requests allowed

# Burn rate = current_bad_event_rate / budget_burn_rate_per_unit_time
# At nominal rate, you consume 1 / 28 * 100% of budget per day = 3.57%/day
# A burn rate of 1.0 = nominal pace. A burn rate of 14.4 = 100% of monthly
# budget gone in 2 days.

# Why 14.4?
14.4 = 28 days / (2 days)        # exhausts 28d budget in 2d
2.0  = 28 days / (14 days)       # exhausts in half the window
1.0  = sustainable for a window`}

Multi-Window Multi-Burn-Rate Alerts

Page on burn rate, not raw error rate. The Google SRE workbook recommends two complementary alerts: a fast one for "we're burning the whole monthly budget in 2 days" and a slow one for "we've been slowly bleeding for the last 6 hours."

{`groups:
  - name: api_slo_alerts
    rules:
      # Fast burn: 14.4x burn rate, requires confirmation in two windows
      - alert: ApiBurnRateFast
        expr: |
          (
            (1 - sli_availability_5m) > 14.4 * (1 - 0.999)
          and
            (1 - sli_availability_1h) > 14.4 * (1 - 0.999)
          )
        for: 2m
        annotations:
          summary: "API burning 14.4x normal rate - 100% of budget gone in 2 days"
          severity: page

      # Slow burn: 6x burn rate over a longer window
      - alert: ApiBurnRateSlow
        expr: |
          (
            (1 - sli_availability_30m) > 6 * (1 - 0.999)
          and
            (1 - sli_availability_6h) > 6 * (1 - 0.999)
          )
        for: 15m
        annotations:
          summary: "API burning 6x normal rate - 100% of budget gone in ~5 days"
          severity: ticket`}

The two-window AND prevents single-spike false positives. The fast alert catches true incidents quickly (5min + 1h). The slow alert catches gradual degradations (30min + 6h). Both fire long before the budget is exhausted, leaving time to fix the underlying issue.

Translating SLO to "9s"

The relationship between SLO percentage and allowed downtime is non-linear. Each additional 9 makes the goal 10x harder, not 1.1x harder.

SLOAllowed bad/monthAllowed bad/yearPractical meaning
90%3 days36.5 days"Hobby project"
99%7.2h3.65 days"Internal tools"
99.5%3.6h1.83 days"Single-region SaaS"
99.9%43.2 min8.76h"Mainstream SaaS"
99.95%21.6 min4.38h"Critical SaaS"
99.99%4.32 min52.6 min"Hyperscaler tier"
99.999%26s5.26 min"Telco / payments core"

SLA vs SLO

Internally, you target an SLO. Externally, you commit to an SLA. Always set the SLO stricter than the SLA — the SLO is your warning track before legal liability hits.

SLO (internal)

What engineering targets. Drives prioritization. Crossing it triggers internal alerts and a deploy freeze. No external party knows or cares about your SLO.

SLA (external)

Contractual. Breach triggers credits, refunds, or termination rights. Lawyers write SLAs; engineers enforce SLOs strict enough to never breach SLAs.

The gap

If your SLA is 99.5%, set SLO at 99.9%. The buffer absorbs the noise of measurement and the gap between detection and remediation. Hitting your SLO consistently means you never breach the SLA.

The contract trap

SLAs measured in calendar months reset on the 1st. SLOs measured in rolling 28-day windows give a smoother signal. Don't let the SLA window dictate the SLO window.

The SRE Workbook Approach

Google's Site Reliability Engineering book and the SRE Workbook codified the playbook most modern SLO practice descends from.

  • SLOs are user-facing. They measure what users experience, not internal CPU or memory metrics.
  • One SLO per user journey. Don't define 50 SLOs. Pick the 3-5 critical user paths and SLO each.
  • Error budget > 100% uptime obsession. Burning down a budget intentionally (chaos testing, risky launches) is healthy. Refusing to spend any budget means you're under-shipping.
  • Burn-rate alerts, not threshold alerts. "Latency > 300ms" pages on every spike. "We're using budget 14x normal" pages only when it actually matters.
  • The error budget policy. Pre-agreed actions when the budget is exhausted: deploy freeze, on-call rotation tightens, postmortem-driven reliability sprint. Written down in advance, not negotiated mid-incident.

Tradeoffs

Strict SLO = brittle culture

99.999% sounds impressive but punishes every deploy. Teams game the metric instead of improving the service. Loosen until the SLO leaves room to ship.

Loose SLO = no signal

99% is so easy to hit that the alert never fires. Tighten until the budget actually constrains behavior — the discomfort is the feature.

Window length

Short windows (1 day) react fast but flap on weekly traffic patterns. Long windows (90 days) are stable but hide acute failures inside aggregates. 28 days is the standard compromise.

Composite SLOs

Service A's 99.9% multiplied by Service B's 99.9% in series = 99.8% effective. Be honest about composition; "all green dashboard" can hide a multiplicative red.

FAQ

How do I pick the right SLO target?

Look at historical performance. If you're at 99.7% naturally, set the SLO at 99.5% (slightly loose) and tighten over time. If users start complaining at 99.9% but tolerate 99%, the right target is somewhere between — SLO is a product decision, not just engineering.

Why measure latency as a fraction, not a percentile?

Percentiles don't aggregate. P99(serviceA) + P99(serviceB) is not P99 of their union. Fractions ("requests faster than 300ms / total requests") aggregate cleanly via sum/sum and survive any grouping you want.

What about SLOs for batch jobs?

Use freshness or correctness SLIs. "Fraction of daily reports that finished within deadline." "Fraction of pipeline outputs that match validation." Same SLO math, the SLI just measures something other than per-request success.

How do I handle planned maintenance?

Don't exempt it. If users see downtime, it counts. Either schedule maintenance during organic low-traffic windows or budget for it. Exempting maintenance from the SLO is how teams convince themselves they're hitting the target while users disagree.

Can SLOs replace traditional alerts?

For user-facing services, mostly yes — burn-rate alerts are sufficient and more actionable than threshold alerts. Infrastructure alerts (disk full, certificate expiring) still need traditional thresholds because no SLI captures them.

What's the deal with budget burn down vs burn rate?

Burn down = total budget consumed cumulatively (going from 100% to 0% over the window). Burn rate = current speed of consumption. Alerts fire on burn rate; dashboards visualize burn down.