SLI & SLO
Error budgets, burn rates, and alerts that actually mean something
An SLI (Service-Level Indicator) is a measurement: fraction of HTTP requests served in under 300ms with status < 500. An SLO (Service-Level Objective) is a target on that measurement: 99.9% over the last 28 days. An SLA (Service-Level Agreement) is a contract that says what happens if the SLO is breached — usually money.
The Google SRE approach reframes reliability work around the error budget: the difference between the target and 100%. At 99.9% over 28 days, you have ~40 minutes of allowable downtime per month. While the budget is positive, ship features fast. When it goes negative, freeze deploys and fix reliability. This single number aligns engineering and product on when to push and when to stabilize.
The Anatomy of an SLO
Every SLO needs three components: an SLI to measure, a target to hit, and a window to measure over. Get any of these wrong and the SLO is meaningless.
Key Numbers
Defining a Good SLI
A good SLI is a ratio of good events over total events. Both numerator and denominator must be measurable from production. The user experience — not internal health — is what defines "good".
{`# Availability SLI: fraction of requests that succeed
sli_availability = sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# Latency SLI: fraction of requests served quickly
# (must use histogram, not average; averages lie)
sli_latency = sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m]))
# Joint SLI: succeeds AND is fast
sli_joint = sum(rate(http_request_duration_seconds_bucket{
le="0.3", status!~"5.."}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m]))
# Freshness SLI for a data pipeline: rows updated within deadline
sli_freshness = count(time() - max_over_time(last_updated[1h]) < 3600)
/ count(last_updated)`} The denominator matters. "Average latency" is not an SLI — it has no good/total ratio and one outlier ruins it. Histogram-based fractions are the standard form because they roll up cleanly and survive label aggregation.
Error Budget Math
The error budget is what you're allowed to "spend" on failures. It is denominated in the same units as your SLI: bad-event-fraction * window.
{`SLO target = 99.9%
window = 28 days = 28 * 86400 s = 2,419,200 s
budget = (1 - 0.999) * 2,419,200 = 2,419.2 s = ~40 minutes
# If your service does 100M requests in 28 days:
budget_in_requests = 100_000_000 * 0.001 = 100,000 bad requests allowed
# Burn rate = current_bad_event_rate / budget_burn_rate_per_unit_time
# At nominal rate, you consume 1 / 28 * 100% of budget per day = 3.57%/day
# A burn rate of 1.0 = nominal pace. A burn rate of 14.4 = 100% of monthly
# budget gone in 2 days.
# Why 14.4?
14.4 = 28 days / (2 days) # exhausts 28d budget in 2d
2.0 = 28 days / (14 days) # exhausts in half the window
1.0 = sustainable for a window`} Multi-Window Multi-Burn-Rate Alerts
Page on burn rate, not raw error rate. The Google SRE workbook recommends two complementary alerts: a fast one for "we're burning the whole monthly budget in 2 days" and a slow one for "we've been slowly bleeding for the last 6 hours."
{`groups:
- name: api_slo_alerts
rules:
# Fast burn: 14.4x burn rate, requires confirmation in two windows
- alert: ApiBurnRateFast
expr: |
(
(1 - sli_availability_5m) > 14.4 * (1 - 0.999)
and
(1 - sli_availability_1h) > 14.4 * (1 - 0.999)
)
for: 2m
annotations:
summary: "API burning 14.4x normal rate - 100% of budget gone in 2 days"
severity: page
# Slow burn: 6x burn rate over a longer window
- alert: ApiBurnRateSlow
expr: |
(
(1 - sli_availability_30m) > 6 * (1 - 0.999)
and
(1 - sli_availability_6h) > 6 * (1 - 0.999)
)
for: 15m
annotations:
summary: "API burning 6x normal rate - 100% of budget gone in ~5 days"
severity: ticket`} The two-window AND prevents single-spike false positives. The fast alert catches true incidents quickly (5min + 1h). The slow alert catches gradual degradations (30min + 6h). Both fire long before the budget is exhausted, leaving time to fix the underlying issue.
Translating SLO to "9s"
The relationship between SLO percentage and allowed downtime is non-linear. Each additional 9 makes the goal 10x harder, not 1.1x harder.
| SLO | Allowed bad/month | Allowed bad/year | Practical meaning |
|---|---|---|---|
| 90% | 3 days | 36.5 days | "Hobby project" |
| 99% | 7.2h | 3.65 days | "Internal tools" |
| 99.5% | 3.6h | 1.83 days | "Single-region SaaS" |
| 99.9% | 43.2 min | 8.76h | "Mainstream SaaS" |
| 99.95% | 21.6 min | 4.38h | "Critical SaaS" |
| 99.99% | 4.32 min | 52.6 min | "Hyperscaler tier" |
| 99.999% | 26s | 5.26 min | "Telco / payments core" |
SLA vs SLO
Internally, you target an SLO. Externally, you commit to an SLA. Always set the SLO stricter than the SLA — the SLO is your warning track before legal liability hits.
SLO (internal)
What engineering targets. Drives prioritization. Crossing it triggers internal alerts and a deploy freeze. No external party knows or cares about your SLO.
SLA (external)
Contractual. Breach triggers credits, refunds, or termination rights. Lawyers write SLAs; engineers enforce SLOs strict enough to never breach SLAs.
The gap
If your SLA is 99.5%, set SLO at 99.9%. The buffer absorbs the noise of measurement and the gap between detection and remediation. Hitting your SLO consistently means you never breach the SLA.
The contract trap
SLAs measured in calendar months reset on the 1st. SLOs measured in rolling 28-day windows give a smoother signal. Don't let the SLA window dictate the SLO window.
The SRE Workbook Approach
Google's Site Reliability Engineering book and the SRE Workbook codified the playbook most modern SLO practice descends from.
- SLOs are user-facing. They measure what users experience, not internal CPU or memory metrics.
- One SLO per user journey. Don't define 50 SLOs. Pick the 3-5 critical user paths and SLO each.
- Error budget > 100% uptime obsession. Burning down a budget intentionally (chaos testing, risky launches) is healthy. Refusing to spend any budget means you're under-shipping.
- Burn-rate alerts, not threshold alerts. "Latency > 300ms" pages on every spike. "We're using budget 14x normal" pages only when it actually matters.
- The error budget policy. Pre-agreed actions when the budget is exhausted: deploy freeze, on-call rotation tightens, postmortem-driven reliability sprint. Written down in advance, not negotiated mid-incident.
Tradeoffs
Strict SLO = brittle culture
99.999% sounds impressive but punishes every deploy. Teams game the metric instead of improving the service. Loosen until the SLO leaves room to ship.
Loose SLO = no signal
99% is so easy to hit that the alert never fires. Tighten until the budget actually constrains behavior — the discomfort is the feature.
Window length
Short windows (1 day) react fast but flap on weekly traffic patterns. Long windows (90 days) are stable but hide acute failures inside aggregates. 28 days is the standard compromise.
Composite SLOs
Service A's 99.9% multiplied by Service B's 99.9% in series = 99.8% effective. Be honest about composition; "all green dashboard" can hide a multiplicative red.
FAQ
How do I pick the right SLO target?
Look at historical performance. If you're at 99.7% naturally, set the SLO at 99.5% (slightly loose) and tighten over time. If users start complaining at 99.9% but tolerate 99%, the right target is somewhere between — SLO is a product decision, not just engineering.
Why measure latency as a fraction, not a percentile?
Percentiles don't aggregate. P99(serviceA) + P99(serviceB) is not P99 of their union. Fractions ("requests faster than 300ms / total requests") aggregate cleanly via sum/sum and survive any grouping you want.
What about SLOs for batch jobs?
Use freshness or correctness SLIs. "Fraction of daily reports that finished within deadline." "Fraction of pipeline outputs that match validation." Same SLO math, the SLI just measures something other than per-request success.
How do I handle planned maintenance?
Don't exempt it. If users see downtime, it counts. Either schedule maintenance during organic low-traffic windows or budget for it. Exempting maintenance from the SLO is how teams convince themselves they're hitting the target while users disagree.
Can SLOs replace traditional alerts?
For user-facing services, mostly yes — burn-rate alerts are sufficient and more actionable than threshold alerts. Infrastructure alerts (disk full, certificate expiring) still need traditional thresholds because no SLI captures them.
What's the deal with budget burn down vs burn rate?
Burn down = total budget consumed cumulatively (going from 100% to 0% over the window). Burn rate = current speed of consumption. Alerts fire on burn rate; dashboards visualize burn down.