SLO Design for Products That Do Not Page Engineers
- An SLO target above 99.9% on a small service usually pages on noise, not pain.
- 30-day rolling windows beat calendar-month windows for nearly every metric.
- Multi-window burn-rate alerts (the Google SRE pattern) cut alert noise by roughly 80% versus single-threshold alerts.
- An error budget that nobody enforces is a number on a dashboard, not an operational discipline.
A team we worked with had a customer-facing service with a 99.99% availability SLO, single-threshold alerts at 99.95%, and a five-engineer on-call rotation. The on-call paged on average four times per week, two of those at 3am. Almost every page was a brief blip below the threshold that recovered before the engineer finished logging in. Within a quarter, two engineers had asked to leave the rotation.
The SLO was not the cause. The SLO design was. The target was tighter than the product needed, the window was too short, and the alert was a single threshold against a noisy SLI. Each of those is a fixable design choice. Together they produce the difference between an on-call rotation that protects the team and one that burns it.
This piece is the operational version of SLO design. The five choices that decide whether the SLO does its job, and how to set each one without overthinking.
Choice 1: the target
The most common SLO mistake is setting the target by instinct. Engineers reach for 99.99% because it sounds rigorous. Product asks for 99.999% because it sounds professional. Neither number has been derived from anything; both will set the alerting threshold above the level the product actually needs.
The honest derivation is from customer expectation, not from instinct. Two questions:
- What level of unavailability do the customers actually notice and complain about?
- What level of unavailability does the product genuinely need to be useful?
For most customer-facing SaaS products, the answer to both is somewhere between 99.5% and 99.9%. That gives a monthly downtime budget of 21 to 43 minutes, which is realistic to operate. Above 99.9% the engineering cost rises sharply: each additional nine costs roughly an order of magnitude more in infrastructure, redundancy and operational discipline.
We default to 99.9% on engagement starts, with explicit conversation if the product needs tighter or can tolerate looser. A team running 99.99% on a service that nobody would notice at 99.9% is paying for nines that produce no customer value, and the cost shows up in on-call burnout.
| Target | Monthly downtime allowed | Roughly fits |
|---|---|---|
| 99.0% | 7h 18m | Internal tools, beta features |
| 99.5% | 3h 39m | Non-critical product paths, content delivery |
| 99.9% | 43m 49s | Most customer-facing SaaS |
| 99.95% | 21m 54s | Payment, auth, anything where downtime equals immediate revenue loss |
| 99.99% | 4m 22s | Genuinely critical infra (high-frequency trading, life-safety) |
| 99.999% | 26s | Almost no commercial product needs this |
Choice 2: the window
The window is the time over which the SLO is computed. The choice is between calendar-aligned windows (last calendar month) and rolling windows (last 30 days).
Calendar windows reset at midnight on the first. This is intuitive and matches how billing works. It is also operationally bad: an outage on the 28th of the month gets six days of “unconsumed budget” treatment, then a fresh budget on the 1st even though the team has not actually fixed the underlying issue.
Rolling windows compute the SLO over the most recent N days. The budget tracks reality continuously. An outage on the 28th still counts on the 5th of the next month. Budget exhaustion is a real signal of accumulated risk, not an artefact of the calendar.
We default to a 30-day rolling window for almost every SLO. The exceptions are services that genuinely operate on a monthly cycle (billing, monthly reports), where calendar windows match the operational rhythm.
Choice 3: the SLI shape
The Service Level Indicator is the actual measurement. Common shapes:
- Availability: success_count / total_count over the window
- Latency: percentage of requests under a latency threshold (e.g., 99% of requests under 500ms)
- Quality: percentage of requests with a “correct” outcome (for AI integrations, percentage of generations that pass the eval gate)
The trap most teams fall into is measuring the wrong thing. Counting all HTTP 200s as “success” includes responses with empty bodies, partial data, or wrong content. Counting only “complete request lifecycle including downstream calls” gives a much truer picture of the customer experience.
Define each SLI in terms of what the customer would call success, not what the load balancer would call success. For an API endpoint, that usually means: response code 2xx, response body non-empty and validated against schema, response within latency budget. For a UI page, it means time-to-interactive within budget. For a background job, it means completion within deadline.
Document the SLI definition in the same place as the SLO target. The number on the dashboard means nothing without the definition behind it.
Choice 4: the error budget policy
The error budget is the inverse of the SLO. A 99.9% SLO means a 0.1% error budget over the window: roughly 43 minutes per 30 days for availability.
The budget is operationally meaningless without a policy that says what happens when it is consumed. Without a policy, the budget is a number on a dashboard that nobody acts on, and the SLO is decorative.
A working error budget policy:
- Budget remaining > 25%: normal operations, ship features at full speed.
- Budget remaining 10% to 25%: feature work continues but with extra deploy review and reduced risk tolerance.
- Budget remaining < 10%: feature deploys to this service freeze. The team focuses on reliability work until the budget recovers above 50%.
- Budget exhausted (negative): all non-emergency deploys to this service stop. Senior leadership is informed. The team produces a recovery plan within 48 hours.
The policy must be agreed in writing by engineering and product before the first time it is invoked. Teams that try to negotiate the policy mid-incident always negotiate themselves out of it, and the SLO loses operational meaning in the same conversation.
Choice 5: multi-window burn-rate alerts
The classical SLO alert is a single threshold: “page when availability over the last hour drops below 99%.” This produces two failure modes. First, a brief blip below threshold pages on noise that recovers in minutes. Second, a slow gradual degradation never crosses the threshold and never pages, but consumes the budget anyway.
The Google SRE multi-window burn-rate alert solves both. The pattern measures error budget consumption rate over multiple time windows simultaneously, and pages only when both a short window and a long window agree that the burn rate is too high.
Practical setup for a 99.9% SLO over 30 days:
| Window pair | Burn rate threshold | Page severity |
|---|---|---|
| 5 minutes AND 1 hour | 14.4x | Critical (page on-call) |
| 30 minutes AND 6 hours | 6x | Critical (page on-call) |
| 6 hours AND 3 days | 3x | Warning (ticket, no page) |
The 14.4x rate burns 2% of the monthly budget in one hour; that is a real incident. The 6x rate burns 5% over six hours; that is a sustained problem worth waking someone for. The 3x rate is a gradual erosion that warrants attention but not a 3am page.
This pattern, used consistently, cuts page volume by roughly eighty percent in the engagements where we have measured before-and-after. The reduction is almost entirely in the “blip recovered before I logged in” category.
What changes operationally when SLOs are designed this way
The team stops getting woken up by noise. The on-call rotation becomes uneventful by default, with real incidents handled inside a runbook and post-mortemed afterwards. Feature work continues at normal speed when the budget is healthy.
When the budget runs low, the conversation shifts naturally toward reliability work, because the policy is in place and everyone knows what happens at 10% remaining. The conversation that would otherwise be “engineering wants to do reliability work, product wants features” becomes “the budget says we do reliability work this week.”
When an incident happens, the team can answer “how bad” in measurable terms. “We spent 12% of the monthly budget” is more actionable than “it was bad for an hour.” The post-mortem can include “we still have 60% budget remaining, here is the prevention work to keep it that way.”
The five choices above are not heroic. They are deliberate decisions, written down, applied consistently. The teams that make them get on-call rotations the engineers do not dread. The teams that skip them produce SLO documents nobody trusts and pages nobody answers.
Questions teams ask
What SLO target should we set?
Lower than instinct says. Most teams reach for 99.99%; very few products genuinely need it. Start at 99.9% for customer-facing services, 99.5% for non-critical paths, and only tighten when the product is demonstrably constrained by the current target.
Should every service have an SLO?
No. SLOs are for services where degradation maps to customer pain. Internal tools, batch jobs, and cron tasks usually need a simpler 'is it running' health check, not a percentile-driven SLO. Adding SLOs to everything dilutes the signal.
Who owns the error budget when it is exhausted?
Engineering and product jointly. The discipline that makes SLOs work is that an exhausted error budget freezes new feature work on that service until the budget recovers. If product can override that policy, the SLO is a metric, not a budget.