Rollback Budget: The Overlooked Deployment Metric
- Rollback budget is the maximum acceptable time-to-recover from a bad deploy, set per service.
- If the budget is more than 30 minutes for a customer-facing service, the deploy pipeline has a bug worth fixing now.
- Database migrations are the most common source of un-rollbackable deploys. Two-phase migrations cost a day and save weekends.
- Practising rollbacks during business hours is the cheapest insurance any release-engineering team can buy.
The deploy pipeline gets attention. CI build times are measured to the second, deploy frequency is plotted on quarterly review slides, and “we ship N times per day” becomes a cultural identity. Rollbacks rarely get the same attention. They are an afterthought, a manual runbook in a wiki page, a kubectl rollout undo command that someone hopes works the first time it gets typed in production.
This piece is about giving rollbacks the same explicit treatment as deploys. The rollback budget, the failure modes that make it impossible, and the engineering work that brings it back inside the budget. The teams that own this metric recover from bad deploys in minutes. The teams that do not own it recover in hours and write postmortems titled “we did not realise the rollback would take that long.”
What a rollback budget is
A rollback budget is the maximum acceptable time from “we know this deploy is bad” to “production is back on the previous version.” It is set per service, written down, and measured against on every actual rollback.
A reasonable starting set:
| Service shape | Rollback budget |
|---|---|
| Customer-facing web app | 5 to 15 minutes |
| Internal admin tool | 15 to 60 minutes |
| Batch / scheduled job | 1 to 4 hours |
| Data pipeline (ETL) | Depends; often unrollable, design for forward fix instead |
| Mobile app store release | Hours to days; not a budget question, an architecture one |
The exact number per service is less important than the act of writing it down. A service without a stated budget tends to drift into “however long it takes,” which is usually three to five times longer than the team would have signed off on if asked.
Why the budget matters more than deploy frequency
Deploy frequency rewards moving fast. Rollback budget rewards moving safely. The two together describe whether a team is shipping responsibly.
A team deploying twenty times a day with a fifteen-minute rollback budget is shipping fast and safely. A team deploying twenty times a day with no rollback budget and a two-hour rollback path is one bad deploy away from a six-hour incident. A team deploying once a week with a five-minute rollback budget is shipping slowly but safely; the rollback discipline is sound, the cadence is the limit.
The DORA research gave us four metrics: deploy frequency, lead time, change failure rate, mean time to restore. Rollback budget is the operational floor that the last one rests on. If MTTR (mean time to restore) is consistently above the budget, the team has a release-engineering bug worth treating as a P1.
Why most rollbacks are slower than the team expects
We measure actual rollback time on every engagement. The first measurement is almost always two to four times the team’s stated estimate. The reasons are predictable:
Database migrations are not reversible. A migration that added a NOT NULL column with a default is forward-compatible; a migration that dropped a column is one-way. The team assumed it could roll back the application code, did not realise the schema change was permanent, and stayed on the bad version while a hotfix was prepared.
Caches and queues outlive the deploy. A bad deploy enqueued five thousand messages with the wrong payload shape. Rolling back the consumer code does not roll back the queued messages. The team has to drain the queue manually, which is not a rollback, it is an incident with a longer name.
Configuration drift. The bad deploy also pushed an environment variable change. Rolling back the code does not roll back the env var. The previous code reads the new env var and breaks differently.
No tested runbook. The team has never actually rolled back this service. The runbook is a wiki page from two years ago. The on-call engineer is reading it for the first time at 3am.
Each of these is fixable. None of them are obvious until the first incident makes them obvious.
The three engineering practices that bring rollbacks inside budget
Two-phase database migrations. Every schema change is split into at least two deploys. Phase one adds the new shape (a new column, a new table) without removing the old. Phase two switches the code to use the new shape. Phase three (often weeks later, after operational confidence) drops the old shape. Any phase is independently rollbackable; a bad phase-two deploy reverts to the application code that read the old shape, which still exists.
The cost is real. A schema change becomes three deploys instead of one, and the team has to keep the old shape around longer than feels necessary. The benefit is also real: a database-driven incident becomes a code rollback, not a forward fix.
Versioned message contracts. Every queue, event stream and cross-service call uses a versioned message format. The consumer can read at least the previous version of the format. A bad producer deploy enqueues messages in version N+1; rolling back the producer enqueues version N again; the consumer keeps draining both. No queue draining, no manual cleanup.
Configuration as code, deployed atomically with the code. Environment variables, feature flags, and runtime config travel with the deploy artifact, not separately. Rolling back the deploy rolls back the config. This costs a small amount of pipeline work and removes the most common surprise from rollback day.
Practising rollbacks
The cheapest insurance any release-engineering team can buy is a scheduled rollback rehearsal. Pick a low-traffic window, deploy the current production version “again” (effectively a no-op), then trigger the rollback path as if a real incident had happened. Measure the time. Document what broke. Fix what broke before the next real incident.
A team that has never rolled back a particular service in the last quarter does not actually know the rollback budget for that service; they have an estimate. The first real measurement always invalidates the estimate.
We run this exercise quarterly on engagements where the rollback story matters. The first cycle usually surfaces three to five issues per service: a config that did not roll back, a cache that needed manual flush, a runbook step that referenced a tool that was renamed eighteen months ago. Each subsequent cycle finds fewer. By cycle three, the rollback time is reliably inside the budget and the team stops thinking about it as a separate skill.
Automatic rollbacks: when and how
A mature deploy pipeline can automate the rollback decision. The trigger is a sustained drop in a key SLI (success rate, latency P99) above a threshold for a defined window after a deploy. The action is a deployment-system call to redeploy the previous artifact.
Automatic rollback works well when:
- The previous artifact is genuinely safe (no migration delta, no incompatible config).
- The SLI is reliable (low noise, fast to compute).
- The time window is long enough to avoid flapping but short enough to recover before users notice.
Automatic rollback works badly when the team has not done the two-phase migration work above. An automatic rollback that breaks because the schema moved forward is worse than no automatic rollback, because it adds a second incident on top of the first.
We typically introduce automatic rollback after a team has manually executed at least ten rollbacks cleanly inside budget. The automation copies a discipline that already exists; it does not invent it.
What this changes about how the team plans deploys
The discipline of “every deploy must be rollbackable inside the budget” reshapes how teams think about migrations, schema changes, and risky releases. The team starts asking, before the deploy goes out, “what is the rollback path here, and how long will it take.” That single question, asked consistently, kills the most expensive class of incident: the one where the team realises mid-incident that the bad change is one-way.
It also changes the conversation with stakeholders. “We can ship this faster if we accept a longer rollback window for one week” becomes a real trade-off the team can make, instead of an implicit risk that nobody owns. The number on the wall is “rollback budget: 10 minutes,” and any change that violates it is visible at planning time, not at incident time.
The teams that own the rollback budget recover from bad deploys without drama. The teams that do not own it write postmortems with action items that look suspiciously like “make rollbacks faster.” Save the postmortem and own the budget upstream.
Questions teams ask
How fast does a rollback need to be?
Set it as a budget, not a target. For a customer-facing web service the rollback budget is typically 5 to 15 minutes. For an internal service or a batch job it can be longer. The number matters less than the discipline of having one and measuring against it.
What if a rollback is not possible due to a database migration?
Then the migration was the wrong shape. The fix is to make every migration two-phase: deploy a backwards-compatible schema first, deploy the code that uses it second, and only then drop the old schema in a third deploy. Each phase is independently rollbackable.
Should rollbacks be automatic on SLO breach?
For mature setups, yes. The trigger is a drop in success rate or a spike in error rate above a threshold sustained for a fixed window. The action is a deployment-system call to redeploy the previous artifact. Automatic rollback removes a 3am page and replaces it with a 9am email.