On-Call Rotations That Do Not Burn Engineers

Product Operations on-call, sre

An on-call rotation that pages every shift loses engineers. Six structural choices that produce rotations engineers will stay on for years.

  • By Orzed Team
  • 5 min read
Key takeaways
  • Page volume is the leading indicator of burnout. Track it weekly per engineer, not per system.
  • Compensate on-call time explicitly. Implicit compensation produces resentment.
  • Rotation length should match team size: longer with more engineers, shorter with fewer.
  • Post-shift recovery time is non-optional after heavy weeks.

A team we worked with had an on-call rotation that paged its members an average of 11 times per week, with 3 to 5 of those at night. Two engineers had quit citing on-call as the breaking point. The remaining team rotated more often to cover the gap, which made the burden worse, which lost a third engineer.

The rotation itself was fine on paper: weekly shifts, 6 engineers in rotation, paid extra for nights. The problem was structural elsewhere. The alerting was untuned (every blip paged). The runbooks were outdated. Post-incident work was not being done. Each shift was paying for accumulating reliability debt the team was not allowed time to address.

We did three things over a quarter: rebuilt the alerting (cut page volume by 70 percent), made post-shift recovery time policy, and reserved one engineer-week per quarter explicitly for reliability work. Page volume dropped to under 3 per week. The rotation stabilised. No more engineers quit citing on-call.

This piece is about the six structural choices that distinguish a rotation engineers will sustain from one that breaks them.

Choice 1: page volume targets

The single most predictive metric for on-call sustainability: page volume per shift. The healthy targets:

Page volume per shiftEngineer experience
0 to 2 per weekSustainable, even pleasant
3 to 5 per weekTolerable, becomes tiring after a quarter
6 to 10 per weekBurning out, engineers start asking to leave rotation
10+ per weekThe rotation is broken, fix it now

Track this weekly per engineer, not just per system. A system with 5 pages per week split across 5 engineers is fine; the same volume hitting one engineer is destructive.

When page volume is too high, the answer is upstream: better SLOs, multi-window burn-rate alerting, root-cause fixes for repeat-offender systems. The on-call rotation cannot absorb a fundamental reliability problem indefinitely.

Choice 2: rotation length matched to team size

Rotation length is a trade-off. Shorter rotations spread the burden more evenly but reduce continuity (each on-call has less context). Longer rotations build continuity but concentrate burden.

Team sizeRecommended rotation length
3-4 engineers24 hours, follow-the-sun preferred
5-7 engineers1 week
8-12 engineers1 week, with a separate secondary rotation
13+ engineers1 week primary + 1 week secondary, with explicit handoffs

The wrong answer is having 4 engineers on a 1-week rotation. That is on-call every fourth week, which is too frequent. Either grow the team or shorten the rotation.

Choice 3: explicit on-call compensation

Engineers on call are working, even when not actively paged. The implicit cost (carrying a phone, declining evening plans, sleeping lightly) needs explicit recognition.

Three valid approaches:

Cash bonus per shift. A defined amount per week of primary rotation, plus a per-page fee for off-hours pages. Common in larger organisations.

Time off in lieu. Comp days earned per shift, used at the engineer’s discretion. Common in smaller engineering-led teams.

Reduced expectations. The engineer on call is exempt from non-urgent meetings, project deadlines are adjusted, they have explicit time for runbook work. Only valid when the manager genuinely respects this and it is observable.

The wrong answer is “we acknowledge it’s a sacrifice and appreciate it”. That is unpaid labour. Engineers notice.

Choice 4: post-shift recovery

After a heavy shift (multiple off-hours pages, an extended incident), the engineer needs explicit recovery time. Not “you can take it if you want”; reserved time the team protects.

A working policy:

  • After any night-time page (00:00 to 07:00 local), the engineer starts the next day at noon.
  • After two or more night-time pages in a shift, the engineer takes a day off the week after.
  • After a major incident (P1 or above), the engineer who carried it gets a half-day off the following week.

These are not handouts. They are the organisation paying for the operational labour up front instead of paying through quiet attrition six months later.

Choice 5: alerting discipline

Most on-call burnout is alerting burnout. The rotation pages too often because the alerting is too sensitive. Two specific failures:

Single-threshold alerts. Page when error rate > 1% for 5 minutes. Fires on every blip. Multi-window burn-rate alerting (covered in our SLO piece) cuts this by roughly 80 percent.

Alerts the team has muted in spirit. When the third “warning” page in a row turns out to be noise, engineers stop responding seriously. The alert that actually matters gets the same level of attention.

The discipline: every alert that fires more than twice without a real underlying issue gets retuned or deleted. There is no value in an alert nobody trusts.

Choice 6: reliability work as part of the rotation

The on-call engineer’s primary job during shift is responding to incidents. Their secondary job is reducing future incidents. Most rotations are designed for the first and silently expect the second to happen in spare time.

A working policy: 20 to 40 percent of the on-call engineer’s shift is explicitly reserved for reliability work. Updating runbooks, addressing root causes from recent incidents, tuning alerts, building tooling that reduces future page volume.

This is not a courtesy. It is the only way the rotation does not accumulate reliability debt forever. A team that pages 5 times per week and never gets time to fix the underlying causes will page 5 times per week forever.

What about AI / LLM on-call?

AI features in production change the on-call shape:

  • Provider-side outages (OpenAI down, Anthropic 5xx) are external; the on-call cannot fix them but must page upward and communicate.
  • Model behaviour drift produces gradual quality degradation, which is harder to notice in real-time and easier to miss in monitoring.
  • Cost spikes (a runaway prompt loop, a cache miss surge) can be operational issues separate from quality.

The rotation needs to know the AI-specific runbooks: how to disable the AI feature, how to fall back to non-AI behaviour, how to page the provider, how to interpret model-quality metrics.

This usually requires explicit runbook investment when the first AI feature ships, and re-investment on every model upgrade.

What we install on engagements

Standard on-call review:

  1. Measure current page volume per engineer per week. Surface the worst offenders (systems and engineers).
  2. Tune the worst-offender alerts. Multi-window burn-rate, single-threshold cleanup.
  3. Define the recovery policy in writing.
  4. Reserve explicit reliability work time per shift.
  5. Compensate explicitly with a method the team agrees on.
  6. Track the metrics weekly: page volume, incident count, MTTR.

Total: typically three to six engineer-weeks for the alerting cleanup, plus the policy work which is mostly conversation.

The teams that get this right have rotations engineers stay on for years. The teams that ignore it pay the cost in attrition. The cost of replacing a senior engineer who quit because of on-call is roughly six months of salary plus the loss of operational knowledge that took two years to build. The math is not subtle.

Frequently asked

Questions teams ask

Should the engineering manager be on the rotation?

Yes, in most teams. The manager being on rotation is the strongest signal that on-call work is valued. It also gives them firsthand exposure to the operational pain that drives reliability prioritisation.

How do we handle on-call across time zones?

Follow-the-sun rotation if the team has 3+ engineers in each zone. Otherwise, an honest conversation about whether the off-hours coverage is worth what it costs the team. Some products genuinely do not need 24/7 coverage; honesty here saves engineers.

What's a healthy page volume per shift?

Under 3 pages per week of shift, with no more than 1 outside business hours. Above that, the rotation is paying for missing reliability work; the fix is upstream, not in the rotation.