Incident Postmortems That Produce Actual Learning

Product Operations incident-response, postmortem

Most postmortems end as filed documents. The few that produce real change share a structure: blameless framing, contributing factors, and tracked follow-ups.

  • By Orzed Team
  • 7 min read
Key takeaways
  • A blameless framing is not soft, it is operationally necessary. Engineers censor information when blame is on the table.
  • Single root cause is almost always wrong. Real incidents have three to five contributing factors.
  • Action items that live in a postmortem doc never get done. Track them in the same backlog as feature work.
  • Quarterly cross-incident review surfaces patterns that any one postmortem misses.

The pattern repeats across every engineering team that runs incidents at any cadence. An incident happens. The on-call recovers production. A postmortem document gets written, mostly by the on-call engineer in the days after. The document gets shared in a channel. A few people read it. Action items get added to a backlog. Some get done; most do not. Three months later a similar incident happens, and the team realises the prevention work from the previous postmortem was never actually shipped.

The document is not the bug. The document is the symptom of a postmortem practice that treats writing as the deliverable instead of behaviour change. This piece is about what separates the two.

What a useful postmortem actually does

A postmortem is operationally valuable when it changes one of three things:

  • A piece of code or config that made the incident possible
  • A piece of process that delayed detection or recovery
  • A piece of shared understanding about how a part of the system actually behaves

If the postmortem produces none of those, the incident has cost the team time and produced nothing in return. The exercise was performative.

The structural changes below all bias toward producing one of those three outcomes. None of them are about formatting the document better.

Blameless framing as a working practice, not a slogan

The blameless principle is widely cited and widely misunderstood. The practical version: in the postmortem, describe what people did and why it made sense to them at the time, not whether they should have done something different.

The reason this matters is operational, not cultural. When blame is on the table, engineers censor what they tell the postmortem. Memory becomes selective; mistakes get described in passive voice; the actual chain of decisions becomes invisible. The postmortem ends up describing a sanitised version of events that misses the contributing factors, which makes the prevention work less effective.

The test for whether a postmortem framing is genuinely blameless is not whether the document uses softer language. It is whether the engineer who made the decision the team would now consider wrong feels safe explaining their actual reasoning at the time. If the answer is no, the postmortem is going to miss the most important contributing factor: why a competent engineer made a decision that contributed to the incident.

Contributing factors instead of a single root cause

The phrase “root cause” implies a single failure that, if absent, would have prevented the incident. Real incidents almost never work that way. They are sequences of small contributing factors that lined up in an unfortunate order, each individually small enough to seem unimportant.

The accident-investigation field abandoned single-root-cause thinking decades ago. Engineering postmortems still cling to it, partly because it produces tidy action items.

A more honest framing lists three to five contributing factors per incident, in the order they manifested. For an incident we triaged in 2025, the contributing factors were:

  1. A schema migration ran in production on a Friday afternoon (process)
  2. The migration locked the table for longer than the 30-second timeout (technical)
  3. The on-call paging schedule had a gap between handovers that intersected the incident (process)
  4. The runbook for this service was three months out of date (documentation)
  5. The rollback path required a manual database step that was never tested (engineering)

If the team had described the incident as “root cause: long-running migration,” the action item would have been “improve migration tooling.” Useful, but it would have left four other factors untouched, and the next similar incident would have hit at least two of them.

The five-factor framing produced five smaller action items, each owned by a different person, each shipped over the next quarter. The next migration incident did not happen.

Action items in the team backlog, not in the postmortem document

Action items that live inside a postmortem document have a near-zero completion rate. The document gets filed; nobody opens it again unless they are writing the next postmortem.

The discipline that makes prevention work happen is to add each action item to the same backlog the team uses for feature work. Same tracker, same prioritisation, same definition-of-done. The action item should be small enough to fit a normal work cycle (typically under one engineer-week) and have a single named owner.

Action items larger than that should be broken down before they leave the postmortem meeting. A vague “improve migration tooling” never ships. A concrete “add a 30-second lock-timeout warning to the migration CI step” ships in two days.

Track completion. A team we worked with started reviewing action item completion rate quarterly and discovered the rate had been thirty percent for two years. Naming the number changed the behaviour without changing anything else; within two quarters the rate was above seventy percent.

The five-day cadence

The postmortem timing matters more than teams expect. Too soon and the on-call is still running on adrenaline; the analysis is shallow and reads like a war story. Too late and the details have faded; engineers reconstruct from memory rather than from logs.

Five working days from incident resolution to postmortem review is the rhythm we install on engagements. Within that window:

  • Day 1: incident resolved, on-call writes a brief timeline (not a full postmortem)
  • Day 2: 30-minute review meeting with the responders to align on the timeline
  • Day 3-4: postmortem document written, reviewed by the involved engineers
  • Day 5: postmortem review meeting with broader team, action items finalised and assigned

Faster than this and the team rushes the analysis. Slower than this and the work loses momentum.

The quarterly cross-incident review

A pattern that almost no team does, and that produces the highest-leverage prevention work we have seen: a quarterly review of all incidents over the last quarter, looking for patterns across incidents.

The review surfaces what no single postmortem can: that three of the last five incidents involved deploys after 4pm local time, that two involved the same configuration system, that four involved the same downstream dependency. Each of those patterns suggests a structural fix that no individual incident’s action items would have produced.

The review is short: ninety minutes, attended by senior engineering and the on-call leads. Output is a small number of cross-cutting initiatives for the next quarter, owned by named individuals, tracked in the same backlog as everything else.

We have seen this exercise reliably produce one or two large reliability wins per quarter that no individual postmortem would have surfaced. The cost is six hours of senior engineering time every three months. The return on that time is the highest of any reliability work we install.

What stops a postmortem culture from working

Three patterns kill the practice faster than any other:

Leadership treating postmortems as performance review material. The moment an engineer’s name in a postmortem becomes a factor in their performance review, the postmortems stop being honest. The chain of causation gets sanitised. The team protects each other instead of the system.

Action items that compete with feature work without prioritisation. If reliability work is always “the next sprint” and never the current one, the action items pile up indefinitely. The error budget policy described in the SLO piece is the operational discipline that makes the trade-off explicit.

Postmortems for tiny incidents. Not every blip needs the full postmortem treatment. A five-minute degradation that recovered without intervention does not warrant a four-page document. The team will burn out on the ceremony and start skipping it for incidents that genuinely need it. Set a threshold (we usually use: customer-visible impact, sustained over five minutes, or budget consumption above some percent) and only run the full process on incidents above the threshold. Small incidents get a short note in the on-call log.

What the team gets back

A postmortem practice that actually works produces a culture where engineers run incidents calmly, talk honestly about contributing factors, and trust that the prevention work will get done. Over a year of operation, the team’s mean time to recover drops, the rate of repeat incidents drops, and on-call volume drops as the underlying issues get addressed instead of patched.

None of this is mystical. It is structural. The team that sets up blameless framing, multi-factor analysis, action items in the main backlog, the five-day cadence and the quarterly review will end up with a reliability practice that compounds. The team that skips those structures and focuses on writing better postmortem documents will keep filing documents nobody reads.

The deliverable is not the postmortem. The deliverable is the change that the postmortem produces.

Frequently asked

Questions teams ask

Should everyone on the team see every postmortem?

Yes for engineering, conditionally for product and leadership. Wide visibility builds shared understanding of how the system actually fails. Restrict only when an incident touches sensitive customer data or legal matters; in those cases write a public summary alongside the detailed internal version.

How long should a postmortem document be?

Two to four pages. Longer documents do not get read. Use the format: timeline, contributing factors, customer impact, action items, prevention notes. If the incident is genuinely complex, attach appendices for detail; keep the main document short.

When should the postmortem happen?

Within 5 working days of incident resolution. Faster than that and the team is still in adrenaline; slower and details are forgotten. Schedule the meeting on day 2, write the document on days 3 to 4, review on day 5.