When the Automation Itself Becomes the Incident

Workflow Automation automation, observability

Automations fail silently more often than loudly. The observability and recovery patterns that catch a broken workflow before a customer reports it.

  • By Orzed Team
  • 5 min read
Key takeaways
  • An automation without alerting is an automation that depends on a customer noticing the failure.
  • Heartbeat checks catch the 'workflow stopped running entirely' failure mode.
  • Output validation catches the 'workflow ran but produced wrong output' failure mode.
  • Idempotent recovery catches the 'workflow ran twice and duplicated the action' failure mode.

A team we worked with had an automated weekly report that summarised customer health for the account management team. The automation had been running every Monday morning for two years. One Monday, it did not run. Nobody noticed for three weeks. The account managers had been making decisions during those three weeks based on the previous report, which was increasingly stale.

The automation had failed because the source SaaS tool deprecated an authentication scheme. The workflow had not been touched in months; the team had forgotten it existed enough to notice it stopped firing. There was no alerting. The runbook (such as it was) was a comment in the workflow description that said “edit this if it breaks”. When we audited, the team had 23 similar automations, 6 of which had not run in the last month and nobody had noticed.

This piece is about the patterns that catch silent automation failures before customers do. None of them are exotic; all of them are skipped on most automations.

The three failure modes

Mode 1: the workflow stopped running entirely. Platform issue, deprecated auth, schedule disabled, host down. The workflow that should run daily has not run for a week. No errors, because no executions.

Mode 2: the workflow ran but produced wrong output. Schema changed upstream, the workflow read the wrong field, the output is technically valid but useless. No errors, because the workflow did not crash.

Mode 3: the workflow ran more times than it should have. Retry logic fired wrong, the platform double-executed, two replicas of the workflow are running. No errors, because each individual execution succeeded.

Each mode requires a different observability hook. A team that has none of them is hoping; a team that has all three has a robust automation surface.

Mode 1: the heartbeat check

For any scheduled or recurring automation, register an external watchdog that knows the expected cadence and alerts when too much time passes without a heartbeat from the workflow.

The pattern:

  • The workflow’s last step pings a watchdog endpoint with a “I ran successfully at {timestamp}” payload.
  • The watchdog has a configured SLA per workflow (“I expect to hear from this every 24 hours”).
  • If the SLA is breached, the watchdog pages the owner.
# Example: simple watchdog using a service like Healthchecks.io,
# Cronitor, or a custom endpoint
def workflow_main():
    try:
        run_workflow()
        ping_watchdog(workflow_id="weekly-customer-health-report", status="ok")
    except Exception as e:
        ping_watchdog(workflow_id="weekly-customer-health-report", status="failed", error=str(e))
        raise

The watchdog is external because the workflow itself cannot detect “I have stopped running”. A separate system has to expect the heartbeat and notice its absence.

This single pattern would have caught the weekly-report failure on day one instead of week three.

Mode 2: the output validation hook

A workflow that completes is not the same as a workflow that worked. A health-report automation that produces an empty report is not a successful run, even if the workflow exited cleanly.

Output validation runs immediately after the workflow’s main work and asserts that the output meets minimal expectations:

  • For a generated document: file exists, is non-empty, has the expected sections.
  • For a database update: the expected number of rows changed, with values in expected ranges.
  • For a notification dispatch: the recipient list is non-empty, the message has the expected fields.

The validator is a separate, deterministic step. If it fails, the workflow alerts even though the main work appeared to complete:

def workflow_main():
    report = generate_health_report()
    validation_errors = validate_report(report)
    if validation_errors:
        alert(f"Report generated but failed validation: {validation_errors}")
        return
    deliver_report(report)
    ping_watchdog(status="ok")

The validator catches the “ran but wrong” failure mode that heartbeats miss.

Mode 3: idempotent recovery + duplicate detection

The trickiest failure mode. The workflow ran fine. Then it ran again, because the platform’s retry logic mistakenly thought the first run failed. Now the customer gets two emails, the database has two rows, the inventory was decremented twice.

Two patterns prevent this:

Idempotency keys. Each workflow execution generates a unique key (typically derived from inputs or timestamp). External actions tagged with the key check whether the same key has already been processed; if so, they return the previous result instead of re-executing.

def send_notification(user_id, message_id, content):
    # Check the idempotency table first
    existing = db.query("SELECT result FROM notifications WHERE message_id=?", (message_id,))
    if existing:
        return existing.result  # Already sent, do not resend
    result = notification_service.send(user_id, content)
    db.insert("notifications", message_id=message_id, result=result)
    return result

Single-execution guards in the platform. Some platforms (Temporal especially) provide native exactly-once execution guarantees. Others (Zapier, n8n) do not; the team has to implement idempotency at the action layer.

For workflows where duplicate execution would cause real damage (financial transactions, customer-facing emails), idempotency is non-negotiable.

A recovery runbook is the third leg

When alerts fire, the on-call engineer needs to know:

  • What the workflow does at a high level
  • Who owns it
  • How to check whether the most recent run actually completed
  • How to disable the workflow if necessary
  • How to manually re-run it (with what inputs)
  • How to roll back any partial state changes
  • Where the credentials live

This is a 1-page document per automation. Without it, every incident is a fresh investigation. With it, recovery takes minutes instead of hours.

Most teams write the runbook after the first incident makes them wish they had. The discipline is to write it before the workflow goes to production. We make this part of the standard automation checklist on engagements.

What about AI-in-the-loop?

LLM steps in automations introduce two extra failure modes:

Silent quality degradation. The model produces output that looks fine but is subtly wrong. The validator catches structural issues but not semantic ones. The fix: add a sample-based human review on a small percentage of outputs (1 to 5 percent), enough to catch quality regressions before they propagate.

Provider-side model drift. The LLM provider silently updates the model under your version pin. The model now responds slightly differently. Daily evaluation against a regression suite catches this; without it, the team learns about the change from a customer complaint.

For any AI-in-the-loop automation, plan for both: a sample human review and a daily eval suite. Neither is optional.

What we install on engagements

The automation observability checklist:

  • Heartbeat: external watchdog with expected-cadence SLA
  • Output validation: deterministic step asserting expected output shape
  • Idempotency: keys on every external write
  • Alerting: on heartbeat miss, on validation failure, on platform error
  • Runbook: 1-page markdown checked into source control
  • Eval suite (AI-only): daily regression against labelled cases
  • Sample review (AI-only): human review of 1 to 5 percent of outputs
  • Owner: a named human responsible for the automation, with a backup

For a single new automation, this is half an engineer-day of additional work. For a portfolio of 20 to 50 automations, the upfront retrofit is one to two engineer-weeks. The first incident the observability prevents pays back the entire investment.

The teams that ship automation observability discover that 90 percent of incidents become non-events: the system catches the failure, alerts the owner, and the fix happens in the normal work day instead of in a customer-facing outage. The automation portfolio becomes an asset instead of a liability.

Frequently asked

Questions teams ask

What's the minimum observability for an automation?

Three things: a heartbeat that fires when the workflow runs, an output assertion that fires if the output is wrong-shaped, and an alert that pages the owner when either of the first two fail. Anything less is hoping.

How do I detect a workflow that quietly stopped running?

An expected-execution heartbeat. If the workflow runs daily, alert when 28 hours pass without a successful run. The alert detects both 'platform broken' and 'workflow accidentally disabled' failures.

What about automations that run continuously?

Same principle, different time scale. Alert on 'no successful run in 5 minutes' or whatever the SLA demands. The heartbeat is a watchdog, not a deep diagnostic.