Why AI Integration Fails at the Governance Layer
- Production AI rarely fails at the model. It fails at the governance layer beneath it.
- A prompt registry, an evaluation gate and a human approval path carry most of the operational weight.
- Swapping to a stronger model defers the failure for weeks, then returns it on a worse day.
- A six-question governance contract written before the first PR catches most of the gaps.
- Governance is the deliverable that lets the team change the model later without a regression panic.
If the client conversation opens with “the model keeps making things up” and closes with “so we need to upgrade to the latest one,” the model is not the failure. The governance layer underneath it is, and swapping the model will not save the integration the next time it breaks.
We have seen the same pattern repeat across AI integration engagements: a capable team ships a working prototype in two weeks, the stakeholder demo lands well, and six weeks later the whole thing is on fire. The output regressed, nobody can reproduce last week’s demo, and the client support team is manually reviewing every generation because there is no other control surface. The team then argues about whether to change the model, the prompt or the retrieval step, and the answer is usually “none of the above.” The failure is a governance failure.
The symptom looks like drift, the cause is an empty control room
Model behaviour shifts. Prompts age. Retrieval corpora grow stale. Any LLM system deployed in production will degrade without explicit controls; that part is expected and well-documented. What surprises teams is how little of that expected work is actually built into the integration. The prompt sits inlined in application code. The eval suite, if it exists, runs on someone’s laptop. Approval for a prompt change happens in Slack. No artifact of the decision survives the week.
The model will change twice before your next release. The governance layer is what makes that acceptable instead of terrifying.
When an AI system fails in production, the post-mortem almost never ends at the model. It ends at a missing artifact: a prompt version that was not recorded, an eval that was never run, a change that never went through approval. The governance layer is the set of systems that make those artifacts exist by default, not by discipline, not by memory, not by Slack threads.
The three systems that are almost always missing
In a well-governed AI integration, three systems carry most of the operational weight. When an integration is in trouble, one or more of them is usually absent or unmaintained.
| System | What it does | Failure if missing |
|---|---|---|
| Prompt registry | Single source of truth for production prompts. Versioned, reviewable, rollback-able independent of application releases. | Prompt edits become full deploys; rollback requires a code revert. |
| Evaluation gate | Automated test suite on every prompt, model or retrieval change with a recorded baseline. Blocks merge on regression. | Regressions surface in production, not in CI. |
| Approval path | Named human owner per change class with a traceable sign-off artifact. Not a committee, not a rota. | Accountability diffuses; nobody owns the next outage. |
A prompt registry
A prompt registry is the source of truth for every prompt running in production. It is versioned, it is reviewable, and it is the only place the application code pulls prompts from at runtime. The registry does not need to be a product. It can be a database table, a YAML file in a dedicated repository, or a small internal service. The minimum useful shape is something like this:
# prompts/customer-summary.yaml
id: customer-summary
version: 7
owner: alex.mercer
model: claude-sonnet-4-6
temperature: 0.2
text: |
You are summarising a customer support thread for the
account manager. Output one paragraph, max 80 words,
no opinions, no recommendations.
Thread:
{{ thread }}
eval:
suite: prompts/customer-summary.eval.yaml
passing_threshold: 0.92
What matters is that prompt changes are diffable, reviewable and rollback-able independent of application releases. The moment you hard-code a prompt in a Python service, you have promoted a text string into production code without any of the review ceremony you apply to code.
If rolling back a prompt change requires a code deploy, the prompt is not in a registry, it is hiding in your application.
An evaluation gate
The second missing system is an evaluation gate: a suite of test cases that run every time a prompt, model or retrieval step changes, and either pass or fail the change against a recorded baseline. The eval does not need to be sophisticated. A few dozen well-chosen inputs with expected output characteristics will catch most regressions. What it needs to be is automated, blocking, and owned.
Most teams can write the eval suite in a week. Fewer can make it part of the merge contract. Fewer still can resist the pressure to bypass it when a business stakeholder wants a specific output by end of day. The eval gate only works when skipping it produces the same organisational friction as skipping a security review.
A human approval path
The third system is a named human approval path for every class of change. Who signs off on a new prompt version shipping to production? Who signs off on a model swap? Who signs off on a retrieval corpus update? If the answer depends on which engineer is online that day, you do not have a governance layer, you have a volunteer rota.
The approval path does not have to be heavyweight. On a well-run AI engagement, the approval artifact is often a single paragraph in a pull request, signed off by the named prompt owner. The critical property is that it exists, it is traceable, and the person signing off is accountable, not just present.
The minimum governance contract
When we scope a new AI integration at Orzed, we do not start with the model choice. We write down the governance contract first, because every other decision is downstream of it. A minimum governance contract answers six questions in writing, before any code:
- Who owns this prompt? One named engineer per prompt, with right of veto on changes.
- What is the eval suite for this prompt? A named file, a named set of cases, an expected pass rate.
- What triggers a human review? Prompt change, model change, retrieval corpus change, provider change, each with a named reviewer.
- What happens when the eval fails? A rollback path that does not require a code deploy, with an expected time to rollback.
- Where does the failure signal come from? User report, automated eval, production monitor, or all three. If it is only the first, the integration is flying blind.
- Who is on the phone when it breaks? An on-call rota with two named engineers. If it is critical enough to exist, it is critical enough to be paged on.
The point of writing it down is not to produce an artifact for the artifact’s sake, it is to force the team to notice which questions they cannot yet answer, before those questions show up in a post-mortem.
Inlined prompt vs registered prompt, side by side
What changes once the registry exists is the cost of every routine operation. The same change request lands very differently in a governed and an ungoverned system:
| Operation | Inlined prompt | Registered prompt |
|---|---|---|
| Prompt edit | Code change, PR, deploy, rollback risk | Config change, PR, eval, instant rollback |
| Model swap | Touched in N places, missed in one | One field, eval re-runs automatically |
| A/B trial | Two code branches in service code | Two registry rows, traffic split routed |
| Audit on incident | Read git history of service code | Read prompt history table |
Why the model swap is not the fix
When an AI integration is in trouble and the instinct is “swap to a stronger model,” the root cause has almost always already been masked. A stronger model will paper over a governance gap for a few weeks, sometimes a few months. Then the gap returns, usually on a day the team can least afford it.
We have seen this specifically in retrieval-heavy integrations. A frontier model compensates for a weak retrieval step by generating plausible-looking content regardless of evidence. The team declares victory. Six weeks later, an edge case breaks the plausibility budget and the output goes out under a client’s brand. At that point the retrieval gap, the missing eval and the undocumented prompt are all visible at once.
The model swap is real leverage in the right context: when the model is genuinely the bottleneck, when the eval proves it, and when the governance layer is in place to catch the next drift. Without those, it is a deferral.
Governance is the deliverable, not the overhead
The hardest part of selling the governance layer to a business stakeholder is that it looks like overhead on the Gantt chart. A prompt registry has no customer-facing UI. An eval suite has no marketing surface. An approval path ships no new features. So teams underinvest, stakeholders do not ask, and the work only becomes visible when it is absent, usually at 2am after a production incident.
Reframe it for the stakeholder: the governance layer is the part of the system that lets the team change the model, change the prompt, change the retrieval corpus, change anything, without a full regression run and a risk of public failure. That is not overhead. That is the ability to keep improving the product after it has shipped, which is the thing the stakeholder actually wanted when they asked for AI in the first place.
An AI integration without governance does not get faster to improve over time. It gets slower, because every change is a risk.
The short version
If an AI integration is breaking in production, the questions that actually diagnose it are not “which model should we upgrade to” or “should we rewrite the prompt.” The questions are: does a prompt registry exist, does an eval gate exist, does a human approval path exist, and is each one being used? When the answer to all four is yes, most integration failures are recoverable in hours. When the answer to any is no, every failure is a rebuild.
Most AI integrations do not need a better model. They need a governance layer that was supposed to be built alongside the first version, and was not.
Questions teams ask
Is a prompt registry overkill for a small AI integration?
No. The registry can be a YAML file in a dedicated repository or a single database table. The point is that a prompt change is diffable, reviewable and rollback-able independent of an application deploy. Even a one-prompt service benefits, because the second prompt always arrives sooner than the team expects.
How big should the evaluation suite be?
Start with twenty to forty cases that exercise the production-shaped inputs. Bias toward edge cases the team is already nervous about. Grow the suite as you discover regressions, and add a regression case for every incident. The eval needs to be automated and blocking on the merge contract, not exhaustive.
Who should own the human approval path?
A named engineer per prompt, with right of veto. A committee diffuses accountability and slows turnaround. The owner does not have to be senior, but must be empowered to block a release if the eval or review surfaces an issue.
What if leadership wants the model upgrade today and the governance later?
Ship the smallest governance scaffold first, then the model upgrade. A prompt file under version control plus a five-case eval is enough to make the upgrade reversible. The work is one engineer-day. Skipping it converts every future change into a high-stakes deploy.
Does this apply to retrieval-augmented generation (RAG) the same way?
Yes, with one extra surface: the retrieval corpus itself. Treat corpus changes the same way you treat prompt changes: version, review, eval, approval. RAG fails the loudest when a corpus update silently shifts the answer distribution and no eval catches it.