Prompt Registry: YAML File, Database Table, or Service?

AI Engineering prompt-registry, ai-engineering

Three working shapes for a production prompt registry, with the trade-offs that decide which one fits a team of three, thirty, or three hundred.

  • By Orzed Team
  • 7 min read
Key takeaways
  • A YAML-in-repo registry is honest and slow. Right for small teams that already use code review for everything.
  • A database table is the cheapest registry that supports A/B trials, traffic splits and runtime overrides.
  • A registry service is overkill until you have either many writers or hard audit requirements.
  • The wrong registry is not the one with fewer features. It is the one that nobody updates.

The first time we put a prompt under version control on a client engagement, the team had thirty-eight production prompts spread across four services. Twelve of them were duplicated, six were stale, and three referenced a model that had been deprecated by the provider eight months earlier. The team did not have a prompt registry; they had a prompt accident.

Choosing what shape the registry should take is the second decision an AI engineering team makes after they decide they need one. The three answers in the wild are a YAML file in a repository, a database table, or a small internal service. Each one has a clean argument for it. Each one has a failure mode that looks fine on day one and painful on day ninety.

The three shapes, fast

ShapeWhere it livesBest forWorst case
YAML file in repoprompts/*.yaml in a dedicated repo or alongside service codeTeams under ten engineers, fewer than twenty prompts, code review already in flowCannot do runtime traffic splits without redeploy
Database tableA row per prompt version in Postgres or equivalentTeams that need A/B trials, runtime selection, or many writersNeeds a small admin UI before non-engineers can read or edit
Internal serviceA small HTTP service wrapping the table with auth, audit and approval flowTeams over fifty engineers, regulated environments, multi-tenantThree months of build time before it pays off

Almost every team starts with the first, outgrows it before they admit they have, and is later forced into the second under deadline pressure. The third is a deliberate decision when audit or scale demands it.

YAML in a repository

A YAML registry is the most honest choice for a small team. The prompt is a text file. The version history is git log. The review is a pull request. The rollback is git revert followed by a deploy. Everything the team already knows how to do for code applies to prompts unchanged.

# prompts/customer-summary.yaml
id: customer-summary
version: 12
owner: alex.mercer
model: claude-sonnet-4-6
temperature: 0.2
text: |
  You are summarising a customer support thread for the
  account manager. Output one paragraph, max 80 words,
  no opinions, no recommendations.

  Thread:
  {{ thread }}
eval:
  suite: prompts/customer-summary.eval.yaml
  passing_threshold: 0.92

The application loads prompts at boot, holds them in memory, and re-loads on a SIGHUP or a deploy. This works well until two things happen.

The first is A/B testing. To trial a new wording against fifteen percent of traffic with a YAML file you need either two prompt files plus a feature flag, or a runtime selector wrapping loadPrompt(). Both work. Both add code paths that someone has to maintain after the trial ends. After three or four trials in flight at the same time, the integration’s prompt-loading code gets unreadable and the team starts skipping trials to avoid the friction.

The second is non-engineer editing. The first time a product manager asks to “tweak the wording on the summary prompt,” every team that runs a YAML registry has to invent an answer. The honest answer is “open a PR and tag me.” That answer works once. By the fifth time, somebody on the team is acting as a permanent prompt-editor, which is a job nobody applied for.

For a team of three to ten engineers with fewer than twenty prompts, none of this matters yet, and the YAML registry buys an enormous amount of clarity for almost no engineering cost. We use this shape on roughly half of the engagements where the prompt count starts low.

A database table

The database registry is one row per prompt version. Schema is small:

CREATE TABLE prompts (
  id            text NOT NULL,
  version       integer NOT NULL,
  text          text NOT NULL,
  model         text NOT NULL,
  temperature   numeric(3,2) NOT NULL DEFAULT 0.20,
  owner         text NOT NULL,
  status        text NOT NULL DEFAULT 'draft',
  traffic_split numeric(3,2) NOT NULL DEFAULT 1.00,
  created_at    timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (id, version)
);

The application queries the table at request time (with a small in-process cache) or subscribes to changes via the database’s replication stream. A new prompt version is an INSERT. A rollback is a status = 'archived' UPDATE. A traffic split is a traffic_split = 0.15 UPDATE. None of those require a deploy.

The cost is two pieces of operational scaffolding. The first is migrations: prompts under version control by file are diffable; prompts in a table need a migration discipline so a DROP TABLE prompts does not happen accidentally. The second is access control: any engineer who can write to production Postgres can now change a prompt, which means production Postgres write access becomes a higher-trust role than it was.

We move teams to this shape when one of three triggers fires: the team starts running A/B tests on prompts more often than weekly, the prompt count crosses thirty, or someone outside engineering starts asking to edit prompts.

An internal service

A prompt registry as a service is a small HTTP API in front of the table, with auth, audit logging and a state-machine approval flow. The application calls GET /prompts/customer-summary?env=production and gets the active version with caching headers. Writes go through POST /prompts and pass through approval, eval and rollout.

The argument for the service shape is not technical, it is organisational. Once the team is large enough that several people are writing prompts in parallel, the database-row approach struggles with three things: review (no PR-style diff), audit (who-changed-what is a query, not an artifact) and approval (a row with status = pending does not enforce that a named reviewer signed off).

A service shape solves all three by making the workflow explicit. The cost is real engineering: an MVP service is two to four engineer-weeks, plus ongoing maintenance. Hosted products in this category exist (Langfuse, Helicone, Pezzo and a handful of in-house systems at frontier labs); they cost a SaaS subscription instead of build time. Either path is the right one above a certain scale and the wrong one before it.

We bring this shape into engagements roughly when the team passes fifty engineers, when the integration is in a regulated context (healthcare, financial services, anything under the EU AI Act high-risk category) or when the prompt count crosses one hundred.

How to pick

The honest test for a team that does not yet know which shape they need:

  1. Count the prompts that are in production today. Under twenty, start with YAML.
  2. Count the writers. If only engineers write prompts, the YAML or table shape both work. If non-engineers write, you need at least the table with a small admin UI, or the service.
  3. Count the trials. If you run more than one A/B trial at a time on prompts, you need the table or the service. The YAML shape will pretend to support trials and quietly become unmaintainable.
  4. Check the audit posture. If the integration is high-risk under regulation, the audit log requirement pushes you to the service shape from day one.

Most teams pick by feel. The feel is usually wrong, because the team optimises for “what is the simplest thing today” without asking “what does this look like in six months.” The above test answers the six-month question with three queries against the codebase and one conversation with the compliance team.

What none of the three shapes solve

A registry is a storage and review layer. It does not, by itself, do any of the following:

  • Run evals. That belongs to a separate eval gate that the registry calls out to. A registry that pretends to do its own eval logic ends up half-baked at both jobs.
  • Make a bad prompt good. The registry will happily store a terrible prompt and version it forever. The eval gate is the thing that protects the integration from a bad prompt shipping. The registry only makes the bad prompt easier to roll back when the eval misses it.
  • Replace approval discipline. A registry with an approval state machine still needs a human to act on it. We have seen registries where every prompt is in pending for weeks because the named approver moved teams and nobody noticed.

The registry is one of three systems in the governance layer (the others are the eval gate and the human approval path). It carries about a third of the operational weight. The other two thirds belong to the systems around it.

What we install on engagements

For a team of five to fifteen, we ship a YAML registry in a dedicated repository, with a small loader in each consuming service and a CI step that runs the eval suite on every PR that touches prompts/. Total install time is roughly half an engineer-week, including the first three prompt migrations.

For a team that already does A/B testing or has more than one product manager touching prompts, we ship a table in the existing application database, an admin page rendered server-side (no SPA), and the same CI eval gate sitting on every change. Total install time is roughly two engineer-weeks.

For everything above that, we recommend choosing a hosted product unless the audit or data-residency posture rules it out. The cost of building and maintaining a registry service is genuinely high, and the marginal value over a hosted product becomes thin around the time the team is willing to pay for one.

The registry shape will change at least once over the lifetime of the integration. Plan for that. The teams that get this right keep the prompt schema (id, version, text, model, owner) consistent across shapes, so a migration from YAML to table to service is a transport change, not a data model change. The teams that get this wrong invent a new schema each time and pay for the migration twice.

Frequently asked

Questions teams ask

Can a YAML registry handle A/B testing?

Yes, but the overhead is high. You need either two PRs and a feature flag, or a runtime selector that reads multiple files. Both work; both are slower than a database row with a traffic_split field. For more than one or two A/B tests at a time, switch to a table.

Does a registry slow down prompt iteration?

It speeds it up after the first three prompts. The slowness on prompt one feels like overhead. By prompt ten, the team is editing rows or files in seconds and rolling back without a deploy, which the prior pattern could not do.

Is there a hosted prompt registry product worth using?

Several exist (Langfuse, Helicone, Pezzo, internal tools at OpenAI and Anthropic). Use one if it integrates with your eval suite and your access control. Avoid one if it forces a SaaS contract before you have ten prompts. The shape matters more than the brand.