Evaluating LLMs Without a Research Team

AI Engineering llm-evaluation, ai-testing

A working evaluation gate that a small engineering team can build in a week, with the assertions, scoring and failure modes that make it production-credible.

By Orzed Team
29 January 2026
6 min read

Key takeaways

An eval suite of thirty cases catches more than a suite of three hundred that nobody runs.
Three assertion types cover most needs: exact match, structural match, and rubric-graded match.
The eval gate must block merge. An eval that produces a Slack notification is a Slack notification, not a gate.
Add a regression case for every incident. The suite becomes a living artifact of every prior failure.

Two of the most useful systems in production AI are also the two that engineering teams skip the longest. The prompt registry is one. The evaluation gate is the other. Both get skipped because the visible cost is real and the cost of skipping is invisible until an incident makes it visible all at once.

This piece is about the eval gate specifically. The version of it that a small engineering team can ship in a week without hiring a research engineer or buying a SaaS subscription, and that catches the regressions that actually reach customers.

What an eval gate is, in one paragraph

An eval gate is a test suite that runs every time anything changes about how the LLM produces output. It compares the new output against a recorded baseline using a small set of assertions. If the assertions fail above a tolerance, the gate blocks the change from merging. The shape is identical to a unit test suite. The hard part is the assertions, because LLM outputs do not have a single correct answer the way add(2, 2) does.

The three assertion types that cover most cases

The first assertion is exact or near-exact match. The output is either a fixed string, a fixed list, or a value within a tolerance. This works for any prompt where the model’s job is to extract something specific from input: a date, a category, a yes/no decision, a list of names. About thirty percent of production prompts in our experience can be evaluated entirely with exact-match assertions, and those are the cheapest and most reliable cases to gate on.

The second is structural match. The output must conform to a schema (JSON, XML, a fixed line format) and the schema’s fields must satisfy bounds (length, type, allowed values). Most JSON-output prompts can be eval-gated on schema alone, plus one or two field-level assertions. Roughly forty percent of production prompts fall here.

The third is rubric-graded match. The output is graded by another model (or a deterministic scorer) against a written rubric. This is the expensive case: each test costs a model call, results have noise, and the rubric itself needs to be eval-tested. Use it only where the cheaper assertions cannot reach. The remaining thirty percent of prompts (open-ended summaries, creative responses, multi-turn outputs) need rubric grading.

Assertion type	Cost per case	Reliability	Use when
Exact / near-exact	Microseconds	Very high	Output is a specific value or list
Structural	Milliseconds	High	Output is structured (JSON, table, tagged text)
Rubric-graded	Cents per case	Medium (noisy)	Output is open-ended and quality is multi-dimensional

The rough mix to aim for in a healthy eval suite is sixty percent exact, thirty percent structural and ten percent rubric. Suites that lean entirely on rubric grading run slowly, cost real money, and produce results that look like opinion. Suites that lean entirely on exact-match are brittle to harmless rewordings and produce false positives that erode the team’s trust in the gate.

A working gate in one engineer-week

The minimum viable eval gate is four files and a CI step.

The first file is the test cases. A YAML or JSON file with around thirty cases per prompt, each with input, expected output (or expected properties), and the assertion type. The cases should be drawn from real production traffic, not from imagination. Pull a random sample of one hundred recent inputs, hand-label thirty, throw the rest away. The thirty hand-labelled cases will catch more regressions than three hundred synthetic cases.

The second file is the runner. A small Python or TypeScript script that loads the prompt, loads the test cases, calls the model for each case, runs the assertions and produces a pass/fail report. A first version is under a hundred lines.

# eval/run.py (sketch)
import yaml, json
from anthropic import Anthropic

client = Anthropic()

def run_prompt(prompt_text, input_data):
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=400,
        messages=[{"role": "user", "content": prompt_text.format(**input_data)}],
    )
    return msg.content[0].text

def evaluate(case, output):
    if case["type"] == "exact":
        return output.strip() == case["expected"].strip()
    if case["type"] == "structural":
        try:
            data = json.loads(output)
        except json.JSONDecodeError:
            return False
        return all(k in data for k in case["required_fields"])
    return None  # rubric grading routed elsewhere

def main(prompt_path, cases_path):
    prompt = yaml.safe_load(open(prompt_path))
    cases = yaml.safe_load(open(cases_path))
    results = []
    for case in cases:
        output = run_prompt(prompt["text"], case["input"])
        results.append({"case": case["id"], "passed": evaluate(case, output)})
    pass_rate = sum(r["passed"] for r in results) / len(results)
    print(f"pass_rate={pass_rate:.3f} threshold={prompt['eval']['passing_threshold']}")
    exit(0 if pass_rate >= prompt["eval"]["passing_threshold"] else 1)

The third file is the prompt’s expected pass rate. A field in the prompt registry that says “this prompt is allowed to ship if the eval pass rate is at least 0.92.” This number is calibrated on the day the prompt is first written and revisited only when the prompt is intentionally rewritten.

The fourth piece is the CI step. A workflow that runs python eval/run.py prompts/{name}.yaml eval/cases/{name}.yaml on every PR that touches prompts/, and fails the build if the pass rate drops below threshold.

That is the entire gate. The first one we ship on an engagement takes two to three days. The second takes a day. The pattern stops being interesting around prompt five.

What goes wrong when the gate is missing

We tracked the postmortem causes across twelve AI integration incidents we triaged in 2025 and 2026. The breakdown:

Root cause	Share	Would the eval gate have caught it?
Prompt edit without test	42%	Yes
Model upgrade behaviour shift	25%	Yes (with nightly run)
Retrieval corpus drift	17%	Partially (needs RAG-specific eval)
Provider-side silent model change	8%	Yes (with nightly run)
Genuine model bug	8%	No (escalation to provider)

Roughly seventy-five percent of the incidents would have been caught at PR time or by the nightly run. The remaining twenty-five percent split between RAG-specific issues (which need a richer eval setup we cover in a separate piece) and genuine provider bugs that no in-team eval can catch.

The asymmetry is worth noting. The eval gate’s failures are visible: a CI red, an annoying re-run, a request to add a missing case. The eval gate’s successes are invisible: an incident that did not happen, a customer who did not see a regression. Teams that judge the gate by its visible cost will conclude it is not worth the work. Teams that track the seventy-five percent figure will conclude it is the cheapest insurance policy in the integration.

Where teams over-engineer

Three patterns we see often in teams trying to do this right and getting tangled in the attempt:

Trying to eval everything at LLM-judge level. This is the most common over-engineering pattern. Every case gets graded by another model on a one-to-five rubric. The pass rate becomes a noisy floating point number. The team spends weeks calibrating the rubric and the meta-model never quite settles. Reserve LLM grading for the small fraction of cases where it is actually needed.

Building a custom dashboard before the gate works. A weekly stakeholder report with green and red blocks looks impressive. It also requires three weeks of frontend work that adds zero new safety. Ship the CI red as the entire dashboard for the first two months. Add visualisation when the team can no longer remember which prompt failed which test in which run.

Drowning the gate in cases. A suite of one thousand cases sounds rigorous. It runs in twenty minutes, costs ten dollars per CI run, and the team starts skipping it on small PRs. The result is worse than a suite of fifty cases that runs in ninety seconds and runs on every PR. Aim for a suite that finishes in under three minutes.

The single rule we install on every engagement

Add a regression case to the eval suite for every incident. The case should reproduce the failure that reached production. The pass rate threshold should not be lowered to accommodate it; the prompt should change until the new case passes alongside the old ones.

The eval suite then becomes a living artifact of every prior failure. After a year of operation, the suite is the team’s most valuable piece of institutional memory: every shape of mistake the integration has ever made, encoded as a test that runs in CI on every change. We have seen this artifact survive three model migrations and two engineering team rotations. The prompt code in the application changed; the eval suite did not.

That is the long-term argument for the eval gate. It is not the regression you catch this week. It is the regression you would have rediscovered two years from now if the case were not in the suite.

Frequently asked

Questions teams ask

Do we need a hosted eval product?

No. A YAML test suite plus a small Python or TypeScript runner is enough for the first hundred cases. Hosted products (Braintrust, Langsmith, Promptfoo) become valuable when you need shared dashboards, historical trends and scoring across teams. They are not a prerequisite for shipping the gate.

How do we score open-ended outputs?

Three options stack: keyword presence (cheap, brittle), structural validation (JSON schema, length bounds), and rubric grading by another LLM (expensive, less brittle). Use the cheapest assertion that catches the regression class you care about. Reserve LLM grading for the cases where the other two cannot.

How often should the eval run?

Every PR that touches a prompt, every model upgrade, and on a nightly schedule against the main branch. Nightly catches silent provider-side model drift inside a pinned version, which is a real failure mode.

Artificial Intelligence

Machine Learning

Data Engineering

Computer Vision

Deep Learning

Natural Language Processing

MLOps & Governance

Cyber Security & Risk Ops

Technology Stack

AI Integration

SaaS Product Development

E-Commerce & Marketplace

Growth Analytics & SEO/GEO

Mobile App Development

Web & Content Platforms

CRM & Revenue Operations

Code & Performance Refactoring

Financial Technology

Healthcare & MedTech

E-Commerce & Retail

Manufacturing & Industrial

Media & Publishing

Education & EdTech

Real Estate & PropTech

Logistics & Supply Chain

Energy & Sustainability

Project Management

Product Strategy

DevOps & Cloud Infrastructure

Enterprise Workflow Automation

Business Intelligence

QA & Release Governance

UX/UI Systems & Design

Change Management & Transformation

Portfolio Management