Orzed QA Agent, evidence based validation
- Composite architecture: Orzed Pulse for semantic review, paired with deterministic test runner, security classifier and static analyzer.
- Evidence pack is structured: deterministic test results, static and security findings, semantic review, weighted recommendation.
- Pulse fine tuned on roughly 14,000 reviewed code cases, 3,200 security audits and 6,500 structural QA records.
- Pilot measured roughly 3.4x throughput against human only baseline with about 18% fewer false positive flags.
- Critical regression catch rate in the 92 to 95 percent band, comparable with senior human reviewers; subtle race conditions and domain rule violations still escalate to human.
The QA Agent is the most visible automated layer in the Orzed delivery pipeline today. Every Execution lane output (code, content, architecture decision, documentation) flows through it before it reaches the senior human reviewer. The agent does not replace the reviewer; it produces a pre filtered queue with an evidence pack already attached, so the human can confirm or overrule a verdict that has its supporting evidence assembled.
This is the technical write up. The user facing changelog announcement is in Console Changelog (“Automated QA Agents are now live”); this piece is the deeper view, focused on the architecture, training, evidence format and limits.
Architecture
The QA Agent is a composite. Three signals run independently and are combined by a small weighted classifier into a single verdict.
Signal one: deterministic test runner. For code deliverables, the agent triggers the project’s existing test suite (the same machinery the engagement’s CI uses) on a fast subset for first pass and the full suite for second pass on a soft fail. The result is structured per test (pass, fail, error, skipped) and aggregated to suite level. For non code deliverables this signal is absent and the weighting shifts.
Signal two: composite security and static analysis. Two open source model families do the heavy lifting. A SecureBERT class vulnerability classifier reads diffs and flags known vulnerability patterns: auth bypass shapes, injection sinks (SQL, command, template), unsafe deserialisation, weak crypto primitives, hardcoded secrets, missing input validation on a typed boundary. A CodeT5+ class static analyzer produces structural findings: complexity hotspots, dead code paths, broken control flow, type contract drift, missing error handling on operations that can fail. Both run on the changed surface only, not the whole repo, which keeps inference predictable and the latency budget reasonable.
Signal three: semantic review on Orzed Pulse. A fine tuned Pulse instance reads the deliverable in context: the brief excerpt that motivated the work, the Planning Recommendation block this deliverable belongs to, the diff or content itself, and the relevant surrounding code or document tree. It produces a structured semantic review covering intent match (does this deliver what the block asked for), edge case coverage (are obvious edge cases handled), documentation completeness (is the deliverable explained well enough for the next reader), and naming and clarity (do the names match the codebase’s existing patterns).
The three signals are combined by a small weighted classifier trained on the historical decisions of senior reviewers. The weights are not symmetric. Security flags from signal two carry roughly twice the weight of a semantic concern from signal three because the cost asymmetry in production is well understood; a missed security issue costs more than a missed naming concern. Test failures from signal one are gating; if deterministic tests fail, the verdict is hard fail regardless of the other signals.
The verdict is one of three: pass (clear), soft fail with concern (queued to human review with the concern flagged), or hard fail (sent back to Execution with the failing evidence cited).
Training
Orzed Pulse, the model behind signal three, was fine tuned on three datasets we have been accumulating since the Console went live.
| Dataset | Size | Source |
|---|---|---|
| Reviewed code cases | ~14,000 | Senior reviewer accept and reject decisions on Execution outputs across 18 months of engagements |
| Security audit records | ~3,200 | External and internal audit reports with the original artifact, the finding, the verdict and the resolution |
| Structural QA records | ~6,500 | Documentation, content and architecture review verdicts with the artifact, the concern and the disposition |
The pipeline was alignment trained with a reinforcement loop on senior reviewer decisions. Each reviewer disagreement with the agent during shadow mode contributed a labelled correction. The loop is not closed permanently; we run it on a rolling cadence as new engagement data accumulates and as the platform’s review patterns evolve.
The composite weighting classifier was trained separately. The training input was the historical agent signal triple (test result, static and security findings, semantic review) for past deliverables paired with the senior reviewer’s verdict. The model is small (a few hundred parameters), trained on a few thousand triples, and is straightforward to retrain when the signal mix changes.
We held out four engagements as a clean evaluation set during pilot. None of their data appears in training; the numbers below were measured against that set and against shadow mode runs on production traffic.
Evidence pack format
Every QA verdict carries an evidence pack. The pack is structured and stored alongside the deliverable in the Console.
{
"deliverable_id": "...",
"verdict": "pass | soft_fail | hard_fail",
"weighted_score": 0.87,
"signals": {
"deterministic_tests": {
"suite_status": "pass",
"passed": 128,
"failed": 0,
"errored": 0,
"skipped": 4,
"tail_log": "..."
},
"security_static": {
"vulnerability_findings": [],
"structural_findings": [
{"category": "complexity", "severity": "low", "location": "...", "note": "..."}
]
},
"semantic_review": {
"intent_match": "high",
"edge_case_coverage": "medium",
"documentation": "high",
"naming_clarity": "high",
"concerns": ["..."]
}
},
"human_review_required": false,
"explanation": "..."
}
The pack is the source of truth for any later disagreement. We deliberately surface the evidence rather than only the verdict, because evidence is debatable and a verdict is not. A customer disputing a deliverable disputes specific evidence; a senior reviewer overruling the agent overrules with reference to specific evidence. Conversation moves onto firmer ground.
Performance bands
Numbers are drawn from the pilot (47 engagements, 1,200 plus deliverables, 6 weeks: 3 weeks shadow, 3 weeks active).
Throughput. Time from deliverable submitted to deliverable accepted by a senior reviewer dropped by a factor in the range of 3.2x to 3.6x, with the median around 3.4x. Smaller deliverables saw the largest improvement; very large multi file refactors saw less because the human review time itself dominates there.
False positive rate. Concerns the agent surfaced that the senior reviewer dismissed without action came down by roughly 18 percent compared with our previous purely heuristic linter chain. The remaining false positives cluster in stylistic concerns where reasonable engineers disagree and concerns where the agent did not have enough surrounding context.
Critical regression catch rate. On a labelled set of 200 historical deliverables containing known regressions, the agent flagged in the 92 to 95 percent band. Senior human reviewers on the same set caught a comparable share. The agent does not catch fewer real regressions than a human on this set; it catches them faster and at queue scale.
Latency. End to end QA Agent processing on a typical code deliverable runs in 1 to 3 seconds. The breakdown: deterministic test runner is a few hundred milliseconds for fast subsets and longer for full suites; security and static analysis runs in a few hundred milliseconds combined; the Pulse semantic review runs under one second.
Limits
Three classes the agent does not handle and is not intended to handle.
Domain specific compliance. Financial regulation review, sector specific privacy controls and any artifact requiring a domain compliance specialist still routes to that specialist. The agent surfaces structural concerns; it does not opine on compliance.
Subjective UX and product judgement. Visual design, content tone fit, copy voice. The agent will note deviation from a documented style guide; it will not have an opinion on whether the style guide is right.
First of a kind architecture decisions. Anything where the engagement has no precedent in our memory or the customer’s prior work. The senior reviewer owns these calls fully.
We will not move these limits without measuring the agent’s performance in the candidate area first.
What is next
Q2 brings specialised QA variants built on Orzed Meridian. Three lines are planned.
Security focused variant. Deeper static analysis, threat modelling integration, longer context window for cross file vulnerability patterns. Built for engagements where security is a primary constraint; opt in at the engagement level.
Performance focused variant. Benchmark deltas on every code deliverable, regression detection against a baseline, hot path identification. Useful for engagements where performance is a stated requirement.
Accessibility focused variant. Front end specific, runs accessibility audits on rendered output, flags WCAG violations and keyboard trap patterns. Built for the engagements where accessibility is a stated deliverable.
Each variant will have its own evidence drawer in the Console, with the same structured pack format the base QA Agent uses today.
Specifications
| Attribute | Value |
|---|---|
| Underlying model | Orzed Pulse (semantic review), composite with SecureBERT class and CodeT5+ class models |
| Training data | ~14,000 reviewed code cases, ~3,200 security audits, ~6,500 structural QA records |
| Median latency | 1 to 3 seconds end to end |
| Output | Structured evidence pack with weighted verdict (pass, soft fail, hard fail) |
| Console surface | QA Evidence drawer on every deliverable, QA Verdict badge on every Approval Checkpoint |
| Binding status | Exploratory; senior human review owns binding pass or fail on architecture, security, release |
| Roadmap | Q2 specialised variants on Orzed Meridian (security, performance, accessibility) |
This concludes the current Orzed Models and Agents index. New entries land here as new agents and model variants ship; the Console Changelog covers the rollouts themselves with the operational detail.
Questions teams ask
Why composite, not a single model?
Different signals catch different failure classes. Deterministic tests catch what is deterministically wrong. Static analysis catches structural issues a model can miss. Security classifiers catch known vulnerability shapes. The semantic LLM pass catches the cases where the code is technically correct but does not solve the intended problem. No single signal carries all four; the composite weighting is what makes the agent useful.
Where does the agent escalate?
Three classes always escalate to senior human review: subtle concurrency bugs in code without sufficient test coverage, domain specific business rule violations the agent has not seen examples of, and content factual errors that require external source cross referencing. The QA Agent's job is to gate the queue, not to replace the senior reviewer's call.
Can I see the evidence pack as a customer?
Yes. Every deliverable in Live Delivery Tracking has a QA Evidence drawer. It contains the deterministic test summary, the static analysis findings, the semantic review and the weighted recommendation. If you disagree with a verdict, the dispute starts from the evidence.