Automated QA Agents are now live

Console Changelog console, qa

The first automated layer of the Orzed Review pipeline is live. Built on Orzed Pulse with public security and static analysis models, evidence backed verdicts.

  • By Orzed Team
  • 7 min read
Key takeaways
  • QA Agent runs on Orzed Pulse, fine tuned on roughly 14,000 reviewed code cases, 3,200 security audits and 6,500 structural QA records.
  • Composite architecture pairs the LLM pass with a SecureBERT class vulnerability scanner and a CodeT5+ class static analyzer.
  • Evidence is structured: deterministic test results, static analysis output and the model's semantic review attached to every deliverable.
  • Pilot measured roughly 3.4x throughput against the human only baseline with about 18% fewer false positive flags.
  • Senior human review still owns binding pass or fail on architecture, security and release decisions. The agent automates the front of the line, not the call.

The Review and Validation Layer of the Orzed delivery pipeline has been the heaviest human bottleneck in the system since we built it. Every execution output (code, content, architecture, documentation) sits in a senior review queue until somebody with the right context signs off. That throughput limit has been the single largest reason a project’s velocity dips midway through a build.

Today the first automated layer of that review pipeline is live across all production engagements. We are calling it the QA Agent and it runs on Orzed Pulse. This note explains what it is, how it was trained, what it caught and missed during the pilot, and where the human review function continues to own the call.

Where the QA Agent sits in the pipeline

A typical engagement moves through the Console in five visible layers and one administrative one. The visible layers are Intake, Technical Review, Planning, Execution and Release. The Review and Validation Layer is not a separate column; it sits between Execution and Release, and it gates every artifact that comes out of an Execution lane.

Until now, that gate was entirely human. A senior engineer would pull a deliverable from the queue, run it through the project’s local checks, eyeball the diff, read the surrounding context, write a verdict. The work was good but the queue was always the bottleneck. We have measured cases where the queue depth was responsible for two thirds of a project’s elapsed time variance.

The QA Agent does not replace that reviewer. It runs first, attaches an evidence pack to the deliverable, and routes the artifact into one of three queues: clear, needs human review with a flagged concern, or hard fail with the failing evidence already cited. The human reviewer’s job becomes confirming or overruling a verdict that already has its supporting evidence assembled.

Technical architecture

The QA Agent is a composite. Three signals are produced independently and combined into a single weighted recommendation.

Signal one: deterministic test runner. For code deliverables, the agent triggers the project’s existing test suite (or a fast subset on first pass). The result is binary at the suite level and structured at the test level. This is the same machinery the engagement’s CI uses, just invoked earlier in the lifecycle.

Signal two: composite security and static analysis. Two public model families do the heavy lifting here. A SecureBERT class vulnerability classifier reads diffs and flags known vulnerability patterns (auth bypass shapes, injection sinks, unsafe deserialisation, weak crypto primitives). A CodeT5+ class static analyzer produces structural findings (complexity hotspots, dead code, broken control flow, type contract drift). Both run on the changed surface only, not the whole repo, which keeps inference cost predictable.

Signal three: semantic review on Orzed Pulse. A fine tuned Orzed Pulse instance reads the deliverable in context (the brief excerpt, the planning recommendation, the diff or content, the surrounding code or document tree) and produces a structured semantic review: intent match, edge case coverage, documentation completeness, naming and clarity. This is the signal that catches the failure modes deterministic tools miss, the “this works but does not solve the problem” class.

The three signals are combined by a small weighted classifier that was itself trained on the historical decisions of our senior reviewers. The weights are not symmetric: a security flag from signal two carries roughly twice the weight of a semantic concern from signal three, because the cost asymmetry in production is well understood. The verdict is pass, soft fail with concern (queued to human), or hard fail.

Training data and approach

The Orzed Pulse instance behind signal three was fine tuned on three datasets we have been accumulating since the Console went live.

DatasetSizeSource
Reviewed code cases~14,000Senior reviewer accept and reject decisions on Execution outputs across 18 months of engagements
Security audit records~3,200External and internal audit reports with the original artifact, the finding, the verdict and the resolution
Structural QA records~6,500Documentation, content and architecture review verdicts, the artifact, the concern, the disposition

The pipeline was alignment trained with a reinforcement loop on senior reviewer decisions. Each reviewer disagreement with the agent during shadow mode contributed a labelled correction. After distillation into Orzed Pulse, the agent reaches a verdict in well under one second on the median deliverable, which is what makes it usable as a queue front end rather than a second review pass.

We held out four engagements as a clean evaluation set during pilot. None of their data appears in training; the pilot numbers below are measured against that set and against shadow mode runs on production traffic.

What the pilot measured

The pilot ran for 6 weeks across 47 engagements, covering more than 1,200 deliverables. The agent ran in shadow mode for the first 3 weeks (verdicts produced but not enforced) and in active mode (verdicts gating the queue) for the second 3 weeks.

Throughput. Time from deliverable submitted to deliverable accepted by a senior reviewer dropped by a factor in the range of 3.2x to 3.6x, with the median closer to 3.4x. The variance is real; smaller deliverables saw the largest improvement, very large multi file refactors saw less because the human review time itself dominates there.

Flag quality. False positives, defined as agent flagged concerns the senior reviewer dismissed without action, came down by roughly 18% compared with our previous purely heuristic linter chain. Most of the remaining false positives are clustered in two areas: stylistic concerns where reasonable engineers disagree, and concerns where the agent did not have enough surrounding context (a problem we are addressing with longer context windows on the next Pulse refresh).

Critical regression catch rate. This is the metric we cared about most. On a labelled set of 200 historical deliverables that contained known regressions, the agent flagged between 92% and 95% of them. The senior human review baseline on the same set was in the same band. The agent does not catch fewer real regressions than a human; it catches them faster and at scale.

Edge cases the agent missed. Three patterns. Subtle race conditions in concurrent code without test coverage. Domain specific business rule violations the agent had no examples of. Content factual errors that required cross referencing external sources. All three remained in the human review path.

What it changes for the customer

If you are running an engagement on the Console, you will see three changes.

Live Delivery Tracking now shows a QA Evidence drawer on every deliverable. Open it and you see the deterministic test summary, the static analysis findings, the semantic review, and the weighted recommendation. The drawer is the source of truth for any later disagreement. We deliberately surface the evidence rather than just the verdict, because evidence is debatable and a verdict is not.

Approval Checkpoints carry a QA Verdict badge. When you arrive at a checkpoint, you see whether the underlying deliverables passed the agent’s review, were soft flagged for human review, or failed and were sent back to Execution. The senior human review status is shown alongside; the agent’s verdict never overrides a human verdict, only supplements it.

Disputes have a defined surface. If a customer disagrees with a senior reviewer’s accept of a deliverable, the dispute is now anchored in the evidence pack, not in opinion. This is the change that should make project disagreements faster to resolve. We have spent more cumulative engagement time arguing about whether a thing was done well than about whether it was done at all; an evidence anchor moves the second conversation onto firmer ground.

Known limits

The agent does not handle, and is not intended to handle, the following classes of deliverable.

  • Domain specific compliance. Financial regulation review, sector specific privacy controls and any artifact that requires a human compliance specialist still routes to that specialist. The agent surfaces structural concerns but does not opine on compliance.
  • Subjective UX and product judgement. Visual design, content tone fit, copy voice. The agent will note deviation from a documented style guide; it will not have an opinion on whether the style guide is right.
  • First of a kind architecture decisions. Anything where the engagement has no precedent in our memory or the customer’s prior work. The senior reviewer owns these calls fully.

These are the right limits for this rollout. We will not move them without measuring the agent’s performance in the candidate area first.

What is next

Q2 brings specialised QA variants built on Orzed Meridian. Three lines are planned: a security focused variant with deeper static analysis and threat modelling integration, a performance focused variant that runs benchmark deltas on every code deliverable, and an accessibility focused variant for the front end work that needs it. Each will be opt in at the engagement level, with its own evidence drawer in the Console.

For engagements already in flight, the QA Agent is on by default starting today. For new engagements, it is part of the Console scaffolding from intake. Operational Credit usage for the agent is absorbed into the Project Credit baseline; customers do not see a separate line item for QA Agent inference, by design. The agent is part of the platform, not an add on.

If you want a deeper read on the Orzed Pulse model behind the agent, see the model write up linked from the Orzed Models and Agents index.

Frequently asked

Questions teams ask

Does the QA Agent replace the human reviewer?

No. The agent automates the first pass on every deliverable so the senior reviewer arrives at a pre filtered queue with an evidence pack already attached. Architectural, security and release sign off remain human decisions, recorded against an Approved Baseline.

What runs underneath the QA Agent?

Orzed Pulse, fine tuned on internal QA telemetry, paired with a SecureBERT family vulnerability classifier and a CodeT5+ family static analysis model. The three signals are weighted into a single pass or fail recommendation with the underlying evidence preserved.

How do I see the QA verdict in the Console?

Every deliverable card in the Live Delivery Tracking view now shows a QA Evidence drawer. It contains the deterministic test summary, the static analysis findings, the semantic review and the weighted recommendation. Disagreements are routed to the human reviewer queue.