Orzed Planning Agent, from brief to pipeline
- Runs on Orzed Horizon for the depth of reasoning the planning workload demands.
- Trained on the Orzed delivery memory: completed engagements reverse engineered into pipeline patterns.
- Produces a structured Planning Recommendation with blocks, dependencies, role assignment and cost and throughput bands.
- Recommendation is labelled exploratory; the Approved Baseline is set by senior human reviewers and only that is binding.
- Customer can view the Recommendation in the Console, request revisions, or override specific assignments.
The Planning Agent is the heaviest agent in the Orzed stack. It runs on Horizon, takes an approved brief plus the Intake Report and the Technical Review Team’s notes, and produces a Planning Recommendation. The Recommendation is the document that turns a customer’s narrative ask into a structured pipeline of work blocks, dependencies, role assignments and cost and throughput estimates.
This is the agent card. It explains the agent’s job, its training, the Recommendation it produces, the binding versus exploratory split, and the limits we have measured.
What the agent does
The Planning Agent reads three inputs: the approved brief, the Intake Report, and the Technical Review Team’s notes from intake. It produces a Planning Recommendation with five sections.
Decomposition. The work broken into blocks. Each block has a name, a deliverable, an acceptance criterion and an estimated effort band. The decomposition is the structural backbone; everything else hangs off it.
Dependency graph. The order in which blocks must run, including parallelisation where dependencies allow. The graph is rendered in the Console as a topology, not as a Gantt chart, because the latency between blocks is rarely the binding constraint; the dependency itself is.
Role assignment. Per block, an indication of which roles do the work. Roles are typed: senior engineer, specialist (security, performance, accessibility), AI agent (Meridian execution, Pulse validation, etc.), human reviewer. The platform’s value comes partly from this assignment getting the AI versus human split right; the agent leans on the historical pattern library to do so.
Throughput estimate. A range estimate for the engagement’s total elapsed time, anchored to the platform’s measured weekly throughput limits. The estimate is a band, not a point; we do not pretend to be more precise than the underlying variability allows.
Cost estimate. Project Credit estimate for the engagement, broken down by block. The estimate is a band and is reconciled to the customer’s quoted price during review; it is not the final price.
The Recommendation is the artifact the Technical Review Team converts into the Approved Baseline.
Architecture
The Planning Agent is an orchestration over Horizon plus a small toolset.
The toolset is narrow on purpose. The agent can query the Orzed pattern library (a structured store of completed engagement decompositions), the platform’s current capacity model (which roles are available with what weekly throughput), the Operational Credit cost model (what each role and each model tier costs per typical block), and the customer’s prior engagement memory if they have one. These are read only tools; the agent does not change platform state during planning.
The reasoning loop is multi pass. Horizon produces a draft decomposition, then critiques it against a checklist of common planning failures (over decomposition into too many small blocks, under decomposition where one block hides three deliverables, missed dependencies, role assignment that ignores capacity, throughput estimates that ignore the dependency graph). The revised draft is the one that lands in the Console.
End to end the agent takes 30 to 90 seconds of wall clock time on a typical engagement, longer for complex ones. This is async territory; the customer is not blocked.
Training
The Planning Agent inherits Horizon’s training and adds a planning specific fine tune.
The fine tune dataset is what we call delivery memory: roughly 1,400 completed engagements, each captured as a triple (original brief, planning artifact, realised delivery). The realised delivery includes what was actually built, in what order, by whom, in what time, at what cost. The fine tune is supervised against the planning artifact (predict the plan that the human team produced) with a secondary loss term against the realised delivery (penalise plans that were close to the original artifact but that diverged sharply from what was actually delivered).
The reverse engineering step that produced delivery memory is itself substantial. Past engagements were not always documented in the structured form delivery memory needs; we spent roughly six months reconstructing pipeline patterns from project artifacts (commits, retrospectives, original briefs) before the dataset was usable for fine tuning. We continue to add to it as new engagements close.
The reinforcement loop runs on senior reviewer judgements. After every Planning Recommendation enters the human review queue, the reviewer’s edits and the eventual Approved Baseline are captured as a structured comparison. The agent is aligned against this signal on a rolling basis.
Performance bands
On the held out engagement set (60 engagements the agent has not seen, with planning artifacts produced by senior planners), the Planning Agent’s Recommendation matches the human planner’s structure (decomposition shape, role split, dependency graph) within a tolerance band on roughly 70 to 80 percent of dimensions. The cases where it diverges are usually defensible (the agent and the human chose different but equally valid structures) and a small minority are misses.
Throughput estimates land in the right band (within plus or minus 20 percent of realised delivery time) on roughly 70 to 80 percent of engagements. The misses cluster on engagements that experienced material mid flight scope change (which no plan predicts) and engagements where the original brief contained a substantive misassumption that survived intake (which we now catch earlier with the strengthened Intake Agent).
Cost estimates land in the right band on roughly 75 to 82 percent of engagements. The cost estimate is somewhat more reliable than the throughput estimate because the platform’s per role and per model cost model is well measured, while throughput is more sensitive to engagement specific external factors.
The Recommendation is generated in the 30 to 90 second range for typical engagements, with the long tail extending to 4 to 5 minutes for the most complex ones. The Console shows progress as the Recommendation streams.
Binding versus exploratory
The Planning Recommendation is labelled exploratory. The Approved Baseline is what becomes binding.
The flow is structural. The Planning Agent produces the Recommendation. The Technical Review Team reviews it, edits it, validates the role assignments against current capacity, sanity checks the throughput and cost bands, and produces the Approved Baseline. The customer reviews the Approved Baseline (not the Recommendation), signs off, and the engagement enters Execution against that signed plan.
This is the same pattern the Intake Agent and QA Agent use: the model proposes, the human disposes, only the human’s disposition binds the engagement. The pattern exists because we will not let an AI estimate be the basis of a customer commitment. AI estimates inform the commitment; the commitment is made by people who can stand behind it.
Customer overrides
The Console exposes the Recommendation to the customer in read mode and the Approved Baseline in editable mode (with revision tracking). Customers can request revisions on specific blocks, override AI versus human assignments per block (e.g., “I want a senior engineer on this block, not an AI agent”), and tune throughput preferences (faster delivery at higher cost, or slower delivery at lower cost). All overrides flow back through the human review loop before they enter the Approved Baseline.
This matters because the platform should not impose its routing preferences on customers who have a strong opinion. Most customers are happy to defer to the Approved Baseline as proposed; some prefer specific control points; both are first class flows.
Limits
Three structural limits.
Brief quality dependency. The Planning Agent’s output is bounded by the brief’s quality. A vague brief produces a Recommendation with wide bands and many flagged risks; a sharp brief produces a tighter Recommendation. The Intake Agent surfaces the brief quality issues; the Planning Agent reflects them.
External constraints. The agent does not know about external customer constraints unless they are in the brief or in the customer memory. Calendar constraints (a regulatory deadline, a partner launch, a hiring freeze) come into the plan only through the human review step.
Novel domains. Engagements in domains where Orzed has limited delivery memory produce Recommendations with wider uncertainty bands. The Technical Review Team typically spends more time on these and the Approved Baseline diverges more from the Recommendation.
Specifications
| Attribute | Value |
|---|---|
| Underlying model | Orzed Horizon, Planning fine tune |
| Training data | Delivery memory (~1,400 engagements with brief, plan and realised delivery), plus reviewer alignment loop |
| Median latency | 30 to 90 seconds (typical engagement) |
| Output | Structured Planning Recommendation (decomposition, dependency graph, role assignment, throughput band, cost band) |
| Console surface | Planning Recommendation drawer, Approved Baseline editor |
| Binding status | Exploratory; binding decisions sit with the Technical Review Team in the Approved Baseline |
The QA Agent write up covers the next major agent in the lifecycle: the agent that runs the first pass on every Execution lane deliverable produced under the Approved Baseline.
Questions teams ask
Why is the Recommendation exploratory and not binding?
Because no model, however capable, should be the binding source of an engagement plan. The Recommendation is the agent's best read of how the work decomposes; the Approved Baseline is the version that human senior reviewers have validated against the customer's actual constraints. Binding decisions on cost, scope and timeline sit with people who can be accountable for them.
How accurate are the Planning Agent's estimates?
Throughput and cost estimates land in the right band on roughly 70 to 80 percent of engagements measured against realised delivery. The misses cluster around engagements where scope materially shifted mid flight (which no plan predicts) and engagements where the brief's assumptions were substantively wrong (which the Intake Agent now catches earlier).
Can I change parts of the Recommendation?
Yes. The Console lets you request revisions on specific blocks, override AI versus human assignments per block, and tune throughput preferences. Revisions go back through the agent and the human review loop; the Approved Baseline reflects the final state.