Introducing Orzed Horizon, the flagship planning model

Orzed Models & Agents orzed-horizon, models

Orzed Horizon is the flagship model in the Orzed stack. Long context, deep reasoning, sized for planning, architecture and senior review work. Async by design.

  • By Orzed Team
  • 5 min read
Key takeaways
  • Mixture of experts architecture in the 70B to 90B active parameter range, distilled from open source bases plus 18 months of Orzed delivery telemetry.
  • Context window in the 256K token range, sized for whole engagement context (briefs, plans, prior decisions, codebase excerpts) in one pass.
  • Powers the Planning Agent and the Senior Review Layer, the two places where reasoning depth matters more than latency.
  • Median latency in the 4 to 7 second range. Used for async planning and review, not for interactive flows.
  • Customers see Horizon outputs as Recommendations or Senior Review Notes, never as binding decisions until a human signs off.

Orzed Horizon is the flagship model in the Orzed stack. We built it for the work in a delivery pipeline that asks “should we do this, and if so how” rather than “do this now”. Planning, architecture trade off analysis, multi stakeholder synthesis, senior review of executed work. The kind of work where being slow and right beats being fast and wrong by an order of magnitude.

This is the model card. It explains where Horizon came from, what it is sized for, where it sits in the Orzed pipeline, and the limits we have measured.

Architecture

Horizon is a mixture of experts (MoE) model with active parameter count in the 70 billion to 90 billion range. Total parameter count is higher; only a routed subset is active per token, which keeps inference cost in a band we can run sustainably for async workloads. The base architecture borrows from open source MoE work in the Mixtral and DeepSeek lineage; the routing is trained against an internal objective that penalises expert collapse and rewards utilisation balance.

Context window is in the 256K token range. For the planning workload this matters because a single Planning Recommendation can need to ingest the original brief, the Intake Agent’s report, the customer’s prior delivery memory, the relevant slice of our internal pattern library and any codebase excerpts the planner needs to reason against, all in one pass. Splitting that into a chain of smaller calls degrades the plan; long context gives us a single coherent reasoning surface.

The reasoning loop is multi pass. Horizon produces a draft plan, critiques its own draft against a checklist of failure modes drawn from our retrospective archive, revises, and emits the final artifact with the critique trail attached. The critique trail is part of what the Technical Review Team reads when they validate the Recommendation; it makes the model’s reasoning auditable rather than opaque.

Training approach

The base is open source. The fine tuning lifts the model from a generalist into a planner.

The Orzed delivery telemetry dataset comprises 18 months of completed engagements: original briefs, planning artifacts, the work that was actually done, the decisions that changed mid flight, the retrospectives that named what worked and what failed. Roughly 1,400 engagements contribute to it, with the long tail clustered around the engagement types we run most often (AI integration, SaaS product builds, automation, data platform).

The fine tune is supervised first (planning artifact pairs: brief to plan, plan to revised plan, plan to retrospective verdict) and then reinforcement aligned with senior reviewer judgements on plan quality. The reward signal is structured, not scalar; reviewers rate plans on six dimensions (decomposition quality, dependency awareness, risk identification, scope discipline, throughput realism, cost realism) and the model is aligned against the multi dimensional signal. A scalar reward collapses too much information.

We hold out two clean evaluation sets. The first is a labelled set of 80 historical engagements where we know what the actual delivery looked like; we measure the model’s plan against the realised delivery. The second is a synthetic set generated to stress test specific failure modes (ambiguous scope, conflicting stakeholder asks, technically impossible asks dressed up as feasible). The numbers in the next section are measured against both.

Where Horizon sits

Two places. The Planning Agent uses Horizon to produce the Planning Recommendation that comes out of the front end of every engagement. The Senior Review Layer uses Horizon to produce the Review Notes that accompany every Approved Baseline change request, every architecture decision and every release readiness call.

In both cases the Horizon output is exploratory, never binding. The Console marks Horizon outputs explicitly as Recommendation or Note; the Approved Baseline (the binding plan) is created only when the Technical Review Team has signed off. This separation is structural and intentional. Horizon is not a decision maker; it is the most capable reasoner we have, and reasoners advise.

Performance bands

Latency on a representative planning workload (full brief plus context, multi pass reasoning, structured output) sits between 4 and 7 seconds at the median, with the long tail extending to 15 seconds for the largest engagements. This is async territory; the Planning Agent does not block any interactive surface, and the Console renders the Recommendation as it streams.

On the held out engagement set, plans produced by Horizon match the realised delivery within a tolerance band on each of the six dimensions roughly 70 to 80 percent of the time. The cases where it misses are concentrated in two classes: engagements where the customer changed direction materially mid flight (which no plan can predict), and engagements where the brief itself was substantively misleading (which the Intake Agent now catches earlier).

Against the synthetic stress set, Horizon flags ambiguous scope in the right place 88 to 92 percent of the time. It refuses technically infeasible asks dressed up as feasible 90 plus percent of the time, with the failure mode being a plan that proposes a constrained version of the original ask rather than refusing outright. The Technical Review Team catches these before they reach a customer.

Limits

Three limits we have measured and are not currently working around.

Latency. Horizon is not for interactive flows. If a task requires sub second response, route to Pulse or Meridian. We provide Horizon as an explicit escalation, not a default.

Cost. Even amortised against the value of a good plan, Horizon inference is the most expensive call in the stack. Operational Credit accounting tracks every Horizon invocation and the platform routes away from it whenever a smaller model can carry the task. The Routing Layer write up explains the heuristics.

Generalist breadth. Horizon is shaped for the planning and review workload. It is not a general code generation model; for everyday coding work it underperforms Meridian, which has a code corpus and tool use loop Horizon does not need.

Specifications

AttributeValue
ArchitectureMixture of experts, active parameters 70B to 90B
BaseOpen source MoE lineage, Orzed fine tuned
Context window256K tokens (range)
Median latency4 to 7 seconds (planning workload)
Throughput targetAsync; 50 to 100 plans per hour per inference cluster
Target use casesPlanning Recommendations, Senior Review Notes, architecture trade off analysis
Console surfaceRecommendation, Senior Review Note (always exploratory)

The next two model cards in this index cover Orzed Meridian (the day to day production model) and Orzed Pulse (the high frequency, low latency tier). The Routing Layer write up explains how a task arrives at the right model in the first place.

Frequently asked

Questions teams ask

Why mixture of experts instead of a dense model at the same size?

Routing tokens through specialised expert subnetworks lets us reach reasoning depth comparable to a much larger dense model while keeping inference cost predictable. The trade off is a more complex training loop and a harder distillation path; we accept both because the planning workload rewards depth more than throughput.

Can I use Horizon directly through the Console?

Not in interactive mode. Horizon outputs reach the Console as Planning Recommendations and Senior Review Notes, both async artifacts. Interactive work routes to Meridian or Pulse depending on the task shape, with Horizon available as an explicit escalation when the Routing Layer detects the task needs it.

Is Horizon trained on my engagement data?

No. Customer engagement data is partitioned and never enters the Horizon training set. The 18 months of delivery telemetry referenced above is internal Orzed data (our own decisions, our own retrospectives) plus the open source bases. Customer data participates only in the engagement specific memory layer, which is scoped to that engagement.