Where a model earns trust,
or loses it

Deployment is the easy half. Monitoring, drift detection, inference optimisation, bias audit, privacy engineering and explainability are the layer that turns a working demo into an accountable production system. We operate all four pillars under one contract.

Deployment pipeline with canary rollout, monitoring panels and SLO signals ops.pipeline slo · enhanced deploy · canary · prod · observe p95 172 ms Model v2.4.1 promote Canary 5% traffic slo pass Production v2.4.1 · 100% auto rollback < 60s Latency · p95 172 ms slo 180 Quality · per-slice 93.7% Δ −0.3 Drift · PSI 0.18 threshold 0.2 9d window · 24h buckets SLO budget · 30d 82% target 99.9% burn rate 1.2× · incident 0

FOUR PILLARS

What the operations layer is actually made of

Deployment is one of four disciplines, not the discipline. An engagement scoped to a single pillar is welcome; most end up covering three once the first production incident makes the gaps visible.

01

Deployment & runtime

Getting weights out of a training notebook and behind a service contract the rest of the company can call.

  • REST and gRPC services (FastAPI, BentoML, Ray Serve)
  • Autoscaling GPU inference (vLLM, TGI, Triton, TensorRT-LLM)
  • Batch, streaming and near-real-time prediction paths
  • Blue-green, shadow and canary rollouts with automatic rollback
  • Feature-flagged model versioning under tenant isolation
02

Monitoring & drift

What a model did yesterday says nothing about what it will do tomorrow. We write the signals that catch regression early.

  • Latency and throughput SLOs with alert routing
  • Input-distribution drift (PSI, KS, JS divergence)
  • Concept drift detectors on labelled online traffic
  • Shadow traffic for candidate model comparison
  • Retraining triggers tied to quality and drift thresholds
03

Optimisation & edge

Inference cost is where most AI budgets quietly die. Quantisation, compilation and edge deployment where they save money or latency.

  • INT8, 4-bit GPTQ and AWQ quantisation
  • Pruning, distillation and speculative decoding
  • CPU, ARM and mobile compilation (ONNX, CoreML, TFLite, ExecuTorch)
  • On-device inference with sync and privacy contracts
  • KV-cache quantisation and paging for long-context LLMs
04

Governance & trust

A production model is a regulated artefact. Bias audit, privacy, explainability and audit trail are design, not add-ons.

  • Bias measurement (demographic parity, equalised odds, calibration)
  • Privacy engineering: anonymisation, differential privacy, data minimisation
  • Explainability (SHAP, integrated gradients, attention probes, counterfactuals)
  • Model cards, training logs, data-source provenance
  • EU AI Act, HIPAA, SOC 2 and sector-specific mapping
Pre-op data layer

MLOps inherits whatever the data layer hands it. Drift detection, lineage and privacy controls start in the pipeline, not at the serving endpoint.

Open data engineering ↗

MATURITY LADDER

Five rungs, most organisations sit on two

We use the Google MLOps maturity frame as a diagnostic. Most organisations we meet are between stage 01 and stage 02. The move that matters is 02 → 03, where the pipeline stops being a person and starts being code.

01

Manual

One engineer, one notebook, one model. Works for proof-of-concept; breaks the moment a second model or a second engineer arrives.

02

Automated training

Training pipelines are reproducible, data and code versioned, metrics tracked. First experiments can be replayed; retraining is still manual.

03

Continuous training

Scheduled and triggered retraining on monitored drift. Candidate models go through staged evaluation before promotion. Human approval on the release gate.

04

Continuous delivery

Shadow deploy, canary, autoscale, automatic rollback on SLO breach. Release gates documented, executed by the pipeline, countersigned by a human.

05

Autonomous operation

The system reshapes itself within policy: chooses variants, rebalances cost, schedules retraining, opens tickets on its own drift. Humans approve the policy, not the steps.

FOUR PITFALLS

The quiet failures we have seen the most

None of these are cutting-edge problems. They are basic operational hygiene that is almost always deferred because the first demo worked. When they break, they break the business case.

Common mistake

Treating deployment as the finish line

A model that passes eval on the training data ships, then quietly decays in production. Without drift detection and a retraining path, the clock starts the moment the release lands.

Common mistake

Optimising for training cost, not inference

Cheap to train, expensive to serve. We set an inference-cost ceiling in stage 01 of the model build so quantisation, distillation and architecture choices are made before the training budget is spent.

Common mistake

Treating bias audit as a one-off

Fairness metrics pass at launch, regress six months later as the input distribution shifts. Audits belong on the continuous-eval dashboard, not in a launch memo.

Common mistake

Explainability as a post-hoc attachment

If the system cannot explain a single decision when the customer asks, the product is regulated debt. Explainability is a runtime feature, not a research artifact.

SLO CONTRACT

Three tiers we actually sign against

Numbers on a slide are not a contract. The table below is the shape every MLOps engagement ends with: latency, availability, rollback and on-call SLOs in writing, tied to a tier that matches the product's risk profile.

Signal Standard Enhanced Mission-critical
Latency P50 < 400 ms < 180 ms < 80 ms
Latency P95 < 1.2 s < 450 ms < 200 ms
Availability 99.5% 99.9% 99.95%
Quality regression Daily eval suite Per-deploy + drift Continuous + canary
Rollback Manual, < 30 min Automatic, < 5 min Automatic, < 60 s
Incident response Next business day < 4 hours 24/7 on-call

TOOLKIT

The operational stack we default to

Stack picks are driven by latency, cost, compliance and team fluency, never by preferred-vendor contracts. These are the tools we run most often; substitutions happen per engagement.

Serving

  • vLLM · TGI · Triton
  • BentoML · Ray Serve
  • TorchServe · FastAPI
  • Modal · Replicate · Runpod

Observability

  • Evidently · WhyLabs · Fiddler
  • Arize · Langsmith · Braintrust
  • Prometheus · Grafana · OpenTelemetry
  • Sentry · Datadog APM

Compression

  • bitsandbytes · AWQ · GPTQ · llm-awq
  • ONNX Runtime · TensorRT · OpenVINO
  • CoreML · TFLite · ExecuTorch
  • Speculative + medusa decoding

Governance

  • SHAP · Captum · integrated gradients
  • Fairlearn · AIF360
  • Presidio · differential privacy (Opacus)
  • Model cards · datasheets for datasets

Adjacent disciplines

Every production AI surface leans on its neighbours. The following disciplines run alongside on most engagements.

Deploy · observe · govern

Have the model, need the system around it

Bring the model, the traffic shape and the compliance envelope. We come back with an SLO contract, deployment plan, monitoring spec and governance map inside ten working days. Numbers you can sign, not aspirations.