Model Monitoring Beyond Accuracy Drift
- Accuracy drift fires last. By then the customer has already noticed.
- Input distribution drift fires earliest, sometimes weeks ahead of accuracy degradation.
- Prediction distribution drift catches the case where inputs look fine but the model is flipping.
- Feature freshness is invisible until a stale feature wrecks a serving window.
A team had been monitoring a fraud-detection model in production for eighteen months. Their dashboard showed accuracy holding steady at 94 percent. They were confident the model was healthy. Then, over a single quarter, the false-positive rate doubled. Customer complaints spiked. The team traced the issue back: the input distribution had been slowly shifting for three months. Specifically, a new payment processor had come online and routed a different mix of transactions through the model. The model had not seen this distribution during training. Accuracy had quietly dropped, but the labelled feedback loop took six to ten weeks (because true labels arrived after a customer dispute period), so the metric on the dashboard was always showing data from before the shift mattered.
This piece is about that pattern. Accuracy is the most visible monitoring signal and the one that fires last. The signals that fire earlier require some setup, but they are what give the team a chance to fix a regression before it becomes a customer incident.
The five signals worth installing
Signal 1: Input distribution drift. The shape of the data flowing into the model changed. Compare the distribution of recent inputs to the training distribution (or to a rolling baseline). Population Stability Index (PSI) is the standard metric:
PSI = sum( (recent_pct - baseline_pct) * ln(recent_pct / baseline_pct) )
PSI > 0.1 indicates a meaningful shift. PSI > 0.2 indicates a significant shift worth investigating. Track per-feature, weekly.
Signal 2: Prediction distribution drift. The shape of the model’s outputs changed. Even when input distributions look similar, the model can start emitting more predictions in one class. Compare the rolling distribution of predictions to baseline. Useful when input drift is hard to measure (high-dimensional inputs, embeddings) but the prediction is a clean discrete or continuous value.
Signal 3: Feature freshness. A feature was supposed to update every hour but has not updated in 18 hours. The model is reading stale data. Track the timestamp of the most recent value per feature; alert when staleness exceeds a per-feature threshold.
This is the cheapest monitor to install and the one that catches the dumbest, most embarrassing failure mode: a feature pipeline silently broke and nobody noticed because the model kept serving (with stale inputs).
Signal 4: Downstream business metric correlation. The model exists because it drives a business outcome (fewer fraudulent transactions, higher conversion, better recommendations). Track the metric the model is supposed to influence. Correlate model predictions to the metric over rolling windows. If the correlation breaks, the model is no longer doing its job, regardless of what its accuracy says.
Signal 5: Label arrival latency. How long after a prediction is made do you actually know whether it was correct? If labels arrive 6 weeks late, your accuracy dashboard is showing data from 6 weeks ago. Track the label-arrival distribution; surface it alongside the accuracy chart so the team knows how stale the metric is.
| Signal | Detects | Time to detection (vs accuracy) |
|---|---|---|
| Input distribution drift | Inputs shifting under model | Weeks earlier |
| Prediction distribution drift | Model behaviour shift even on same-looking inputs | Days to weeks earlier |
| Feature freshness | Stale or broken feature pipelines | Hours, before serving impact |
| Business metric correlation | Model no longer influencing the outcome | Days to weeks earlier |
| Label arrival latency | Tells you how stale your accuracy chart is | Always known |
| Accuracy (raw) | Final outcome regression | Latest |
What to instrument first
For a team with no monitoring beyond accuracy, the install order:
- Feature freshness. Trivial to implement, catches the dumbest failures.
- Input distribution drift on top 10 features. PSI computed weekly per feature, surfaced on a single dashboard.
- Prediction distribution drift. Rolling histogram of predictions, weekly comparison to baseline.
- Business metric correlation. Connect model predictions to the business outcome. Track the rolling correlation.
- Label arrival distribution. Surface alongside accuracy so the team knows how recent the data is.
The first three take roughly one engineer-week each. The last two are more bespoke; budget two to four weeks total depending on the data infrastructure.
Thresholds and alerting discipline
Not every drift signal should page on-call. The right hierarchy:
| Severity | Signal | Action |
|---|---|---|
| Critical | Accuracy below SLA, business metric breakdown, feature freshness past hard threshold | Page on-call |
| Warning | PSI > 0.2 on key feature, prediction distribution shifted significantly | Open ticket, no page |
| Info | PSI 0.1-0.2, label arrival latency increasing | Dashboard only |
The team that pages on every PSI bump trains itself to ignore alerts. The team that does not page on accuracy SLA breaches finds out from customer support. The middle path is per-signal severity, configured deliberately.
Drift response patterns
When drift is detected, three response patterns:
Investigate first. PSI fired on a feature; before changing the model, find out why the distribution shifted. Often it is a known business event (new market launched, new product category, regulatory change). The model may not need to change; the distribution shift may be expected.
Retrain. If the shift is genuine and the model’s quality has degraded, retrain on a more recent window. Most production models should be on a scheduled retrain cadence (weekly to monthly) anyway; drift accelerates the schedule.
Roll back. If the current model has degraded sharply, roll back to the previous version while a new model is being trained. Versioned models in the registry make this minutes; without versioning it is hours to days.
The pattern to avoid: changing the model without diagnosing the cause. A drift response that retrains on contaminated data produces a worse model, not a better one.
Observability stack
For a typical production ML team, the monitoring stack:
- Feature freshness: a small custom dashboard or Evidently/WhyLabs/Fiddler.
- Distribution drift: Evidently, WhyLabs or a custom PSI computation in the data warehouse.
- Prediction distribution: same as above.
- Business correlation: usually a custom dashboard joining model logs to business metrics in the warehouse.
- Accuracy: native to your ML platform or a custom dashboard.
No single tool does all of this well. Most teams end up with two or three tools and a glue dashboard. That is fine; do not let perfect be the enemy of “we have monitoring”.
What we install on engagements
The standard ML monitoring discipline:
- Per-model SLA defined (accuracy threshold, latency, throughput).
- Per-model criticality classified (silent-fail OK vs business-impacting vs revenue-touching).
- Alerting tier mapped to severity (page, ticket, dashboard).
- The five signals above instrumented per model, with thresholds calibrated to traffic.
- A weekly review where the team looks at the dashboards together (not just on incidents).
The weekly review is the highest-leverage piece. Drift is gradual; daily glances miss it, dashboards rarely visited miss it. A 30-minute weekly review where the ML team looks at every production model’s monitoring catches things that automated alerts cannot calibrate for.
The teams that get this right know the moment a model starts shifting and respond before the regression reaches customers. The teams that monitor only accuracy find out from customer support, weeks late, and write postmortems titled “we should have seen this earlier”.
Questions teams ask
What's the simplest drift metric to start with?
Population Stability Index (PSI) on the top 10 most-important features. It is computationally cheap, intuitive (PSI > 0.2 is a meaningful shift), and catches most distribution shifts that matter.
How do I monitor a model with delayed labels?
Use proxy metrics that arrive faster than the true label. Click-through rate is a proxy for purchase, an early-payment signal proxies for default. Track the proxy daily and the true label weekly or monthly.
Should monitoring fire alerts or just dashboards?
Both. Distributional shifts fire warnings (ticket, no page). Accuracy regressions and SLA breaches fire alerts (page on-call). The team that pages on every PSI bump burns out fast.