Shipping Classical ML in an LLM World

MLOps mlops, classical-ml

LLMs did not retire classical ML. The categories where gradient boosting and logistic regression still beat LLMs on cost, latency and reliability.

  • By Orzed Team
  • 5 min read
Key takeaways
  • Tabular ML problems are still solved by gradient-boosted trees in 2026, not by LLMs.
  • An LLM call costs roughly 1000x what a logistic regression inference costs at production volumes.
  • Regulated decisions need explainability that LLMs cannot easily provide.
  • Pre-LLM ML never went away. Most of the recommendation engines, fraud signals and ranking systems still use it.

A team approached us asking for help building an LLM-powered fraud-detection system. They were a few weeks into the design and budgeting for a substantial ongoing inference cost. We asked what kind of inputs the model would see. They described a fixed schema: 47 features per transaction, all structured (amounts, IDs, timestamps, categorical flags). We asked what the output would be. A binary classification: fraud or not.

The right answer was a gradient-boosted tree. We built a baseline in three days using XGBoost on their existing labelled data. It hit 96 percent precision and 92 percent recall, with sub-millisecond inference latency, at roughly 1/2000th the cost the LLM solution would have run at. The team shipped it the following week.

The LLM had been the default answer because “AI” had become synonymous with “LLM” in the team’s vocabulary. The actual problem was a textbook classical ML problem and had been for fifteen years.

This piece is about that confusion. The categories where classical ML still wins, the categories where LLMs are the right answer, and the operational shape of running both in the same organisation.

Where classical ML still wins decisively

Tabular prediction at scale. Fraud, credit risk, churn, recommendations, propensity scoring. Inputs are structured rows; outputs are scores or classifications; labelled data exists. Gradient-boosted trees (XGBoost, LightGBM, CatBoost) and logistic regression remain the right answer in 2026. Inference is microseconds; cost is negligible at production volume; explainability via SHAP or feature importance is mature.

Latency-critical inference. Real-time bidding, sub-100ms recommendation, high-frequency trading. An LLM cannot match these latencies even with aggressive optimisation. Classical models inference in single-digit milliseconds at most.

Regulated decisions. Credit, insurance, hiring (where allowed), housing. The regulator wants to know “why did the model decide this”, and the answer needs to be a feature attribution, not “the LLM thought so”. Classical models with SHAP or counterfactual explanations meet this bar; LLM explanations meet it poorly.

Small data problems with strong priors. A team with 5,000 labelled examples and a clear target variable does not need to fine-tune a 70B LLM. A regularised logistic regression or a small XGBoost trained on the labels often produces better results faster.

Anomaly detection on structured data. Isolation forests, autoencoders, statistical methods. LLMs are bad at “is this transaction unusual” compared to dedicated anomaly detection.

The cost comparison is striking. At 100,000 inferences per second sustained:

Model classInference costLatencyNotes
Logistic regression~$0 (CPU-bound)< 1 msFits in cache
XGBoost / LightGBM~$0 (CPU-bound)1-3 msTree traversal
Small neural net (10M params)low5-20 msGPU inference
Fine-tuned LLM (8B)substantial50-500 msGPU, batching
Frontier LLM (Sonnet 4.6)high200-2000 msProvider API, paid per token

For tabular problems where the classical model matches the LLM on accuracy (which is most of them), the cost ratio is 100x to 10,000x in favour of classical.

Where LLMs genuinely win

Open-ended generation. Drafting responses, summarising documents, translating, formatting. LLMs are the best tool ever built for this category.

Reasoning over unstructured input. When the input is text, an image, or a mixed bag of formats, LLMs handle it. Classical NLP could do parts of this; LLMs do all of it cleanly.

Tool use and orchestration. When the answer requires calling external systems in a context-dependent order, the LLM’s flexibility is the value.

Conversational interfaces. Chat is an LLM problem. Classical NLP could handle simple intents but not multi-turn coherence.

Cold start problems. Tasks where the team has no labelled data and the LLM’s pre-training carries enough world knowledge to get started. Used carefully (with eval discipline as the data accumulates), this is a powerful pattern.

Multi-modal inputs. Images, audio, video, text together. Foundation models handle this; classical ML largely does not without dedicated specialised models per modality.

The categories above are real and growing. They are also not the entire ML landscape; they are a meaningful slice of it.

Where to use both together (hybrid)

Many production systems benefit from both classes:

Classical model + LLM explanation. A fraud detector flags a transaction with a numeric score. An LLM explains the decision in natural language for the customer support agent. Each does what it is best at.

LLM-driven feature extraction + classical scoring. An LLM extracts structured features from unstructured text (customer email sentiment, intent, key entities). A classical model takes those structured features along with other signals and produces the final score. The LLM is the front-end; the classical model is the decision-maker.

Classical pre-filtering + LLM analysis. The classical model handles the 95 percent easy cases at low cost. Hard cases are routed to an LLM for the deeper analysis. Cost stays low because the LLM only sees the residual.

These patterns combine the strengths cleanly. The teams that build them avoid the trap of “everything is an LLM” while not regressing to “no AI at all”.

The operational shape of running both

A team that runs classical ML and LLMs in production has two parallel operational stacks:

ConcernClassical MLLLMs
Model registryMLflow, Weights & Biases, SagemakerProvider API plus internal version pin
Inference servingCustom service or BentoML/SeldonProvider API or self-hosted (vLLM, TGI)
MonitoringDrift, accuracy, latencyEvals, drift, cost, output quality
Retraining triggerDrift threshold, calendarEval regression, model deprecation
Cost controlInfrastructure budgetPer-token budget

The two stacks have distinct disciplines. A team that has only LLM-ops skills cannot maintain a classical ML system; a team with only classical ML-ops cannot maintain LLMs in production. Most production AI teams need both skill sets.

What we recommend on engagements

For each ML use case in scope:

  1. Classify the input shape (structured tabular vs unstructured vs mixed).
  2. Classify the output shape (score/classification vs generation vs decision).
  3. Classify the latency budget (microseconds vs milliseconds vs seconds).
  4. Classify the cost budget per inference.
  5. Pick the model class that fits all four.

Roughly:

  • Structured input + structured output + tight latency + tight cost = classical ML.
  • Unstructured input + generation = LLM.
  • Mixed = hybrid pipeline.

The wrong move in 2026 is defaulting to LLMs for problems they do not naturally fit. The cost premium is real, the latency cost is real, the explainability gap is real, and customers (regulators, finance, product) start asking why six months later.

Classical ML did not retire. The teams that ship production AI honestly use whatever tool fits the workload, regardless of which one is in the headlines this quarter.

Frequently asked

Questions teams ask

Doesn't a fine-tuned LLM beat classical ML on tabular tasks?

Almost never on the cost-adjusted bottom line. A gradient-boosted tree trained on the same data typically matches or beats a fine-tuned LLM on accuracy, with 100x to 1000x lower inference cost and millisecond latency. The exceptions are tasks where the input has unstructured components (mixed text + numbers); even then, hybrid is often better than pure LLM.

Where do LLMs genuinely beat classical ML?

Open-ended generation, summarisation, multi-step reasoning, tool use, conversational interfaces, multi-modal inputs, problems where the rules cannot be enumerated, and small-data problems where the LLM's pre-training is the data. These are all real and large categories. But they are not the entire ML landscape.

How do I know which to choose?

If the input is structured (rows in a database, fixed schema), the output is a known prediction (class, score, ranking), and you have labelled data, default to classical ML. If any of those is false, consider an LLM.