Language systems
that hold up under real use

Most NLP wins collapse on messy, multilingual, domain-specific input. We tune encoders on in-domain data, pair LLMs with retrieval instead of trusting open-book inference, and ship speech with diarisation and noise guards. Evaluation is continuous, not a launch checklist.

Token stream, attention paths, embedding space and retrieval arc nlp.token.stream ctx 8192 · bf16 tokens · embed · attend · retrieve p50 180 ms she opened the contract template slowly . query match § 4.2 stt · stream tokens used 5,842 / 8,192

CAPABILITY MAP

Six language surfaces, one operating discipline

Each capability stands on its own, and every production build ends up combining at least three of them. The map below is how we split the engagement into named deliverables.

01

Classification & sentiment

Intent detection, review mining, moderation queues, ticket routing. At production scale, fine-tuned encoders still outperform zero-shot LLM prompts.

  • BERT · DeBERTa · ModernBERT
  • XLM-R multilingual
  • Active learning loops
  • Confidence-calibrated output
02

Embeddings & vector search

Text, sentence and domain-tuned embeddings. The retrieval layer under modern assistants, semantic search and de-duplication systems.

  • Sentence-Transformers
  • E5 · BGE · GTE
  • Matryoshka embeddings
  • Hybrid dense + BM25
03

Retrieval-grounded QA

LLMs over your corpus, with citations. Chunking strategy, reranking, freshness control, access propagation to tenants. Accuracy is a retrieval problem first, a model problem second.

  • pgvector · Qdrant · Weaviate
  • ColBERT · Cohere Rerank
  • Hybrid search + filters
  • Citation attribution
04

Conversational assistants

Multi-turn dialogue with tool use, memory and guardrails. Named capabilities, explicit failure modes, audit trail per action. Designed around the handoff to a human.

  • Function calling · tool use
  • MCP · agent loops
  • Memory + context window
  • Escalation handoff
05

Translation & summarisation

Domain-tuned machine translation, long-form abstractive summarisation, style-preserving rewriting. Evaluation on semantic similarity and human pairwise, not BLEU alone.

  • NLLB · MADLAD · Tower
  • Long-context transformers
  • Constrained decoding
  • Style guides as system prompts
06

Speech I/O (STT · TTS)

Streaming transcription with diarisation, neural voice synthesis with licensing and consent guardrails, accessibility-grade captions with punctuation and formatting.

  • Whisper · Parakeet · Distil-Whisper
  • XTTS · StyleTTS2
  • Speaker diarisation
  • Noise + VAD pre-processing

PIPELINE

From corpus to live serving

The real work is before and after training. Corpus curation sets the ceiling; eval suites decide whether the release ships. The stages below are the reviewable sequence we run on every language engagement.

  1. 01, Corpus

    Corpus

    Domain data collection, licensing review, deduplication, PII filter, contamination check against public eval sets. Held-out slices by source, cohort and language.

  2. 02, Tokenise

    Tokenise

    Tokeniser choice (BPE, WordPiece, Unigram) matched to domain. Custom vocabulary extensions for code, chemistry, clinical or legal corpora where generic tokenisers bleed budget.

  3. 03, Model

    Model

    Encoder, decoder or encoder-decoder selection. Parameter-efficient fine-tuning (LoRA, DoRA) as default; full fine-tune when the task shifts domain heavily; continued pretraining for language expansion.

  4. 04, Eval

    Eval

    Offline suites (MMLU, domain-held-out), RAG-aware evals (Ragas), jailbreak and prompt-injection suites, human pairwise review. Fairness across language and demographic slices.

  5. 05, Serve

    Serve

    vLLM or TGI for LLMs, ONNX Runtime for compact encoders. Streaming for UX, batching for throughput, cost-per-turn dashboards on every production surface.

Retrieval beats recall

Closed-book LLM answers drift. Retrieval-grounded QA with citations survives the internal audit. The difference is a vector store and a reranker, both budgeted from day one.

See capability 03 ↘

WHERE WE SHIP

Six surfaces where language systems earn their place

Production NLP is not a single product. Support, legal, clinical, developer tools, commerce and media each impose their own latency, privacy and review constraints. Below, the sectors we have scars in.

Support

Customer support AI

Ticket triage, response drafting, knowledge-base answering with citation. Human approval on anything that touches an account, refund or compliance boundary.

Legal · regulated

Legal and compliance

Contract review, clause extraction, disclosure parsing, reg-change monitoring. Retrieval with provenance is the baseline; anything less fails the internal audit.

Clinical

Clinical documentation

Transcription, SOAP-note drafting, code suggestion (ICD, CPT) with clinician review. PHI handling under HIPAA-ready architecture, deployment to in-region infra.

Developer

Developer and code tools

Code completion, review augmentation, documentation generation, bug triage. Tuned to the internal stack, wired into existing IDEs and review surfaces.

Commerce

Commerce and search

Semantic product search, review summarisation, description generation, multilingual catalogs. Evaluated on click-through and conversion, not offline rank metrics alone.

Media · voice

Media and voice surfaces

Streaming captions, dubbing, voice assistants, podcast editing. Consent and licensing are the design, not the afterthought.

WHAT IT LOOKS LIKE

Five sample interactions from real engagements

Every capability on the map has a measurable shape. The cards below are simplified but real: actual input forms on the left, actual output shape on the right, with the metrics we track in the meta line.

Classification Input

"Order arrived but the charger was missing. Third time this month. Considering returning the whole thing."

intent

complaint · missing_item

priority: high · sentiment: −0.78
Embedding similarity Input

query: "waterproof hiking boots for wide feet"

top match

Salomon X Ultra 4 Wide GTX

cosine: 0.871 · hybrid rerank: +0.06
RAG · grounded QA Input

"Can a client cancel inside the 14-day window under the new contract template?"

answer

Yes, Section 4.2 of Template v3.1 grants a 14-day withdrawal window for all non-milestoned engagements.

citations: contract-template-v3.1 §4.2 · confidence: 0.92
Summarisation Input

A 1,800-word quarterly review covering six workstreams and their release notes.

executive summary

Three workstreams shipped to production, one slipped a sprint on compliance, two remain in staging with dependency risk.

length: 46 words · kept: figures + owner names
STT · transcription Input

11-minute kick-off recording, two speakers, light background noise.

transcript

Speaker A: let's start with scope. Speaker B: the main unknown is data residency…

WER: 3.4% · diarisation: 2 spk · punctuated

SAFETY + QUALITY

The four surfaces a language system has to defend

Every production LLM surface is under quiet adversarial load. The four areas below cover what we build before launch and keep running afterwards.

Retrieval integrity

  • Citation attribution per claim
  • Source freshness scoring
  • Access-control propagation
  • Knowledge cutoff declared

Adversarial hardening

  • Prompt-injection detectors
  • Jailbreak red-team suite
  • Output content filters
  • Rate + rule-based abuse guards

Privacy & compliance

  • PII redaction on ingest + egress
  • SOC 2 · HIPAA · EU AI Act maps
  • Data residency respected
  • Delete-on-request workflows

Quality monitoring

  • Sampled human review queue
  • Regression eval on every deploy
  • Drift detectors for input + output
  • Per-cohort quality dashboards

Adjacent disciplines

Every production NLP surface leans on its neighbours. These are the disciplines we co-run on most engagements.

Text · voice · retrieval

Bring the corpus, bring the use case

Share the domain, the language mix, the volume and the compliance envelope. We come back with a retrieval architecture, model shortlist, eval suite and rollout plan inside ten working days.