"Order arrived but the charger was missing. Third time this month. Considering returning the whole thing."
complaint · missing_item
Most NLP wins collapse on messy, multilingual, domain-specific input. We tune encoders on in-domain data, pair LLMs with retrieval instead of trusting open-book inference, and ship speech with diarisation and noise guards. Evaluation is continuous, not a launch checklist.
CAPABILITY MAP
Each capability stands on its own, and every production build ends up combining at least three of them. The map below is how we split the engagement into named deliverables.
Intent detection, review mining, moderation queues, ticket routing. At production scale, fine-tuned encoders still outperform zero-shot LLM prompts.
Text, sentence and domain-tuned embeddings. The retrieval layer under modern assistants, semantic search and de-duplication systems.
LLMs over your corpus, with citations. Chunking strategy, reranking, freshness control, access propagation to tenants. Accuracy is a retrieval problem first, a model problem second.
Multi-turn dialogue with tool use, memory and guardrails. Named capabilities, explicit failure modes, audit trail per action. Designed around the handoff to a human.
Domain-tuned machine translation, long-form abstractive summarisation, style-preserving rewriting. Evaluation on semantic similarity and human pairwise, not BLEU alone.
Streaming transcription with diarisation, neural voice synthesis with licensing and consent guardrails, accessibility-grade captions with punctuation and formatting.
PIPELINE
The real work is before and after training. Corpus curation sets the ceiling; eval suites decide whether the release ships. The stages below are the reviewable sequence we run on every language engagement.
Domain data collection, licensing review, deduplication, PII filter, contamination check against public eval sets. Held-out slices by source, cohort and language.
Tokeniser choice (BPE, WordPiece, Unigram) matched to domain. Custom vocabulary extensions for code, chemistry, clinical or legal corpora where generic tokenisers bleed budget.
Encoder, decoder or encoder-decoder selection. Parameter-efficient fine-tuning (LoRA, DoRA) as default; full fine-tune when the task shifts domain heavily; continued pretraining for language expansion.
Offline suites (MMLU, domain-held-out), RAG-aware evals (Ragas), jailbreak and prompt-injection suites, human pairwise review. Fairness across language and demographic slices.
vLLM or TGI for LLMs, ONNX Runtime for compact encoders. Streaming for UX, batching for throughput, cost-per-turn dashboards on every production surface.
Closed-book LLM answers drift. Retrieval-grounded QA with citations survives the internal audit. The difference is a vector store and a reranker, both budgeted from day one.
WHERE WE SHIP
Production NLP is not a single product. Support, legal, clinical, developer tools, commerce and media each impose their own latency, privacy and review constraints. Below, the sectors we have scars in.
Ticket triage, response drafting, knowledge-base answering with citation. Human approval on anything that touches an account, refund or compliance boundary.
Contract review, clause extraction, disclosure parsing, reg-change monitoring. Retrieval with provenance is the baseline; anything less fails the internal audit.
Transcription, SOAP-note drafting, code suggestion (ICD, CPT) with clinician review. PHI handling under HIPAA-ready architecture, deployment to in-region infra.
Code completion, review augmentation, documentation generation, bug triage. Tuned to the internal stack, wired into existing IDEs and review surfaces.
Semantic product search, review summarisation, description generation, multilingual catalogs. Evaluated on click-through and conversion, not offline rank metrics alone.
Streaming captions, dubbing, voice assistants, podcast editing. Consent and licensing are the design, not the afterthought.
WHAT IT LOOKS LIKE
Every capability on the map has a measurable shape. The cards below are simplified but real: actual input forms on the left, actual output shape on the right, with the metrics we track in the meta line.
"Order arrived but the charger was missing. Third time this month. Considering returning the whole thing."
complaint · missing_item
query: "waterproof hiking boots for wide feet"
Salomon X Ultra 4 Wide GTX
"Can a client cancel inside the 14-day window under the new contract template?"
Yes, Section 4.2 of Template v3.1 grants a 14-day withdrawal window for all non-milestoned engagements.
A 1,800-word quarterly review covering six workstreams and their release notes.
Three workstreams shipped to production, one slipped a sprint on compliance, two remain in staging with dependency risk.
11-minute kick-off recording, two speakers, light background noise.
Speaker A: let's start with scope. Speaker B: the main unknown is data residency…
SAFETY + QUALITY
Every production LLM surface is under quiet adversarial load. The four areas below cover what we build before launch and keep running afterwards.
Adjacent disciplines
Every production NLP surface leans on its neighbours. These are the disciplines we co-run on most engagements.
Transformer training, fine-tuning, alignment. The foundation modern NLP stands on.
FoundationCorpora, licensing, PII handling, residency. The layer that decides whether the model is compliant.
ClassicalClassification, ranking and time-aware baselines. Often the leaner answer than an LLM.
UmbrellaThe full discipline map. NLP is one of six pillars we operate across.
Share the domain, the language mix, the volume and the compliance envelope. We come back with a retrieval architecture, model shortlist, eval suite and rollout plan inside ten working days.