Deep learning that
ships
Neural networks are a means, not a brand. We pick architecture by task shape, build the training pipeline to be reproducible, and treat inference cost and latency as design constraints from the first experiment, not a post-hoc scramble.
ARCHITECTURE STACK
Six families, each with its own reason
Feed-forward still lands. Convolution still owns images at scale. Recurrent state still beats attention on some streaming signals. Transformers dominate the rest. Fine-tuning is the default before full training.
Feed-forward networks
Dense layers, back-prop, early activation functions. Still the right baseline when the signal is non-structured and the dataset is small enough that anything deeper overfits.
- Dense · MLP
- ReLU · GELU · Swish
- Batch norm · layer norm
- Adam · AdamW · SGD momentum
Convolutional networks
Translation-invariant feature detectors for images, spectrograms and structured 2D signals. ResNet and EfficientNet families are still production workhorses outside vision-LM territory.
- ResNet · ResNeXt · RegNet
- EfficientNet · ConvNeXt
- U-Net · DeepLab
- YOLO · DETR
Sequential networks
Recurrent state for text, audio and telemetry. Less fashionable since Transformers, but superior for some streaming / low-latency signals where attention quadratic cost is a hard stop.
- LSTM · GRU · bi-directional
- Seq2seq · attention
- Neural ODE · state-space
- Mamba · S4 · S6
Attention architectures
Self-attention over tokens. The substrate for modern language, vision (ViT, DINO) and multimodal models (CLIP, Flamingo, GPT-4V). Scaling laws and mixture-of-experts still unfold here.
- Encoder · decoder · enc-dec
- ViT · Swin · DeiT
- Mixture of experts
- RoPE · ALiBi · MLA
GAN · Diffusion · VAE
Distributions learned well enough to sample from. Diffusion now dominates image and audio; autoregressive transformers dominate text and code; hybrids (consistency, flow-matching) are rising.
- Stable Diffusion · SDXL · Flux
- VAE · VQ-VAE · VQGAN
- Consistency models
- Flow-matching
Fine-tune · adapter · LoRA
Pretrained weights plus targeted training. LoRA / QLoRA for parameter-efficient fine-tuning, full fine-tune when the task shifts domain, distillation to compress teacher into student.
- LoRA · QLoRA · DoRA
- Full fine-tune · DPO · PPO
- Distillation · knowledge transfer
- Adapter layers · prefix tuning
No deep model outperforms its data. The data engineering discipline carries the collection, cleaning and labelling stack that deep learning feeds on.
TRAINING LIFECYCLE
Five stages from dataset to deployed weights
Every model we ship passes through the same reviewable sequence. Each stage ends with versioned artefacts, not just a commit message.
- 01, Dataset
Dataset
Curation, deduplication, contamination check against public eval sets. For custom data: licensing review, PII filter, quality scoring, held-out slices by source and cohort.
- 02, Architecture
Architecture
Pick the model family from task shape, not vintage. Set hyperparameter search space: width, depth, LR schedule, batch size, warmup, regularisation. Sweep before commit.
- 03, Train
Train
Distributed training with FSDP or DeepSpeed ZeRO. Mixed precision (bf16), gradient checkpointing, activation offload. Checkpoint every N steps; rewind on instability.
- 04, Evaluate
Evaluate
Offline suites, leakage-controlled holdout, slice analysis across cohorts. Human pairwise review for generative output. Drift baseline established on staging traffic.
- 05, Ship
Ship
Quantisation (INT8, 4-bit GPTQ/AWQ), KV cache tuning, batching strategy. vLLM or TensorRT-LLM for serving, SLO gates on latency and quality, canary rollout.
GENERATIVE SURFACES
Four domains where generative models reach production
Generation is not a demo. It is licensing, safety, latency and evaluation working together. These are the four surfaces where we run generative models to accountable output.
Latent diffusion pipelines
Stable Diffusion, SDXL, Flux for text-to-image and image-to-image. ControlNet, LoRA adapters, IP-Adapter for conditioning. Evaluation on FID, human pairwise and downstream use-case metrics.
LLM training and alignment
Base-model pretraining at <1B parameters when the moat is the data, continued pretraining on domain corpora, SFT then DPO or PPO for alignment. Evals on closed-book and RAG settings.
Speech and music synthesis
Text-to-speech (XTTS, StyleTTS2), streaming synthesis for assistants, music continuation (MusicGen, Stable Audio). Multi-speaker cloning under licensing and consent guardrails.
Vision-language and video
CLIP-family encoders for retrieval, vision-language models for captioning and VQA, diffusion video (SVD, CogVideoX) for short-form generation with temporal consistency.
SCALE + SAFETY
Training at scale, serving at cost, governed on release
The hard part is not the first epoch. It is running a 70B-parameter training job across a spot-instance cluster without losing a week to restarts, then serving the result at a latency the product actually requires.
Distributed training
- FSDP · DeepSpeed ZeRO-3
- Tensor + pipeline parallelism
- Ray Train orchestration
- Spot-instance checkpointing
Inference optimisation
- vLLM · TensorRT-LLM · TGI
- INT8 · GPTQ · AWQ · 4-bit
- Speculative decoding
- KV-cache quantisation
Evaluation & QA
- DeepEval · Ragas · LM-Eval
- LangSmith · Braintrust
- Red-team + jailbreak suites
- Pairwise human review
Safety & governance
- LlamaGuard · ShieldGemma
- PII + prompt-injection filters
- Model cards · training logs
- EU AI Act · SOC 2 mapping
PAPERS LEDGER
The eight papers we keep open on the desk
Not a literature review. The ideas below changed how we run deep learning in production. Every time someone new joins the team, they read these before touching an experiment.
AlexNet
Krizhevsky · Sutskever · Hinton
CNNs + GPU training broke the ImageNet benchmark and started the deep-learning wave.
GAN
Goodfellow et al.
Adversarial training opened generative modelling for images, audio and molecules.
ResNet
He · Zhang · Ren · Sun
Skip connections let networks train much deeper without vanishing gradients.
Transformer
Vaswani et al. ("Attention Is All You Need")
Self-attention replaced recurrence and became the substrate for modern language, vision and multimodal models.
GPT-3
Brown et al.
Scaling laws: bigger models + more data + few-shot prompting shifted what "learning" meant in practice.
DDPM
Ho · Jain · Abbeel
Denoising diffusion reframed generative modelling; Stable Diffusion and successors grew from this line.
LoRA
Hu et al.
Low-rank adapters made parameter-efficient fine-tuning the pragmatic default for domain adaptation.
Mamba
Gu · Dao
State-space models challenged the attention monopoly on long sequences with linear-time alternatives.
Adjacent disciplines
Deep learning never ships alone
Every production neural network leans on its neighbours. These are the disciplines we co-run on most engagements.
Machine Learning
Classical ML, tabular gradient boosting, feature engineering. Often the right baseline before a deep model is warranted.
FoundationData Engineering
Pipelines, lakehouses and the training-data layer that decides whether the model is worth training at all.
AppliedComputer Vision
Deep learning's oldest production surface. Detection, segmentation, OCR, medical imaging pipelines.
UmbrellaArtificial Intelligence
The wider discipline map. Deep learning is one of six pillars we operate across.
Have the data, need the weights
Share the task, the data shape and the inference budget. We come back with an architecture shortlist, compute estimate and training plan inside ten working days. No benchmark leaderboards; only numbers tied to your use case.