Deep learning that
ships

Neural networks are a means, not a brand. We pick architecture by task shape, build the training pipeline to be reproducible, and treat inference cost and latency as design constraints from the first experiment, not a post-hoc scramble.

Layered neural network with skip connections and a training-loop gradient sweep dl.net.forward epoch 14/60 layer · activation · gradient · loss bf16 · lr 3e-4 256 1024 1024 10 back-prop · adamw Input Hidden · 01 Hidden · 02 Output loss 3.1 → 0.42

ARCHITECTURE STACK

Six families, each with its own reason

Feed-forward still lands. Convolution still owns images at scale. Recurrent state still beats attention on some streaming signals. Transformers dominate the rest. Fine-tuning is the default before full training.

1960s+ ANN

Feed-forward networks

Dense layers, back-prop, early activation functions. Still the right baseline when the signal is non-structured and the dataset is small enough that anything deeper overfits.

  • Dense · MLP
  • ReLU · GELU · Swish
  • Batch norm · layer norm
  • Adam · AdamW · SGD momentum
1998+ CNN

Convolutional networks

Translation-invariant feature detectors for images, spectrograms and structured 2D signals. ResNet and EfficientNet families are still production workhorses outside vision-LM territory.

  • ResNet · ResNeXt · RegNet
  • EfficientNet · ConvNeXt
  • U-Net · DeepLab
  • YOLO · DETR
1997+ RNN · LSTM · GRU

Sequential networks

Recurrent state for text, audio and telemetry. Less fashionable since Transformers, but superior for some streaming / low-latency signals where attention quadratic cost is a hard stop.

  • LSTM · GRU · bi-directional
  • Seq2seq · attention
  • Neural ODE · state-space
  • Mamba · S4 · S6
2017+ Transformer

Attention architectures

Self-attention over tokens. The substrate for modern language, vision (ViT, DINO) and multimodal models (CLIP, Flamingo, GPT-4V). Scaling laws and mixture-of-experts still unfold here.

  • Encoder · decoder · enc-dec
  • ViT · Swin · DeiT
  • Mixture of experts
  • RoPE · ALiBi · MLA
2014+ Generative

GAN · Diffusion · VAE

Distributions learned well enough to sample from. Diffusion now dominates image and audio; autoregressive transformers dominate text and code; hybrids (consistency, flow-matching) are rising.

  • Stable Diffusion · SDXL · Flux
  • VAE · VQ-VAE · VQGAN
  • Consistency models
  • Flow-matching
2019+ Transfer

Fine-tune · adapter · LoRA

Pretrained weights plus targeted training. LoRA / QLoRA for parameter-efficient fine-tuning, full fine-tune when the task shifts domain, distillation to compress teacher into student.

  • LoRA · QLoRA · DoRA
  • Full fine-tune · DPO · PPO
  • Distillation · knowledge transfer
  • Adapter layers · prefix tuning
Foundation first

No deep model outperforms its data. The data engineering discipline carries the collection, cleaning and labelling stack that deep learning feeds on.

Open data engineering ↗

TRAINING LIFECYCLE

Five stages from dataset to deployed weights

Every model we ship passes through the same reviewable sequence. Each stage ends with versioned artefacts, not just a commit message.

  1. 01, Dataset

    Dataset

    Curation, deduplication, contamination check against public eval sets. For custom data: licensing review, PII filter, quality scoring, held-out slices by source and cohort.

  2. 02, Architecture

    Architecture

    Pick the model family from task shape, not vintage. Set hyperparameter search space: width, depth, LR schedule, batch size, warmup, regularisation. Sweep before commit.

  3. 03, Train

    Train

    Distributed training with FSDP or DeepSpeed ZeRO. Mixed precision (bf16), gradient checkpointing, activation offload. Checkpoint every N steps; rewind on instability.

  4. 04, Evaluate

    Evaluate

    Offline suites, leakage-controlled holdout, slice analysis across cohorts. Human pairwise review for generative output. Drift baseline established on staging traffic.

  5. 05, Ship

    Ship

    Quantisation (INT8, 4-bit GPTQ/AWQ), KV cache tuning, batching strategy. vLLM or TensorRT-LLM for serving, SLO gates on latency and quality, canary rollout.

GENERATIVE SURFACES

Four domains where generative models reach production

Generation is not a demo. It is licensing, safety, latency and evaluation working together. These are the four surfaces where we run generative models to accountable output.

Image

Latent diffusion pipelines

Stable Diffusion, SDXL, Flux for text-to-image and image-to-image. ControlNet, LoRA adapters, IP-Adapter for conditioning. Evaluation on FID, human pairwise and downstream use-case metrics.

Language

LLM training and alignment

Base-model pretraining at <1B parameters when the moat is the data, continued pretraining on domain corpora, SFT then DPO or PPO for alignment. Evals on closed-book and RAG settings.

Audio

Speech and music synthesis

Text-to-speech (XTTS, StyleTTS2), streaming synthesis for assistants, music continuation (MusicGen, Stable Audio). Multi-speaker cloning under licensing and consent guardrails.

Multimodal

Vision-language and video

CLIP-family encoders for retrieval, vision-language models for captioning and VQA, diffusion video (SVD, CogVideoX) for short-form generation with temporal consistency.

SCALE + SAFETY

Training at scale, serving at cost, governed on release

The hard part is not the first epoch. It is running a 70B-parameter training job across a spot-instance cluster without losing a week to restarts, then serving the result at a latency the product actually requires.

Distributed training

  • FSDP · DeepSpeed ZeRO-3
  • Tensor + pipeline parallelism
  • Ray Train orchestration
  • Spot-instance checkpointing

Inference optimisation

  • vLLM · TensorRT-LLM · TGI
  • INT8 · GPTQ · AWQ · 4-bit
  • Speculative decoding
  • KV-cache quantisation

Evaluation & QA

  • DeepEval · Ragas · LM-Eval
  • LangSmith · Braintrust
  • Red-team + jailbreak suites
  • Pairwise human review

Safety & governance

  • LlamaGuard · ShieldGemma
  • PII + prompt-injection filters
  • Model cards · training logs
  • EU AI Act · SOC 2 mapping

PAPERS LEDGER

The eight papers we keep open on the desk

Not a literature review. The ideas below changed how we run deep learning in production. Every time someone new joins the team, they read these before touching an experiment.

2012

AlexNet

Krizhevsky · Sutskever · Hinton

CNNs + GPU training broke the ImageNet benchmark and started the deep-learning wave.

2014

GAN

Goodfellow et al.

Adversarial training opened generative modelling for images, audio and molecules.

2015

ResNet

He · Zhang · Ren · Sun

Skip connections let networks train much deeper without vanishing gradients.

2017

Transformer

Vaswani et al. ("Attention Is All You Need")

Self-attention replaced recurrence and became the substrate for modern language, vision and multimodal models.

2020

GPT-3

Brown et al.

Scaling laws: bigger models + more data + few-shot prompting shifted what "learning" meant in practice.

2020

DDPM

Ho · Jain · Abbeel

Denoising diffusion reframed generative modelling; Stable Diffusion and successors grew from this line.

2021

LoRA

Hu et al.

Low-rank adapters made parameter-efficient fine-tuning the pragmatic default for domain adaptation.

2023

Mamba

Gu · Dao

State-space models challenged the attention monopoly on long sequences with linear-time alternatives.

Adjacent disciplines

Every production neural network leans on its neighbours. These are the disciplines we co-run on most engagements.

Train · fine-tune · serve

Have the data, need the weights

Share the task, the data shape and the inference budget. We come back with an architecture shortlist, compute estimate and training plan inside ten working days. No benchmark leaderboards; only numbers tied to your use case.