Agent Memory Design Beyond the Chat History
- Chat history is not memory. It is a transcript that grows until the context window forces truncation.
- Working memory belongs in the prompt; episodic in a vector store; semantic in a structured database.
- Memory writes are riskier than memory reads. An agent that writes to memory needs validation before persistence.
- Forgetting is a feature, not a bug. An agent that remembers everything is an agent that confuses old facts with current ones.
A common pattern when teams first build agentic systems: they treat the chat history as memory. Each turn is appended to a transcript. When the context window fills, the oldest turns are truncated. The agent “remembers” until it doesn’t, and the failure is visible only when a user mentions something from session three weeks ago and the agent has no idea what they’re talking about.
Memory is not chat history. Chat history is a transcript that gets truncated. Memory is a deliberate architecture that decides what to remember, where to store it, when to retrieve it and when to forget. This piece is about the three memory shapes that solve different problems in production agentic systems, and how to know which one your agent actually needs.
Working memory
Working memory is the active state of the current task. It includes:
- The user’s request as currently understood
- The plan the agent is executing
- Intermediate results from tool calls so far
- Constraints, deadlines, and other context for the current operation
Working memory belongs inside the agent’s prompt as structured fields, not appended chat turns. A typed scratchpad keeps the agent oriented without polluting the context with conversational filler.
# Working memory as a structured field, not chat turns
working_memory = {
"current_task": "Issue refund for order #4421",
"plan": [
{"step": "verify_order_eligibility", "status": "done", "result": "eligible"},
{"step": "compute_refund_amount", "status": "in_progress", "result": None},
{"step": "execute_refund", "status": "pending", "result": None},
],
"constraints": ["Customer is in tier-1 SLA", "Refund must clear within 24h"],
"scratch": "Eligible because original purchase was within 30-day window."
}
This shape works because:
- The agent always sees its current state explicitly, not buried in conversation
- The state is greppable, loggable, debuggable
- A truncation event (context overflow) is impossible because working memory is fixed-size
Working memory is reset between distinct tasks. The agent does not need to remember “I once issued a refund three weeks ago” while issuing today’s refund. Carrying old working memory forward is the most common source of agent confusion.
Episodic memory
Episodic memory holds prior conversations, sessions, or interactions. It exists when the agent needs to answer questions like “what did we discuss last week” or “based on our previous interactions”.
The right storage shape is a vector store keyed by conversation or user. On every new turn, the agent retrieves the top-K most relevant prior episodes and includes summaries (not full transcripts) in its context.
# Retrieve relevant prior conversations, not the full history
relevant_episodes = vector_store.search(
query=user_current_message,
filter={"user_id": current_user_id},
top_k=3,
)
# Each episode has a precomputed summary, not the raw transcript
context_addition = "\n".join(ep.summary for ep in relevant_episodes)
Two design choices matter:
Summaries, not transcripts. The vector store stores precomputed summaries (1 to 3 sentences per episode), not raw chat. Storing transcripts means retrieval brings back unbounded text that crowds the context window. Summaries fit in a budget.
Recency-weighted ranking. Pure vector similarity will resurface a five-year-old conversation that is no longer relevant. Hybrid ranking that combines similarity with a recency decay (score = similarity * exp(-age_days / 30)) keeps the memory current.
Episodic memory has a retention policy. Most production systems set a TTL (90 days, 6 months, 1 year) and delete older episodes unless the user has opted in to longer retention. The retention policy is also a privacy property; document it.
Semantic memory
Semantic memory holds learned facts the agent should treat as ground truth. Customer name, contract terms, preferences, persistent state. This is structured data, not text.
The right storage is a regular database (Postgres, a key-value store, a document store). The agent retrieves specific fields by key, not by similarity search.
# Semantic memory: structured, retrieved by key
user_facts = {
"name": "Alex Mercer",
"tier": "enterprise",
"preferred_language": "en-GB",
"communication_channel": "email",
"active_subscription": {"plan": "prime", "renews_at": "2026-09-01"},
}
Why structured: free-form text “memory” gets summarised badly, retrieved imprecisely, and produces drift over time. Structured semantic memory is an authoritative record. The agent does not invent the user’s name; it looks it up.
The hardest part of semantic memory is the WRITE path. When does the agent learn a new fact? Who validates it? The default answer in production: an agent does not write to semantic memory directly. A deterministic step (a form submission, a confirmed action, a manual review) commits the fact. The agent reads, but it does not write.
When the agent must write (preference learning, dynamic personalisation), the write should land in a quarantine: a separate “proposed facts” table that requires either a verifier model’s approval or an idempotent confirmation before promotion to the main semantic store.
Picking the right shape
The three shapes are not competing. A production agent often uses all three, for different concerns.
| Shape | Storage | Retrieval | Lifecycle |
|---|---|---|---|
| Working | In-prompt structured field | Always present during task | Reset between tasks |
| Episodic | Vector store with summaries | Top-K by hybrid similarity | TTL-bounded retention |
| Semantic | Structured database (KV or relational) | Lookup by key | Permanent, write-controlled |
The mistake teams make is using one shape for all three jobs. Stuffing semantic facts into the chat history (the agent forgets them next session). Using vector search for working state (slow, lossy). Using structured KV for episodic memory (you cannot semantic-search structured KV).
Memory failures we have seen
The amnesiac agent. No memory architecture beyond chat history. Forgets everything between sessions. Users perceive the agent as starting from zero each time.
The confused agent. Episodic memory is too aggressive; old conversations resurface as if they are current. The agent answers based on a deprecated promotion the user enquired about months ago.
The hoarding agent. Memory is written but never pruned. Vector store grows unbounded; retrieval quality degrades; cost climbs. After a year, the agent is slower and worse than it was at launch.
The contaminated agent. The agent writes its own hallucinations to semantic memory. Subsequent sessions retrieve the hallucination as fact and compound it. The fix requires retroactive cleanup of the memory store.
The leaking agent. Memory leaks across users. A retrieval query for user A returns facts about user B because the access control on the memory layer was not designed before the data went in.
Each of these is preventable with explicit design at the start. None of them are easy to fix retroactively.
What to install on day one
Every agent system that goes to production should have:
- A working-memory schema, documented and validated
- An episodic-memory store with a retention policy and per-user access control
- A semantic-memory schema with explicit write controls
- Observability on every memory read and every memory write
- A memory audit endpoint a human can call to see exactly what an agent remembers about a user
The audit endpoint is the most undervalued of these. Without it, the team cannot answer the user’s question “what do you know about me”, the privacy team cannot fulfil a GDPR request, and the engineering team cannot debug a memory-related incident. With it, all three become tractable.
The teams that get agent memory right design it once at the start and revisit it on every model upgrade. The teams that get it wrong let it accrete from chat-history-plus-hacks and pay for the cleanup quarters later.
Questions teams ask
Can I just stuff more chat history into the context?
Up to a point. Context-window growth costs money, dilutes attention, and quietly degrades reasoning quality past roughly 30k to 50k tokens of context. Deliberate memory architecture beats raw history past that point.
Where should I store agent memory?
Working memory: in the prompt, as structured fields. Episodic: in a vector store keyed by conversation. Semantic: in a structured database (Postgres, KV) with named keys. Don't merge them, the access patterns and lifecycles differ.
How do I prevent the agent from contaminating its own memory?
Validate before persistence. An agent should not write directly to long-term memory; a deterministic validator should approve the write, or the write should land in a quarantine that requires explicit promotion.