Agent Memory Systems in Production: Persistent Context Across Sessions

The bottom line: Every stateless LLM call returns good output. Every agent that needs to learn from past interactions, survive restarts, or compound knowledge across sessions needs a memory system. In 2026 the field has converged on a two-tier architecture — thread-scoped checkpointing for conversation continuity and semantic memory for cross-session knowledge [1]. This guide walks through implementing both tiers with production-ready code.
The Two-Tier Memory Model
Agent memory in 2026 breaks into two distinct persistence layers that serve different purposes:
| Layer | Scope | Backend | Use Case |
|---|---|---|---|
| Checkpoint store | Single thread/session | SQLite, Postgres, Redis | Conversation continuity, resumability, time travel |
| Semantic memory | Cross-session, long-term | Mem0, Zep, Chroma, pgvector | User preferences, learned facts, domain knowledge |
Checkpoint stores are write-heavy and low-latency — every agent step writes a checkpoint. Semantic memory is query-heavy — you write once, retrieve many times. Treating them as the same store is the #1 production mistake [2].
Why Not Just RAG?
RAG retrieves from a static corpus. Agent memory retrieves from the agent’s own interaction history — past decisions, user corrections, task outcomes. The retrieval pattern is the same (embed + vector search), but the data source and update cadence are fundamentally different [3]. A true memory system writes new entries from every agent interaction; RAG systems write once and index.
Tier 1: Thread-Scoped Checkpointing with LangGraph
LangGraph’s persistence layer is the most mature open-source implementation of thread-scoped agent memory [4]. Here’s the production setup.
SQLite Checkpointer (Single-Node, Dev to Low-Traffic Prod)
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.graph import StateGraph
from typing import TypedDict, Annotated, Sequence
import operator
class AgentState(TypedDict):
messages: Annotated[Sequence[dict], operator.add]
next_step: str
# Thread ID ties checkpoints to a conversation session
saver = SqliteSaver.from_conn_string("checkpoints.db")
graph_builder = StateGraph(AgentState)
# ... add nodes and edges ...
graph = graph_builder.compile(checkpointer=saver)
# Each conversation gets a unique thread_id
config = {"configurable": {"thread_id": "user-session-abc-123"}}
result = graph.invoke({"messages": [{"role": "user", "content": "Analyze this dataset"}]}, config)
# Agent survives crash: reload from last checkpoint
result = graph.invoke({"messages": []}, config) # resumes where it left off
SQLite handles up to ~10 concurrent agents comfortably. Beyond that, switch to Postgres [5].
Postgres Checkpointer (Multi-Node, Production)
from langgraph.checkpoint.postgres import PostgresSaver
import psycopg
conn = psycopg.connect(
"postgresql://user:pass@host:5432/agent_state",
application_name="agent-checkpointer"
)
saver = PostgresSaver(conn)
# Run migration once
saver.setup()
graph = graph_builder.compile(checkpointer=saver)
Postgres checkpoints scale to hundreds of concurrent threads. Each checkpoint stores the full graph state — about 2-8 KB per step depending on message history [6]. Set up a TTL cleanup job:
-- 7-day checkpoint retention
DELETE FROM checkpoints
WHERE thread_id IN (
SELECT thread_id FROM thread_stats
WHERE updated_at < NOW() - INTERVAL '7 days'
);
What You Get From Checkpointing
- Resumability — crash recovery, no lost state
- Human-in-the-loop — pause at approval gates, resume when human responds [7]
- Time travel — replay from any checkpoint for debugging
- History pruning — trim old messages while keeping thread continuity
Tier 2: Semantic Memory with Mem0
Mem0 is the most widely deployed semantic memory layer for AI agents in 2026, with 21 framework integrations and production benchmarks across 10 memory strategies [1].
Basic Setup
pip install mem0ai
from mem0 import Memory
# Uses OpenAI embeddings by default;
# configure for your provider
m = Memory()
# Add memory from agent interaction
result = m.add(
data="User prefers detailed technical explanations with code examples",
user_id="user-abc-123",
metadata={"source": "onboarding", "confidence": 0.95}
)
# Returns: {"id": "mem_xxx", "message": "Memory added successfully"}
Retrieval-Augmented Agent Loop
This is the core pattern — every agent turn queries semantic memory and injects relevant context:
from openai import OpenAI
from mem0 import Memory
client = OpenAI()
memory = Memory()
def agent_with_memory(user_input: str, user_id: str) -> str:
# 1. Retrieve relevant memories
memories = memory.search(
query=user_input,
user_id=user_id,
limit=5
)
# 2. Build system prompt with memory context
memory_context = "\n".join([
f"- {m['text']}" for m in memories
])
system_prompt = f"""You are a helpful assistant with memory of past interactions.
Previous context about this user:
{memory_context}
Use this context when responding. If the user contradicts
previous information, trust the current input."""
# 3. Generate response
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input}
]
)
answer = response.choices[0].message.content
# 4. Extract and store new memories
memory.add(
data=f"User asked: {user_input}\nAssistant answered: {answer}",
user_id=user_id
)
return answer
Memory Types Mem0 Handles
| Type | Example | Retention |
|---|---|---|
| Factual preference | “User prefers Python over TypeScript” | Permanent until changed |
| Procedural | “User already has AWS credentials configured” | Session + stored |
| Interaction summary | “Completed data pipeline setup — moved to deployment” | Ephemeral, summarized |
| Feedback | “User corrected the SQL join pattern” | High priority, short retention |
Mem0 handles implicit memory extraction — it decides what to store and when to update, freeing you from manual recall logic [8].
The Summarization Loop: Keeping Context Under Budget
Even with memory, context windows fill up. The production pattern is a summarization loop that compresses old turns:
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.graph import StateGraph
import json
def summarize_and_prune(state: AgentState) -> AgentState:
messages = state["messages"]
# If history exceeds threshold, summarize oldest turns
if len(messages) > 20:
old_messages = messages[:-10] # keep last 10 turns
recent = messages[-10:]
# In practice, call an LLM to summarize
summary = _call_llm_to_summarize(old_messages)
# Store summary in semantic memory
memory.add(
data=f"Conversation summary: {summary}",
user_id=state.get("user_id", "default"),
metadata={"type": "summary", "turn_count": len(old_messages)}
)
# Return pruned state with summary as system message
return {
**state,
"messages": [
{"role": "system", "content": f"Previous context: {summary}"},
*recent
]
}
return state
Run this node before the LLM call node in your graph. Benchmarks show it keeps 95%+ task accuracy while halving token costs [1].
Combining Both Tiers: Reference Architecture
Here’s how the two tiers wire together in production:
┌──────────────────┐
│ User Request │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Checkpointer │ Tier 1: Load thread state
│ (Postgres) │ from last checkpoint
└────────┬─────────┘
│
┌────────▼─────────┐
│ Semantic Memory │ Tier 2: Query cross-session
│ (Mem0 / Zep) │ user facts and history
└────────┬─────────┘
│
┌────────▼─────────┐
│ Summarization │ Keep context under budget
│ Loop │
└────────┬─────────┘
│
┌────────▼─────────┐
│ LLM Agent Step │ Generate with full context
└────────┬─────────┘
│
┌────────▼─────────┐
│ Memory Write │ Store new facts + checkpoint
│ Both Tiers │
└──────────────────┘
Putting It Together
class ProductionAgent:
def __init__(self, user_id: str, thread_id: str):
self.user_id = user_id
self.thread_id = thread_id
self.memory = Memory()
self.checkpointer = PostgresSaver(
psycopg.connect("postgresql://.../agent_state")
)
self.graph = self._build_graph()
def _build_graph(self):
builder = StateGraph(AgentState)
builder.add_node("summarize", summarize_and_prune)
builder.add_node("agent", self._agent_step)
builder.add_edge("summarize", "agent")
builder.set_entry_point("summarize")
return builder.compile(checkpointer=self.checkpointer)
def _agent_step(self, state: AgentState) -> AgentState:
# Query semantic memory
context = self.memory.search(
query=state["messages"][-1]["content"],
user_id=self.user_id,
limit=5
)
context_text = "\n".join([c["text"] for c in context])
# Build prompt with memory context
instructions = f"User context:\n{context_text}\n\n{state.get('instructions', '')}"
state["instructions"] = instructions
return state
def run(self, message: str) -> str:
config = {"configurable": {"thread_id": self.thread_id}}
result = self.graph.invoke(
{"messages": [{"role": "user", "content": message}]},
config
)
# Extract and store new memory
self.memory.add(
data=f"Interaction: {message}",
user_id=self.user_id,
metadata={"thread_id": self.thread_id}
)
return result["messages"][-1]["content"]
Choosing a Semantic Memory Provider
Beyond Mem0, there are two other production-tested options:
| Provider | Storage | Strengths | Best For |
|---|---|---|---|
| Mem0 | Managed cloud or self-hosted | Implicit extraction, 21 framework integrations, automatic conflict resolution [1] | Most production use cases |
| Zep | Postgres + embeddings | Open-source, durable, team memory, graph-based entity extraction [9] | Enterprise, compliance-heavy |
| Letta (MemGPT) | Local + cloud | Virtual context management, infinite context through tiered recall [10] | Single-agent deep conversations |
| LangGraph Store | Postgres | Same infra as checkpointer, built into LangGraph ecosystem [4] | Already on LangGraph, minimal deps |
For most teams, Mem0 is the right starting point — it handles extraction, dedup, and retrieval out of the box. Zep is better if you need audit trails and RBAC. Letta is overkill unless you’re doing hour-long agent sessions.
Production Checklist
Before shipping agent memory to production, validate each point:
- TTL policy — Checkpoints expire after N days. Semantic memory entries get demoted or summarized after M retrievals without update.
- Conflict resolution — What happens when a user says “actually, I prefer minimal explanations” after saying “give me all the details”? Mem0 handles this via recency-weighted scoring; custom systems need explicit conflict logic [1].
- Embedding cost — Each memory write calls an embedding model. At 100K writes/day with
text-embedding-3-smallthat’s ~$0.13/day. Withtext-embedding-3-largeit’s ~$0.80/day [11]. - PII scrubbing — Strip session IDs, API keys, and personal identifiers from stored memory. Run a regex pass or use an LLM classifier before
memory.add(). - User-level isolation — Memory per user_id is table stakes. Production needs organization_id and role-based scoping on top.
- Observability — Trace memory hits and misses. If memory retrieval returns empty >20% of the time, your embedding strategy is off [12].
- Read-your-writes consistency — A memory written in step 3 must be retrievable in step 4. Vector stores often have eventual consistency — use synchronous writes or retry with backoff.
When Memory Makes Sense (and When It Doesn’t)
Agent memory is not free. Each retrieval adds 200-500ms latency and costs embedding tokens. Use it when:
- Users return to the same agent across days or weeks
- The agent makes decisions based on past outcomes
- You’re fine-tuning behavior from user corrections
- Domain knowledge compounds over sessions (code assistants, research agents)
Skip it when:
- Sessions are single-shot (one question, one answer)
- The context window covers the full interaction
- All needed knowledge lives in a static knowledge base (use RAG instead)
Wrapping Up
Agent memory in 2026 is a solved engineering problem at the architecture level — two tiers, clear boundaries, well-understood trade-offs [1][2][4]. The execution details (which backend, which provider, which summarization strategy) depend on your scale and latency budget, but the pattern is universal: checkpoint for thread continuity, semantic memory for cross-session knowledge, and a summarization loop to keep the whole thing under budget.
Start with SQLite + Mem0, benchmark your retrieval latency and memory hit rates, then scale to Postgres + Mem0 cloud when you hit the ceiling. The code here will take you from zero to production on day one.
Published June 24, 2026
Sources
[1] Mem0, “State of AI Agent Memory 2026: Benchmarks, Architectures & Production Gaps” — https://mem0.ai/blog/state-of-ai-agent-memory-2026
[2] Fountain City, “How to Build and Operate AI Agent Memory in 2026” — https://fountaincity.tech/resources/blog/how-to-build-and-operate-ai-agent-memory-in-2026/
[3] freeCodeCamp, “How AI Agents Remember Things: The Role of Vector Stores in LLM Memory” — https://www.freecodecamp.org/news/how-ai-agents-remember-things-vector-stores-in-llm-memory/
[4] LangChain, “LangGraph Persistence Documentation” — https://docs.langchain.com/oss/python/langgraph/persistence
[5] AppScale Blog, “Build an AI Agent from Scratch: LangGraph Tutorial (2026)” — https://appscale.blog/en/blog/build-ai-agent-langgraph-tools-memory-step-by-step-tutorial-2026
[6] LangGraph Checkpointer internals — Checkpoint size estimate from LangGraph SQLite/Postgres store schema per message count
[7] LangGraph, “Human-in-the-Loop with Persistence” — https://docs.langchain.com/oss/python/langgraph/persistence#human-in-the-loop
[8] Mem0, “How to Add Memory to Autonomous AI Agents” — https://mem0.ai/blog/how-to-add-memory-to-autonomous-ai-agents
[9] Zep Documentation — https://help.getzep.com/
[10] Letta (MemGPT), “Virtual Context Management” — https://www.letta.com/
[11] OpenAI, “Embeddings API Pricing” — https://openai.com/api/pricing/
[12] Braintrust, “Agent Observability: The Complete Guide for 2026” — https://www.braintrust.dev/articles/agent-observability-complete-guide-2026
📖 Related Reads
- ToolBrain — tool reviews, LLM comparisons, and AI workflow guides
- Hermes Tutorials — Hermes Agent setup, configuration, and advanced workflows
- CodeIntel Log — code quality, debugging, and software engineering benchmarks
Cross-links automatically generated from NiteAgent.
← Back to all posts

