Adding Memory to Production AI Agents: Mem0, Zep, and LangMem Compared

Your agent works perfectly in a demo. The user comes back the next day, types "continue where we left off," and the agent has no idea who they are. They re-explain their company, their preferences, what broke last time — again. Every session starts from zero.

That's not a model capability problem. It's a memory architecture problem.

After running into this wall on RevAgent (where sales reps were re-explaining their pipeline context on every session) and BandiFinder (where users had to re-specify their procurement criteria every time), I spent the last few months building production memory into both. This post is what I learned — the architecture, the tool comparison, the real benchmark numbers, and the GDPR compliance angle that most memory guides skip entirely.

Why Context Windows Aren't the Answer

The obvious first instinct is: bigger context window. Just pass more history.

The math kills that idea quickly. At $3/MTok for Claude Sonnet 4.6 input tokens, a 1M token context inference costs $3 per call. Run that across 100 users doing 5 sessions per day and you're spending $1,500/day on input tokens alone — before any output. And even with 200K–400K token windows, full conversation history is impractical for agents that run continuously for weeks. Production traces from real agents show 80,000 to 120,000 token contexts within two to three weeks of operation, just from memory file bloat.

Context windows also don't survive session restarts. Your agent gets killed by a deployment, a crash, or a timeout — everything in context is gone.

External memory is the only production-viable path.

The Four Memory Types

Before comparing tools, you need to understand what kind of memory your agent actually needs. There are four distinct types, and they serve different purposes:

Type	What It Stores	Example
Episodic	Past interaction history	"Last week the user asked about tender #4421 and rejected it because of the budget cap"
Semantic	Persistent facts and preferences	"User works at a logistics firm, prefers Python, deployment target is AWS"
Procedural	Learned workflows and behavior rules	"When asked to debug Python, always check import errors first"
Working	Current context window	What's happening right now in this session

Most agents only need episodic + semantic. Procedural memory is rarer — it's when you want the agent's behavior to improve over time, not just its recall. Working memory is already handled by LangGraph's checkpointer (within a session).

The question is which external store handles episodic and semantic best for your use case.

The Production Architecture: Hot Path + Cold Path

Before picking a tool, understand the architecture pattern that's emerged as the standard for production memory in 2026.

User Message
     │
     ▼
┌─────────────────────────────────────┐
│           Memory Node                │
│                                     │
│  ┌─────────────────────────────┐    │
│  │  HOT PATH                   │    │
│  │  Last N messages +          │    │
│  │  compressed summary         │    │
│  └─────────────────────────────┘    │
│                                     │
│  ┌─────────────────────────────┐    │
│  │  COLD PATH                  │    │
│  │  Semantic search from        │    │
│  │  external store (Mem0/Zep)  │    │
│  │  Target: <100ms latency     │    │
│  └─────────────────────────────┘    │
└─────────────────────────────────────┘
     │
     ▼
Agent receives: Hot context + retrieved memories
     │
     ▼
Agent responds → Memory Node saves new facts

The Hot Path handles immediate operational context — the last few messages plus a compressed summary of the current session. The Cold Path retrieves from your external store using semantic similarity. A Memory Node runs after each agent turn to extract and save new facts.

Sub-100ms Cold Path latency is the key production metric. If your memory retrieval takes longer than that, users feel the lag on every turn.

The Three Tools: What They Are and When to Use Each

Mem0: Best for Personalization at Scale

Mem0 is a managed memory layer that sits on top of your existing vector store (Qdrant, Chroma, Pinecone, or Weaviate). It handles extraction, deduplication, and retrieval automatically.

The April 2026 algorithm is significantly better than earlier versions. Instead of pure vector similarity, it now uses multi-signal hybrid search combining semantic similarity, BM25 keyword matching, and entity matching. When you call add(), Mem0 extracts entities and stores them in a parallel entity collection. At search time, entities from the query boost relevant memories inside the final combined score — so "tell me about my AWS setup" retrieves memories mentioning "AWS," "infrastructure," and "cloud" even when those words aren't semantically close in embedding space.

from mem0 import Memory
 
# Initialize with your vector store
config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "revagent-memories",
            "host": "localhost",
            "port": 6333,
        }
    },
    "llm": {
        "provider": "anthropic",
        "config": {
            "model": "claude-haiku-4-5",  # Use cheap model for extraction
            "temperature": 0,
        }
    }
}
 
m = Memory.from_config(config)
 
# After each agent turn, save what matters
def save_turn_memories(user_id: str, conversation: list[dict]):
    m.add(
        messages=conversation,
        user_id=user_id,       # Namespace per user — critical for GDPR
        metadata={"agent": "revagent-risk", "session_date": "2026-05-17"}
    )
 
# Before each agent turn, retrieve relevant context
def get_memory_context(user_id: str, query: str) -> str:
    memories = m.search(query=query, user_id=user_id, limit=5)
    if not memories["results"]:
        return ""
    return "\n".join([m["memory"] for m in memories["results"]])

Then inject into your LangGraph agent:

from langgraph.graph import StateGraph, MessagesState, START, END
 
def agent_node(state: MessagesState):
    user_id = state["user_id"]
    last_message = state["messages"][-1].content
    
    # Retrieve memories for this query
    memory_context = get_memory_context(user_id, last_message)
    
    # Inject into system prompt
    system = f"""You are a deal risk analyst.
 
User context from past sessions:
{memory_context}
 
Use this context to personalize your analysis. If context is empty, proceed normally."""
 
    response = llm.invoke([
        {"role": "system", "content": system},
        *state["messages"]
    ])
    return {"messages": [response]}
 
def memory_save_node(state: MessagesState):
    """Runs after agent responds — saves new facts."""
    save_turn_memories(
        user_id=state["user_id"],
        conversation=state["messages"][-4:]  # Last 2 turns
    )
    return state
 
workflow = StateGraph(MessagesState)
workflow.add_node("agent", agent_node)
workflow.add_node("save_memory", memory_save_node)
workflow.add_edge(START, "agent")
workflow.add_edge("agent", "save_memory")
workflow.add_edge("save_memory", END)
 
graph = workflow.compile(checkpointer=checkpointer)

Benchmark numbers:

LOCOMO accuracy: 67.13%
p95 search latency: 0.200s
Token reduction vs full-context: ~72–90%
vs OpenAI native memory: +26% accuracy

Where Mem0 falls short: On LongMemEval — which tests multi-hop temporal queries — Mem0 scores 49.0%. When your agent needs to answer "what was the client's budget constraint last March, and how does it compare to their current ask?" that's a temporal multi-hop query, and Mem0's flat vector store struggles with it.

Zep / Graphiti: Best for Temporal Reasoning

Zep takes a fundamentally different architectural bet. Rather than storing memories as flat vectors, it builds a temporal knowledge graph using its open-source Graphiti engine.

Every fact stored in Zep has a validity window: when it became true, and when (if ever) it was superseded. "Client's procurement budget is €500K (as of February 2026)" is not just a stored string — it's a fact with a temporal bound. When new information contradicts old, Graphiti invalidates the old fact without discarding the historical record. You can query what was true at any point in time.

from zep_cloud.client import AsyncZep
from zep_cloud.types import Message
 
zep = AsyncZep(api_key=ZEP_API_KEY)
 
# Add a session (Zep builds the knowledge graph automatically)
async def add_to_zep(session_id: str, user_id: str, messages: list[dict]):
    await zep.memory.add(
        session_id=session_id,
        messages=[
            Message(
                role=msg["role"],
                role_type="user" if msg["role"] == "user" else "assistant",
                content=msg["content"],
            )
            for msg in messages
        ]
    )
 
# Retrieve memory — Zep returns a context string with graph-informed summaries
async def get_zep_context(session_id: str) -> str:
    memory = await zep.memory.get(session_id=session_id)
    return memory.context  # Pre-formatted context string ready to inject
 
# Temporal search — find facts with time awareness
async def search_zep_temporal(user_id: str, query: str):
    results = await zep.memory.search_sessions(
        text=query,
        user_id=user_id,
        search_scope="facts",  # Search the knowledge graph, not raw messages
        limit=5,
    )
    return results

For RevAgent specifically, Zep's temporal graph pays off when a sales rep asks: "Has anything changed with Acme Corp since last quarter?" Zep can traverse the graph and surface that the contact changed, the deal stage moved, and sentiment shifted — with timestamps. Mem0 would retrieve semantically similar facts but can't reason about the sequence or what superseded what.

Benchmark numbers:

LongMemEval accuracy (GPT-4o): 63.8% vs Mem0's 49.0% — a 15-point gap
DMR benchmark: 94.8% vs MemGPT's 93.4%
Paper: arXiv:2501.13956 (cited at ICLR 2026 MemAgents Workshop)

Where Zep falls short: If self-hosting is a hard requirement, the operational overhead is significant — you need to manage Neo4j (the graph database backend) alongside the Zep service. Mem0 is cleaner to self-host. Zep Cloud abstracts all of this, but that's another vendor dependency.

Use Zep when: Your agent deals with entities that change over time — CRM data, client relationships, compliance states, anything where "what was true then" matters as much as "what is true now."

LangMem: Best for Procedural Memory in LangGraph Stacks

LangMem is the LangChain team's memory SDK and it solves a different problem than Mem0 or Zep. It handles procedural memory — the agent's behavior improving over time, not just its recall.

Where Mem0 and Zep answer "what does this agent know?", LangMem answers "how should this agent behave?"

from langgraph.prebuilt import create_react_agent
from langgraph.store.memory import InMemoryStore
from langmem import create_manage_memory_tool, create_search_memory_tool
 
# For production: replace InMemoryStore with PostgresStore
store = InMemoryStore(
    index={
        "dims": 1536,
        "embed": "openai:text-embedding-3-small",
    }
)
 
# Namespace memories per user to prevent cross-contamination (GDPR)
agent = create_react_agent(
    "anthropic:claude-sonnet-4-6",
    tools=[
        create_manage_memory_tool(namespace=("users", "{user_id}", "facts")),
        create_search_memory_tool(namespace=("users", "{user_id}", "facts")),
    ],
    store=store,
    prompt="""You are a procurement assistant. You have tools to save and search 
memories about users. Save important preferences, constraints, and context.
Search memories before responding to personalize your answer."""
)

LangMem also does something neither Mem0 nor Zep do: prompt optimization as memory. The agent learns from feedback and updates its own behavior instructions.

from langmem import create_memory_manager
 
# Background memory manager — runs after sessions to extract and consolidate
manager = create_memory_manager(
    "anthropic:claude-haiku-4-5",  # Use cheap model for background extraction
    instructions="""Extract:
- User preferences and constraints (budget caps, preferred vendors, regions)
- Feedback on agent responses (what worked, what didn't)
- Recurring patterns in what users ask for""",
    enable_inserts=True,
    enable_updates=True,
    enable_deletes=True,  # Remove outdated facts
)
 
# Run periodically (not in the hot path)
async def consolidate_memories(user_id: str, recent_sessions: list):
    existing = store.search(("users", user_id, "facts"))
    await manager.ainvoke({
        "messages": recent_sessions,
        "existing": existing,
    })

Critical production caveat: LangMem's p95 search latency on the LOCOMO benchmark is 59.82 seconds. This is not a typo. Do not put LangMem in your hot path for interactive agents. Use it for background consolidation only — run it after sessions end, not during them.

Use LangMem when: You're on a LangGraph stack and want the agent to learn from feedback over time. Run it in the background, not inline. For real-time memory retrieval during the session, layer Mem0 or Zep on top.

Benchmark Comparison

	Mem0	Zep / Graphiti	LangMem
Architecture	Hybrid vector + entity	Temporal knowledge graph	Prompt + vector store
LongMemEval accuracy	49.0%	63.8%	N/A
LOCOMO accuracy	67.13%	N/A	N/A
p95 search latency	0.200s	Fast	⚠️ 59.82s
Temporal reasoning	Limited	Excellent	No
Procedural memory	No	No	Yes
Self-hosting	Easy	Complex (Neo4j)	Easy
LangGraph native	Adapter	ZepCloud	Native
GDPR namespace scoping	Yes	Yes	Yes
Open source	Yes (core)	Yes (Graphiti)	Yes

The Token Cost Case

Before I added memory to RevAgent, every session injected whatever history fit in context. After 3 weeks of continuous operation, calls were hitting 80K+ tokens just from accumulated conversation history — and most of it was irrelevant to the current query.

Production systems using naive full-context approaches typically run 3 to 5 times higher token costs than necessary, with recall that degrades measurably over weeks. A retrieval-based memory layer sends only the 5 most relevant memories — not the full history — and the answer quality is equivalent. You're not trading recall for savings; you're removing irrelevant context that was hurting the model anyway.

For RevAgent's risk agent:

Before memory layer:
- Average context: 12,400 tokens per call
- Daily calls: 800
- Daily input tokens: ~9.9M
- Monthly cost (GPT-4.1-mini): ~$158

After Mem0 memory layer:
- Average context: 3,200 tokens per call (system prompt + 5 memories + current query)
- Daily calls: 800
- Daily input tokens: ~2.6M
- Monthly cost (GPT-4.1-mini): ~$42

Saving: $116/month (~73% reduction)

At scale (20K calls/day), that's $600+/month saved — and the agent gives better answers because the context is relevant rather than exhaustive.

GDPR + Memory: The Part Everyone Gets Wrong

Most memory guides skip this entirely. For EU deployments, it's the most important section.

Adding a memory layer means you're now persisting personal data across sessions. Under GDPR, users have the right to access what the agent remembers about them (Article 15), correct inaccurate memories (Article 16), and demand deletion (Article 17). The EU AI Act, fully applicable from August 2026, adds a 10-year audit trail requirement for high-risk AI systems — which directly conflicts with GDPR erasure rights for the same data.

Four non-negotiable design decisions:

1. Namespace memories by user from day one.

Never store memories in a monolithic vector index. Every memory tool supports per-user namespacing — use it from the start. Retrofitting deletion support onto a flat index is painful.

# Mem0: always include user_id
m.add(messages=conversation, user_id="user-123")
m.search(query=query, user_id="user-123")
 
# LangMem: namespace hierarchy
namespace = ("org-abc", "user-123", "preferences")
 
# For GDPR deletion — deletes everything for this user
m.delete_all(user_id="user-123")

2. Maintain a source record → embedding ID mapping.

Embeddings in vector stores don't have a "delete by user ID" button unless you built one. Keep a mapping table that lets you identify every embedding belonging to a user and cascade deletion through it.

# When saving a memory, log its vector ID
async def save_memory_with_audit(user_id: str, content: str):
    memory_id = m.add(content, user_id=user_id)["id"]
    
    # Audit log — maps user to their memory IDs
    await db.execute("""
        INSERT INTO memory_audit_log (user_id, memory_id, created_at, content_summary)
        VALUES ($1, $2, NOW(), $3)
    """, user_id, memory_id, content[:100])
    
    return memory_id
 
# Article 17 erasure request
async def erase_user_memories(user_id: str):
    # Delete from memory store
    m.delete_all(user_id=user_id)
    
    # Delete from audit log
    await db.execute(
        "DELETE FROM memory_audit_log WHERE user_id = $1", user_id
    )
    
    # Log the erasure itself (you need proof you complied)
    await db.execute("""
        INSERT INTO erasure_log (user_id, erased_at, erased_by)
        VALUES ($1, NOW(), 'system')
    """, user_id)

3. Treat memory tiers differently.

Short-lived working memory (a few turns of context) has a different risk profile than long-lived semantic memory (user preferences stored for months). Apply different retention policies:

async def apply_memory_retention(user_id: str):
    retention_config = await get_user_retention_policy(user_id)
    
    # Episodic memory: keep 90 days
    cutoff_episodic = datetime.utcnow() - timedelta(days=retention_config["episodic_days"])
    m.delete(
        user_id=user_id,
        filter={"created_at": {"$lt": cutoff_episodic.isoformat()}, "type": "episodic"}
    )
    
    # Semantic memory: keep until user requests deletion
    # (preferences are valuable — don't auto-expire)

4. PII never enters the memory store.

Apply PII filtering before saving memories. This is the same PIIMiddleware pattern from my GDPR-Compliant AI post — the only difference is you're applying it to memory writes, not just LLM inputs.

from langchain.agents.middleware import PIIMiddleware
 
def redact_before_save(content: str) -> str:
    """Strip PII before writing to memory store."""
    import re
    # Email
    content = re.sub(r'[\w\.-]+@[\w\.-]+\.\w+', '[REDACTED_EMAIL]', content)
    # Italian fiscal code
    content = re.sub(r'[A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z]', '[REDACTED_CF]', content)
    return content
 
# Wrap every memory write
def save_memory_gdpr_safe(user_id: str, content: str):
    clean_content = redact_before_save(content)
    m.add(clean_content, user_id=user_id)

Which Tool to Pick

Pick Mem0 if: You're building personalization memory for a B2B SaaS or customer-facing agent. Easy to self-host, fast retrieval, good accuracy for preference-based queries. The right default for most use cases.

Pick Zep if: Your agent works with entities that change over time — CRM deals, client relationships, compliance states. The temporal knowledge graph pays dividends when facts evolve and you need to reason about when something was true, not just what is true now. The accuracy gap on temporal queries (63.8% vs 49.0%) is meaningful in production.

Pick LangMem if: You're on LangGraph and want the agent to learn from feedback — behavioral improvement, not just recall. Use it in the background, never in the hot path. Combine it with Mem0 or Zep for real-time retrieval during sessions.

The combination I'd ship today: Mem0 for real-time retrieval (Hot Path) + LangMem background consolidation for procedural learning + LangGraph's PostgresSaver checkpointer for within-session state. Swap Mem0 for Zep if your use case involves temporal reasoning across entity relationships.

Production Checklist

Before shipping memory in production:

Namespace by user from day one — no monolithic indexes; targeted deletion must be possible
PII redaction before writes — never let raw user input reach your memory store
Source record → embedding ID mapping — deletion must cascade through both
Retention policies per tier — episodic has a TTL; semantic doesn't (unless the user asks)
Erasure log — you need proof of compliance, not just compliance
Memory Node runs after agent responds — not before, to avoid latency in the hot path
Cold Path target: <100ms — anything slower and users will feel it
LangMem latency warning — 59.82s p95; background only, never inline
Monitor memory quality with LangSmith — trace what memories are being retrieved and whether they're relevant; use the feedback loop to cull bad memories from the store
Test deletion — actually run your erasure pipeline in staging; don't discover it's broken during a real DSAR

Building AI Agents with LangGraph: From Prototype to Production — the checkpointer and @task patterns that handle within-session state
GDPR-Compliant AI: Building Guardrails for EU AI Act Readiness — PII middleware, audit logging, and right-to-erasure design
LangSmith in Production: Observability, Evaluation, and Debugging AI Agents — monitoring memory retrieval quality and building feedback loops

Building LangGraph agents and need production memory without the GDPR headache? I've shipped this pattern across RevAgent and BandiFinder — both EU-facing products with real compliance requirements. Get in touch or book a call.