Most AI agents today are stateless. They receive a prompt, produce a response, and forget everything. The moment a conversation ends or a workflow completes, all accumulated context evaporates. This is fine for single-turn tasks, but it falls apart the moment you need an agent to learn from past interactions, coordinate with other agents, or maintain continuity across sessions.
A universal memory layer solves this. It is the infrastructure that lets agents persist context, retrieve relevant past experiences, share knowledge with other agents, and improve their behavior over time. Think of it as the difference between a colleague with amnesia and one who remembers every project you have worked on together.
This post teaches you how to design and build that layer from scratch.
What You Will Learn
- The three fundamental memory types for AI agents: episodic, semantic, and procedural
- How to store and index each memory type for fast, relevant retrieval
- Hybrid retrieval strategies that combine vector search, keyword matching, and graph traversal
- State management patterns for multi-agent workflows
- Runnable Python code for a working memory layer prototype
- Trade-offs and failure modes you will encounter in production
Why Agents Need Memory: First Principles
To understand why memory matters, consider what happens without it. An agent tasked with debugging a codebase will re-read the same files, ask the same clarifying questions, and re-discover the same patterns every single time it is invoked. It cannot build on prior findings.
Human cognition relies on multiple memory systems working together. Cognitive science broadly categorizes these as episodic memory (specific experiences), semantic memory (general knowledge and facts), and procedural memory (learned skills and routines). AI agent architectures benefit from a similar decomposition.
This is not a new observation. The paper Architectures for Building Agentic AI by Nowaczyk (2025) explicitly lists memory as a core component alongside goal managers, planners, and tool routers. The argument is that reliability in agentic systems is "chiefly an architectural property" — and memory architecture is a large part of that.
The Three Memory Types
Each memory type serves a different function, has different storage characteristics, and requires different retrieval strategies.
Episodic Memory
Episodic memory stores specific events and interactions. Each record captures what happened, when, in what context, and what the outcome was. This is the agent's autobiography.
Examples: "User asked me to refactor the auth module on Tuesday. I suggested extracting a middleware layer. They accepted the suggestion and the PR was merged." Each entry is timestamped, attributed, and often includes metadata like the agent's confidence level or the user's satisfaction signal.
Semantic Memory
Semantic memory stores facts, concepts, and relationships independent of when they were learned. It is the agent's knowledge base.
Examples: "The production database uses PostgreSQL 16. The team prefers functional React components. The API rate limit is 1000 requests per minute." These facts have no temporal anchor — they are true until updated.
Procedural Memory
Procedural memory stores learned patterns, workflows, and decision-making heuristics. It is the agent's skill set.
Examples: "When deploying to staging, always run the integration test suite first. When the user says 'make it faster,' check for N+1 queries before suggesting caching." These are not facts or events — they are compiled strategies derived from repeated experience.
| Dimension | Episodic | Semantic | Procedural |
|---|---|---|---|
| Content | Specific events, interactions | Facts, concepts, relationships | Workflows, heuristics, strategies |
| Structure | Timestamped records with context | Key-value or graph-structured | Condition-action rules or templates |
| Retrieval | Temporal + similarity search | Exact match + semantic search | Pattern matching on context |
| Update pattern | Append-only (immutable events) | Upsert (facts change) | Refine over time (versioned) |
| Storage size | Grows linearly with usage | Bounded by domain scope | Relatively compact |
Architecture: How the Memory Layer Works
The memory layer sits between the agent's reasoning loop and persistent storage. Every agent action can write to memory. Every agent decision can read from memory. The retrieval system must return the most relevant memories given the current context, not just the most recent ones.
graph TD
A["Agent Action / Observation"] --> B["Memory Writer"]
B --> C["Episodic Store"]
B --> D["Semantic Store"]
B --> E["Procedural Store"]
F["Agent Query (current context)"] --> G["Hybrid Retrieval Engine"]
C --> G
D --> G
E --> G
G --> H["Vector Search (embeddings)"]
G --> I["BM25 Keyword Search"]
G --> J["Graph Traversal"]
H --> K["Rank Fusion (RRF)"]
I --> K
J --> K
K --> L["Ranked Memory Recall"]
L --> M["Agent Reasoning Loop"]
The write path classifies incoming information into the appropriate store. The read path runs multiple retrieval strategies in parallel and fuses their results into a single ranked list.
Practical Implementation
Let us build a working memory layer in Python. We will use SQLite for structured storage, a vector index for embedding-based search, and a simple rank fusion algorithm to combine results.
Step 1: Define the Memory Schema
1. Define memory record structures
Each memory type gets its own schema. We use dataclasses for clarity.
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
import uuid
import json
@dataclass
class EpisodicMemory:
content: str
context: dict # e.g., {"task": "refactor", "user": "alice"}
outcome: Optional[str] = None
timestamp: datetime = field(default_factory=datetime.utcnow)
memory_id: str = field(default_factory=lambda: str(uuid.uuid4()))
memory_type: str = "episodic"
@dataclass
class SemanticMemory:
subject: str # e.g., "production_database"
predicate: str # e.g., "uses"
obj: str # e.g., "PostgreSQL 16"
confidence: float = 1.0
source: str = "" # where was this learned
memory_id: str = field(default_factory=lambda: str(uuid.uuid4()))
memory_type: str = "semantic"
@dataclass
class ProceduralMemory:
condition: str # when to apply this procedure
action: str # what to do
success_count: int = 0
failure_count: int = 0
memory_id: str = field(default_factory=lambda: str(uuid.uuid4()))
memory_type: str = "procedural"
The context field on episodic memories is critical. It captures the situation in which the event occurred, enabling retrieval by context similarity rather than just keyword matching.
Step 2: Build the Storage Backend
2. Implement storage with SQLite and a vector index
We use SQLite for structured data and numpy for a minimal in-memory vector index. In production, you would swap the vector index for something like Qdrant, Weaviate, or pgvector.
import sqlite3
import numpy as np
from typing import List, Tuple
class MemoryStore:
def __init__(self, db_path: str = ":memory:"):
self.conn = sqlite3.connect(db_path)
self._init_tables()
# In-memory vector index: list of (memory_id, embedding)
self.vector_index: List[Tuple[str, np.ndarray]] = []
def _init_tables(self):
self.conn.executescript("""
CREATE TABLE IF NOT EXISTS episodic (
memory_id TEXT PRIMARY KEY,
content TEXT,
context TEXT,
outcome TEXT,
timestamp TEXT
);
CREATE TABLE IF NOT EXISTS semantic (
memory_id TEXT PRIMARY KEY,
subject TEXT,
predicate TEXT,
obj TEXT,
confidence REAL,
source TEXT
);
CREATE TABLE IF NOT EXISTS procedural (
memory_id TEXT PRIMARY KEY,
condition_text TEXT,
action_text TEXT,
success_count INTEGER,
failure_count INTEGER
);
CREATE VIRTUAL TABLE IF NOT EXISTS memory_fts
USING fts5(memory_id, content, tokenize='porter');
""")
def write_episodic(self, mem: "EpisodicMemory", embedding: np.ndarray):
self.conn.execute(
"INSERT INTO episodic VALUES (?, ?, ?, ?, ?)",
(mem.memory_id, mem.content, json.dumps(mem.context),
mem.outcome, mem.timestamp.isoformat())
)
# Index for full-text search
self.conn.execute(
"INSERT INTO memory_fts VALUES (?, ?)",
(mem.memory_id, mem.content)
)
self.conn.commit()
# Add to vector index
self.vector_index.append((mem.memory_id, embedding))
def write_semantic(self, mem: "SemanticMemory", embedding: np.ndarray):
# Upsert: if same subject+predicate exists, update it
self.conn.execute(
"""INSERT INTO semantic VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(memory_id) DO UPDATE SET
obj=excluded.obj, confidence=excluded.confidence""",
(mem.memory_id, mem.subject, mem.predicate,
mem.obj, mem.confidence, mem.source)
)
text = f"{mem.subject} {mem.predicate} {mem.obj}"
self.conn.execute(
"INSERT INTO memory_fts VALUES (?, ?)",
(mem.memory_id, text)
)
self.conn.commit()
self.vector_index.append((mem.memory_id, embedding))
Note that we write every memory record into both the structured table and the full-text search index. This enables hybrid retrieval.
Step 3: Implement Hybrid Retrieval
3. Build the retrieval engine with rank fusion
The retrieval engine runs two search strategies — vector similarity and BM25 keyword matching — then combines them using Reciprocal Rank Fusion (RRF). RRF is a simple, effective algorithm: for each document, sum 1 / (k + rank) across all retrieval methods, where k is a constant (typically 60).
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Compute cosine similarity between two vectors."""
dot = np.dot(a, b)
norm = np.linalg.norm(a) * np.linalg.norm(b)
if norm == 0:
return 0.0
return float(dot / norm)
class HybridRetriever:
def __init__(self, store: MemoryStore, k: int = 60):
self.store = store
self.k = k # RRF constant
def vector_search(self, query_embedding: np.ndarray,
top_n: int = 10) -> List[Tuple[str, float]]:
"""Return (memory_id, score) pairs ranked by cosine similarity."""
scores = []
for memory_id, emb in self.store.vector_index:
sim = cosine_similarity(query_embedding, emb)
scores.append((memory_id, sim))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_n]
def bm25_search(self, query_text: str,
top_n: int = 10) -> List[Tuple[str, float]]:
"""Use SQLite FTS5 built-in ranking for BM25 search."""
cursor = self.store.conn.execute(
"""SELECT memory_id, rank FROM memory_fts
WHERE memory_fts MATCH ?
ORDER BY rank LIMIT ?""",
(query_text, top_n)
)
# FTS5 rank is negative (lower = better), so we negate for consistency
return [(row[0], -row[1]) for row in cursor.fetchall()]
def retrieve(self, query_text: str, query_embedding: np.ndarray,
top_n: int = 5) -> List[Tuple[str, float]]:
"""Hybrid retrieval using Reciprocal Rank Fusion."""
vec_results = self.vector_search(query_embedding, top_n=20)
bm25_results = self.bm25_search(query_text, top_n=20)
# Build rank maps: memory_id -> rank position (1-indexed)
rrf_scores: dict = {}
for rank, (memory_id, _) in enumerate(vec_results, start=1):
rrf_scores[memory_id] = rrf_scores.get(memory_id, 0) + \
1.0 / (self.k + rank)
for rank, (memory_id, _) in enumerate(bm25_results, start=1):
rrf_scores[memory_id] = rrf_scores.get(memory_id, 0) + \
1.0 / (self.k + rank)
# Sort by fused score, descending
ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return ranked[:top_n]
The RRF approach is attractive because it does not require normalizing scores across different retrieval methods. Each method produces a ranked list, and RRF combines the ranks. This was first described by Cormack, Clarke, and Butt (2009) and remains widely used in production search systems.
Step 4: Wire It Together
4. Complete the memory layer with a unified API
Here is the unified interface an agent would use:
# Simulated embedding function (replace with a real model in production)
def embed(text: str) -> np.ndarray:
"""Produce a deterministic pseudo-embedding for demonstration."""
np.random.seed(hash(text) % 2**32)
return np.random.randn(384).astype(np.float32)
# Initialize
store = MemoryStore()
retriever = HybridRetriever(store)
# Agent learns a fact (semantic memory)
fact = SemanticMemory(
subject="production_db",
predicate="runs_on",
obj="PostgreSQL 16"
)
store.write_semantic(fact, embed(f"{fact.subject} {fact.predicate} {fact.obj}"))
# Agent records an interaction (episodic memory)
episode = EpisodicMemory(
content="User requested migration from MySQL to PostgreSQL. "
"I generated migration scripts and validated schema compatibility.",
context={"task": "database_migration", "user": "alice"},
outcome="success"
)
store.write_episodic(episode, embed(episode.content))
# Later: agent needs to recall relevant context
query = "What do I know about the production database?"
results = retriever.retrieve(query, embed(query), top_n=3)
for memory_id, score in results:
print(f"Memory {memory_id[:8]}... | RRF Score: {score:.4f}")
Expected output (memory IDs will vary):
Memory 3a7f2c1e... | RRF Score: 0.0328
Memory b9d41e8a... | RRF Score: 0.0164
The top result is the semantic fact about PostgreSQL, and the second is the episodic memory about the migration — both relevant to the query.
Embedding Quality Matters More Than You Think
The pseudo-embedding function above is for demonstration only. In production, use a proper embedding model (e.g.,
text-embedding-3-smallfrom OpenAI, or an open-source model likebge-base-en-v1.5). The quality of your embeddings directly determines the quality of vector search results. Poor embeddings will surface irrelevant memories no matter how sophisticated your retrieval pipeline is.
Multi-Agent State Sharing
When multiple agents collaborate — say, a research agent and a coding agent working on the same task — they need shared access to the memory layer. This introduces two challenges: consistency and trust.
Consistency
If Agent A writes a memory and Agent B reads it milliseconds later, is the memory visible? For most agentic workloads, eventual consistency is acceptable. Agents are not running real-time trading systems; a few hundred milliseconds of propagation delay is fine. Use a shared database (PostgreSQL with pgvector, for instance) and rely on its default isolation level.
Trust Scoring
Not all memories are equally reliable. An agent that has been wrong frequently should have its memories weighted lower. A simple trust model assigns each agent a trust score and multiplies it into the retrieval ranking:
@dataclass
class AgentTrustProfile:
agent_id: str
correct_predictions: int = 0
total_predictions: int = 0
@property
def trust_score(self) -> float:
if self.total_predictions == 0:
return 0.5 # neutral prior
return self.correct_predictions / self.total_predictions
def trust_weighted_retrieve(
retriever: HybridRetriever,
query_text: str,
query_embedding: np.ndarray,
memory_author_map: dict, # memory_id -> agent_id
trust_profiles: dict, # agent_id -> AgentTrustProfile
top_n: int = 5
) -> List[Tuple[str, float]]:
"""Re-rank retrieval results by trust-weighted scores."""
raw_results = retriever.retrieve(query_text, query_embedding, top_n=20)
weighted = []
for memory_id, rrf_score in raw_results:
agent_id = memory_author_map.get(memory_id, "unknown")
trust = trust_profiles.get(agent_id, AgentTrustProfile(agent_id)).trust_score
weighted.append((memory_id, rrf_score * trust))
weighted.sort(key=lambda x: x[1], reverse=True)
return weighted[:top_n]
This gets tricky because trust is domain-dependent. An agent might be highly reliable for code reviews but unreliable for cost estimation. A more sophisticated approach partitions trust scores by task category, but the simple version above is a reasonable starting point.
Unbounded Memory Growth Will Degrade Performance
Every write to the memory layer increases storage and slows down retrieval. Without a garbage collection strategy, your vector index will grow until search latency becomes unacceptable. Implement a decay mechanism: older episodic memories that have not been accessed in a long time should be archived or summarized. Semantic memories should be periodically validated. Procedural memories with high failure counts should be retired.
Seeing This in Practice
The architecture described above — episodic, semantic, and procedural stores with hybrid retrieval and multi-agent trust scoring — is implemented in SuperLocalMemory, a local-first memory layer for AI agents. It stores all data on your machine (no cloud dependency), supports shared memory across different AI tools like OpenAI, Claude, and Gemini, and implements the trust-weighted retrieval pattern shown above.
You can inspect how it handles cross-agent memory sharing by looking at the memory write path:
# Clone the repository and explore the memory layer
git clone https://github.com/superlocalai/superlocalmemory.git
cd superlocalmemory
# The core memory store implementation lives in:
# src/memory/store.ts — episodic, semantic, procedural writes
# src/retrieval/hybrid.ts — vector + BM25 rank fusion
# src/trust/scoring.ts — per-agent trust profiles
The codebase follows the same architectural pattern described in this post: classify on write, search across all stores on read, fuse results with RRF, and apply trust weighting based on the authoring agent's track record. It is a useful reference implementation if you want to see how these ideas translate to production TypeScript.
Real-World Considerations
When NOT to Build a Memory Layer
If your agent handles stateless, one-shot tasks (like "translate this sentence" or "format this JSON"), a memory layer adds complexity with no benefit. Memory layers pay off when agents operate across sessions, collaborate with other agents, or need to improve over time.
Storage Backend Choice
| Backend | Vector Search | Keyword Search | Graph Queries | Operational Complexity |
|---|---|---|---|---|
| PostgreSQL + pgvector | Good (HNSW index) | Built-in FTS | Requires CTEs or extensions | Low (one database) |
| Qdrant + SQLite | Excellent | Separate system | Not built-in | Medium (two systems) |
| Neo4j + vector index | Moderate | Lucene-based | Excellent | High (specialized) |
| In-memory (dev/prototyping) | Fast, no persistence | Simple | Manual | Minimal |
For most teams, PostgreSQL with pgvector is the pragmatic choice. It handles vector search, keyword search, and structured queries in a single system. You lose some vector search performance compared to a dedicated engine, but you gain operational simplicity.
Embedding Dimensionality vs. Retrieval Speed
Higher-dimensional embeddings capture more nuance but slow down similarity search. With HNSW indexes, search is approximately O(log n) regardless of dimension, but the constant factor grows linearly with dimension size. For most agent memory use cases, 384 or 768 dimensions are sufficient. Going to 1536 or 3072 dimensions rarely justifies the increased storage and compute costs.
Memory Conflicts
Two agents might write contradictory semantic memories: Agent A says "the API uses OAuth 2.0" while Agent B says "the API uses API keys." Your memory layer needs a conflict resolution strategy. Options include: latest-write-wins, trust-weighted resolution (prefer the higher-trust agent), or flagging the conflict for human review.
Do Not Treat All Memory Types Identically
A common mistake is dumping everything into a single vector store and hoping retrieval will sort it out. Episodic, semantic, and procedural memories have fundamentally different update patterns and query profiles. Episodic memories are append-only and queried by temporal context. Semantic memories are upserted and queried by entity. Procedural memories are versioned and queried by situational match. Mixing them into one index degrades retrieval quality for all three.
Further Reading and Sources
- Architectures for Building Agentic AI by Nowaczyk (2025) — a thorough treatment of agentic system components including memory, with emphasis on reliability as an architectural property
- Foundations of GenIR by Ai, Zhan, Liu (2025) — covers how generative AI models change information access paradigms, relevant to understanding retrieval in agent contexts
- Safe, Untrusted, "Proof-Carrying" AI Agents by Tagliabue and Greco (2025) — discusses trust and governance in agentic workflows, directly relevant to multi-agent memory sharing
- Reciprocal Rank Fusion by Cormack, Clarke, and Butt (2009) — the original RRF paper, still the go-to algorithm for combining ranked lists
- Semantica — an open-source semantic layer and GraphRAG framework relevant to graph-based memory retrieval
- pgvector documentation — if you choose PostgreSQL as your storage backend, this is essential reading for vector index configuration
Key Takeaways
- Three memory types, three stores. Episodic (events), semantic (facts), and procedural (skills) memories have different storage, update, and retrieval characteristics. Design for each independently.
- Hybrid retrieval beats any single method. Combining vector search with BM25 keyword matching using Reciprocal Rank Fusion consistently outperforms either approach alone.
- Trust scoring is essential for multi-agent systems. When multiple agents write to shared memory, weight retrieval results by the authoring agent's track record to surface reliable information first.
- Start with PostgreSQL + pgvector. One database that handles structured storage, full-text search, and vector similarity is operationally simpler than stitching together three specialized systems.
- Plan for memory decay. Unbounded memory growth degrades retrieval quality and performance. Implement archival, summarization, and retirement strategies from the start.
Top comments (0)