Agentic Memory
Executive Summary
This document describes a Proof of Concept for Agentic Memory in the Semantic Router. Agentic Memory enables AI agents to remember information across sessions, providing continuity and personalization.
⚠️ POC Scope: This is a proof of concept, not a production design. The goal is to validate the core memory flow (retrieve → inject → extract → store) with acceptable accuracy. Production hardening (error handling, scaling, monitoring) is out of scope.
Core Capabilities
| Capability | Description |
|---|---|
| Memory Retrieval | Embedding-based search with simple pre-filtering |
| Memory Saving | LLM-based extraction of facts and procedures |
| Cross-Session Persistence | Memories stored in Milvus (survives restarts; production backup/HA not tested) |
| User Isolation | Memories scoped per user_id (see note below) |
⚠️ User Isolation - Milvus Performance Note:
Approach POC Production (10K+ users) Simple filter ✅ Filter by user_idafter search❌ Degrades: searches all users, then filters Partition Key ❌ Overkill ✅ Physical separation, O(log N) per user Scalar Index ❌ Overkill ✅ Index on user_idfor fast filteringPOC: Uses simple metadata filtering (sufficient for testing).
Production: Configureuser_idas Partition Key or Scalar Indexed Field in Milvus schema.
Key Design Principles
- Simple pre-filter decides if query should search memory
- Context window from history for query disambiguation
- LLM extracts facts and classifies type when saving
- Threshold-based filtering on search results
Explicit Assumptions (POC)
| Assumption | Implication | Risk if Wrong |
|---|---|---|
| LLM extraction is reasonably accurate | Some incorrect facts may be stored | Memory contamination (fixable via Forget API) |
| 0.6 similarity threshold is a starting point | May need tuning (miss relevant or include irrelevant) | Adjustable based on retrieval quality logs |
| Milvus is available and configured | Feature disabled if down | Graceful degradation (no crash) |
| Embedding model produces 384-dim vectors | Must match Milvus schema | Startup failure (detectable) |
| History available via Response API chain | Required for context | Skip memory if unavailable |
Table of Contents
- Problem Statement
- Architecture Overview
- Memory Types
- Pipeline Integration
- Memory Retrieval
- Memory Saving
- Memory Operations
- Data Structures
- API Extension
- Configuration
- Failure Modes and Fallbacks
- Success Criteria
- Implementation Plan
- Future Enhancements
1. Problem Statement
Current State
The Response API provides conversation chaining via previous_response_id, but knowledge is lost across sessions:
Session A (March 15):
User: "My budget for the Hawaii trip is $10,000"
→ Saved in session chain
Session B (March 20) - NEW SESSION:
User: "What's my budget for the trip?"
→ No previous_response_id → Knowledge LOST ❌
Desired State
With Agentic Memory:
Session A (March 15):
User: "My budget for the Hawaii trip is $10,000"
→ Extracted and saved to Milvus
Session B (March 20) - NEW SESSION:
User: "What's my budget for the trip?"
→ Pre-filter: memory-relevant ✓
→ Search Milvus → Found: "budget for Hawaii is $10K"
→ Inject into LLM context
→ Assistant: "Your budget for the Hawaii trip is $10,000!" ✅
2. Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ AGENTIC MEMORY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ExtProc Pipeline │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Request → Fact? → Tool? → Security → Cache → MEMORY → LLM │ │
│ │ │ │ ↑↓ │ │
│ │ └───────┴──── signals used ────────┘ │ │
│ │ │ │
│ │ Response ← [extract & store] ←─────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────┴─────────────────────┐ │
│ │ │ │
│ ┌─────────▼─────────┐ ┌───────── ───▼───┐ │
│ │ Memory Retrieval │ │ Memory Saving │ │
│ │ (request phase) │ │(response phase)│ │
│ ├───────────────────┤ ├────────────────┤ │
│ │ 1. Check signals │ │ 1. LLM extract │ │
│ │ (Fact? Tool?) │ │ 2. Classify │ │
│ │ 2. Build context │ │ 3. Deduplicate │ │
│ │ 3. Milvus search │ │ 4. Store │ │
│ │ 4. Inject to LLM │ │ │ │
│ └─────────┬─────────┘ └────────┬───────┘ │
│ │ │ │
│ │ ┌──────────────┐ │ │
│ └────────►│ Milvus │◄─────────────┘ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Component Responsibilities
| Component | Responsibility | Location |
|---|---|---|
| Memory Filter | Decision + search + inject | pkg/extproc/req_filter_memory.go |
| Memory Extractor | LLM-based fact extraction | pkg/memory/extractor.go (new) |
| Memory Store | Storage interface | pkg/memory/store.go |
| Milvus Store | Vector database backend | pkg/memory/milvus_store.go |
| Existing Classifiers | Fact/Tool signals (reused) | pkg/extproc/processor_req_body.go |
Storage Architecture
Issue #808 suggests a multi-layer storage architecture. We implement this incrementally:
┌─────────────────────────────────────────────────────────────────────────┐
│ STORAGE ARCHITECTURE (Phased) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 1 (MVP) │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Milvus (Vector Index) │ │ │
│ │ │ • Semantic search over memories │ │ │
│ │ │ • Embedding storage │ │ │
│ │ │ • Content + metadata │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 2 (Performance) │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Redis (Hot Cache) │ │ │
│ │ │ • Fast metadata lookup │ │ │
│ │ │ • Recently accessed memories │ │ │
│ │ │ • TTL/expiration support │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────── ─────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 3+ (If Needed) │ │
│ │ ┌───────────────────────┐ ┌───────────────────────┐ │ │
│ │ │ Graph Store (Neo4j) │ │ Time-Series Index │ │ │
│ │ │ • Memory links │ │ • Temporal queries │ │ │
│ │ │ • Relationships │ │ • Decay scoring │ │ │
│ │ └───────────────────────┘ └───────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
| Layer | Purpose | When Needed | Status |
|---|---|---|---|
| Milvus | Semantic vector search | Core functionality | ✅ MVP |
| Redis | Hot cache, fast access, TTL | Performance optimization | 🔶 Phase 2 |
| Graph (Neo4j) | Memory relationships | Multi-hop reasoning queries | ⚪ If needed |
| Time-Series | Temporal queries, decay | Importance scoring by time | ⚪ If needed |
Design Decision: We start with Milvus only. Additional layers are added based on demonstrated need, not speculation. The
Storeinterface abstracts storage, allowing backends to be added without changing retrieval/saving logic.
3. Memory Types
| Type | Purpose | Example | Status |
|---|---|---|---|
| Semantic | Facts, preferences, knowledge | "User's budget for Hawaii is $10,000" | ✅ MVP |
| Procedural | How-to, steps, processes | "To deploy payment-service: run npm build, then docker push" | ✅ MVP |
| Episodic | Session summaries, past events | "On Dec 29 2024, user planned Hawaii vacation with $10K budget" | ⚠️ MVP (limited) |
| Reflective | Self-analysis, lessons learned | "Previous budget response was incomplete - user prefers detailed breakdowns" | 🔮 Future |
⚠️ Episodic Memory (MVP Limitation): Session-end detection is not implemented. Episodic memories are only created when the LLM extraction explicitly produces a summary-style output. Reliable session-end triggers are deferred to Phase 2.
🔮 Reflective Memory: Self-analysis and lessons learned. Not in scope for this POC. See Appendix A.
Memory Vector Space
Memories cluster by content/topic, not by type. Type is metadata:
┌────────────────────────────────────────────────────────────────────────┐
│ MEMORY VECTOR SPACE │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ BUDGET/MONEY │ │ DEPLOYMENT │ │
│ │ CLUSTER │ │ CLUSTER │ │
│ │ │ │ │ │
│ │ ● budget=$10K │ │ ● npm build │ │
│ │ (semantic) │ │ (procedural) │ │
│ │ ● cost=$5K │ │ ● docker push │ │
│ │ (semantic) │ │ (procedural) │ │
│ └─────────────────┘ └──────────── ─────┘ │
│ │
│ ● = memory with type as metadata │
│ Query matches content → type comes from matched memory │
│ │
└────────────────────────────────────────────────────────────────────────┘
Response API vs. Agentic Memory: When Does Memory Add Value?
Critical Distinction: Response API already sends full conversation history to the LLM when previous_response_id is present. Agentic Memory's value is for cross-session context.
┌─────────────────────────────────────────────────────────────────────────┐
│ RESPONSE API vs. AGENTIC MEMORY: CONTEXT SOURCES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SAME SESSION (has previous_response_id): │
│ ───────────────────────────────────────── │
│ Response API provides: │
│ └── Full conversation chain (all turns) → sent to LLM │
│ │
│ Agentic Memory: │
│ └── STILL VALUABLE - current session may not have the answer │
│ └── Example: 100 turns planning vacation, but budget never said │
│ └── Days ago: "I have 10K spare, is that enough for a week in │
│ Thailand?" → LLM extracts: "User has $10K budget for trip" │
│ └── Now: "What's my budget?" → answer in memory, not this chain │
│ │
│ NEW SESSION (no previous_response_id): │
│ ────────────────────────────────────── │
│ Response API provides: │
│ └── Nothing (no chain to follow) │
│ │
│ Agentic Memory: │
│ └── ADDS VALUE - retrieves cross-session context │
│ └── "What was my Hawaii budget?" → finds fact from March session │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Design Decision: Memory retrieval adds value in both scenarios — new sessions (no chain) and existing sessions (query may reference other sessions). We always search when pre-filter passes.
Known Redundancy: When the answer IS in the current chain, we still search memory (~10-30ms wasted). We can't cheaply detect "is the answer already in history?" without understanding the query semantically. For POC, we accept this overhead.
Phase 2 Solution: Context Compression solves this properly — instead of Response API sending full history, we send compressed summaries + recent turns + relevant memories. Facts are extracted during summarization, eliminating redundancy entirely.
4. Pipeline Integration
Current Pipeline (main branch)
1. Response API Translation
2. Parse Request
3. Fact-Check Classification
4. Tool Detection
5. Decision & Model Selection
6. Security Checks
7. PII Detection
8. Semantic Cache Check
9. Model Routing → LLM
Enhanced Pipeline with Agentic Memory
REQUEST PHASE:
─────────────
1. Response API Translation
2. Parse Request
3. Fact-Check Classification ──┐
4. Tool Detection ├── Existing signals
5. Decision & Model Selection ──┘
6. Security Checks
7. PII Detection
8. Semantic Cache Check ───► if HIT → return cached
9. 🆕 Memory Decision:
└── if (NOT Fact) AND (NOT Tool) AND (NOT Greeting) → continue
└── else → skip to step 12
10. 🆕 Build context + rewrite query [~1-5ms]
11. 🆕 Search Milvus, inject memories [~10-30ms]
12. Model Routing → LLM
RESPONSE PHASE:
──────────────
13. Parse LLM Response
14. Cache Update
15. 🆕 Memory Extraction (async goroutine, if auto_store enabled)
└── Runs in background, does NOT add latency to response
16. Response API Translation
17. Return to Client
Step 10 details: Query rewriting strategies (context prepend, LLM rewrite, HyDE) are explained in Appendix C.
5. Memory Retrieval
Flow
┌─────────────────────────────────────────────────────────────────────────┐
│ MEMORY RETRIEVAL FLOW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. MEMORY DECISION (reuse existing pipeline signals) │
│ ──────────────────────────── ────────────────────── │
│ │
│ Pipeline already classified: │
│ ├── ctx.IsFact (Fact-Check classifier) │
│ ├── ctx.RequiresTool (Tool Detection) │
│ └── isGreeting(query) (simple pattern) │
│ │
│ Decision: │
│ ├── Fact query? → SKIP (general knowledge) │
│ ├── Tool query? → SKIP (tool provides answer) │
│ ├── Greeting? → SKIP (no context needed) │
│ └── Otherwise → SEARCH MEMORY │
│ │
│ 2. BUILD CONTEXT + REWRITE QUERY │
│ ───────────────────────────── │
│ History: ["Planning vacation", "Hawaii sounds nice"] │
│ Query: "How much?" │
│ │
│ Option A (MVP): Context prepend │
│ → "How much? Hawaii vacation planning" │
│ │
│ Option B (v1): LLM rewrite │
│ → "What is the budget for the Hawaii vacation?" │
│ │
│ 3. MILVUS SEARCH │
│ ───────────── │
│ Embed context → Search with user_id filter → Top-k results │
│ │
│ 4. THRESHOLD FILTER │
│ ──────────────── │
│ Keep only results with similarity > 0.6 │
│ ⚠️ Threshold is configurable; 0.6 is starting value, tune via logs │
│ │
│ 5. INJECT INTO LLM CONTEXT │
│ ──────────────────────── │
│ Add as system message: "User's relevant context: ..." │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Implementation
MemoryFilter Struct
// pkg/extproc/req_filter_memory.go
type MemoryFilter struct {
store memory.Store // Interface - can be MilvusStore or InMemoryStore
}
func NewMemoryFilter(store memory.Store) *MemoryFilter {
return &MemoryFilter{store: store}
}
Note:
storeis theStoreinterface (Section 8), not a specific implementation. At runtime, this is typicallyMilvusStorefor production orInMemoryStorefor testing.
Memory Decision (Reuses Existing Pipeline)
⚠️ Known Limitation: The
IsFactclassifier was designed for general-knowledge fact-checking (e.g., "What is the capital of France?"). It may incorrectly classify personal-fact questions ("What is my budget?") as fact queries, causing memory to be skipped.POC Mitigation: We add a personal-indicator check. If query contains personal pronouns ("my", "I", "me"), we override
IsFactand search memory anyway.Future: Retrain or augment the fact-check classifier to distinguish general vs. personal facts.
// pkg/extproc/req_filter_memory.go
// shouldSearchMemory decides if query should trigger memory search
// Reuses existing pipeline classification signals with personal-fact override
func shouldSearchMemory(ctx *RequestContext, query string) bool {
// Check for personal indicators (overrides IsFact for personal questions)
hasPersonalIndicator := containsPersonalPronoun(query)
// 1. Fact query → skip UNLESS it contains personal pronouns
if ctx.IsFact && !hasPersonalIndicator {
logging.Debug("Memory: Skipping - general fact query")
return false
}
// 2. Tool required → skip (tool provides answer)
if ctx.RequiresTool {
logging.Debug("Memory: Skipping - tool query")
return false
}
// 3. Greeting/social → skip (no context needed)
if isGreeting(query) {
logging.Debug("Memory: Skipping - greeting")
return false
}
// 4. Default: search memory (conservative - don't miss context)
return true
}
func containsPersonalPronoun(query string) bool {
// Simple check for personal context indicators
personalPatterns := regexp.MustCompile(`(?i)\b(my|i|me|mine|i'm|i've|i'll)\b`)
return personalPatterns.MatchString(query)
}
func isGreeting(query string) bool {
// Match greetings that are ONLY greetings, not "Hi, what's my budget?"
lower := strings.ToLower(strings.TrimSpace(query))
// Short greetings only (< 20 chars and matches pattern)
if len(lower) > 20 {
return false
}
greetings := []string{
`^(hi|hello|hey|howdy)[\s\!\.\,]*$`,
`^(hi|hello|hey)[\s\,]*(there)?[\s\!\.\,]*$`,
`^(thanks|thank you|thx)[\s\!\.\,]*$`,
`^(bye|goodbye|see you)[\s\!\.\,]*$`,
`^(ok|okay|sure|yes|no)[\s\!\.\,]*$`,
}
for _, p := range greetings {
if regexp.MustCompile(p).MatchString(lower) {
return true
}
}
return false
}
Context Building
// buildSearchQuery builds an effective search query from history + current query
// MVP: context prepend, v1: LLM rewrite for vague queries
func buildSearchQuery(history []Message, query string) string {
// If query is self-contained, use as-is
if isSelfContained(query) {
return query
}
// MVP: Simple context prepend
context := summarizeHistory(history)
return query + " " + context
// v1 (future): LLM rewrite for vague queries
// if isVague(query) {
// return rewriteWithLLM(history, query)
// }
}
func isSelfContained(query string) bool {
// Self-contained: "What's my budget for the Hawaii trip?"
// NOT self-contained: "How much?", "And that one?", "What about it?"
vaguePatterns := []string{`^how much\??$`, `^what about`, `^and that`, `^this one`}
for _, p := range vaguePatterns {
if regexp.MustCompile(`(?i)`+p).MatchString(query) {
return false
}
}
return len(query) > 20 // Short queries are often vague
}
func summarizeHistory(history []Message) string {
// Extract key terms from last 3 user messages
var terms []string
count := 0
for i := len(history) - 1; i >= 0 && count < 3; i-- {
if history[i].Role == "user" {
terms = append(terms, extractKeyTerms(history[i].Content))
count++
}
}
return strings.Join(terms, " ")
}
// v1: LLM-based query rewriting (future enhancement)
func rewriteWithLLM(history []Message, query string) string {
prompt := fmt.Sprintf(`Conversation context: %s
Rewrite this vague query to be self-contained: "%s"
Return ONLY the rewritten query.`, summarizeHistory(history), query)
// Call LLM endpoint
resp, _ := http.Post(llmEndpoint+"/v1/chat/completions", ...)
return parseResponse(resp)
// "how much?" → "What is the budget for the Hawaii vacation?"
}