Review date: 2026-06-22 Review author: Zhongzhu Zhou Paper reviewed: Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents Paper authors: Shuo Ji, Yibo Li, Bryan Hooi arXiv: 2606.06036v1, 2026-06-04 Venue/status: ICML 2026 (Proceedings of the 43rd International Conference on Machine Learning, Seoul)
Short Answer
Every time you ask your phone’s voice assistant “What did I say I wanted for dinner on Tuesday?” and it searches through some flat conversation log, it is using passive retrieval: pick the top-k memory chunks that look similar to your query and hand them to the model. MRAgent argues this is the wrong mental model for memory — and cognitive neuroscience agrees. Human memory recall is not a search engine lookup; it is an active reconstruction process where intermediate cues chain together into a coherent recall episode.
MRAgent (Memory Reasoning Architecture for LLM Agents) operationalizes this idea. It represents an agent’s interaction history as a heterogeneous Cue–Tag–Content graph, where fine-grained cues (entities, keywords) are linked to episodic memory units through intermediate associative tags. At query time, instead of doing one-shot similarity retrieval, MRAgent runs an iterative loop: the LLM reasons over currently available cues and tags, selects which graph branches to explore next, traverses those branches, collects evidence, and updates its reasoning trajectory — until it has accumulated enough context to answer confidently.
The paper proves, via an approximation-theoretic argument (Theorem 4.1), that active reconstruction is strictly more expressive than any passive retrieval policy given the same retrieval budget. Empirically, on the LoCoMo benchmark (50 long conversations, up to 35 sessions), MRAgent achieves a 23.3% relative improvement in LLM-Judge score over the best baseline under the Gemini backbone, and 12.4% under Claude. On LongMemEval (500 questions, ~115K-token histories), MRAgent achieves 72.95 J-score vs the next-best 54.92 — a 32% relative gain. Most strikingly, it does this while consuming only 118k tokens per sample, versus 632k for A-Mem and 3.27M for LangMem — a 5x to 28x reduction in memory access cost.
This review unpacks every piece of machinery behind these results: the graph construction pipeline, the traversal operators, the active reconstruction loop, the formal proof of superiority, and what the authors glossed over in their design choices.
1. Prerequisites
1.1 The Memory Problem in LLM Agents
A language model agent operates over an extended horizon: multiple conversation sessions, tool calls, past observations. The context window — even at 1M tokens — cannot hold an unbounded history. So agents need external memory systems: some storage that holds past interactions, and some retrieval mechanism that pulls relevant pieces into the current context at query time.
The key requirements for a good agent memory system are:
- Faithfulness: Factual content should not be distorted during storage or retrieval.
- Efficiency: Retrieval should not bloat the context window with irrelevant content.
- Compositionality: The system must support reasoning over multiple pieces of evidence, especially when no single memory chunk contains the full answer.
- Adaptivity: The retrieval process should adapt based on intermediate findings — if the first retrieved chunk reveals a new entity, the system should be able to follow that lead.
Current systems fail on the last two requirements. Let us examine why.
1.2 Classical Retrieval-Augmented Generation (RAG) and Its Limits
Standard RAG (Lewis et al., 2020) stores documents in a vector store and, given a query , computes a cosine similarity against all stored embeddings, returning the top- most similar chunks:
This is the simplest form of passive retrieval: the set of retrieved items is determined entirely by the query , with no ability to incorporate intermediate evidence discovered during retrieval.
The failure mode is illustrated in the paper’s running example:
Query: “What did Caroline do when Nate won his second video game tournament?”
- Similarity search for “video game tournament” returns many Nate-related events about tournaments — but none directly about Caroline.
- The correct answer requires inferring from the text “Nate just won another regional video game tournament in July” that the event occurred in July, then retrieving what Caroline was doing in July.
- A passive retriever, given the query at the start, has no way to make this inferential leap.
Passive Retrieval (RAG):
┌──────────────────────────────────────────┐
│ Query: "What did Caroline do when │
│ Nate won his 2nd tournament?" │
└─────────────────────┬────────────────────┘
│ encode query
▼
┌──────────────────────────────────────────┐
│ Memory: [m1: Nate won 1st tournament] │
│ [m2: Nate won 2nd tournament] │
│ [m3: Caroline traveled in July] │
│ ... │
└─────────────────────┬────────────────────┘
│ TopK similarity to query
▼
Returns m1, m2 (Nate events)
❌ misses Caroline's July activity
❌ no way to infer temporal cue "July"
Graph-based memory systems like A-Mem address some compositionality by expanding neighborhoods in a predefined N-hop manner:
But this still fails: expanding N-hop neighbors from Nate’s tournament events does not reach Caroline’s July activity unless there is an explicit graph edge connecting them — which there typically is not.
The fundamental issue: passive retrieval can only select what is directly similar to the query or pre-defined neighbors. It cannot infer new retrieval cues from intermediate evidence.
1.3 Knowledge Graphs and Heterogeneous Graphs
A knowledge graph (KG) is a directed graph where:
- is a set of nodes (entities or concepts),
- is a set of typed edges (triples),
- is a set of relation types.
A heterogeneous KG allows nodes of different types (e.g., persons, events, facts) connected by different relation types. MRAgent’s memory graph is a heterogeneous KG where nodes have types cue, tag, episode, semantic, topic and edges are the relations.
A key property leveraged by MRAgent: once you retrieve a node, you can follow its outgoing edges to discover neighboring nodes that may not be similar to the original query text. This is what enables evidence chaining across multiple reasoning steps.
1.4 Cognitive Neuroscience: How Human Memory Actually Works
The paper’s design is deeply inspired by cognitive neuroscience. The key insight, from Rugg & Renoult (2025) and Rashid et al. (2016), is that memory recall is a reconstruction process:
- You begin with a contextual cue (e.g., “dinner on Tuesday”).
- This cue triggers the reactivation of engrams — compact neural memory traces that encode patterns from past experiences.
- Engram activation propagates through associative links to related engrams, progressively reconstructing a coherent memory episode.
- The process is iterative and evidence-dependent: intermediate reactivated patterns bias subsequent recall, allowing the brain to follow a chain of associations until a full episode is recovered.
This is exactly unlike text search. The brain does not rank every stored memory by similarity to the cue and pick the top-k. It follows a chain of semantic associations, updating its “search state” at each step.
MRAgent implements this computationally:
- Cues correspond to contextual keywords / entities (analogous to contextual cues in the brain).
- Tags correspond to engrams — compact associative patterns that link cues to episodic content.
- Episodes correspond to the actual memory content (specific past events).
- Semantic units correspond to stable long-term knowledge (personality attributes, preferences).
- Active reconstruction replaces the passive engram-independent retrieval of RAG.
Human Memory MRAgent Analogue
──────────────────────────────────────────────────
Contextual cue → Query cue (extracted entity/keyword)
Engram → Tag (associative label in graph)
Episodic memory → Episode node (concrete event)
Semantic memory → Semantic node (stable fact)
Reconstruction process → Iterative LLM-guided traversal
1.5 Sequential Decision Processes and the Value of Adaptive Policies
In reinforcement learning terms, memory retrieval can be modeled as a sequential decision process: at each step , the agent is in state (accumulated evidence so far) and takes an action (which memory node/branch to explore next), observing a new evidence item. A passive retrieval policy is stateless: — all retrieval steps are determined by the query alone, before any evidence is observed. An active retrieval policy is stateful:
The -th retrieval step depends on everything seen so far. This is the key advantage: an adaptive policy can follow newly discovered evidence, while a passive policy commits to all retrieval targets before seeing any evidence.
2. Problem Formulation: Active Memory Access
2.1 Memory System Definition
The paper formalizes the memory system as follows. Let be an external memory consisting of memory units . Given a query , memory access selects memory units over steps. Let denote the accumulated evidence after step .
Passive retrieval policy : selects all memory units solely based on the query:
Active reconstruction policy : selects the next unit conditioned on the evolving evidence:
This distinction is fundamental. A passive retriever must “guess” all retrieval targets upfront based on the query alone. An active reconstructor can follow evidence leads: if step retrieves a node that mentions “July”, step can specifically target events in July that were not retrievable from the original query alone.
2.2 Why Passive Retrieval Fails on Complex Queries
The paper identifies three concrete weaknesses of passive retrieval:
- Inability to update retrieval strategy mid-course. A passive policy cannot incorporate intermediate evidence (e.g., “July” being the temporal anchor).
- Noise accumulation from fixed aggregation. Pre-constructed N-hop neighbor expansion from a seed set often includes irrelevant nodes — no mechanism to prune off-topic branches.
- Heavy reliance on pre-constructed structures. Graph-based passive retrievers rely on predefined edge types; if the right connection was not encoded during construction, retrieval fails.
Active reconstruction addresses all three: it adapts in real-time (1), prunes irrelevant branches based on accumulated context (2), and can infer new retrieval cues dynamically (3).
3. Associative Memory System: Cue–Tag–Content Graph
MRAgent organizes agent memory as a heterogeneous graph — a Cue–Tag–Content structure — that encodes semantic and structural associations between memory elements.
3.1 Graph Nodes and Node Types
Figure 1: Cue–Tag–Content Memory Graph Architecture
graph TD
subgraph CueLayer["Cue Layer"]
C1["Nate"]
C2["video game tournament"]
C3["Caroline"]
end
subgraph TagLayer["Tag Layer"]
T1["Tournament Victory"]
T2["Travel-Location"]
T3["Personality"]
end
subgraph EpisodeLayer["Episode Layer"]
E1["e1: Nate won 1st tournament May"]
E2["e2: Nate won 2nd tournament July"]
E3["e3: Caroline traveling in July"]
end
subgraph SemanticLayer["Semantic Layer"]
S1["s1: Nate enjoys movies and games"]
S2["s2: Caroline is a dancer"]
end
subgraph TopicLayer["Topic Layer"]
Topic1["tau1: Nate gaming activities"]
end
C1 -- "Tournament Victory" --> E1
C2 -- "Tournament Victory" --> E1
C1 -- "Tournament Victory" --> E2
C2 -- "Tournament Victory" --> E2
C3 -- "Travel-Location" --> E3
C1 -- "Personality" --> S1
C3 -- "Profession" --> S2
Topic1 --> E1
Topic1 --> E2
The three core node types are:
-
Cues : Fine-grained keywords such as entity names, actions, or contextual descriptors. Examples: “Nate”, “video game tournament”, “Caroline”, “July”.
-
Contents : Specific memory units, further divided into:
- Episode : Concrete event with timestamp (e.g., “Nate won his first video game tournament last week”).
- Semantic : Stable abstract knowledge (e.g., “Nate is bright and bold”).
- Topic : Higher-level abstraction summarizing recurring episode patterns (e.g., “Nate’s gaming activities”).
-
Associations : Each triple connects a cue to a content node through a tag — an associative label encoding the semantic relation between cue and content.
3.2 The Role of Tags as Semantic Bridges
Tags are the key innovation distinguishing MRAgent’s graph from a standard knowledge graph. In a standard KG, edges are predefined relation types (is-a, located-in, etc.). Tags in MRAgent are LLM-extracted associative labels that capture the relational pattern of an episode: “Tournament Victory”, “Financial Decision”, “Creative Work Submission”.
Why this matters: two cues might both connect to an episode via the same tag, even though neither cue directly mentions the other. For example, the query mentions “Nate” and “video game tournament”, and the tag “Tournament Victory” connects them to episode . But ‘s timestamp is July, and Caroline’s travel event has timestamp July — now the LLM can construct a new cue “July” and follow it to . This cross-episode reasoning is not possible with similarity-based retrieval alone.
3.3 Two Induced Traversal Operators
The paper defines two fundamental mapping operators over :
Operator takes a cue and returns all tags that connect it to some content — i.e., “given I know Nate won a tournament, what associative categories exist?” Operator then retrieves the content nodes conditioned on both a cue and a selected tag — i.e., “given Nate and Tournament Victory, what specific episode?”
This two-stage decoupling is crucial: by exposing tags as an intermediate layer, the LLM can reason over a small set of semantic categories (tags) before committing to retrieving expensive episodic content. This avoids the combinatorial explosion of naive N-hop expansion.
3.4 Multi-Granular Memory Layers
The full memory graph has three layers at different granularities:
Episodic Layer (Cue–Tag–Episode): Stores event-specific memory units . Each episode corresponds to a concrete occurrence at a specific time. Episodes are organized along a unified timeline, enabling temporal reasoning (e.g., “what happened in July?”).
Semantic Layer (Cue–Tag–Semantic): Captures relatively stable knowledge: personal attributes, preferences, general facts extracted from dialogue. Unlike episodes, semantic units persist across individual contexts and are anchored to entity-level cues.
Abstraction Layer (Topics): Topic nodes summarize recurring patterns shared across episodes. Topics connect to their constituent episodes via , allowing top-down access: localize relevant topic → descend to associated episodes.
4. Memory Population Pipeline
The memory graph is constructed offline from raw dialogue text via an automated LLM distillation pipeline.
4.1 Episodic Memory Construction
Step 1: Segment into episodic units. The dialogue stream is first rewritten to resolve coreferences and normalize temporal expressions, then segmented into coherent episodic units:
where performs pronoun resolution, temporal normalization, and episodic segmentation.
Step 2: Extract tags and cues. For each episodic unit , the LLM extracts:
where produces a short associative tag (e.g., “Tournament Victory”) summarizing the relational pattern, and extracts a set of fine-grained cues (entities, attributes, salient descriptors).
Step 3: Build graph triplets. Each cue is linked to the episode through the tag , forming Cue–Tag–Episode triples .
4.2 Semantic Memory Construction
Semantic units are extracted by a separate LLM-based extractor :
where is the stable fact content (e.g., “Nate is bright and bold”), is an entity-level cue (e.g., “Nate”), and is an aspect-level tag (e.g., “Personality”).
4.3 Topic Abstraction Construction
Topic nodes are generated by summarizing recurring themes across episodic segments:
Each summarizes a coherent set of episodes sharing a recurring semantic pattern, connected to those episodes via topic–episode links.
Figure 2: Memory Population Data Flow
flowchart LR
A["Raw Dialogue T"] --> B["Rewrite and Segment"]
B --> C["Episodic Units e1..en"]
C --> D1["Tag Extraction: e_i to g_i"]
C --> D2["Cue Extraction: e_i to C_i"]
C --> D3["Semantic Extraction: T to cue-tag-sem"]
C --> D4["Topic Abstraction: episodes to tau_j"]
D1 --> E["Cue-Tag-Episode Triplets"]
D2 --> E
D3 --> F["Cue-Tag-Semantic Triplets"]
D4 --> G["Topic Nodes linked to episodes"]
E --> H["Heterogeneous Memory Graph G"]
F --> H
G --> H
5. MRAgent: The Active Reconstruction Loop
MRAgent formulates memory access as an active sequential decision process. We define the reconstruction state, the available traversal actions, and the full reconstruction algorithm.
5.1 Reconstruction State
At step , MRAgent maintains a reconstruction state:
where:
- : the active set — the current set of memory elements (cues, tags, content) being considered for the next traversal step.
- : the reconstructed context — all evidence accumulated in steps , which conditions subsequent traversal decisions.
Initial state: where contains the initial active set derived by matching query cues against the stored cue set.
5.2 Traversal Action Space
The agent can apply a finite set of traversal actions . Each action is a typed mapping operator over .
Forward traversal actions expand the active set along Cue→Tag→Content relations:
Action expands from cues to their associated tags — a cheap operation since tags are lightweight. Action then retrieves full episodic or semantic content based on both cues and selected tags.
Reverse traversal actions allow retrieved content to activate new cues and tags:
This is the key “reactivation” mechanism — analogous to an engram activating related cues. After retrieving episode (“Nate won his 2nd tournament in July”), the reverse traversal can surface the cue “July” and tag “Travel/Location”, enabling subsequent retrieval of Caroline’s July activity.
The concrete toolkit contains seven specialized operations (Table 4 in the paper):
| Tool | Operator | Purpose |
|---|---|---|
query_tag_events | Retrieve episodic events for a cue-tag pair | |
query_conversation_time | Return timestamp of an episode | |
query_event_keywords | Extract cues/tags from an episode | |
query_event_context | Retrieve context surrounding the episode | |
query_personal_information | Return semantic aspects for a person entity | |
query_personal_aspect | Retrieve semantic content for (person, aspect) | |
query_topic_events | Retrieve episodes associated with a topic |
5.3 The Active Reconstruction Algorithm
Figure 3: MRAgent Active Reconstruction Loop
flowchart TD
Q["Query x"] --> CI["Cue Initialization: ExtractCues"]
CI --> IS["Initial State: Z0, empty H"]
IS --> LOOP["Loop: t = 0 to T-1"]
LOOP --> LLM1["Step 1: LLM Reasoning\nSelect action subset A_t"]
LLM1 --> TRAV["Step 2: Controlled Traversal\nApply each action on active set Z_t"]
TRAV --> LLM2["Step 3: LLM Routing\nPrune irrelevant branches\nUpdate Z_next"]
LLM2 --> UPD["State Update: H_next = H_prev + Z_next"]
UPD --> STOP{"Sufficient Evidence?"}
STOP -- "Yes" --> ANS["AnswerLLM generates final answer"]
STOP -- "No" --> LOOP
ANS --> Output["Final Answer y-hat"]
Algorithm 1 (from the paper):
Input: Query x, memory graph G, actions A = {Π₁,...,Πₘ}, max steps T
Output: Answer ŷ
1: C ← ExtractCues(x)
2: Z⁽⁰⁾ ← ActiveSetInit(C, G) // match query cues to stored cues
3: H⁽⁰⁾ ← ∅
4: S⁽⁰⁾ ← (Z⁽⁰⁾, H⁽⁰⁾)
5: for t = 0 to T - 1 do
6: // Action selection
7: A⁽ᵗ⁾ ← f_select(x, H⁽ᵗ⁾, Z⁽ᵗ⁾) // LLM picks subset of A
8: // Controlled traversal
9: Z̃⁽ᵗ⁺¹⁾ ← ∅
10: for all a ∈ A⁽ᵗ⁾ do
11: Z̃⁽ᵗ⁺¹⁾ ← Z̃⁽ᵗ⁺¹⁾ ∪ Πₐ(Z⁽ᵗ⁾) // expand candidates
12: end for
13: // Routing and state update
14: Z⁽ᵗ⁺¹⁾ ← f_route(x, H⁽ᵗ⁾, Z̃⁽ᵗ⁺¹⁾) // LLM prunes irrelevant
15: H⁽ᵗ⁺¹⁾ ← H⁽ᵗ⁾ ∪ Z⁽ᵗ⁺¹⁾
16: if Stop(x, H⁽ᵗ⁺¹⁾) then break // sufficient evidence?
17: end for
18: ŷ ← AnswerLLM(x, H⁽ᵗ⁺¹⁾)
19: return ŷ
Step-by-step walkthrough for the running example:
Query: “What did Caroline do when Nate won his second video game tournament?”
Turn 0:
- ExtractCues(x) → {
Caroline,video game tournament,second} - Initial active set: cues matched in graph →
Caroline,video game tournament - LLM selects action:
Π_{c→g}(expand to tags) +query_tag_eventsforvideo game tournament - Traversal returns candidate tags: {
Tournament Victory,Tournament Participation} - LLM routes: selects
Tournament Victoryas most relevant - Evidence: tags associated with
video game tournament+Tournament Victory
Turn 1:
- LLM action selection:
Π_{(c,g)→v}(video game tournament, Tournament Victory)→ retrieve episodes - Traversal returns: {: “Nate won 1st tournament May”, : “Nate won 2nd tournament July”}
- LLM routes: selects (second tournament), also calls
query_conversation_time(e₂)→ timestamp: July - Evidence: content + temporal anchor “July”
Turn 2:
- Reverse traversal:
Π_{v→(c,g)}(e₂)surfaces new cue “July” + tag “Travel/Location” - LLM selects:
query_tag_events(Caroline, Travel/Location)for July time window - Traversal returns: {: “Caroline was traveling in July”}
- LLM routes: is directly relevant
- Evidence: content
Turn 3:
- STOP condition: sufficient evidence to answer
- AnswerLLM(”…Caroline was traveling when Nate won his second tournament”)
This multi-step process recovered the answer that was completely unreachable by any single-shot passive retrieval.
5.4 Two Operating Modes
MRAgent operates in two modes:
- Navigate mode (): The agent invokes the toolkit to progressively explore the memory graph and accumulate evidence. The LLM reasons over current state and selects traversal actions.
- Answer mode (): Once the stopping condition is met, the agent transitions to generating the final response conditioned on .
6. Theoretical Analysis: Why Active > Passive
6.1 Hypothesis Class Formulation
The paper proves that active reconstruction is strictly more expressive than passive retrieval for any retrieval budget .
Setup: Consider tasks with queries , answers , and heterogeneous KG memories . An LM with parameters induces:
- : functions that decide which node to retrieve at step
- : an answer head that outputs given the retrieval history
An active policy uses — conditioned on full history. A passive policy uses — conditioned only on the query.
The active hypothesis class at budget :
The passive hypothesis class:
Theorem 4.1 (Main): For any retrieval budget , the passive hypothesis class is strictly contained in the active hypothesis class:
6.2 Proof Sketch: Active Retrieval Achieves Zero Error
The proof constructs a separating distribution — the “Binary-Tree Needle-in-a-Haystack” task — where:
- Memory is a complete binary tree of depth with nodes.
- Each internal node encodes the next bit to follow: = “go left (0) or right (1)”.
- The answer label is encoded only in the target leaf , where is sampled uniformly.
Active strategy achieves zero error: Starting from the root, at each step :
- Retrieve current node → read the next bit .
- Move to child . After steps, reach target leaf ; retrieve it to read . Total retrievals: . Error = 0.
Passive strategy has irreducible error: A passive policy must commit to all retrieval targets as a function of alone, before seeing any node payloads. The target leaf is uniform over leaves. Any fixed retrieval targets can include at most leaves. The probability of hitting is . Therefore:
where is the minimum error when only the prior is known.
Figure 4: Active vs Passive Retrieval — Hypothesis Class Geometry
graph TD
subgraph "All Possible Retrieval Predictors"
subgraph "H_active(T)"
subgraph "H_passive(T)"
P["Passive policies\n(query-only)"]
end
A["Active-only policies\n(history-conditioned)\ne.g., binary-tree follower"]
end
end
The containment is strict: there exist distributions (like ) where the best passive policy has bounded-away-from-zero error, but some active policy achieves zero error. This is not just a theoretical curiosity — the binary tree is a stylized model of any multi-hop query where each hop depends on information revealed by the previous hop.
7. Experiments
7.1 Setup
Benchmarks:
- LoCoMo (Maharana et al., 2024): 50 long-term conversational dialogues, each spanning up to 35 sessions with ~300 turns average, accompanied by ~200 QA pairs per conversation. Question categories: single-hop, multi-hop, temporal, open-domain. The paper excludes adversarial questions.
- LongMemEval (Wu et al., 2025): 500 questions paired with ~115K-token chat histories (the LongMemEval-S setting). Categories: multi-session, single-user, temporal-reasoning, single-preference.
Baselines:
- RAG (Lewis et al., 2020): Standard similarity-based retrieval.
- A-Mem (Xu et al., 2025): Graph-structured memory notes with LLM-assisted relation extraction; retrieves via similarity seed + neighborhood expansion.
- MemoryOS (Kang et al., 2025): Three-tier hierarchy (short-term, mid-term, long-term persona) with hierarchical retrieval.
- LangMem (LangChain, 2025): Compresses dialogue into memory summaries stored as embeddings; summarizes retrieved memories for context.
- Mem0 (Chhikara et al., 2025): Extracts natural-language facts incrementally; LLM-driven ADD/UPDATE/DELETE/NOOP operations; similarity-based retrieval at inference.
LLM Backbones: Gemini-2.5-Flash and Claude-Sonnet-4.5.
Metrics:
- LLM-Judge (J): Binary semantic correctness judged by GPT-4o-mini ().
- Score: Harmonic mean of precision and recall of correct answers.
- Evidence Recall: Fraction of ground-truth supporting evidence retrieved.
7.2 Main Results
Figure 5: LoCoMo Results — MRAgent vs. Baselines
| Model | Method | Multi-hop F₁ | Multi-hop J | Temporal F₁ | Temporal J | Open-Domain F₁ | Open-Domain J | Single-hop F₁ | Single-hop J | Overall J |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini | RAG | 34.89 | 58.16 | 43.52 | 49.22 | 25.68 | 41.67 | 53.69 | 69.20 | 61.30 |
| A-Mem | 33.81 | 53.54 | 40.22 | 49.53 | 12.49 | 33.43 | 46.39 | 61.83 | 55.97 | |
| MemoryOS | 41.42 | 63.82 | 35.91 | 47.04 | 23.43 | 41.66 | 54.82 | 71.90 | 63.35 | |
| LangMem | 40.67 | 61.82 | 44.70 | 53.58 | 20.49 | 43.65 | 48.20 | 69.68 | 62.86 | |
| Mem0 | 45.17 | 68.79 | 58.19 | 61.68 | 26.24 | 41.66 | 54.37 | 73.72 | 68.31 | |
| MRAgent | 43.69 | 75.17 | 67.66 | 80.37 | 32.51 | 68.75 | 64.08 | 90.48 | 84.21 | |
| Claude | RAG | 34.53 | 57.45 | 43.39 | 48.29 | 26.56 | 43.75 | 53.66 | 69.20 | 61.10 |
| LangMem | 44.37 | 70.92 | 56.64 | 80.68 | 22.66 | 54.71 | 54.36 | 83.12 | 78.61 | |
| MRAgent | 56.72 | 90.19 | 69.82 | 85.34 | 34.67 | 71.57 | 68.62 | 91.10 | 88.32 |
Key observations:
- Temporal questions show the sharpest improvement (Gemini: +18.7 J over Mem0; Claude: +4.7 J over LangMem on temporal). This directly validates the core design: temporal queries require inferring anchor timestamps as intermediate cues.
- Multi-hop questions benefit substantially (+6.4 J over Mem0 on Gemini; +19.3 J over LangMem on Claude), confirming that chained evidence is needed for compositional retrieval.
- Single-hop questions improve as well (+16.8 J over Mem0 on Gemini), suggesting that even for simpler queries, the structured tag-based access is more precise than similarity-based retrieval.
7.3 Efficiency Analysis
Figure 6: Token Consumption and Runtime Comparison (LongMemEval, Gemini)
| Method | Token Consumption | Runtime (s) |
|---|---|---|
| A-Mem | 632k | 1,122.23 |
| MemoryOS | 273k | 3,135.54 |
| LangMem | 3,268k | 1,209.57 |
| Mem0 | 245k | 533.29 |
| MRAgent | 118k | 586.11 |
MRAgent uses only 118k tokens per sample — a 5x reduction versus A-Mem and a 27x reduction versus LangMem. The reason: by exposing tags as intermediate semantic shortcuts, MRAgent can decide before loading full episodic content whether a memory branch is relevant. Irrelevant branches are pruned at the tag level without ever accessing the (expensive) episodic text.
The runtime of 586s is slightly higher than Mem0 (533s) but significantly lower than MemoryOS (3,136s). The multi-step reconstruction loop adds some latency versus single-shot retrieval, but this is offset by not needing to load thousands of tokens of irrelevant context.
7.4 Ablation Study
The ablation isolates three structural variants:
- CE (Cue→Episode): Direct episodic indexing without tags — retrieve episodes by cue match.
- CTE (Cue→Tag→Episode): Tags mediate cue-to-episode retrieval, but no content-level content nodes.
- CTC (Cue→Tag→Content): Full architecture including semantic nodes.
Each variant is tested with and without active reasoning.
Figure 7: Ablation Results (LoCoMo, Multi-hop, Claude Backbone)
| Variant | Reasoning | Multi-hop J (approx.) |
|---|---|---|
| CE (Cue→Episode) | No | 42 |
| CTE (Cue→Tag→Episode) | No | 55 |
| CTC (Cue→Tag→Content) | No | 60 |
| CE | Yes | 65 |
| CTE | Yes | 78 |
| CTC = MRAgent | Yes | 90 |
Ablation shows both structure richness (CE→CTE→CTC) and active reasoning each contribute, but reasoning is the dominant factor (note the gap between No-Reasoning and Yes-Reasoning columns within each variant).
Key findings:
- Active reasoning is the dominant factor. Across all structural variants, adding reasoning (blue bars) consistently outperforms the no-reasoning variant. Multi-step exploration is more important than any structural choice alone.
- Tags provide effective semantic guidance. Within the no-reasoning setting, performance improves monotonically from CE to CTE to CTC — demonstrating that richer associative structures enable more reliable retrieval.
- Episodic and semantic memory are complementary. Removing semantic nodes (going from CTC to CTE) degrades performance on multi-hop questions that require stable facts (e.g., “what does Nate enjoy?”) alongside episodic events.
7.5 Multi-Turn Reasoning Analysis
Figure 8: Evidence Recall Accumulation Over Reasoning Turns
The paper tracks cumulative evidence recall across reasoning turns on LoCoMo. Average turns required:
| Question Type | Average Turns | Max Valid Turns |
|---|---|---|
| Multi-hop | 3.16 | 2.65 |
| Temporal | 2.42 | 2.40 |
| Open Domain | 2.60 | 1.09 |
| Single-hop | 2.07 | 1.28 |
For single-hop questions, recall plateaus after ~2 turns — the simple queries converge quickly. For multi-hop questions, recall improves by over 30% across successive steps (red line in Figure 6(a)), confirming that iterative exploration is essential for compositional reasoning.
The agent’s self-termination is also effective: Max Valid Turns closely matches Average Turns, indicating that the LLM correctly identifies when additional exploration is unlikely to yield new useful evidence. Increasing parallel retrieval per turn (, the per-turn budget) yields only limited gains, while increasing depth (, number of reasoning turns) improves performance steadily — reconstruction depth cannot be substituted by increased parallel exploration.
8. Limitations and Boundary Conditions
The authors acknowledge several limitations in the conclusion:
1. Latency grows with reconstruction depth. Multi-step exploration adds wall-clock latency versus single-shot retrieval. Queries requiring deep chains (temporal + multi-hop) may require – turns, each involving LLM inference. In low-latency production settings this is a concern.
2. Static memory construction — no forgetting or updating. The memory graph grows monotonically as interactions accumulate. There is no mechanism to:
- Update existing nodes when facts change (e.g., Caroline moves to a different city).
- Forget old or contradictory information.
- Consolidate near-duplicate events.
This is acknowledged as a direction for future work. In long-lived deployment scenarios (e.g., a personal assistant used for years), the graph could become large and noisy.
3. Memory construction quality depends on LLM distillation accuracy. Tags and cues are extracted by an LLM. If the extraction is noisy — tags that are too generic, or cues that miss salient entities — the graph structure will be degraded, and reconstruction will fail. The paper does not analyze robustness to extraction errors.
4. Theoretical guarantee requires . Theorem 4.1 gives the strict containment for any budget . For (single-shot retrieval), active and passive policies are equivalent (both reduce to a single query-based selection).
9. Critical Assessment: Weaknesses and Improvements
9.1 Weaknesses and Flaws
Evaluation scope is narrow relative to the claims. The paper tests only two LLM backbones (Gemini-2.5-Flash and Claude-Sonnet-4.5) and two benchmarks. Both benchmarks are human–LLM-generated conversation datasets. The generalization claims (“MRAgent outperforms all baselines across question types”) are strong, but the evidence base is limited: no open-domain task-completion benchmarks, no benchmarks involving structured tool use, no stress-testing on very large memory graphs ( episodes).
No latency breakdown. The paper reports aggregate runtime (586s for MRAgent) but does not break down how much time is spent in (a) LLM inference, (b) graph traversal, (c) LLM routing, and (d) stopping condition evaluation. Without this, it is impossible to predict whether the system is latency-bottlenecked by the graph traversal or by the LLM calls — and therefore which component to optimize.
Reconstruction cost scales with depth, but no analysis for hard cases. Section 5.5 shows that increasing (depth) improves multi-hop performance steadily (Figure 9). But the paper sets without analyzing the distribution of required turns across query complexity. If 5% of queries require turns, is that acceptable? Is the LLM’s stopping condition accurate for hard multi-hop queries that require more than 8 turns?
Cross-backbone MRAgent gets 86.76 J vs MRAgent 72.95 J on LongMemEval* (Table 2 in paper). This 13.8-point gap suggests that Gemini-for-memory-construction + Claude-for-retrieval is substantially better than either alone. The paper does not fully explain why — are Gemini’s tag extractions higher quality? Is Claude’s routing better? This “oracle combination” result muddles the interpretation of the ablation study.
Variance is not analyzed rigorously for baselines. The paper reports standard deviations for MRAgent results (e.g., ) but the tables show that some baselines also have values. However, the significance of the 23% improvement claim is not tested via statistical tests on the benchmark; it relies on point estimates.
9.2 Limitations the Authors Understate or Omit
Memory construction is the weakest link, but receives the least attention. The paper describes the construction pipeline in Section 3.3 and Appendix B but provides no ablation on construction quality (e.g., what happens if tags are noisy, or cues are over-coarsened?). The 118k token efficiency claim assumes the graph is well-structured; a poorly constructed graph could cause more traversal steps and negate the efficiency gain.
The “no forgetting” limitation is more serious than acknowledged. In the paper’s experimental setup, memory graphs are constructed from fixed finite conversation histories. In a real deployment, conversations accumulate indefinitely. The paper does not address:
- What happens when two episodic units contradict each other (e.g., “Nate lives in Singapore” vs later “Nate moved to London”)?
- How does retrieval performance degrade as graph size grows from to episodes?
- Is the LLM’s tag-based pruning still accurate for very large active sets?
The evaluation does not include any adversarial or out-of-distribution queries. LoCoMo explicitly excludes adversarial questions. In real agent deployments, users ask vague, ambiguous, or contradictory questions. A reconstruction approach that commits to following tag-based paths could get stuck in an irrelevant branch if the initial cue extraction is off.
Tool-call overhead is not modeled. The seven-tool toolkit (Table 4) requires the LLM to decide which tool to invoke at each step — this is a function-calling overhead. For small models (< 7B parameters), structured tool-call generation is less reliable. The paper uses Gemini-2.5-Flash and Claude-Sonnet-4.5, both large and reliable for function calling, but it is unclear how MRAgent performs with smaller LLMs.
9.3 Concrete Improvement Suggestions
Adaptive memory maintenance. The most pressing limitation is the static graph. A natural fix is to add a maintenance pass after each conversation session: a lightweight LLM run that identifies contradictory or redundant triplets and performs UPDATE/DELETE operations analogous to Mem0’s approach. This would need to be efficient enough to not bottleneck the overall system.
Tag quality ablation and robustness. The authors should ablate construction quality by: (a) replacing LLM-extracted tags with random tags, (b) using a weaker extraction model, (c) intentionally injecting noisy cues. This would reveal how sensitive the approach is to tag quality and where the irreducible floor is.
Scale evaluation. Testing on memory graphs with – episodes (e.g., a year-long conversation history) would validate whether graph traversal remains efficient at scale. Empirically, does average reconstruction depth grow logarithmically or linearly with graph size?
Multi-agent memory sharing. The paper frames MRAgent as a single-agent memory system. An interesting extension is shared memory pools across multiple agents, where episodic evidence from one agent’s interactions can be queried by another. The Cue-Tag-Content graph structure is naturally composable (just merge the graphs), but conflict resolution across agents would require new mechanisms.
Smaller LLM backbone testing. Validating MRAgent with a 7B or 13B parameter model would determine whether the approach is practical for edge/on-device deployment, or whether it requires a frontier LLM’s function-calling reliability.
10. Conclusion
MRAgent makes a clean conceptual contribution: memory retrieval should be an active, iterative reconstruction process, not a passive similarity lookup. The Cue–Tag–Content graph provides the data structure that makes this tractable — tags serve as lightweight semantic checkpoints that allow the LLM to reason about which memory branches are relevant without loading expensive episodic content first. The theoretical result (Theorem 4.1) gives a rigorous justification for the design philosophy.
The empirical results are strong: 23% improvement on LoCoMo, 32% on LongMemEval, 5–28x reduction in token consumption, with competitive runtime. The ablation cleanly isolates the contribution of each component, and the multi-turn analysis reveals the mechanism — progressive evidence accumulation over 2–4 reasoning turns.
The critical open problems are the lack of a forgetting/update mechanism and the narrow evaluation scope. Future work should address memory lifecycle management for long-lived deployments and validate the approach on more diverse agent workloads. Despite these limitations, MRAgent represents a clear step forward in how we think about memory for long-horizon LLM agents, and the cognitive neuroscience framing provides a principled foundation for further development.
Code: https://github.com/Ji-shuo/MRAgent
11. Reproducibility and Implementation Notes
11.1 Memory Construction Details
The memory construction pipeline runs as a preprocessing step over the full conversation history before any queries are served. Key implementation choices:
Episodic segmentation heuristic. The paper uses for segmentation without specifying the exact prompt or segment length constraints. In practice, the authors’ public code uses overlapping context windows to ensure coreferences are resolved across turn boundaries — a detail that affects tag quality significantly.
Tag vocabulary. Tags are LLM-generated phrases; the vocabulary is not pre-defined. This means semantically similar tags (e.g., “Tournament Win”, “Won a Tournament”, “Gaming Victory”) may coexist in the graph without being merged. The code implements a post-processing step that clusters similar tags using embedding similarity, but the clustering threshold is a hyperparameter not discussed in the main paper.
Graph scale in experiments. On LoCoMo (50 conversations × ~300 turns × ~1 episode/2 turns), each conversation graph has roughly 150–300 episode nodes, 500–1000 cue nodes, and 200–400 tag nodes. This is small enough that full graph traversal is feasible; at scale, additional indexing would be needed.
11.2 Reconstruction Loop Implementation
The Navigate/Answer mode transition is implemented as a ReAct-style loop where the LLM generates structured JSON tool calls:
{
"mode": "Navigate",
"action": "query_tag_events",
"parameters": {
"cue": "video game tournament",
"tag": "Tournament Victory"
},
"reasoning": "Need to find which specific tournament was the second one."
}
The LLM must output valid JSON with the correct tool name and parameters — this requires a capable instruction-following model. The paper uses Gemini-2.5-Flash and Claude-Sonnet-4.5, both of which handle structured generation reliably. For smaller models, structured decoding (constrained generation with a JSON schema) would be required.
The stopping condition Stop(x, H^{(t+1)}) is also LLM-evaluated: the model emits "mode": "Answer" when it judges the accumulated context sufficient. This is an implicit reward signal — the model must internally estimate whether additional traversal will yield marginal value.
11.3 Evaluation Metric Details
The LLM-Judge metric uses GPT-4o-mini as a semantic equivalence judge with temperature 0. Each method is evaluated three independent times, and mean ± std is reported. The judge prompt asks: “Is the generated answer semantically equivalent to the reference answer, accounting for paraphrasing and different surface forms?” — outputting binary 0/1. This is more robust than exact-match metrics for open-ended memory questions, but it introduces judge-model-specific bias (GPT-4o-mini may have calibration differences from other models).
Evidence Recall in Equation (24):
where is the agent-retrieved evidence set and is the annotated ground-truth evidence set. This measures the retrieval completeness independent of whether the answer generation step uses the evidence correctly.
11.4 Comparison to GraphRAG
GraphRAG (Han et al., 2025) is a related system that organizes retrieval corpora into graph structures and retrieves via community summaries and neighborhood expansion. The key architectural difference from MRAgent:
- GraphRAG uses a top-down community summary approach: the graph is summarized at multiple coarse-grained levels, and retrieval starts from high-level community descriptions.
- MRAgent uses a bottom-up tag-guided traversal: starting from fine-grained cues, the system navigates up through tags and down to content.
GraphRAG is designed for document corpora (e.g., news articles, Wikipedia); MRAgent is designed for agent interaction histories where temporal and entity-specific relationships matter. The two approaches are complementary and could potentially be combined.
12. Broader Implications
12.1 A New Paradigm for Agent Memory
MRAgent’s most important contribution is not the specific architecture but the design principle: separate the retrieval structure from the retrieval policy. Existing systems conflate these — the graph structure (edges, neighborhoods) implicitly encodes the retrieval policy (follow these edges). MRAgent decouples them: the Cue-Tag-Content graph is the structure, and the LLM-driven traversal loop is the policy. This separation allows the policy to adapt at inference time without retraining the memory representation.
This principle generalizes beyond agent memory. Any system where: (a) the relevant answer depends on multi-hop reasoning across heterogeneous information, (b) intermediate evidence reveals new retrieval directions, (c) the retrieval budget is limited,
…would benefit from active reconstruction over passive retrieval. Candidate applications: scientific literature review (following citation chains based on intermediate findings), medical record analysis (following symptom → diagnosis → treatment chains), and code repository navigation (following function-call chains to trace bug origins).
12.2 Relationship to Search-R1 and Agentic RAG
Search-o1 (Li et al., 2025) and Search-R1 (Jin et al., 2025) both move retrieval inside the reasoning loop, issuing search queries on demand when a knowledge gap is detected. MRAgent differs in a crucial way: Search-o1 and Search-R1 retrieve from external corpora (web search, knowledge bases) to fill factual gaps within a single reasoning task. MRAgent reconstructs from the agent’s own interaction history to recall past experiences.
The two paradigms are complementary. A complete long-horizon agent could use:
- MRAgent-style active reconstruction for accessing its own memory (what happened in past sessions),
- Search-o1-style agentic RAG for accessing external world knowledge (current facts, documentation).
The combination raises interesting architectural questions: should the two memory sources be unified into a single graph, or kept separate with a meta-router deciding which to consult?