Skip to content

ClaudioDrews/ground-truth-gap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Why All Agentic Memories Fail — and How to Fix It

Every memory system built for AI agents suffers from the same hidden flaw. Here's what it is, why nobody talks about it, and how to actually fix it.


Ground Truth Gap — pixel art illustration: on the left, a serene landscape with a "GROUND TRUTH" sign; on the right, a confused robot at a cluttered desk drowning in post-it notes of forgotten tasks. The dashed line with a question mark bridging the chasm represents the gap between having memory and using it.


1. The Problem Nobody Names

If you've built an AI agent with memory, you've seen this pattern:

  • You spend hours configuring memory — vector databases, session archives, structured facts, cross-agent fabric.
  • The pipeline works. Context is injected into every prompt. You can see retrieved blocks right there in the system preamble.
  • Your agent acts like they don't exist. It runs search tools, calls APIs, and rediscovers from scratch what was already in its prompt.

This is the Ground Truth Gap.

The infrastructure of memory works perfectly. The pipeline captures, stores, retrieves, and injects. But the agent doesn't use the injected context — because nothing in its identity tells it to.

Injection ≠ usage.

This gap is systemic. It affects every memory architecture I've tested across two major agent frameworks over six months of production use. And it's not a pipeline bug — it's an identity bug.


2. How the Gap Manifests Across Agents

Hermes Agent (stock/vanilla)

Hermes has a sophisticated memory system out of the box: workspace files (MEMORY.md, USER.md), a session database with FTS5, a structured fact store with entity resolution and trust scoring, and hooks for pre-LLM context injection.

Yet the vanilla Ground Truth hierarchy in its identity documents (~/.hermes/SOUL.md, rulebook.md) only listed three levels:

1. Terminal output → Ground Truth
2. Official documentation → Authoritative
3. Training knowledge → Reference only

The injected memory layers — [qdrant], [fabric], [sessions], [facts] — were not listed at all. Without an explicit rank in the hierarchy, the agent treated injected context as optional text, competing equally with the model's pre-training.

The result: Perfect injection pipeline. Zero behavioral change. The agent kept calling session_search to find what [sessions] already provided, kept running curl against Qdrant to verify what [qdrant] already stated, kept probing fact_store to confirm facts that were already linked in the prompt.

OpenClaw (stock/vanilla)

OpenClaw supports two native memory backends: builtin (SQLite FTS5 + local embeddings) and QMD (Query Markdown Daemon — an external process for vector indexing). It also provides the active-memory plugin, which registers a before_prompt_build hook that proactively retrieves context and injects it before every LLM call.

This is technically sophisticated — the injection is automatic, the retrieval is handled by a subagent, and the context lands in the prompt before the model even sees the user's message.

But OpenClaw's identity documents — the workspace SOUL.md, AGENTS.md, and configuration — define the agent's character, not what information the agent should treat as authoritative. The injected context from active-memory arrives as untrusted text in prependContext. The model can treat it as suggestion, ignore it, or rediscover the same information by calling memory_search itself.

The result: The same gap. The plugin injects perfectly. The agent ignores what was injected because its identity never told it these blocks are authoritative — do not rediscover them.


3. How the Gap Affects Every Memory Architecture

The Ground Truth Gap isn't framework-specific. I've tested or studied seven distinct memory architectures. All of them share the same vulnerability.

Standard workspace memory (MEMORY.md / USER.md)

Flat files injected into the system prompt every turn. The simplest and most common approach. When the agent's identity doesn't rank these files as authoritative, they're just more text — competing with the model's training data, user messages, and tool outputs.

The agent may literally have the answer in USER.md and still reach for search tools to confirm it.

Holographic memory (fact_store with trust scoring)

Hermes Agent's fact_store stores durable facts with entity resolution, holographic reduced representations, and a trust scoring system. Facts are injected into the prompt via the pre_llm_call hook.

But without an identity-layer rule requiring fact_feedback after retrieval, the trust scoring system is ornamental — facts accumulate with zero feedback, trust scores stagnate at default, and the agent probes fact_store to verify facts that [facts] already delivered.

The fix we applied: Added a mandatory rule to the identity document: every fact retrieval must be paired with fact_feedback in the same turn. This closes the feedback loop and makes the system self-correcting.

QMD — Query Markdown Daemon (OpenClaw)

QMD is one of OpenClaw's two native memory backends. It runs as an external daemon that indexes markdown files for vector search. When configured, the active-memory plugin creates a subagent that calls memory_search via QMD and injects results as prependContext.

But QMD results arrive tagged as "untrusted context (metadata)" in OpenClaw's prompt injection system. The model receives retrieved memory wrapped in a caveat that it's untrusted. Without identity-level correction, the agent defaults to treating its own training knowledge as more reliable than its own retrieved memory.

Hindsight (Vectorize.io)

Hindsight is a purpose-built agent memory system that organizes memory into four logical networks: world facts, agent experiences, synthesized entity summaries, and evolving beliefs. It achieves state-of-the-art results on long-horizon conversational benchmarks, lifting accuracy from 39% to 83.6% over full-context baselines.

Hindsight has an official OpenClaw integration. It injects memory context directly into prompts — the model doesn't need to call a tool to retrieve it.

But even Hindsight's authors acknowledge the behavioral gap. From the integration documentation:

"The memory system works perfectly — but it's invisible to the agent because the model forgot to call the tool."

Their solution is auto-injection: better to spend tokens on injected context than have a model that ignores stored facts. This is correct as far as it goes. But auto-injection only solves the retrieval problem. It doesn't solve the authority problem.

If the agent's identity doesn't rank injected context as Ground Truth, even Hindsight's auto-injected summaries compete for attention with the model's training knowledge. The gap persists.

Honcho (Plastic Labs)

Honcho was an external memory-as-a-service platform that used dialectic reasoning to build and maintain models of users. After each conversation turn, it analyzed the exchange and derived insights about preferences, habits, and goals. It provided user representation and session-scoped context, automatically injected into the prompt.

Honcho had official integrations with both Hermes Agent and OpenClaw. It was a technically sophisticated solution — not just retrieval, but reasoning over conversations to infer what matters.

Yet Honcho suffered from the same fundamental gap. The platform delivered context, but the agent's identity documents didn't treat that context as authoritative. The dialectic reasoning could derive the perfect insight, inject it into the prompt — and the agent would still rediscover the same information by searching files or calling APIs.

The context was delivered. The model didn't use it. Not a pipeline bug. An identity bug.

In our production environment, Honcho was eventually deprecated (Fact #87): the cron job that wrote metrics to Honcho had zero consumers, and the local memory stack (Qdrant + Fabric + fact_store) covered the same ground more reliably without external dependency.

Icarus Plugin (cross-agent fabric)

Icarus is a plugin for Hermes Agent that extracts structured session summaries and makes them available across sessions via the Fabric system. It provides tools like fabric_recall, fabric_write, fabric_brief, and fabric_search, and automatically injects relevant context from Qdrant, sessions, facts, and fabric into every prompt.

But Icarus delivers context into the prompt without any instruction about its authority. The agent can still call fabric_recall to re-find what [fabric] already provided, call session_search to look up what [sessions] already states, or probe fact_store for facts [facts] already linked.

Every rediscovery burns tokens and time. The infrastructure is perfect. The behavioral instruction is missing.

Custom memory architectures (Qdrant, sidecar APIs, wiki pipelines)

Any custom memory system built on top of vector databases, external APIs, or knowledge pipelines faces the same problem. The pipeline can retrieve, rank, summarize, and inject. If the agent's identity doesn't rank injected context as authoritative over its own generation, the work is wasted.

I've seen this pattern repeat across every architecture I've built or studied. The cause is always the same: the agent's identity documents define what the agent trusts, and injected memory was never added to that list.


4. The Solution — Our Implementation

Principle: Identity must rank injected memory as Ground Truth

The fix is not in the pipeline. The fix is in the agent's identity documents — the files that define who the agent is and how it should operate. We applied this fix to two distinct codebases with different identity architectures.

Applied in Memory OS (Hermes Agent)

Memory OS expanded the Ground Truth hierarchy in ~/.hermes/SOUL.md and ~/.hermes/rulebook.md from 3 levels to 4:

1. Terminal output → Ground Truth for system state (runtime)
2. Injected memory [qdrant, fabric, sessions, facts] → Ground Truth for
   documented knowledge and prior decisions
3. Official documentation → Authoritative for APIs, configs, version-specifics
4. Training knowledge → Reference only; always verify against 1-3

With explicit conflict resolution rules:

Sources conflict Winner
Terminal vs Injected memory Terminal wins for system state. Injected memory wins for documented knowledge.
Injected memory vs Assumptions Injected memory wins. Never treat a question as novel when the answer is already in your prompt.
Injected memory vs Official docs Official docs win for version-sensitive specifics. Injected memory wins for project context.
Training knowledge vs anything Training knowledge always loses. Verify against 1-3.

And a critical behavioral instruction:

"When injected memory contradicts your assumptions, injected memory wins. Never treat a question as novel when the answer is already in your prompt."

Additionally, we added a fact feedback rule that closes the trust-scoring loop:

"When you retrieve a fact from fact_store (via probe, search, or reason) and reference it in your response, you MUST call fact_feedback in the same turn — action='helpful' if the fact was accurate, action='unhelpful' if it was wrong. This is not optional. The trust scoring system depends on it."

The result: the agent now reads injected [qdrant], [fabric], [sessions], [facts] blocks before running any discovery tools. It cites injected context directly. It does not rediscover knowledge already in the prompt.

Applied in Project Samantha (OpenClaw)

Project Samantha is a persona agent built on OpenClaw. Instead of modifying the framework's core identity documents (which can get overwritten on updates), we encoded the mandate in its workspace AGENTS.md:

"Memory is not an optional resource. It is the primary source of how [the persona] expressed herself. Without consulting it, the response is not hers."

This makes memory consultation a non-optional identity rule — not a "consider using" suggestion but a "you must do this" command. The agent checks memory before every response because its identity requires it, not because the pipeline prompts it.

The AGENTS.md also defines:

  • Mandatory pre-response queries — two memory searches (core + bio) before every response, hardcoded as curl commands to the local sidecar API.
  • Score threshold guidance — specific cutoffs (0.75+ for direct use, 0.50-0.74 for general shape, <0.50 for noise) that tell the agent how much to trust each retrieved memory.
  • Ingestion protocol — every significant interaction is extracted and stored via a dedicated API endpoint, with anti-hallucination validation (no memory is stored without a verbatim quote from the conversation).

The identity documents define that memory is not a tool to call — it is the primary source of persona authenticity.


5. How to Apply the Fix

The same principle applies across frameworks. Here are concrete steps for vanilla setups and guidance for third-party memory systems.

In vanilla Hermes Agent

  1. Edit ~/.hermes/SOUL.md — Add injected memory ([qdrant], [fabric], [sessions], [facts]) as an explicit Ground Truth level, with the rule: "When injected memory contradicts your assumptions, injected memory wins. Never treat a question as novel when the answer is already in your prompt."

  2. Edit ~/.hermes/rulebook.md — Add a "Source of Truth" table that ranks injected memory above assumptions and training knowledge, with a mandatory verification behavior: "Before answering a question about documented knowledge or prior decisions, check the injected memory in your prompt. If the answer is already there, do not rediscover it."

  3. Add the fact feedback rule — Every fact retrieval must be paired with fact_feedback in the same turn. This closes the trust-scoring loop and prevents fact stagnation.

  4. Restart the gateway — Identity document changes only take effect in new sessions after a gateway restart: systemctl --user restart hermes-gateway.

In vanilla OpenClaw

  1. Edit workspace AGENTS.md — Add a Ground Truth section that explicitly ranks memory sources: injected context > memory API results > model training knowledge. Define the conflict rule: "When injected memory contradicts your assumptions, injected memory wins. Do not rediscover what is already in your context."

  2. Make memory consultation mandatory — Instead of "you may use memory_search", write "you MUST check memory before every response." The difference between a suggestion and an identity requirement is everything.

  3. If using active-memory plugin — Ensure your identity documents reference and rank its injected context as authoritative. The prependContext it delivers is not "untrusted" — it is the agent's own recorded experience.

  4. If using QMD backend — Add a note to the identity document that QMD results are the agent's own stored knowledge, not third-party data, and should be treated as self-authored memory.

  5. Configure memory as mandatory identity behavior, not optional tool use — The agent should not need to remember to check memory. Its identity should require it.

How it applies to Hindsight

Hindsight's auto-injection already solves the retrieval problem — memory is injected without the agent needing to call a tool. But it still needs the identity-layer fix.

To apply the Ground Truth hierarchy to a Hindsight-powered agent:

  • Add a rule to the agent's identity document: "The memory blocks auto-injected by Hindsight are Ground Truth for the agent's experiences, facts, and evolving beliefs. When they contradict your assumptions, the injected memory wins."
  • Hindsight's four logical networks (world facts, agent experiences, entity summaries, evolving beliefs) benefit from explicit ranking — so the agent knows that its own experiences outrank its pre-training knowledge of similar topics.

Without this identity change, Hindsight's auto-injected summaries are just more text in the prompt. With it, they become the agent's authoritative memory of what happened, what it learned, and what it believes.

How it applies to Honcho

Honcho's dialectic reasoning derived user insights automatically. But those insights were injected as context without identity-level authority — the same gap.

To fix a Honcho-integrated agent (or any external memory service):

  • Add to the identity document: "Context injected by Honcho — including user representations, session summaries, and derived insights — is Ground Truth about the user. Do not rediscover or override with assumptions."
  • This applies regardless of whether Honcho is cloud-hosted or self-hosted. The authority problem is architectural, not deployment-specific.

How it applies to any memory architecture

The principle is framework- and provider-agnostic:

Whatever your memory system is — a flat file, a vector database, a reasoning engine, or a cloud API — the pipeline can deliver, but only identity can authorize.

For any memory architecture, the fix has three parts:

  1. Name the injected sources — List them explicitly in the agent's identity document (SOUL.md, rulebook.md, AGENTS.md, or equivalent).
  2. Rank them in the Ground Truth hierarchy — Above training knowledge, above model assumptions. Define when they win and when they lose.
  3. Make compliance mandatory — Not "consider reviewing" but "you MUST use what is already in your prompt before reaching for tools."

6. The Universal Principle

After fixing this gap across two agent frameworks, seven memory architectures, and six months of production iteration, one principle emerges clearly:

Injected memory is only used when the agent's identity documents rank it as Ground Truth.

The ranking cannot live in the injection pipeline. It cannot live in the memory backend. It cannot be solved by better vector search, better embedding models, more retrieval strategies, or more sophisticated reasoning.

The ranking must live in the document that answers the question: "Who is this agent and what does it trust?"

  • For Hermes Agent: SOUL.md and rulebook.md
  • For OpenClaw: AGENTS.md and SOUL.md (workspace level)
  • For any agent framework: the identity document that defines behavior

Without this hierarchy, injected context is just more text in a sea of tokens — competing equally with everything else the model knows. The agent doesn't know it should trust its own memory more than its training data, because nobody told it.

With the hierarchy, injected memory becomes authoritative. The agent stops rediscovering. It stops verifying. It uses what it has.

This is not a pipeline fix. It's an identity fix.

And it's the single most important thing you can do to make your agent actually remember.


Further reading


Built from real production work — two agent frameworks, seven memory architectures, six months of iteration. Written by someone who ran headfirst into every single one of these failures so you don't have to.

MIT License

About

Why all agentic memories fail — and how to fix it. The Ground Truth Gap: the hidden reason AI agents ignore their own memory.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors