Skip to content

feat(agents): paged, multi-fidelity context memory for McpClient#1915

Open
iamMihirT wants to merge 4 commits intodimensionalOS:devfrom
iamMihirT:feat/agent-context-memory
Open

feat(agents): paged, multi-fidelity context memory for McpClient#1915
iamMihirT wants to merge 4 commits intodimensionalOS:devfrom
iamMihirT:feat/agent-context-memory

Conversation

@iamMihirT
Copy link
Copy Markdown

TL;DR

Replaces the unbounded list[BaseMessage] history in McpClient with a
paged, multi-fidelity memory layer that keeps the assembled prompt inside
the active model's context window. The worker thread also recovers from
turn-level exceptions instead of dying silently, which was the original
user-visible symptom (messages piling up in the queue with agent_idle
stuck at False).

Closes #1899.

Before / after

Before: every BaseMessage the agent saw was appended to self._history
and replayed in full on every turn. As the conversation grew — especially
with image artefacts — the request eventually exceeded the model's context
window. OpenAI returned context_length_exceeded, the exception bubbled
out of _process_message, the worker thread died, and every subsequent
enqueued message sat unprocessed forever.

After: McpClient owns a MemoryEngine that converts each inbound
message into one or more typed Page objects, each storing four fidelity
representations (POINTER, STRUCTURED, COMPRESSED, FULL). A
two-phase greedy selector chooses per-page fidelities to fit the active
model's token budget. Recent image artefacts are pinned at FULL; older
pages gracefully degrade. The worker thread wraps each turn in
try/except, emits a PHYSICAL_INSUFFICIENCY fault, and keeps draining
the queue.

Usage

Existing callers get the new behaviour automatically with sensible defaults
(full model window, 3 recent images pinned at FULL, 4096-token output
reserve). Three new optional knobs on McpClientConfig:

field default purpose
token_budget None Override the per-model context window.
pin_recent_evidence 3 Keep the N most recent image artefacts at FULL.
output_reserve_tokens 4096 Tokens carved out of the window for the response.

Plus a new faults: Out[FaultEvent] stream on the module for
observability (eviction, fidelity-drop, rehydrate, and
physical-insufficiency events flow through it alongside structlog).

Component walkthrough

Everything new lives under dimos/agents/memory/. Each file is a single
responsibility; the public surface is deliberately small.

file role
pages.py Page dataclass, fidelity enum, page-type enum. Invariants enforced in __post_init__.
page_table.py Thread-safe page store. Explicit + auto pins, recency tracking, pin-rebalance diffs.
tokens.py TokenCounter protocol + TiktokenCounter (OpenAI) and HeuristicCounter (fallback).
budget.py ModelBudget: per-model context window, output reserve, per-message overhead.
faults.py Structured FaultEvent + FaultObserver (routes to structlog and an optional Out stream).
ingestion.py ingest_message: turns a BaseMessage into list[Page], with multimodal split rules.
selector.py Two-phase greedy assemble_prompt respecting pins and the effective token budget.
engine.py MemoryEngine facade: ingestion + assembly + pin rebalancing + fault emission + rehydration.
artefact_tool.py build_get_artefact_tool: LLM-callable rehydration of a degraded EVIDENCE page back to FULL.

Integration is a single-file change in dimos/agents/mcp/mcp_client.py:
replace self._history with self._engine, feed state_graph.stream from
engine.assemble(), add the get_artefact tool to the agent's tool list,
and wrap _process_message in try/except inside _thread_loop.

Testing

170 tests ship with this PR (all green):

  • 168 unit tests in dimos/agents/memory/test_*.py — cover fidelity
    invariants, pin policy, token accounting contracts, budget resolution,
    fault emission, multimodal split rules, selector edge cases, and the
    rehydration path.
  • 2 integration tests in
    dimos/agents/mcp/test_mcp_client_history_compaction.py:
    • test_history_compaction_keeps_recent_images_at_full_under_pressure
      drives the real MemoryEngine from a bypass-constructed McpClient
      under an 8k-token budget and verifies the effective-cap,
      recent-image-pinning, and fault-emission contracts.
    • test_thread_loop_recovers_from_process_message_exception is the
      regression guard for the motivating bug: makes the second of three
      turns raise and verifies all three are processed, a
      PHYSICAL_INSUFFICIENCY fault is emitted, and the worker exits
      cleanly.

Run them with:

uv run pytest dimos/agents/memory/ \
              dimos/agents/mcp/test_mcp_client_history_compaction.py

Not included in this PR

  • Agent class: the older dimos.agents.agent.Agent was removed
    upstream before this work landed. Only McpClient is wired up.
  • End-to-end LLM tests: no real API calls in the test suite. The
    integration tests bypass-construct McpClient and drive the engine
    directly, which is fast and deterministic but does not exercise the
    full request round-trip.
  • Dynamic model window discovery: the ModelBudget table is hardcoded
    for current OpenAI/Anthropic/Gemini/Ollama model names (with
    longest-prefix matching). Add new entries in budget.py as models ship.

Reviewer test plan

  • Read dimos/agents/memory/engine.py and
    dimos/agents/memory/selector.py — these are the two most
    architecturally load-bearing files.
  • Skim dimos/agents/memory/ingestion.py multimodal split rules
    (the ingest_message docstring lists the full matrix).
  • Verify the new McpClientConfig knobs are optional and
    default-equivalent to the old full-history behaviour for small
    conversations.
  • Run the two integration tests locally to confirm thread recovery.
  • Smoke-test your preferred blueprint (e.g. unitree-go2-agentic)
    against a long conversation with image artefacts to watch the
    fault stream in action.

Files changed

  • dimos/agents/memory/ — 10 production files + 9 unit test files (new)
  • dimos/agents/mcp/mcp_client.py — modified
  • dimos/agents/mcp/test_mcp_client_history_compaction.py — new
  • AGENTS.md — new Context Memory subsection
  • pyproject.tomltiktoken>=0.8.0 added to the agents extra
  • uv.lock — lockfile update for tiktoken

Total: 24 files, ~6.3k insertions / ~18 deletions across 4 commits.

Contributor License Agreement

  • I have read and approved the CLA.

Made with Cursor

Mihir Trivedi added 4 commits April 24, 2026 15:08
Introduces a new ``dimos.agents.memory`` package that manages agent
conversation state as a set of typed, multi-fidelity ``Page`` objects
rather than an unbounded list of LangChain messages. Each page stores
four fidelity rungs eagerly (POINTER, STRUCTURED, COMPRESSED, FULL),
and a two-phase greedy selector picks per-page rungs to assemble a
prompt that fits the active model's token budget.

Core components:

- ``pages.py``: ``Page`` dataclass + fidelity / page-type enums, with
  invariants enforced in ``__post_init__``.
- ``page_table.py``: thread-safe store with explicit and auto pins;
  ``pin_recent_evidence`` keeps the N most recent EVIDENCE pages at
  FULL fidelity.
- ``tokens.py``: ``TokenCounter`` protocol, tiktoken-backed counter for
  OpenAI models and a heuristic fallback for others. Counters return
  content-only tokens; per-message ChatML overhead lives on the budget.
- ``budget.py``: ``ModelBudget`` with per-model context windows,
  output reserve, and per-message overhead; longest-prefix model name
  resolution.
- ``faults.py``: structured fault events (PAGE_EVICTED, PAGE_DEGRADED,
  REFETCH_FAULT, PHYSICAL_INSUFFICIENCY, PIN_REBALANCE) emitted to
  structlog plus an optional ``Out[FaultEvent]`` stream for live UI.
- ``ingestion.py``: turns a LangChain ``BaseMessage`` into one or more
  pages. Multimodal HumanMessages split into a CONVERSATION page plus
  one EVIDENCE page per image; AIMessage ``tool_calls`` payloads are
  cost-counted uniformly across all four fidelity rungs.
- ``selector.py``: two-phase greedy ``assemble_prompt`` that respects
  pin invariants and the effective (per-message-overhead-aware)
  budget.
- ``engine.py``: ``MemoryEngine`` façade coordinating ingestion,
  assembly, pin rebalancing, fault emission, and artefact rehydration.
- ``artefact_tool.py``: ``build_get_artefact_tool`` returns a
  ``StructuredTool`` an LLM can call to rehydrate a degraded EVIDENCE
  page back to FULL.

Adds ``tiktoken>=0.8.0`` to the ``agents`` extra in ``pyproject.toml``.

168 unit tests ship alongside the package covering fidelity invariants,
pin policy, token accounting, budget resolution, fault emission,
multimodal split rules, selector edge cases, and the rehydration path.
Replaces the unbounded ``self._history: list[BaseMessage]`` with a
``MemoryEngine`` that ingests every inbound message, hands the LLM an
assembled prompt that fits the active model's context window, and
emits fault events whenever pages are degraded or evicted to make room.

Changes:

- New ``McpClientConfig`` fields (all optional, default-equivalent to
  the model's full budget):
  - ``token_budget``: override the per-model context window.
  - ``pin_recent_evidence`` (default 3): keep the N most recent image
    artefacts at FULL fidelity.
  - ``output_reserve_tokens`` (default 4096): carved out of the
    context window for the model's response.
- New ``faults: Out[FaultEvent]`` stream on the module so operators
  can observe memory-layer events alongside the existing ``agent``
  and ``agent_idle`` streams.
- ``on_system_modules`` now appends the engine's ``get_artefact``
  StructuredTool to the agent's tool list so the LLM can rehydrate a
  degraded EVIDENCE page back to FULL on demand.
- ``_process_message`` ingests the inbound message through the engine
  and streams the assembled ``list[BaseMessage]`` into the state graph
  instead of the raw history.
- ``_thread_loop`` wraps ``_process_message`` in ``try/except`` so a
  turn-level failure no longer kills the worker thread silently; the
  exception is forwarded to the fault stream via
  ``emit_physical_insufficiency`` and the loop continues draining
  queued messages.

Fixes the original motivating bug: once the worker thread died on a
``context_length_exceeded`` or any other turn-level error, subsequent
messages piled up in the queue unprocessed and ``agent_idle`` stayed
at ``False`` forever.
…hread recovery

Adds two integration tests that drive a real ``MemoryEngine`` from a
bypass-constructed ``McpClient`` (no MCP server needed).

- ``test_history_compaction_keeps_recent_images_at_full_under_pressure``:
  ingests a system bootstrap + three camera-snapshot images
  interleaved with 100 chatty text turns under a tight 8k token budget.
  Asserts that (1) the assembled prompt respects
  ``budget.effective_budget_for_messages(n_surviving)``, (2) the three
  most recent EVIDENCE pages stay at FULL fidelity, (3) older
  CONVERSATION pages get degraded and a PAGE_DEGRADED fault is emitted,
  and (4) ``physical_insufficient`` stays False.

- ``test_thread_loop_recovers_from_process_message_exception``:
  regression guard for the motivating bug. Enqueues three messages and
  monkeypatches ``_process_message`` to raise on the second. Asserts
  that all three messages are processed (the thread survived), a
  PHYSICAL_INSUFFICIENCY fault carrying ``details["exception"]`` was
  emitted, ``agent_idle`` returns to True, and the worker exits
  cleanly on shutdown.
Adds a new ``Context Memory`` subsection under the Agent System block
covering: the original unbounded-history problem, the paged
multi-fidelity memory layer, the four fidelity rungs, the pin policy,
the ``McpClient`` configuration knobs, the fault event taxonomy, the
``get_artefact`` tool, and the ``_thread_loop`` recovery contract.

Also adds a Further Reading bullet pointing at the ``dimos.agents.memory``
package.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR replaces the unbounded list[BaseMessage] history in McpClient with a paged, multi-fidelity MemoryEngine that keeps assembled prompts inside the model's context window, and wraps _process_message in a try/except so the worker thread survives turn-level failures. The architecture is well-designed — single-responsibility modules, strong Page invariants, correct tool-call pair atomicity in the selector, and a tolerant fault observer.

  • P1 — incomplete agent_idle fix: the try/finally that publishes agent_idle=True only wraps state_graph.stream(...). If engine.ingest() raises (e.g. TypeError from _role_of on an unsupported BaseMessage subclass), the exception escapes before that block, _thread_loop catches it, but agent_idle remains False forever — the exact stuck-idle symptom the PR targets. Moving the finally to wrap the entire post-publish(False) body of _process_message fixes this.
  • P2 — mutable Page objects in snapshots: PageTable.ordered() returns a shallow copy; callers hold references to the same Page instances that rebalance_evidence_pins mutates under its lock, making stale snapshots susceptible to torn reads.

Confidence Score: 3/5

Not safe to merge as-is: the agent_idle fix is incomplete and the stuck-idle symptom can still be triggered by an unexpected message type during ingestion.

One P1 defect present: the try/finally scope in _process_message does not cover engine.ingest() or engine.assemble(), so a TypeError from _role_of (reachable with any non-standard LangChain message subclass) leaves agent_idle permanently False — the exact symptom this PR sets out to fix. The ceiling for a P1 is 4; the defect sits directly on the motivating bug path, pulling the score to 3.

dimos/agents/mcp/mcp_client.py (_process_message try/finally scope) and dimos/agents/memory/ingestion.py (_role_of error handling) need attention before merge.

Important Files Changed

Filename Overview
dimos/agents/mcp/mcp_client.py Core integration: replaces _history list with MemoryEngine; adds fault stream. P1 bug: try/finally for agent_idle.publish(True) does not cover ingest/assemble calls, leaving agent_idle stuck at False on those failure paths.
dimos/agents/memory/engine.py Façade wiring page table, counter, budget, and fault observer; well-structured single-responsibility design with proper locking and fault emission.
dimos/agents/memory/selector.py Two-phase greedy selector with correct tool-call pair atomicity and pin handling; Phase-2 upgrade loop is O(N×U) which could degrade at scale but is documented.
dimos/agents/memory/page_table.py Thread-safe page store with auto/manual pin rebalancing; ordered() returns shallow copies exposing mutable Page objects that can be seen mid-mutation by stale snapshots.
dimos/agents/memory/pages.py Core data types with strong __post_init__ invariants covering role, fidelity monotonicity, pin consistency, and tool-call pairing requirements.
dimos/agents/memory/ingestion.py Converts BaseMessage into Page objects with four fidelity levels; _role_of raises TypeError for unrecognised subclasses, which can trigger the agent_idle stuck bug in mcp_client.py.
dimos/agents/memory/budget.py Immutable ModelBudget with longest-prefix model lookup; _normalize_model comment overstates what suffixes are stripped but the resolve logic is correct.
dimos/agents/memory/faults.py Fault event types and observer; correctly tolerant (never raises from emit), routes to both structlog and optional Out stream.
dimos/agents/memory/artefact_tool.py LLM-callable rehydration tool; cleanly separates the get_artefact side-effect (marking a page for FULL on the next turn) from prompt assembly.
dimos/agents/memory/tokens.py Token counter protocol with tiktoken (preferred) and character-heuristic fallback; image cost constants are fixed estimates, expected for budget approximations.

Sequence Diagram

sequenceDiagram
    participant Q as MessageQueue
    participant TL as _thread_loop
    participant PM as _process_message
    participant ME as MemoryEngine
    participant PT as PageTable
    participant SEL as assemble_prompt
    participant LG as LangGraph

    Q->>TL: message (dequeued)
    TL->>PM: _process_message(state_graph, message)
    PM->>ME: ingest(message)
    ME->>PT: add(page(s))
    ME->>PT: rebalance_evidence_pins()
    PT-->>ME: PinRebalance diff
    ME-->>PM: list[Page]
    PM->>ME: assemble()
    ME->>PT: ordered()
    ME->>SEL: assemble_prompt(pages, budget)
    SEL-->>ME: AssembledPrompt
    ME-->>PM: AssembledPrompt
    PM->>LG: state_graph.stream(messages)
    LG-->>PM: streamed node outputs
    PM->>ME: ingest(response_msg)
    PM->>PM: finally → agent_idle.publish(True)
    TL->>TL: except → emit_physical_insufficiency()
Loading

Comments Outside Diff (2)

  1. dimos/agents/memory/page_table.py, line 604-607 (link)

    P2 Mutable Page objects returned by ordered() can be seen mid-mutation

    PageTable.ordered() returns a shallow copy of _pages — so callers receive a list of the same Page instances held in the table. rebalance_evidence_pins and mark_pinned_full mutate those shared objects (setting pinned, pinned_at_full, min_fidelity) while holding _lock. Any code that retains a stale snapshot from a previous ordered() call can observe partially-mutated pages without the lock.

    The MemoryEngine drives everything from one worker thread so the practical risk is low today, but the invariant is fragile. Consider freezing Page (@dataclass(frozen=True)) and replacing pins with a side-car mapping inside PageTable, or at minimum document that pages must never be retained across a lock boundary.

  2. dimos/agents/memory/selector.py, line 1343-1413 (link)

    P2 Phase-2 upgrade loop is O(N²) over groups

    _upgrade_until_full rescans every non-evicted group on every iteration to find the best upgrade candidate. With U upgrades applied and N groups, the loop is O(N×U). The docstring acknowledges this, but a long conversation with hundreds of tool-result pages will exhibit quadratic scan cost on every assemble() call. A priority-queue keyed on (delta, -turn_seq) would keep this O(N log N) if scale becomes a concern.

Reviews (1): Last reviewed commit: "docs(AGENTS.md): document the context me..." | Re-trigger Greptile

Comment on lines +303 to 345
self,
state_graph: CompiledStateGraph[Any, Any, Any, Any],
message: BaseMessage,
) -> None:
self.agent_idle.publish(False)
self._history.append(message)
# Ingest the inbound message into the memory engine; this becomes
# a Page (or several, for multimodal) and drives the next
# assemble. Edge case: under a physically-insufficient budget
# the fresh input itself may be degraded. The engine emits a
# PHYSICAL_INSUFFICIENCY fault in that case.
self._engine.ingest(message)
pretty_print_langchain_message(message)
self.agent.publish(message)

for update in state_graph.stream({"messages": self._history}, stream_mode="updates"):
for node_output in update.values():
for msg in node_output.get("messages", []):
self._history.append(msg)
pretty_print_langchain_message(msg)
self.agent.publish(msg)

if self._message_queue.empty():
self.agent_idle.publish(True)


def _append_image_to_history(
assembled = self._engine.assemble()
try:
for update in state_graph.stream(
{"messages": assembled.messages}, stream_mode="updates"
):
for node_output in update.values():
for msg in node_output.get("messages", []):
self._engine.ingest(msg)
pretty_print_langchain_message(msg)
self.agent.publish(msg)
finally:
if self._message_queue.empty():
self.agent_idle.publish(True)


def _queue_image_artefact_message(
mcp_client: McpClient, func_name: str, uuid_: str, result: Any
) -> None:
"""Queue a HumanMessage containing an image artefact returned by a
non-text tool result.

Ingestion splits this multimodal message into a CONVERSATION page
(with the preamble text) plus one EVIDENCE page per image artefact
automatically — see ``dimos.agents.memory.ingestion``. The caller
doesn't need to set a page type hint.
"""
mcp_client.add_message(
HumanMessage(
content=[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 agent_idle stuck at False when ingest raises

The try/finally that resets agent_idle only wraps the state_graph.stream(...) call. If self._engine.ingest(message) or self._engine.assemble() raises — for example, ingestion._role_of raises TypeError for any BaseMessage subclass beyond the four explicit cases (FunctionMessage, etc.) — the exception propagates out of _process_message before the try block is ever entered. _thread_loop catches it (preventing thread death), but agent_idle remains False forever. This is the same stuck-idle symptom the PR intends to cure, re-introduced for a narrower but still real failure mode.

Move the finally to cover the entire post-publish(False) body:

    def _process_message(
        self,
        state_graph: CompiledStateGraph[Any, Any, Any, Any],
        message: BaseMessage,
    ) -> None:
        self.agent_idle.publish(False)
        try:
            self._engine.ingest(message)
            pretty_print_langchain_message(message)
            self.agent.publish(message)

            assembled = self._engine.assemble()
            for update in state_graph.stream(
                {"messages": assembled.messages}, stream_mode="updates"
            ):
                for node_output in update.values():
                    for msg in node_output.get("messages", []):
                        self._engine.ingest(msg)
                        pretty_print_langchain_message(msg)
                        self.agent.publish(msg)
        finally:
            if self._message_queue.empty():
                self.agent_idle.publish(True)

Comment on lines +96 to +104
def _default_provenance(msg: BaseMessage) -> str:
if isinstance(msg, SystemMessage):
return "system"
if isinstance(msg, HumanMessage):
return "user_input"
if isinstance(msg, AIMessage):
return "llm_response"
if isinstance(msg, ToolMessage):
return "tool_response"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _role_of raises TypeError for unrecognised message types

_role_of has an explicit raise TypeError(...) fall-through for any BaseMessage subclass beyond System/Human/AI/Tool. LangChain occasionally surfaces FunctionMessage or vendor-specific subclasses. When that happens, engine.ingest() raises, escaping _process_message before the try/finally that resets agent_idle. Combined with the companion comment on _process_message, this path is the concrete trigger for the stuck-idle bug.

Consider adding a fallback that maps unknown subclasses to "human" (or emits a structured warning and skips ingestion) rather than raising, so a single unexpected message type doesn't stall the agent.

Comment on lines +207 to +224
system_overhead: int = DEFAULT_SYSTEM_OVERHEAD,
per_message_overhead: int = DEFAULT_PER_MESSAGE_OVERHEAD,
) -> ModelBudget:
"""Resolve a :class:`ModelBudget` for *model_name*.

All arguments after ``model_name`` are keyword-only; callers must pass
``override=...`` etc. explicitly.

Args:
model_name: Provider-qualified model string (e.g. ``"gpt-4o"`` or
``"ollama:llama3.2:latest"``).
override: If set, use this as the ``context_window`` instead of the
table lookup. Useful when operators want to keep prompts
artificially small for cost / latency control.
output_reserve: Tokens held back for the response.
system_overhead: Formatting / tool-schema slack.
per_message_overhead: Per-message role/framing cost. See
:class:`ModelBudget`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _normalize_model comment overstates what it strips

The docstring says it drops "OpenAI -YYYY-MM-DD date suffixes", but the implementation only strips the :tag suffix. Date-stamped strings like "gpt-4o-2024-08-06" pass through unchanged and are resolved by the longest-prefix fallback in resolve_budget. The fallback works correctly, but the mismatch makes the intended contract ambiguous for future editors. Either update the comment or add the date-suffix strip here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant