PCM: Proof-Carrying Memory for LLM Agents

Reduce LLM costs and hallucinations by compressing conversation state into evidence-bound "memory packs" with verification and repair.

Evidence-bound answers — every claim has a citation or is marked [UNVERIFIED]
Low-token memory packs — ~11x cheaper per grounded-correct answer
Automatic repair loop — fallback retrieval when verification fails

Quickstart

pip install -e .
python examples/quickstart.py

from pcm import PCMConfig
from pcm.stores import InMemorySourceStore, InMemoryMemoryStore
from pcm.extraction import MemoryExtractor
from pcm.runtime import PackBuilder, Generator, Verifier, FallbackHandler

# Initialize
config = PCMConfig()
source_store = InMemorySourceStore()
memory_store = InMemoryMemoryStore()

# Ingest session context
session_id = "demo"
text = """
Project deadline is 2026-02-15.
Authentication must use OAuth2.
Do not store PII in logs.
"""
# ... ingest and extract atoms ...

# Query with PCM
pack = await pack_builder.build(session_id, "What is the deadline?")
    output = await generator.generate("What is the deadline?", pack)

print(output.answer)
# "The deadline is 2026-02-15. [mem_deadline_001]"

print(output.claims)
# [Claim(text="The deadline is 2026-02-15", memory_refs=["mem_deadline_001"])]

API

Endpoint	Method	Description
`/ingest`	POST	Ingest text into a session
`/query`	POST	Query with evidence-bound response

POST /ingest

{
  "session_id": "my_session",
  "text": "Project deadline is 2026-02-15. Use OAuth2 for auth."
}

POST /query

{
  "session_id": "my_session",
  "query": "What is the deadline?"
}

Response

{
  "answer": "The deadline is 2026-02-15. [mem_deadline_001]",
  "claims": [
    {"text": "The deadline is 2026-02-15", "refs": ["mem_deadline_001"]}
  ],
  "verification": {"ok": true},
  "tokens_used": 423
}

Benchmark Results

Internal Benchmark (v15 Stable)

Metric	v15	vs Full-Context
Grounded accuracy (G-Acc)	71.0%	+45.2 pts
Unsupported claims per question	0.00	-0.07
Tokens per grounded-correct	596.1	11.1x cheaper
Fallback recovery (when triggered)	60%	—
Refusal correctness (missing info)	100%	+100 pts

Compared to full_context_cited, PCM is ~11.1x cheaper per grounded-correct answer (Tokens/G-Correct), not per raw query.

LoCoMo Long-Context Evaluation (Recent)

Evaluated on LoCoMo benchmark (199 questions, long conversational memory):

Metric	PCM	Progress
Accuracy	24.1%	Baseline established
Grounded Accuracy	15.6%	Evidence-bound answers
Wrong Refusals	38.2%	Reduced via EVENT_TIME extraction
Unsupported Claims/Q	0.31	Low hallucination rate

Recent improvements:

✅ Added EVENT_TIME atom extraction for temporal queries ("when did X happen?")
✅ Implemented refusal override with expanded retrieval
✅ Added semantic embeddings for better fact retrieval
✅ Fixed citation aggregation and scoring

See docs/EVALUATION_BENCHMARK.md for evaluation methodology.

# Reproduce benchmark
python benchmarks/run_benchmark.py \
  --dataset benchmarks/datasets/benchmark_v10.jsonl \
  --out benchmarks/results/benchmark_v15.json

See docs/benchmark.md for methodology.

How It Works

Ingest — chunk text, extract structured atoms (facts, constraints, decisions)
Pack — build memory pack under strict token budget
Generate — produce answer with claim-level citations
Verify — check claims against evidence
Repair — if failed, retrieve evidence → patch repair or regenerate
Return — answer + audit trail

See docs/ARCHITECTURE.md for details.

Guarantees (v15)

Guarantee	Description
Citation enforcement	Every factual claim has an evidence reference or is marked `[UNVERIFIED]`
Unsupported blocking	Claims without evidence are blocked by verifier
Refusal over fabrication	Missing info triggers refusal instead of hallucination

Project Status

Status	Track
✅ v15 stable	Production-ready baseline
🔄 Active development	LoCoMo evaluation, EVENT_TIME extraction
🧪 v16 experimental	Canonical constraint schema (not merged)

Recent Progress:

✅ EVENT_TIME atom extraction with deterministic relative time resolution
✅ Semantic embeddings for improved fact retrieval
✅ Refusal override mechanism for better recall
✅ LoCoMo benchmark integration and evaluation

Roadmap:

Broader extraction coverage for rare phrasings
Improved temporal event extraction and resolution
PostgreSQL + pgvector persistence
Enhanced citation faithfulness scoring

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
benchmarks		benchmarks
db/migrations		db/migrations
docs		docs
examples		examples
external_eval		external_eval
locomo_repo		locomo_repo
pcm		pcm
prompts		prompts
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LINK_CHECK_REPORT.md		LINK_CHECK_REPORT.md
Makefile		Makefile
OPEN_SOURCE_CHECKLIST.md		OPEN_SOURCE_CHECKLIST.md
PUBLIC_REPO_CHECKLIST.md		PUBLIC_REPO_CHECKLIST.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCM: Proof-Carrying Memory for LLM Agents

Quickstart

API

Benchmark Results

Internal Benchmark (v15 Stable)

LoCoMo Long-Context Evaluation (Recent)

How It Works

Guarantees (v15)

Project Status

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PCM: Proof-Carrying Memory for LLM Agents

Quickstart

API

Benchmark Results

Internal Benchmark (v15 Stable)

LoCoMo Long-Context Evaluation (Recent)

How It Works

Guarantees (v15)

Project Status

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages