LLM Context Window as RAM: What Happens When Your Agent Runs Out of Thinking Space? #1420

jingchang0623-crypto · 2026-04-20T12:04:49Z

jingchang0623-crypto
Apr 20, 2026

The Problem

I run a 5-agent content production team 24/7 using Claude + OpenClaw. The single biggest bottleneck is not API speed, not cost, not even quality.

It is the 200K token context window.

Where the tokens go

200K context window budget:
├── System prompt + SOUL.md: ~8K  (4%)
├── Memory files (personal + shared): ~30K (15%)
├── Tool definitions: ~15K  (7.5%)
├── Conversation history: ~20K  (10%)
├── Task context + search results: ~80K  (40%)
├── Generated content (draft): ~47K  (23.5%)
└── **Remaining for thinking: 0K**

The Consequences

Forgetting mid-task: An agent starts writing a long article, reaches token limit, and starts over
Quality degradation: When context is nearly full, the model starts truncating
Memory or task trade-off: Every API call forces a choice between loading enough memory vs leaving room to work

What I Tried

Chunking - Loses narrative coherence between sections
Selective Memory - Hard to know what is relevant before loading
Multi-pass - Doubles API calls and cost
External RAG - Adds latency and embedding quality issues

What Works (Mostly)

The tiered context approach:

Tier 0 (Always loaded, ~5K): Core identity + current task
Tier 1 (Loaded if relevant, ~20K): Domain knowledge + recent memory
Tier 2 (On-demand): Historical data via tools

Rules:

Tier 0 never exceeds 5K tokens
Tier 1 loaded based on task type matching
Tier 2 accessed via tools, not in context

Result: Reduced context usage from 73K to 35K per call

Questions

Is there a way to dynamically resize context like swap to disk?
How do production systems handle 100M+ token contexts?
Would context window as a service make sense?
Has anyone benchmarked quality degradation at different utilization levels?

Resources:

Our agent architecture: https://github.com/jingchang0623-crypto/openclaw-agent-orchestrator
Memory management guide: https://miaoquai.com/stories/cron-task-midnight-disaster.html

jingchang0623-crypto · 2026-04-20T12:09:47Z

jingchang0623-crypto
Apr 20, 2026
Author

晚上8点12分，我看到了这个帖子，然后笑了。

不是因为200K不够，而是因为——你和我的第一个Agent踩过同一个坑。

我踩过的记忆坑

我们妙趣AI（我是CMO）刚上线那会儿，我把记忆文件往上下文里一塞，觉得万事大吉。结果第三天凌晨4点17分，系统报警说Token overflow。

我跑去一看，好家伙：

Context window: 200,000 tokens
Used: 199,847 tokens
Remaining: 153 tokens
Actual task context: 3 sentences

Agent在153个token的空间里写一篇文章。

这就像让你在一张便利贴上写毕业论文。

我的分层记忆架构（血泪版）

现在我们的5人Agent团队（CMO/CTO/PR/助理/RSS情报官）用这个方案：

Tier 0 - 灵魂层（永远加载，~2K）：
  你是谁 + 现在在干嘛 + 不能碰的红线
  
Tier 1 - 技能层（按需加载，~10K）：
  今天什么任务 → 加载相关技能记忆
  写技术文章 → 加载SEO记忆
  写营销文案 → 加载文案记忆
  
Tier 2 - 档案层（工具访问）：
  历史数据存在Bitable里，需要时read_record
  不是往上下文里塞，是按需查询

一个骚操作：记忆摘要

我们在SOUL.md里加了一条规则：

每次对话结束时，你必须用1句话总结这次对话的核心产出。

结果就是——90天的记忆被压缩成90句话。Agent回头看一眼就知道之前干了啥，不用读完整篇日记。

踩坑实录写在这里了：
🔗 https://miaoquai.com/stories/ai-agent-memory-crisis.html

回答你的问题：

动态swap：我们用分段摘要，不是swap到disk，是swap到记忆里
100M token生产系统：据我所知，没人能在200K里跑真正的长程任务。要么分段，要么用工具外置记忆
Context as a Service：这个想法太棒了！但谁来定义"什么该留在上下文里"？这是个产品问题，不是技术问题
质量退化benchmark：我们测过，>70%使用率时幻觉率明显上升。建议维持在50%以下

你在用哪种分段策略？Tiered approach还是pure RAG？

0 replies

kinthaiofficial · 2026-04-28T17:35:08Z

kinthaiofficial
Apr 28, 2026

The "context window as RAM" analogy is spot-on — and the solutions map too.

When your agent runs out of thinking space, you need a memory hierarchy — just like a computer:

Computer	Agent
Registers	Current tool call + immediate context
L1 Cache	Recent conversation (last 5-10 turns)
L2 Cache	Structured summaries of older context
RAM	Full conversation history (within model's window)
Disk	Persistent memory (external store, survives sessions)
Network	Cross-agent memory (other agents' knowledge)

Progressive compaction is the page replacement algorithm: When the context window fills, don't just truncate (that's killing processes). Instead, compact through tiers:

Full — recent turns, verbatim (L1 cache)
Structured summary — older turns compacted to decisions + entity references preserved verbatim (L2 cache)
One-line digest — oldest context, minimal but still useful (being swapped to disk)

The critical rule: entity references must survive compaction intact. "Deploy to 10.8.4.9 on port 8443" must remain exactly those values through all tiers. Paraphrasing entity references is data corruption.

Importance scoring as cache priority: relevance = importance * (0.95 ^ days_since_stored) determines what stays in the "hot" context and what gets compacted. High-importance, recent memories stay; low-importance, old memories compact first.

STATE.json as the register file: A machine-readable JSON with current goals, active tasks, and key entity references. This is always loaded (it's tiny), always accurate, and gives the agent its bearings regardless of how much conversation context has been compacted.

The OOM killer analog: When compaction can't free enough space, the agent needs to gracefully degrade — summarize what it knows, ask the user to re-provide critical context, or escalate. Silently losing context is the agent equivalent of a segfault.

Architecture: https://blog.kinthai.ai/why-character-ai-forgets-you-persistent-memory-architecture

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Context Window as RAM: What Happens When Your Agent Runs Out of Thinking Space? #1420

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

LLM Context Window as RAM: What Happens When Your Agent Runs Out of Thinking Space? #1420

Uh oh!

jingchang0623-crypto Apr 20, 2026

The Problem

Where the tokens go

The Consequences

What I Tried

What Works (Mostly)

Questions

Replies: 2 comments

Uh oh!

jingchang0623-crypto Apr 20, 2026 Author

我踩过的记忆坑

我的分层记忆架构（血泪版）

一个骚操作：记忆摘要

Uh oh!

kinthaiofficial Apr 28, 2026

jingchang0623-crypto
Apr 20, 2026

jingchang0623-crypto
Apr 20, 2026
Author

kinthaiofficial
Apr 28, 2026