TL;DR
The cold-checkpoint store is gated by prompt_len <= cold_max_tokens (default 30,000). A Claude Code opening prompt with a moderate MCP toolset is already past that: 59 tools rendered to 30,246 tokens in our deployment. The gate fails, the cold anchor is never written, and nothing is logged — the exact mechanism designed for "independent agent sessions" reuse is disabled invisibly for the workload it was built for.
How it fails today
Every fresh Claude Code session (/clear + first message) repays the full stable prefix. Two consecutive fresh sessions, 100 seconds apart, byte-identical prompts:
00:05:01 ds4-server: chat ctx=0..30246:30246 TOOLS prompt start
00:05:55 ds4-server: kv cache stored tokens=20480 trimmed=0 reason=continued ...
00:06:26 ds4-server: chat ctx=0..30246:30246 TOOLS prompt done 84.679s
(no reason=cold store: 30246 > 30000, silently skipped)
00:06:42 ds4-server: live kv cache miss live=30276 prompt=30246 common=30246 reason=token-mismatch
00:08:07 ds4-server: chat ctx=0..30246:30246 TOOLS prompt done 84.710s ← full prefill again
After raising the limit (--kv-cache-cold-max-tokens 60000), the designed behavior kicks in immediately:
01:09:20 ds4-server: kv cache stored tokens=27005 trimmed=3213 reason=cold ...
01:09:40 ds4-server: kv cache hit text tokens=27005 ... load=72.4 ms
01:09:54 ds4-server: chat ctx=27005..30218:3213 TOOLS prompt done 13.786s ← 84.7s → 13.8s
Backend: Apple Silicon / Metal (M2 Ultra 192 GiB), DeepSeek V4 Flash q2-q4-imatrix.
We only found this by reading the source; the flag's help text gives no hint that agent first prompts have outgrown the default.
Proposed fix
- Log a hint when a fully cold prompt skips the cold store because it exceeds
cold_max_tokens (one line, only when cached == 0, so no noise).
- Consider raising the default: Claude Code first prompts crossed 30K once a handful of MCP servers are configured, and they only grow. 50,000–60,000 keeps the original "don't burn disk on huge contexts" intent while covering current agent CLIs.
PR for (1) incoming together with a related cold-anchor fix.
TL;DR
The cold-checkpoint store is gated by
prompt_len <= cold_max_tokens(default 30,000). A Claude Code opening prompt with a moderate MCP toolset is already past that: 59 tools rendered to 30,246 tokens in our deployment. The gate fails, the cold anchor is never written, and nothing is logged — the exact mechanism designed for "independent agent sessions" reuse is disabled invisibly for the workload it was built for.How it fails today
Every fresh Claude Code session (
/clear+ first message) repays the full stable prefix. Two consecutive fresh sessions, 100 seconds apart, byte-identical prompts:After raising the limit (
--kv-cache-cold-max-tokens 60000), the designed behavior kicks in immediately:Backend: Apple Silicon / Metal (M2 Ultra 192 GiB), DeepSeek V4 Flash q2-q4-imatrix.
We only found this by reading the source; the flag's help text gives no hint that agent first prompts have outgrown the default.
Proposed fix
cold_max_tokens(one line, only whencached == 0, so no noise).PR for (1) incoming together with a related cold-anchor fix.