Skip to content

Cache misses on second back-to-back client.messages.create() (~40% rate) #1451

@ymyke

Description

@ymyke

Two back-to-back client.messages.create() calls with the same cached system prompt produce intermittent cache misses on the second call

Summary

When two requests are sent back-to-back with an identical cache_control system prompt, the second request misses the cache that the first request just wrote ~40% of the time and re-writes the same prefix. Sleeping 2 seconds between the two requests reliably eliminates the miss.

Both observations come from a ~30-line standalone reproducer (no tools, no multi-turn, no beta endpoints, stable Sonnet model).


Reproducer

#!/usr/bin/env python3
import os
import time
import anthropic

SYSTEM_FILLER = "You are a helpful assistant. " * 200  # ~1300 tokens


def run_once(attempt: int, client: anthropic.Anthropic) -> dict:
    # Unique marker per attempt → fresh cache key.
    unique = f"<!-- min-{attempt}-{int(time.time() * 1000)} -->\n"
    system_blocks = [{
        "type": "text",
        "text": unique + SYSTEM_FILLER,
        "cache_control": {"type": "ephemeral"},
    }]

    # ---- R1 ----
    r1 = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=50,
        system=system_blocks,
        messages=[{"role": "user", "content": "hi"}],
    )

    # ---- R2 — same system, fired immediately ----
    r2 = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=50,
        system=system_blocks,
        messages=[{"role": "user", "content": "hello"}],
    )

    t1_cc = r1.usage.cache_creation_input_tokens or 0
    t2_cc = r2.usage.cache_creation_input_tokens or 0
    t2_cr = r2.usage.cache_read_input_tokens or 0

    bug = (t2_cc >= t1_cc * 0.9 and t2_cc > 800)
    return {"t1_cc": t1_cc, "t2_cc": t2_cc, "t2_cr": t2_cr, "bug": bug}


def main():
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    results = [run_once(i, client) for i in range(1, 21)]
    for i, r in enumerate(results, 1):
        v = "BUG" if r["bug"] else "OK "
        print(f"[{i:2d}/20]  R1cc={r['t1_cc']}  R2cc={r['t2_cc']} R2cr={r['t2_cr']}{v}")
    bug = sum(1 for r in results if r["bug"])
    print(f"\nBug reproduced: {bug}/20")


if __name__ == "__main__":
    main()

Output (one 20-trial run)

[ 1/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[ 2/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 3/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[ 4/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 5/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 6/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 7/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 8/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[ 9/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[10/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[11/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[12/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[13/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[14/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[15/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[16/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[17/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[18/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[19/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[20/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG

Bug reproduced: 8/20

Each "BUG" trial wrote cache_creation_input_tokens=1215 on both requests — the same exact prefix billed twice, despite both requests sharing an identical cached system block.


Mitigation: sleep 2 s before R2

# Same script as above, with the following inserted between R1 and R2:
import time
time.sleep(2)

Output (one 20-trial run with the sleep)

[ 1/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 2/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 3/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 4/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 5/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 6/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 7/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 8/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 9/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[10/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[11/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[12/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[13/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[14/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[15/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[16/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[17/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[18/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[19/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[20/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK

Bug reproduced: 0/20

Context

The official prompt-caching docs (docs.claude.com/en/docs/build-with-claude/prompt-caching) acknowledge a related issue for concurrent requests:

"For concurrent requests, note that a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests."

This guidance does not cover the sequential form documented above, where the second request fires after the first response has fully returned and still misses ~40% of the time.

anthropics/claude-code#38356 (closed, unresolved) reported matching symptoms — three requests 762–800 ms apart, each independently writing identical cache_creation_input_tokens=10,553. The reporter framed them as "parallel API calls during a user turn," but with 750+ ms inter-call gaps these were almost certainly sequential requests hitting the same cache-visibility race we describe here. The issue was auto-closed by the duplicate-detection bot after 30 days of inactivity; the root cause was never addressed.

Cost impact

Each redundant write is billed at the full cache_creation_input_tokens rate. At typical agent prompt sizes (10–100K tokens), the cost adds up across any multi-turn workload that doesn't apply the workaround.

Environment

  • anthropic Python SDK 0.97.0 (also reproduced on 0.96.0 — the bug is API-side, SDK-agnostic)
  • Model: claude-sonnet-4-5
  • Single account, single region

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions