Cache misses on second back-to-back `client.messages.create()` (~40% rate)

# Two back-to-back `client.messages.create()` calls with the same cached system prompt produce intermittent cache misses on the second call

## Summary

When two requests are sent back-to-back with an identical `cache_control` system prompt, the second request **misses the cache that the first request just wrote ~40% of the time** and re-writes the same prefix. Sleeping 2 seconds between the two requests reliably eliminates the miss.

Both observations come from a ~30-line standalone reproducer (no tools, no multi-turn, no beta endpoints, stable Sonnet model).

---

## Reproducer

```python
#!/usr/bin/env python3
import os
import time
import anthropic

SYSTEM_FILLER = "You are a helpful assistant. " * 200  # ~1300 tokens


def run_once(attempt: int, client: anthropic.Anthropic) -> dict:
    # Unique marker per attempt → fresh cache key.
    unique = f"\n"
    system_blocks = [{
        "type": "text",
        "text": unique + SYSTEM_FILLER,
        "cache_control": {"type": "ephemeral"},
    }]

    # ---- R1 ----
    r1 = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=50,
        system=system_blocks,
        messages=[{"role": "user", "content": "hi"}],
    )

    # ---- R2 — same system, fired immediately ----
    r2 = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=50,
        system=system_blocks,
        messages=[{"role": "user", "content": "hello"}],
    )

    t1_cc = r1.usage.cache_creation_input_tokens or 0
    t2_cc = r2.usage.cache_creation_input_tokens or 0
    t2_cr = r2.usage.cache_read_input_tokens or 0

    bug = (t2_cc >= t1_cc * 0.9 and t2_cc > 800)
    return {"t1_cc": t1_cc, "t2_cc": t2_cc, "t2_cr": t2_cr, "bug": bug}


def main():
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    results = [run_once(i, client) for i in range(1, 21)]
    for i, r in enumerate(results, 1):
        v = "BUG" if r["bug"] else "OK "
        print(f"[{i:2d}/20]  R1cc={r['t1_cc']}  R2cc={r['t2_cc']} R2cr={r['t2_cr']}  → {v}")
    bug = sum(1 for r in results if r["bug"])
    print(f"\nBug reproduced: {bug}/20")


if __name__ == "__main__":
    main()
```

### Output (one 20-trial run)

```
[ 1/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[ 2/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 3/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[ 4/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 5/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 6/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 7/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[ 8/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[ 9/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[10/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[11/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[12/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[13/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[14/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[15/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[16/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[17/20]  R1cc=1215  R2cc=0    R2cr=1215  → OK 
[18/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[19/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG
[20/20]  R1cc=1215  R2cc=1215 R2cr=0     → BUG

Bug reproduced: 8/20
```

Each "BUG" trial wrote `cache_creation_input_tokens=1215` on **both** requests — the same exact prefix billed twice, despite both requests sharing an identical cached system block.

---

## Mitigation: sleep 2 s before R2

```python
# Same script as above, with the following inserted between R1 and R2:
import time
time.sleep(2)
```

### Output (one 20-trial run with the sleep)

```
[ 1/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 2/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 3/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 4/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 5/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 6/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 7/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 8/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[ 9/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[10/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[11/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[12/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[13/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[14/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[15/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[16/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[17/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[18/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[19/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK
[20/20]  R1cc=1217  R2cc=0 R2cr=1217  → OK

Bug reproduced: 0/20
```

---

## Context

The official prompt-caching docs ([docs.claude.com/en/docs/build-with-claude/prompt-caching](https://docs.claude.com/en/docs/build-with-claude/prompt-caching)) acknowledge a related issue for **concurrent** requests:

> *"For concurrent requests, note that a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests."*

This guidance does not cover the **sequential** form documented above, where the second request fires *after* the first response has fully returned and still misses ~40% of the time.

`anthropics/claude-code#38356` (closed, unresolved) reported matching symptoms — three requests 762–800 ms apart, each independently writing identical `cache_creation_input_tokens=10,553`. The reporter framed them as "parallel API calls during a user turn," but with 750+ ms inter-call gaps these were almost certainly *sequential* requests hitting the same cache-visibility race we describe here. The issue was auto-closed by the duplicate-detection bot after 30 days of inactivity; the root cause was never addressed.

## Cost impact

Each redundant write is billed at the full `cache_creation_input_tokens` rate. At typical agent prompt sizes (10–100K tokens), the cost adds up across any multi-turn workload that doesn't apply the workaround.

## Environment

- `anthropic` Python SDK 0.97.0 (also reproduced on 0.96.0 — the bug is API-side, SDK-agnostic)
- Model: `claude-sonnet-4-5`
- Single account, single region

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache misses on second back-to-back `client.messages.create()` (~40% rate) #1451

Two back-to-back `client.messages.create()` calls with the same cached system prompt produce intermittent cache misses on the second call

Summary

Reproducer

Output (one 20-trial run)

Mitigation: sleep 2 s before R2

Output (one 20-trial run with the sleep)

Context

Cost impact

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cache misses on second back-to-back client.messages.create() (~40% rate) #1451

Description

Two back-to-back client.messages.create() calls with the same cached system prompt produce intermittent cache misses on the second call

Summary

Reproducer

Output (one 20-trial run)

Mitigation: sleep 2 s before R2

Output (one 20-trial run with the sleep)

Context

Cost impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Cache misses on second back-to-back `client.messages.create()` (~40% rate) #1451

Two back-to-back `client.messages.create()` calls with the same cached system prompt produce intermittent cache misses on the second call