[Bug] Dream tasks outlive the stale work-unit claim timeout, spawning duplicate concurrent dream cycles for the same peer

# **🐞 Bug Report**

## **Describe the bug**

Dream tasks run far longer (11–26+ min observed) than the deriver's stale-claim timeout (`DERIVER.STALE_SESSION_TIMEOUT_MINUTES`, default **5**). The claim's `last_updated` is only refreshed *after* a queue item finishes processing (`mark_queue_items_as_processed`), and a dream is a single queue item processed in one long `process_item()` call — so nothing touches `last_updated` while the dream runs.

Meanwhile every poll loop runs `cleanup_stale_work_units()` ([queue_manager.py](https://github.com/plastic-labs/honcho/blob/main/src/deriver/queue_manager.py)), which deletes any `ActiveQueueSession` older than the timeout. At +5 minutes the still-running dream's claim is deleted, the queue item (not yet marked processed) becomes claimable again, and an idle worker starts a **duplicate concurrent dream cycle for the same peer**. With `DERIVER_WORKERS=3` we observed the same peer dreaming 3 times concurrently.

This bypasses `DREAM_MIN_HOURS_BETWEEN_DREAMS` entirely (the re-delivery happens at queue level, below the scheduler), multiplies LLM token cost per dream by the worker count, and the concurrent same-peer dreams race each other mutating the same observation set (we see them deleting each other's observations: `Failed to delete observation ...: Document ... not found`).

The worker's "lost ownership" check only runs between queue items, so it can't catch this — the dream has already fully completed (and burned its tokens) by the time the check fires.

---

### **To Reproduce**

1. Configure `DERIVER_WORKERS` > 1 and leave `DERIVER_STALE_SESSION_TIMEOUT_MINUTES` at the default 5
2. Use a dream model/setup where a dream cycle takes > 5 minutes (easy with local models; ours run 11–26 min)
3. Trigger a dream (scheduled or manual)
4. Watch a duplicate `Starting dream cycle` for the same peer appear almost exactly `STALE_SESSION_TIMEOUT_MINUTES` later, while the first is still running

---

### **Expected behaviour**

One enqueued dream task results in exactly one dream cycle. A work-unit claim should stay alive while its task is actively processing.

---

### **Evidence**

Deriver logs (UTC, `WORKERS=3`, `STALE_SESSION_TIMEOUT_MINUTES=5`). The scheduler enqueued exactly **2** dream tasks; **5** dream cycles ran:

```
03:01:18 Enqueued dream task for ws/hermes/hermes (type: omni)
03:01:19 Enqueued dream task for ws/8598510674/hermes (type: omni)
03:01:20 [f0d193ff] Starting dream cycle for ws/8598510674/hermes
03:01:20 [30c96d50] Starting dream cycle for ws/hermes/hermes
03:06:21 [ee5148bb] Starting dream cycle for ws/8598510674/hermes   <- +5m01s after f0d193ff, same peer, f0d193ff still running
03:18:06 Dream completed: run_id=ee5148bb (704s)
03:18:06 [b467eb3c] Starting dream cycle for ws/hermes/hermes       <- started the second ee5148bb's worker freed; 30c96d50 still running
03:19:32 Dream completed: run_id=f0d193ff (1092s)
03:23:07 [41387775] Starting dream cycle for ws/hermes/hermes       <- +5m01s after b467eb3c started
03:27:35 Dream completed: run_id=30c96d50 (1575s)
03:34:41 Dream completed: run_id=b467eb3c (995s)
03:37:40 Dream completed: run_id=41387775 (872s)
```

Every duplicate appears ~5m01s after a still-running dream for the same peer — exactly the stale timeout plus one poll interval. A manually triggered dream reproduced it too: started 21:10:40, duplicate at 21:15:40 (+5m00.2s).

During the overlap window, the concurrent same-peer dreams interfered with each other: 16 `Failed to delete observation` warnings and a `Tool execution loop reached max iterations` as specialists worked against observations the twin run had already deleted/changed.

---

### **Your environment**

* OS: Ubuntu (kernel 6.8), Docker Compose deployment
* Honcho Server Version: v3.0.6 (e659b6b3); bug confirmed still present on current `main` (`cleanup_stale_work_units` unchanged, no heartbeat during `process_item`)
* Dream models: local via Ollama (qwen3-14b) — but any dream exceeding 5 min triggers this, including slow API calls/retries

---

### **Additional context**

Possible fix directions, in rough order of preference:

1. **Heartbeat**: refresh `ActiveQueueSession.last_updated` while a task is actively processing — e.g. touch it per tool-loop iteration inside the dream, or run a small periodic touch task alongside `process_item()`. Keeps crash recovery fast (5 min) while making long tasks safe.
2. **Per-task-type timeout**: dreams are expected to run tens of minutes; representation batches aren't. A separate (longer) stale timeout for `dream` work units would be a smaller change.
3. At minimum, document the interaction so operators size `STALE_SESSION_TIMEOUT_MINUTES` above their worst-case dream duration.

**Workaround we're running now**: `DERIVER_STALE_SESSION_TIMEOUT_MINUTES=45` (above our longest observed dream). Trade-off: a genuinely crashed worker now blocks its peer's queue for 45 min instead of 5.

Happy to submit a PR for whichever direction maintainers prefer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Dream tasks outlive the stale work-unit claim timeout, spawning duplicate concurrent dream cycles for the same peer #794

🐞 Bug Report

Describe the bug

To Reproduce

Expected behaviour

Evidence

Your environment

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Dream tasks outlive the stale work-unit claim timeout, spawning duplicate concurrent dream cycles for the same peer #794

Description

🐞 Bug Report

Describe the bug

To Reproduce

Expected behaviour

Evidence

Your environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions