π Bug Report
Describe the bug
Dream tasks run far longer (11β26+ min observed) than the deriver's stale-claim timeout (DERIVER.STALE_SESSION_TIMEOUT_MINUTES, default 5). The claim's last_updated is only refreshed after a queue item finishes processing (mark_queue_items_as_processed), and a dream is a single queue item processed in one long process_item() call β so nothing touches last_updated while the dream runs.
Meanwhile every poll loop runs cleanup_stale_work_units() (queue_manager.py), which deletes any ActiveQueueSession older than the timeout. At +5 minutes the still-running dream's claim is deleted, the queue item (not yet marked processed) becomes claimable again, and an idle worker starts a duplicate concurrent dream cycle for the same peer. With DERIVER_WORKERS=3 we observed the same peer dreaming 3 times concurrently.
This bypasses DREAM_MIN_HOURS_BETWEEN_DREAMS entirely (the re-delivery happens at queue level, below the scheduler), multiplies LLM token cost per dream by the worker count, and the concurrent same-peer dreams race each other mutating the same observation set (we see them deleting each other's observations: Failed to delete observation ...: Document ... not found).
The worker's "lost ownership" check only runs between queue items, so it can't catch this β the dream has already fully completed (and burned its tokens) by the time the check fires.
To Reproduce
- Configure
DERIVER_WORKERS > 1 and leave DERIVER_STALE_SESSION_TIMEOUT_MINUTES at the default 5
- Use a dream model/setup where a dream cycle takes > 5 minutes (easy with local models; ours run 11β26 min)
- Trigger a dream (scheduled or manual)
- Watch a duplicate
Starting dream cycle for the same peer appear almost exactly STALE_SESSION_TIMEOUT_MINUTES later, while the first is still running
Expected behaviour
One enqueued dream task results in exactly one dream cycle. A work-unit claim should stay alive while its task is actively processing.
Evidence
Deriver logs (UTC, WORKERS=3, STALE_SESSION_TIMEOUT_MINUTES=5). The scheduler enqueued exactly 2 dream tasks; 5 dream cycles ran:
03:01:18 Enqueued dream task for ws/hermes/hermes (type: omni)
03:01:19 Enqueued dream task for ws/8598510674/hermes (type: omni)
03:01:20 [f0d193ff] Starting dream cycle for ws/8598510674/hermes
03:01:20 [30c96d50] Starting dream cycle for ws/hermes/hermes
03:06:21 [ee5148bb] Starting dream cycle for ws/8598510674/hermes <- +5m01s after f0d193ff, same peer, f0d193ff still running
03:18:06 Dream completed: run_id=ee5148bb (704s)
03:18:06 [b467eb3c] Starting dream cycle for ws/hermes/hermes <- started the second ee5148bb's worker freed; 30c96d50 still running
03:19:32 Dream completed: run_id=f0d193ff (1092s)
03:23:07 [41387775] Starting dream cycle for ws/hermes/hermes <- +5m01s after b467eb3c started
03:27:35 Dream completed: run_id=30c96d50 (1575s)
03:34:41 Dream completed: run_id=b467eb3c (995s)
03:37:40 Dream completed: run_id=41387775 (872s)
Every duplicate appears ~5m01s after a still-running dream for the same peer β exactly the stale timeout plus one poll interval. A manually triggered dream reproduced it too: started 21:10:40, duplicate at 21:15:40 (+5m00.2s).
During the overlap window, the concurrent same-peer dreams interfered with each other: 16 Failed to delete observation warnings and a Tool execution loop reached max iterations as specialists worked against observations the twin run had already deleted/changed.
Your environment
- OS: Ubuntu (kernel 6.8), Docker Compose deployment
- Honcho Server Version: v3.0.6 (e659b6b); bug confirmed still present on current
main (cleanup_stale_work_units unchanged, no heartbeat during process_item)
- Dream models: local via Ollama (qwen3-14b) β but any dream exceeding 5 min triggers this, including slow API calls/retries
Additional context
Possible fix directions, in rough order of preference:
- Heartbeat: refresh
ActiveQueueSession.last_updated while a task is actively processing β e.g. touch it per tool-loop iteration inside the dream, or run a small periodic touch task alongside process_item(). Keeps crash recovery fast (5 min) while making long tasks safe.
- Per-task-type timeout: dreams are expected to run tens of minutes; representation batches aren't. A separate (longer) stale timeout for
dream work units would be a smaller change.
- At minimum, document the interaction so operators size
STALE_SESSION_TIMEOUT_MINUTES above their worst-case dream duration.
Workaround we're running now: DERIVER_STALE_SESSION_TIMEOUT_MINUTES=45 (above our longest observed dream). Trade-off: a genuinely crashed worker now blocks its peer's queue for 45 min instead of 5.
Happy to submit a PR for whichever direction maintainers prefer.
π Bug Report
Describe the bug
Dream tasks run far longer (11β26+ min observed) than the deriver's stale-claim timeout (
DERIVER.STALE_SESSION_TIMEOUT_MINUTES, default 5). The claim'slast_updatedis only refreshed after a queue item finishes processing (mark_queue_items_as_processed), and a dream is a single queue item processed in one longprocess_item()call β so nothing toucheslast_updatedwhile the dream runs.Meanwhile every poll loop runs
cleanup_stale_work_units()(queue_manager.py), which deletes anyActiveQueueSessionolder than the timeout. At +5 minutes the still-running dream's claim is deleted, the queue item (not yet marked processed) becomes claimable again, and an idle worker starts a duplicate concurrent dream cycle for the same peer. WithDERIVER_WORKERS=3we observed the same peer dreaming 3 times concurrently.This bypasses
DREAM_MIN_HOURS_BETWEEN_DREAMSentirely (the re-delivery happens at queue level, below the scheduler), multiplies LLM token cost per dream by the worker count, and the concurrent same-peer dreams race each other mutating the same observation set (we see them deleting each other's observations:Failed to delete observation ...: Document ... not found).The worker's "lost ownership" check only runs between queue items, so it can't catch this β the dream has already fully completed (and burned its tokens) by the time the check fires.
To Reproduce
DERIVER_WORKERS> 1 and leaveDERIVER_STALE_SESSION_TIMEOUT_MINUTESat the default 5Starting dream cyclefor the same peer appear almost exactlySTALE_SESSION_TIMEOUT_MINUTESlater, while the first is still runningExpected behaviour
One enqueued dream task results in exactly one dream cycle. A work-unit claim should stay alive while its task is actively processing.
Evidence
Deriver logs (UTC,
WORKERS=3,STALE_SESSION_TIMEOUT_MINUTES=5). The scheduler enqueued exactly 2 dream tasks; 5 dream cycles ran:Every duplicate appears ~5m01s after a still-running dream for the same peer β exactly the stale timeout plus one poll interval. A manually triggered dream reproduced it too: started 21:10:40, duplicate at 21:15:40 (+5m00.2s).
During the overlap window, the concurrent same-peer dreams interfered with each other: 16
Failed to delete observationwarnings and aTool execution loop reached max iterationsas specialists worked against observations the twin run had already deleted/changed.Your environment
main(cleanup_stale_work_unitsunchanged, no heartbeat duringprocess_item)Additional context
Possible fix directions, in rough order of preference:
ActiveQueueSession.last_updatedwhile a task is actively processing β e.g. touch it per tool-loop iteration inside the dream, or run a small periodic touch task alongsideprocess_item(). Keeps crash recovery fast (5 min) while making long tasks safe.dreamwork units would be a smaller change.STALE_SESSION_TIMEOUT_MINUTESabove their worst-case dream duration.Workaround we're running now:
DERIVER_STALE_SESSION_TIMEOUT_MINUTES=45(above our longest observed dream). Trade-off: a genuinely crashed worker now blocks its peer's queue for 45 min instead of 5.Happy to submit a PR for whichever direction maintainers prefer.