Skip to content

[Bug] Dream tasks outlive the stale work-unit claim timeout, spawning duplicate concurrent dream cycles for the same peerΒ #794

Description

@TcDrozd

🐞 Bug Report

Describe the bug

Dream tasks run far longer (11–26+ min observed) than the deriver's stale-claim timeout (DERIVER.STALE_SESSION_TIMEOUT_MINUTES, default 5). The claim's last_updated is only refreshed after a queue item finishes processing (mark_queue_items_as_processed), and a dream is a single queue item processed in one long process_item() call β€” so nothing touches last_updated while the dream runs.

Meanwhile every poll loop runs cleanup_stale_work_units() (queue_manager.py), which deletes any ActiveQueueSession older than the timeout. At +5 minutes the still-running dream's claim is deleted, the queue item (not yet marked processed) becomes claimable again, and an idle worker starts a duplicate concurrent dream cycle for the same peer. With DERIVER_WORKERS=3 we observed the same peer dreaming 3 times concurrently.

This bypasses DREAM_MIN_HOURS_BETWEEN_DREAMS entirely (the re-delivery happens at queue level, below the scheduler), multiplies LLM token cost per dream by the worker count, and the concurrent same-peer dreams race each other mutating the same observation set (we see them deleting each other's observations: Failed to delete observation ...: Document ... not found).

The worker's "lost ownership" check only runs between queue items, so it can't catch this β€” the dream has already fully completed (and burned its tokens) by the time the check fires.


To Reproduce

  1. Configure DERIVER_WORKERS > 1 and leave DERIVER_STALE_SESSION_TIMEOUT_MINUTES at the default 5
  2. Use a dream model/setup where a dream cycle takes > 5 minutes (easy with local models; ours run 11–26 min)
  3. Trigger a dream (scheduled or manual)
  4. Watch a duplicate Starting dream cycle for the same peer appear almost exactly STALE_SESSION_TIMEOUT_MINUTES later, while the first is still running

Expected behaviour

One enqueued dream task results in exactly one dream cycle. A work-unit claim should stay alive while its task is actively processing.


Evidence

Deriver logs (UTC, WORKERS=3, STALE_SESSION_TIMEOUT_MINUTES=5). The scheduler enqueued exactly 2 dream tasks; 5 dream cycles ran:

03:01:18 Enqueued dream task for ws/hermes/hermes (type: omni)
03:01:19 Enqueued dream task for ws/8598510674/hermes (type: omni)
03:01:20 [f0d193ff] Starting dream cycle for ws/8598510674/hermes
03:01:20 [30c96d50] Starting dream cycle for ws/hermes/hermes
03:06:21 [ee5148bb] Starting dream cycle for ws/8598510674/hermes   <- +5m01s after f0d193ff, same peer, f0d193ff still running
03:18:06 Dream completed: run_id=ee5148bb (704s)
03:18:06 [b467eb3c] Starting dream cycle for ws/hermes/hermes       <- started the second ee5148bb's worker freed; 30c96d50 still running
03:19:32 Dream completed: run_id=f0d193ff (1092s)
03:23:07 [41387775] Starting dream cycle for ws/hermes/hermes       <- +5m01s after b467eb3c started
03:27:35 Dream completed: run_id=30c96d50 (1575s)
03:34:41 Dream completed: run_id=b467eb3c (995s)
03:37:40 Dream completed: run_id=41387775 (872s)

Every duplicate appears ~5m01s after a still-running dream for the same peer β€” exactly the stale timeout plus one poll interval. A manually triggered dream reproduced it too: started 21:10:40, duplicate at 21:15:40 (+5m00.2s).

During the overlap window, the concurrent same-peer dreams interfered with each other: 16 Failed to delete observation warnings and a Tool execution loop reached max iterations as specialists worked against observations the twin run had already deleted/changed.


Your environment

  • OS: Ubuntu (kernel 6.8), Docker Compose deployment
  • Honcho Server Version: v3.0.6 (e659b6b); bug confirmed still present on current main (cleanup_stale_work_units unchanged, no heartbeat during process_item)
  • Dream models: local via Ollama (qwen3-14b) β€” but any dream exceeding 5 min triggers this, including slow API calls/retries

Additional context

Possible fix directions, in rough order of preference:

  1. Heartbeat: refresh ActiveQueueSession.last_updated while a task is actively processing β€” e.g. touch it per tool-loop iteration inside the dream, or run a small periodic touch task alongside process_item(). Keeps crash recovery fast (5 min) while making long tasks safe.
  2. Per-task-type timeout: dreams are expected to run tens of minutes; representation batches aren't. A separate (longer) stale timeout for dream work units would be a smaller change.
  3. At minimum, document the interaction so operators size STALE_SESSION_TIMEOUT_MINUTES above their worst-case dream duration.

Workaround we're running now: DERIVER_STALE_SESSION_TIMEOUT_MINUTES=45 (above our longest observed dream). Trade-off: a genuinely crashed worker now blocks its peer's queue for 45 min instead of 5.

Happy to submit a PR for whichever direction maintainers prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions