Skip to content

[Bug]: everOS /health hangs after 2-5 successful KB uploads — single asyncio event loop deadlocks between cascade + extract_atomic_facts #316

Description

@02781477

everos server start hangs after N successful KB uploads — single asyncio event loop appears to deadlock during cascade + extract_atomic_facts + LanceDB write

Summary

When ingesting multiple KB documents in sequence (via /api/v1/knowledge/documents), everOS reliably hangs after a small number of successful uploads (2–5 chunks in our runs). /health starts returning 502/timeout, although the LLM call returns HTTP 201 for some uploads. The process never recovers and must be killed.

This appears to be a deadlock in the single asyncio event loop between:

  1. The HTTP request handler waiting on extract_knowledge to finish
  2. The cascade_worker calling extract_atomic_facts on a freshly-written doc
  3. Both eventually needing the same LanceDB write slot
  4. cascade_worker holding a processing row in md_change_state, blocking its own retry chain

After a few successful chunks, the OME pipeline stops making forward progress even though llama-server reports requests_processing=0 and all upstream LLM calls have returned.

Why this matters

The user-visible failure mode is severe:

  • /health stops responding → monitoring / MCP tool calls all hang
  • kb-ingest.progress stops updating (the watchdog has no signal)
  • A new client POST /api/v1/knowledge/documents hangs forever, never receiving a response
  • Process must be hard-killed; on restart, cascade rebuild (2 s) returns but the queue still has stale processing rows from the prior run

Reproduction is trivial: upload more than ~5 KB documents back-to-back. The lockup happens on a single-host, single-process deployment — exactly the in-scope use case for v1.x.

Observed evidence

everOS log near the hang

2026-06-28T12:33:41.329Z [info] document created  doc_id=d_6e1ee6c085c1  topic_count=15
2026-06-28T12:33:41.334Z [info] POST /api/v1/knowledge/documents 201
2026-06-28T12:34:07.205Z [info] document created  doc_id=d_845129d6c5d9  topic_count=1
2026-06-28T12:34:07.212Z [info] POST /api/v1/knowledge/documents 201
2026-06-28T12:37:20.458Z [info] GET /health 200              <-- last successful /health
2026-06-28T12:38:09.[...] [error] LLMError "Request timed out."

llama-server metrics during the hang

llamacpp:requests_processing    0    <-- LLM is idle
llamacpp:requests_deferred      0
llamacpp:n_busy_slots_per_decode 1.05
llamacpp:n_decode_total         22367
llamacpp:n_tokens_max           52811

The model has finished work. everOS is stuck after the LLM returns.

Python state of the hung everOS process

  • Open TCP connections: 1 ESTABLISHED from ingest client + ~5 internal
  • Threads: 89 (vs ~50 at idle)
  • Handles: 613
  • Memory: ~450 MB RSS, slowly climbing
  • CPU: ~3-5 % (only event-loop housekeeping)

A traceback captured at hang time:

File "...\starlette\middleware\errors.py", line 165, in __call__
  await self.app(scope, receive, _send)
LLMError: 'Request timed out.'

httpx.ReadTimeout     timeout=NOT_GIVEN
AsyncHTTP11Connection ['http://127.0.0.1:8585', CLOSED]

The connection to llama-server has been closed by the server side after some unknown timeout, but everOS's task is still awaiting it.

Reproduction shape

Single-machine reproduction on Windows + Python 3.12 + EverOS 1.1.0:

  1. everos init, then edit <root>/everos.toml with any OpenAI-compatible LLM endpoint (we tested with Qwen3.5-9B-UD-Q4_K_XL via llama-server and minimax-m3 via https://api.minimaxi.com/v1 — both reproduce)
  2. everos server start
  3. From another shell, upload 10–20 KB documents at ~10 s each via POST /api/v1/knowledge/documents with multipart/form-data
  4. Watch kb-ingest.progress (or GET /api/v1/knowledge/documents)
  5. After 2–5 uploads return 201, the next upload hangs
  6. curl http://127.0.0.1:8000/health → timeout
  7. Killing everos and restarting recovers, but the next round repeats

The Python client we used (urllib with timeout=300):

req = urllib.request.Request(url, data=body, headers={"Content-Type": f"multipart/form-data; boundary={boundary}"}, method="POST")
with urllib.request.urlopen(req, timeout=300) as r:
    return r.status, r.read()

Expected behavior

Each POST /api/v1/knowledge/documents should:

  1. Accept the request
  2. Run extract_knowledge (LLM call) and write the markdown + cascade entry
  3. Return 201 within bounded time

/health should keep returning 200 throughout, even while extract_atomic_facts / cascade are processing in the background. The cascade worker should not block user-facing API responses.

Suspected root cause

Based on observable behavior, the deadlock likely involves these pieces competing in the single event loop:

  1. Runner.run() in infra/ome/_dispatch/runner.py:128 holds the OME engine semaphore for the entire retry chain (max_retries × timeout). When the inner LLM call exceeds the httpx client timeout, the retry keeps re-entering and the semaphore never releases.

  2. cascade_worker _run_loop() calls extract_atomic_facts for each newly-written doc. While processing one row, it holds an internal claim (status='processing' in md_change_state) and is awaiting an LLM call. New requests that try to ingest a doc want the same LanceDB connection pool.

  3. LanceDB writes (knowledge_topic.lance) serialize through a single writer. cascade_worker trying to upsert + the request handler trying to upsert + extract_atomic_facts all touching the same table → cross-task dependency on the same async resource.

  4. The semaphore is not a per-attempt timeout — if a downstream task (cascade) holds the slot and is itself awaiting the LLM, all upstream tasks wait indefinitely.

The combination of:

  • single asyncio event loop
  • one shared engine semaphore
  • one cascade worker serializing through that semaphore
  • one LanceDB writer per table

creates a deadlock window whenever LLM call duration exceeds request handler timeout and cascade worker happens to be processing the same table.

Workarounds we tried (all partial)

Workaround Effect
Increase LLMConfig.timeout 60→300 s Less timeouts, but doesn't prevent hang — only LLM call returns faster, hang still happens downstream
Disable cascade scanner via EVEROS_CASCADE_SCANNER_DISABLED=1 Reduces noise but cascade_worker still runs
Reduce -c to give more llama-server headroom Doesn't help — bottleneck is not LLM
Add more llama-server slots Doesn't help — only 1–2 active requests in flight at hang time
Drop --reasoning on llama-server Helps indirectly but does not eliminate the hang

The most reliable workaround we found is to bulk-import the markdown files directly into <root>/default_app/default_project/knowledge/Technology/... and then let cascade drain at its own pace with no concurrent ingest. But this defeats the use of the HTTP API and doesn't work for users who want to upload programmatically.

Environment

EverOS: 1.1.0 (PyPI) — also reproduced on 1.0.1
Python: 3.12
Runtime: bare-metal Windows 11
LLM: Qwen3.5-9B-UD-Q4_K_XL (local llama-server b9469, --reasoning off, --cache-type-k q8_0)
     and minimax/minimax-m3 (https://api.minimaxi.com/v1)
Embedding: 9B via llama-server --embedding --pooling last, everOS truncates to 1024-d
KB size at hang: ~840 documents, ~5 GB LanceDB knowledge_topic
Concurrent uploads: 1 (sequential ingest script, ~10 s per file, ~80 KB chunks)

Possible fixes (suggestion only — maintainers decide)

This is not a small change, but a few directions that look promising:

  1. Make Runner.run (and the OME semaphore) honor a per-attempt deadline so a stuck task can't hold the semaphore indefinitely.
  2. Decouple cascade_worker from the OME engine semaphore — let it run in its own bounded queue with a smaller concurrency budget, so ingest requests never wait on cascade.
  3. Add a watchdog task that, every N seconds, force-releases stuck processing rows and resets their retry_count.
  4. Make /health truly independent — currently it goes through the same middleware chain that can be blocked.

Happy to test any patches or provide more traces if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions