Skip to content

fix(build): hard timeout on hax_client.create_request to prevent silent hangs#76

Merged
AbirAbbas merged 2 commits into
mainfrom
fix/hax-create-request-timeout
May 11, 2026
Merged

fix(build): hard timeout on hax_client.create_request to prevent silent hangs#76
AbirAbbas merged 2 commits into
mainfrom
fix/hax-create-request-timeout

Conversation

@AbirAbbas
Copy link
Copy Markdown
Collaborator

Symptom

run_1778512783034_f4985c96 — github-buddy.implement_from_issue tripped its 7200s active-time watchdog after 8011s of wallclock even though the agentfield#568/#569 pause-cascade fix was correctly subtracting paused time:

  • 15:32:06 — first HITL cycle started (cascade wired, parent_pause_clock_id=139880527325184)
  • 16:03:23 → 16:16:53 — start_pauseend_pause fired cleanly, 810.21s correctly subtracted
  • 16:16:53 → 16:29:45 — revision iteration (run_architectrun_tech_leadrun_sprint_planner) all succeeded
  • 16:29:45 → 17:45:3676 min of nothing. No new reasoners. No second app.pause(). swe-planner.build stayed listed as running and never flipped to waiting/waiting_for_approval.
  • 17:45:36 — watchdog tripped. Math: wall=8011s − paused=810s = active=7201s > budget=7200s. Math is exact. The fix from #568/#569 is doing its job correctly — the upstream hang shouldn't have been allowed to last 76 min in the first place.

Cause

Between run_sprint_planner completing and the next intended app.pause(), the SWE-AF code path runs:

  1. _format_plan_for_approval(plan_result) — pure Python, very unlikely to hang 76min
  2. hax_client.create_request(**hax_create_kwargs) — synchronous HTTP call to hax-sdk, no client-side timeout
  3. open(approval_state_path, "w") … _json.dump(…) — file IO, basically can't hang
  4. await app.pause(...) — known-good (first cycle exercised it end-to-end)

(2) is the only external blocking call and the highest-likelihood culprit. A wedged hax-sdk would cause create_request to block indefinitely.

What this PR does NOT prove

I have not reproduced this locally (no stack trace from the stuck swe-planner.build process, no fake hax-sdk repro). The fix is targeted at the highest-likelihood candidate based on the timeline + code path. If the actual hang is elsewhere (e.g., a 76-min _format_plan_for_approval, which I don't believe but can't rule out from outside), this PR doesn't address it.

The cost of being narrowly scoped is low: the timeout wrap only fires if hax_client.create_request actually takes > 120s, and it would make the diagnostic signal clear (a fast, loud RuntimeError instead of an opaque 2-hour watchdog timeout up the call chain).

What this PR does

Wrap the synchronous hax_client.create_request(**hax_create_kwargs) in:

asyncio.wait_for(asyncio.to_thread(hax_client.create_request, **kwargs), timeout=120.0)

Plus app.note() markers at entry, success (with request_id), and on each error branch (TimeoutError distinguished from any other exception) so the run timeline shows exactly when the call started, succeeded, or got stuck.

120s gives hax-sdk room for cold-start; anything beyond that is almost certainly wedged.

Test plan

  • python -m compileall -q swe_af/ — clean
  • pytest tests/ — 917 passed, 1 skipped, 1 deselected (the pre-existing local-env tests.conftest._is_real_host collision with hax-sdk's editable install — same failure on main, unrelated to this PR)
  • Deploy → trigger another implement_from_issue run that reaches Phase 1.5 revision
  • Confirm second create_request either succeeds or surfaces a clear RuntimeError within 120s instead of hanging

AbirAbbas and others added 2 commits May 11, 2026 16:05
…nt hangs

Without this, a wedged hax-sdk causes app.pause() to never be reached:
the surrounding reasoner sits in the Phase 1.5 revision loop with no
new sub-reasoners spawned and no app.pause() to mark the execution
WAITING. The parent reasoner (e.g. github-buddy.implement_from_issue)
keeps accumulating active-time on its pause-aware watchdog and
eventually trips the default 7200s budget — even though the cascade
fix from agentfield#568/#569 is doing exactly what it's supposed to.

Run that motivated this:
run_1778512783034_f4985c96 — first hax cycle worked end-to-end
(start_pause→end_pause→revision iter). After the revision phases
completed at 16:29:45, swe-planner.build was supposed to format a new
hax payload and call hax_client.create_request → app.pause for the
second cycle. Instead build sat doing nothing visible for 76 minutes
and github-buddy.implement_from_issue tripped its watchdog at 17:45:36
with active_elapsed=7201s (810s of legitimate pause time correctly
subtracted by the cascade — math was right; the upstream hang
shouldn't have been allowed to run that long in the first place).

Fix: wrap the synchronous hax_client.create_request in
asyncio.wait_for(asyncio.to_thread(...), timeout=120) so a wedged
hax-sdk surfaces as a fast, loud RuntimeError instead of an opaque
multi-hour watchdog timeout up the call chain. Adds app.note() before
the call (entry), on success (with request_id), and on each error
branch (TimeoutError specifically vs anything else) so the run timeline
shows exactly when the call started, succeeded, or got stuck.

120s gives hax-sdk reasonable headroom for cold-start; anything
beyond that is almost certainly wedged. If hax-sdk's create endpoint
ever legitimately needs more, we can bump the constant — but the
current behaviour of "wait forever" is strictly worse than any
finite value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/error

Adds a focused test file pinning the wrapper from the prior commit
(_create_hax_request_with_timeout). 50 parametrized iterations + 2
sanity tests catch flakes in async scheduler / thread-pool behaviour
that a single-run check would miss.

Refactor: extracts the inline timeout block from build() into a
module-level helper of the same name so tests can drive it directly
without standing up a full reasoner invocation. Same control flow,
same timeout, same notes — just hoisted for reuse and testability.
The module-level HAX_CREATE_REQUEST_TIMEOUT_SECONDS=120.0 is the
production default and a sanity test pins it against accidental drift.

Test coverage:
  * 20× hung create_request → must raise RuntimeError within
    [timeout, 4*timeout]. Pins both lower bound (no spurious early
    failure) and upper bound (the actual production bug — unbounded
    wait — would blow past 4*timeout).
  * 20× responsive create_request → must return its response,
    complete well under timeout, fire the 'submitted' app.note.
  * 10× ConnectionError → must propagate unchanged (operators need
    to tell hang from network failure apart).
  * configurable timeout_seconds kwarg is honoured (not silently
    overridden by the module default).
  * module default is 120s (drift guard).

Test characteristics:
  * Full file runs in ~32s locally (PYENV_VERSION=3.12.5).
  * hang_seconds=1.5 in the fake client — long enough to outlive the
    0.3s test timeout with margin, short enough that orphaned threads
    (asyncio.to_thread doesn't kill the underlying thread on cancel)
    don't pile up and starve the default executor pool.
  * Fixture-scoped patch of app.note prevents real HTTP calls during
    tests.

What this PR does NOT prove: the actual production hang IS in
hax_client.create_request specifically. I still haven't reproduced the
production failure end-to-end (would require a stack dump from the
stuck swe-planner.build container OR a fake hax-sdk + full
control-plane setup). The fix is targeted at the highest-likelihood
candidate from timeline analysis. If the actual hang is elsewhere
(e.g. _format_plan_for_approval), this PR doesn't address it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AbirAbbas AbirAbbas merged commit 97e8112 into main May 11, 2026
2 checks passed
@AbirAbbas AbirAbbas deleted the fix/hax-create-request-timeout branch May 11, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant