fix(build): hard timeout on hax_client.create_request to prevent silent hangs by AbirAbbas · Pull Request #76 · Agent-Field/SWE-AF

AbirAbbas · 2026-05-11T20:23:01Z

Symptom

run_1778512783034_f4985c96 — github-buddy.implement_from_issue tripped its 7200s active-time watchdog after 8011s of wallclock even though the agentfield#568/#569 pause-cascade fix was correctly subtracting paused time:

15:32:06 — first HITL cycle started (cascade wired, parent_pause_clock_id=139880527325184)
16:03:23 → 16:16:53 — start_pause → end_pause fired cleanly, 810.21s correctly subtracted
16:16:53 → 16:29:45 — revision iteration (run_architect → run_tech_lead → run_sprint_planner) all succeeded
16:29:45 → 17:45:36 — 76 min of nothing. No new reasoners. No second app.pause(). swe-planner.build stayed listed as running and never flipped to waiting/waiting_for_approval.
17:45:36 — watchdog tripped. Math: wall=8011s − paused=810s = active=7201s > budget=7200s. Math is exact. The fix from #568/#569 is doing its job correctly — the upstream hang shouldn't have been allowed to last 76 min in the first place.

Cause

Between run_sprint_planner completing and the next intended app.pause(), the SWE-AF code path runs:

_format_plan_for_approval(plan_result) — pure Python, very unlikely to hang 76min
hax_client.create_request(**hax_create_kwargs) — synchronous HTTP call to hax-sdk, no client-side timeout
open(approval_state_path, "w") … _json.dump(…) — file IO, basically can't hang
await app.pause(...) — known-good (first cycle exercised it end-to-end)

(2) is the only external blocking call and the highest-likelihood culprit. A wedged hax-sdk would cause create_request to block indefinitely.

What this PR does NOT prove

I have not reproduced this locally (no stack trace from the stuck swe-planner.build process, no fake hax-sdk repro). The fix is targeted at the highest-likelihood candidate based on the timeline + code path. If the actual hang is elsewhere (e.g., a 76-min _format_plan_for_approval, which I don't believe but can't rule out from outside), this PR doesn't address it.

The cost of being narrowly scoped is low: the timeout wrap only fires if hax_client.create_request actually takes > 120s, and it would make the diagnostic signal clear (a fast, loud RuntimeError instead of an opaque 2-hour watchdog timeout up the call chain).

What this PR does

Wrap the synchronous hax_client.create_request(**hax_create_kwargs) in:

asyncio.wait_for(asyncio.to_thread(hax_client.create_request, **kwargs), timeout=120.0)

Plus app.note() markers at entry, success (with request_id), and on each error branch (TimeoutError distinguished from any other exception) so the run timeline shows exactly when the call started, succeeded, or got stuck.

120s gives hax-sdk room for cold-start; anything beyond that is almost certainly wedged.

Test plan

python -m compileall -q swe_af/ — clean
pytest tests/ — 917 passed, 1 skipped, 1 deselected (the pre-existing local-env tests.conftest._is_real_host collision with hax-sdk's editable install — same failure on main, unrelated to this PR)
Deploy → trigger another implement_from_issue run that reaches Phase 1.5 revision
Confirm second create_request either succeeds or surfaces a clear RuntimeError within 120s instead of hanging

…nt hangs Without this, a wedged hax-sdk causes app.pause() to never be reached: the surrounding reasoner sits in the Phase 1.5 revision loop with no new sub-reasoners spawned and no app.pause() to mark the execution WAITING. The parent reasoner (e.g. github-buddy.implement_from_issue) keeps accumulating active-time on its pause-aware watchdog and eventually trips the default 7200s budget — even though the cascade fix from agentfield#568/#569 is doing exactly what it's supposed to. Run that motivated this: run_1778512783034_f4985c96 — first hax cycle worked end-to-end (start_pause→end_pause→revision iter). After the revision phases completed at 16:29:45, swe-planner.build was supposed to format a new hax payload and call hax_client.create_request → app.pause for the second cycle. Instead build sat doing nothing visible for 76 minutes and github-buddy.implement_from_issue tripped its watchdog at 17:45:36 with active_elapsed=7201s (810s of legitimate pause time correctly subtracted by the cascade — math was right; the upstream hang shouldn't have been allowed to run that long in the first place). Fix: wrap the synchronous hax_client.create_request in asyncio.wait_for(asyncio.to_thread(...), timeout=120) so a wedged hax-sdk surfaces as a fast, loud RuntimeError instead of an opaque multi-hour watchdog timeout up the call chain. Adds app.note() before the call (entry), on success (with request_id), and on each error branch (TimeoutError specifically vs anything else) so the run timeline shows exactly when the call started, succeeded, or got stuck. 120s gives hax-sdk reasonable headroom for cold-start; anything beyond that is almost certainly wedged. If hax-sdk's create endpoint ever legitimately needs more, we can bump the constant — but the current behaviour of "wait forever" is strictly worse than any finite value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…/error Adds a focused test file pinning the wrapper from the prior commit (_create_hax_request_with_timeout). 50 parametrized iterations + 2 sanity tests catch flakes in async scheduler / thread-pool behaviour that a single-run check would miss. Refactor: extracts the inline timeout block from build() into a module-level helper of the same name so tests can drive it directly without standing up a full reasoner invocation. Same control flow, same timeout, same notes — just hoisted for reuse and testability. The module-level HAX_CREATE_REQUEST_TIMEOUT_SECONDS=120.0 is the production default and a sanity test pins it against accidental drift. Test coverage: * 20× hung create_request → must raise RuntimeError within [timeout, 4*timeout]. Pins both lower bound (no spurious early failure) and upper bound (the actual production bug — unbounded wait — would blow past 4*timeout). * 20× responsive create_request → must return its response, complete well under timeout, fire the 'submitted' app.note. * 10× ConnectionError → must propagate unchanged (operators need to tell hang from network failure apart). * configurable timeout_seconds kwarg is honoured (not silently overridden by the module default). * module default is 120s (drift guard). Test characteristics: * Full file runs in ~32s locally (PYENV_VERSION=3.12.5). * hang_seconds=1.5 in the fake client — long enough to outlive the 0.3s test timeout with margin, short enough that orphaned threads (asyncio.to_thread doesn't kill the underlying thread on cancel) don't pile up and starve the default executor pool. * Fixture-scoped patch of app.note prevents real HTTP calls during tests. What this PR does NOT prove: the actual production hang IS in hax_client.create_request specifically. I still haven't reproduced the production failure end-to-end (would require a stack dump from the stuck swe-planner.build container OR a fake hax-sdk + full control-plane setup). The fix is targeted at the highest-likelihood candidate from timeline analysis. If the actual hang is elsewhere (e.g. _format_plan_for_approval), this PR doesn't address it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AbirAbbas and others added 2 commits May 11, 2026 16:05

AbirAbbas merged commit 97e8112 into main May 11, 2026
2 checks passed

AbirAbbas deleted the fix/hax-create-request-timeout branch May 11, 2026 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(build): hard timeout on hax_client.create_request to prevent silent hangs#76

fix(build): hard timeout on hax_client.create_request to prevent silent hangs#76
AbirAbbas merged 2 commits into
mainfrom
fix/hax-create-request-timeout

AbirAbbas commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbirAbbas commented May 11, 2026

Symptom

Cause

What this PR does NOT prove

What this PR does

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant