fix(build): hard timeout on hax_client.create_request to prevent silent hangs#76
Merged
Conversation
…nt hangs Without this, a wedged hax-sdk causes app.pause() to never be reached: the surrounding reasoner sits in the Phase 1.5 revision loop with no new sub-reasoners spawned and no app.pause() to mark the execution WAITING. The parent reasoner (e.g. github-buddy.implement_from_issue) keeps accumulating active-time on its pause-aware watchdog and eventually trips the default 7200s budget — even though the cascade fix from agentfield#568/#569 is doing exactly what it's supposed to. Run that motivated this: run_1778512783034_f4985c96 — first hax cycle worked end-to-end (start_pause→end_pause→revision iter). After the revision phases completed at 16:29:45, swe-planner.build was supposed to format a new hax payload and call hax_client.create_request → app.pause for the second cycle. Instead build sat doing nothing visible for 76 minutes and github-buddy.implement_from_issue tripped its watchdog at 17:45:36 with active_elapsed=7201s (810s of legitimate pause time correctly subtracted by the cascade — math was right; the upstream hang shouldn't have been allowed to run that long in the first place). Fix: wrap the synchronous hax_client.create_request in asyncio.wait_for(asyncio.to_thread(...), timeout=120) so a wedged hax-sdk surfaces as a fast, loud RuntimeError instead of an opaque multi-hour watchdog timeout up the call chain. Adds app.note() before the call (entry), on success (with request_id), and on each error branch (TimeoutError specifically vs anything else) so the run timeline shows exactly when the call started, succeeded, or got stuck. 120s gives hax-sdk reasonable headroom for cold-start; anything beyond that is almost certainly wedged. If hax-sdk's create endpoint ever legitimately needs more, we can bump the constant — but the current behaviour of "wait forever" is strictly worse than any finite value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/error
Adds a focused test file pinning the wrapper from the prior commit
(_create_hax_request_with_timeout). 50 parametrized iterations + 2
sanity tests catch flakes in async scheduler / thread-pool behaviour
that a single-run check would miss.
Refactor: extracts the inline timeout block from build() into a
module-level helper of the same name so tests can drive it directly
without standing up a full reasoner invocation. Same control flow,
same timeout, same notes — just hoisted for reuse and testability.
The module-level HAX_CREATE_REQUEST_TIMEOUT_SECONDS=120.0 is the
production default and a sanity test pins it against accidental drift.
Test coverage:
* 20× hung create_request → must raise RuntimeError within
[timeout, 4*timeout]. Pins both lower bound (no spurious early
failure) and upper bound (the actual production bug — unbounded
wait — would blow past 4*timeout).
* 20× responsive create_request → must return its response,
complete well under timeout, fire the 'submitted' app.note.
* 10× ConnectionError → must propagate unchanged (operators need
to tell hang from network failure apart).
* configurable timeout_seconds kwarg is honoured (not silently
overridden by the module default).
* module default is 120s (drift guard).
Test characteristics:
* Full file runs in ~32s locally (PYENV_VERSION=3.12.5).
* hang_seconds=1.5 in the fake client — long enough to outlive the
0.3s test timeout with margin, short enough that orphaned threads
(asyncio.to_thread doesn't kill the underlying thread on cancel)
don't pile up and starve the default executor pool.
* Fixture-scoped patch of app.note prevents real HTTP calls during
tests.
What this PR does NOT prove: the actual production hang IS in
hax_client.create_request specifically. I still haven't reproduced the
production failure end-to-end (would require a stack dump from the
stuck swe-planner.build container OR a fake hax-sdk + full
control-plane setup). The fix is targeted at the highest-likelihood
candidate from timeline analysis. If the actual hang is elsewhere
(e.g. _format_plan_for_approval), this PR doesn't address it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
run_1778512783034_f4985c96— github-buddy.implement_from_issue tripped its 7200s active-time watchdog after 8011s of wallclock even though the agentfield#568/#569 pause-cascade fix was correctly subtracting paused time:parent_pause_clock_id=139880527325184)start_pause→end_pausefired cleanly, 810.21s correctly subtractedrun_architect→run_tech_lead→run_sprint_planner) all succeededapp.pause().swe-planner.buildstayed listed asrunningand never flipped towaiting/waiting_for_approval.wall=8011s − paused=810s = active=7201s>budget=7200s. Math is exact. The fix from #568/#569 is doing its job correctly — the upstream hang shouldn't have been allowed to last 76 min in the first place.Cause
Between
run_sprint_plannercompleting and the next intendedapp.pause(), the SWE-AF code path runs:_format_plan_for_approval(plan_result)— pure Python, very unlikely to hang 76minhax_client.create_request(**hax_create_kwargs)— synchronous HTTP call to hax-sdk, no client-side timeoutopen(approval_state_path, "w") … _json.dump(…)— file IO, basically can't hangawait app.pause(...)— known-good (first cycle exercised it end-to-end)(2) is the only external blocking call and the highest-likelihood culprit. A wedged hax-sdk would cause
create_requestto block indefinitely.What this PR does NOT prove
I have not reproduced this locally (no stack trace from the stuck swe-planner.build process, no fake hax-sdk repro). The fix is targeted at the highest-likelihood candidate based on the timeline + code path. If the actual hang is elsewhere (e.g., a 76-min
_format_plan_for_approval, which I don't believe but can't rule out from outside), this PR doesn't address it.The cost of being narrowly scoped is low: the timeout wrap only fires if
hax_client.create_requestactually takes > 120s, and it would make the diagnostic signal clear (a fast, loudRuntimeErrorinstead of an opaque 2-hour watchdog timeout up the call chain).What this PR does
Wrap the synchronous
hax_client.create_request(**hax_create_kwargs)in:Plus
app.note()markers at entry, success (withrequest_id), and on each error branch (TimeoutErrordistinguished from any other exception) so the run timeline shows exactly when the call started, succeeded, or got stuck.120s gives hax-sdk room for cold-start; anything beyond that is almost certainly wedged.
Test plan
python -m compileall -q swe_af/— cleanpytest tests/— 917 passed, 1 skipped, 1 deselected (the pre-existing local-envtests.conftest._is_real_hostcollision with hax-sdk's editable install — same failure onmain, unrelated to this PR)implement_from_issuerun that reaches Phase 1.5 revisioncreate_requesteither succeeds or surfaces a clearRuntimeErrorwithin 120s instead of hanging