fix(httpclient): set a default HTTP timeout on HTTP_MGR#831
Conversation
The internal httpx client was created without a timeout, so a stalled upstream or slow TLS handshake could block a worker indefinitely on a socket read syscall — uninterruptible by Python's signal handlers. Downstream users observed processes running 50+ minutes past their job budget because of a single read that never returned. Set a non-None default Timeout(30.0, connect=10.0) when HTTP_MGR is constructed in get_http_mgr(). Add EDGAR_HTTP_TIMEOUT env var (seconds) for static configuration; the existing configure_http(timeout=...) still works for runtime changes (and its docstring's documented 'default: 30.0' is now actually true). Callers that want unbounded waits can opt out by passing timeout=None via configure_http() or supplying an explicit timeout to the request.
dgunning
left a comment
There was a problem hiding this comment.
Strong fix Kevin — the 50-minute-blocked-read evidence is the kind of production data that makes a clear case. The diagnosis and env-var pattern look right, CI's green, and the test asserting every phase is non-None directly targets the bug condition.
One substantive issue before merging: your PR description says callers can opt out via configure_http(timeout=None), but configure_http at line 280 skips the assignment when timeout is None:
if timeout is not None:
HTTP_MGR.httpx_params["timeout"] = Timeout(timeout, connect=10.0)So passing None is a no-op — it preserves whatever's already set (after this PR: the new 30s default). After this lands, there's actually no documented way to opt out without poking HTTP_MGR.httpx_params directly.
Could you add an explicit escape hatch? My preference is a dedicated function rather than overloading timeout=None:
def disable_http_timeout() -> None:
"""Remove the default HTTP timeout — requests will wait indefinitely.
Use only when blocking reads on stalled upstreams is acceptable
(e.g., long-running batch jobs with external supervision).
"""
HTTP_MGR.httpx_params.pop("timeout", None)And for the env-var path, let an empty string or none mean unlimited:
def get_edgar_http_timeout() -> Optional[float]:
raw = os.environ.get("EDGAR_HTTP_TIMEOUT", "30.0")
if raw == "" or raw.lower() in ("none", "unlimited"):
return None
return float(raw)Then in get_http_mgr():
t = get_edgar_http_timeout()
if t is not None:
http_mgr.httpx_params["timeout"] = httpx.Timeout(t, connect=10.0)Three reasons I prefer a dedicated function over configure_http(timeout=None):
- Consistency —
configure_http's other three params (verify_ssl,use_system_certs,proxy) all useNoneto mean "don't change." Breaking that pattern for justtimeoutwould be a code-review footgun. - Intent-revealing —
disable_http_timeout()is impossible to misread.configure_http(timeout=None)always raises the "is that 'no timeout' or 'no change'?" question. - Safety asymmetry — disabling timeouts is exactly the dangerous thing your PR exists to prevent. Making it require an ugly explicit call rather than a natural-looking parameter value is the right friction.
Worth adding two tests: disable_http_timeout() actually removes the key, and EDGAR_HTTP_TIMEOUT= (or =none) round-trips to no timeout being set.
One other minor: EDGAR_HTTP_TIMEOUT=0 currently produces httpx.Timeout(0.0, connect=10.0) which httpx treats as "immediate timeout" — probably worth guarding against <= 0 or having it route through the same "unlimited" path so users don't shoot themselves in the foot.
…UT=0 Address review feedback on the default-timeout PR. - get_edgar_http_timeout() now returns Optional[float]. Empty, 'none', 'unlimited', and any value <= 0 route through the 'no timeout' path. Critically, EDGAR_HTTP_TIMEOUT=0 used to produce httpx.Timeout(0.0) which httpx treats as immediate-timeout on every I/O phase. - get_http_mgr() only sets httpx_params['timeout'] when the resolved value is not None, so the unlimited path leaves the key absent. - configure_http(timeout=None) remains a no-op (matches the 'leave unchanged' semantics shared by verify_ssl/use_system_certs/proxy); disable_http_timeout() is the dedicated escape hatch. - Docstring on configure_http points users at disable_http_timeout() instead of leaving the opt-out path undocumented.
|
Thanks for the detailed review — all three concerns addressed in 1. Dedicated 2. Env-var opt-out. 3. Tests added:
All 13 timeout-related tests pass locally ( |
dgunning
left a comment
There was a problem hiding this comment.
Thanks Kevin — all three concerns cleanly addressed in 0b50574. The dedicated disable_http_timeout() keeps configure_http's "None = leave unchanged" semantics consistent across all four params, the env-var opt-out covers the obvious aliases, and the <= 0 guard closes the EDGAR_HTTP_TIMEOUT=0 footgun. Parametrized test coverage on the opt-out path is a nice touch. CI green across the board.
Added: - xbrl.calculation_linkbase() — per-filing calculation linkbase as a pandas DataFrame, one row per parent->child arc (GH #766 Phase 1) - Statement.extension_arcs() — surfaces filer-authored concepts that participate in a statement's calc linkbase but are absent from its presentation tree (GH #766 Phase 2) - Section.markdown() — structure-preserving per-section markdown for per-item chunkers / RAG pipelines (PR #833, @HonzaCuhel) Fixed: - StreamingParser dropped 20%+ of text from <span>-wrapped paragraphs on filings crossing the 10MB streaming threshold (PR #830, @kevinchiu) - HTTP_MGR had no default timeout — stalled requests could pin workers indefinitely (PR #831, @kevinchiu) - 13F-HR holdings merged Put/Call positions into the underlying equity row, losing the PutCall column (GH #824) - import edgar emitted DeprecationWarning on every startup, breaking downstream test suites running under -W error (PR #832, @kevinchiu) - Filing.search() / Filing.grep() returned nothing on pre-2002 plain-text filings (GH #819) - TOC analyzer fabricated phantom Items on 10-Q filings via three 10-K-shaped heuristics that fired regardless of form (PR #827, @HonzaCuhel) - SearchResults panel labels conflated BM25 rank with section index (GH #765) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom
The internal
httpx.Clientis constructed without a defaulttimeout,so a stalled upstream or slow TLS handshake can leave a worker blocked
forever on a socket read. The read syscall is not interruptible by
Python's signal handlers, so the process sits wedged past any external
job budget — observed in production with jobs running 50+ minutes past
their 180s budget because of a single
get_current_filings()read thatnever returned.
The existing
configure_http(timeout=…)docstring already claimsdefault: 30.0, but no default is actually applied at construction.Root cause
edgar/httpclient.py::get_http_mgrbuildsHTTP_MGR.httpx_paramswithSSL/limits/etc. populated, but never sets
timeout. httpx then defaultsto
httpx._config.DEFAULT_TIMEOUT_CONFIGwhich sets a 5s timeout oneach phase only when no timeout argument is passed at all, and most
internal call sites pass
timeout=Noneexplicitly, which httpxinterprets as "no timeout, ever."
Fix
Add
get_edgar_http_timeout()readingEDGAR_HTTP_TIMEOUT(seconds,default 30.0), and apply
httpx.Timeout(get_edgar_http_timeout(), connect=10.0)inget_http_mgr(). Callers that want unbounded waitscan still opt out via
configure_http(timeout=None)or by passing anexplicit timeout to a request.
Tests
tests/test_httprequests.py::test_default_http_timeout_is_set—asserts that
HTTP_MGR.httpx_params["timeout"]is set, is anhttpx.Timeout, and that every phase (connect/read/write/pool) isnon-
None.tests/test_httprequests.py::test_default_http_timeout_env_override—asserts that
EDGAR_HTTP_TIMEOUT=5overrides the read timeout.Both fail on
main; both pass with this change. No regressions vsmainon the rest oftest_httprequests.py(the same 11 pre-existingidentity/network-dependent failures appear on both sides).