Skip to content

fix(httpclient): set a default HTTP timeout on HTTP_MGR#831

Merged
dgunning merged 2 commits into
dgunning:mainfrom
kevinchiu:fix/http-default-timeout
May 27, 2026
Merged

fix(httpclient): set a default HTTP timeout on HTTP_MGR#831
dgunning merged 2 commits into
dgunning:mainfrom
kevinchiu:fix/http-default-timeout

Conversation

@kevinchiu
Copy link
Copy Markdown
Contributor

Symptom

The internal httpx.Client is constructed without a default timeout,
so a stalled upstream or slow TLS handshake can leave a worker blocked
forever on a socket read. The read syscall is not interruptible by
Python's signal handlers, so the process sits wedged past any external
job budget — observed in production with jobs running 50+ minutes past
their 180s budget because of a single get_current_filings() read that
never returned.

The existing configure_http(timeout=…) docstring already claims
default: 30.0, but no default is actually applied at construction.

Root cause

edgar/httpclient.py::get_http_mgr builds HTTP_MGR.httpx_params with
SSL/limits/etc. populated, but never sets timeout. httpx then defaults
to httpx._config.DEFAULT_TIMEOUT_CONFIG which sets a 5s timeout on
each phase only when no timeout argument is passed at all, and most
internal call sites pass timeout=None explicitly, which httpx
interprets as "no timeout, ever."

Fix

Add get_edgar_http_timeout() reading EDGAR_HTTP_TIMEOUT (seconds,
default 30.0), and apply httpx.Timeout(get_edgar_http_timeout(), connect=10.0) in get_http_mgr(). Callers that want unbounded waits
can still opt out via configure_http(timeout=None) or by passing an
explicit timeout to a request.

Tests

tests/test_httprequests.py::test_default_http_timeout_is_set
asserts that HTTP_MGR.httpx_params["timeout"] is set, is an
httpx.Timeout, and that every phase (connect/read/write/pool) is
non-None.

tests/test_httprequests.py::test_default_http_timeout_env_override
asserts that EDGAR_HTTP_TIMEOUT=5 overrides the read timeout.

Both fail on main; both pass with this change. No regressions vs
main on the rest of test_httprequests.py (the same 11 pre-existing
identity/network-dependent failures appear on both sides).

The internal httpx client was created without a timeout, so a stalled
upstream or slow TLS handshake could block a worker indefinitely on a
socket read syscall — uninterruptible by Python's signal handlers.
Downstream users observed processes running 50+ minutes past their
job budget because of a single read that never returned.

Set a non-None default Timeout(30.0, connect=10.0) when HTTP_MGR is
constructed in get_http_mgr(). Add EDGAR_HTTP_TIMEOUT env var (seconds)
for static configuration; the existing configure_http(timeout=...)
still works for runtime changes (and its docstring's documented
'default: 30.0' is now actually true).

Callers that want unbounded waits can opt out by passing timeout=None
via configure_http() or supplying an explicit timeout to the request.
Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong fix Kevin — the 50-minute-blocked-read evidence is the kind of production data that makes a clear case. The diagnosis and env-var pattern look right, CI's green, and the test asserting every phase is non-None directly targets the bug condition.

One substantive issue before merging: your PR description says callers can opt out via configure_http(timeout=None), but configure_http at line 280 skips the assignment when timeout is None:

if timeout is not None:
    HTTP_MGR.httpx_params["timeout"] = Timeout(timeout, connect=10.0)

So passing None is a no-op — it preserves whatever's already set (after this PR: the new 30s default). After this lands, there's actually no documented way to opt out without poking HTTP_MGR.httpx_params directly.

Could you add an explicit escape hatch? My preference is a dedicated function rather than overloading timeout=None:

def disable_http_timeout() -> None:
    """Remove the default HTTP timeout — requests will wait indefinitely.

    Use only when blocking reads on stalled upstreams is acceptable
    (e.g., long-running batch jobs with external supervision).
    """
    HTTP_MGR.httpx_params.pop("timeout", None)

And for the env-var path, let an empty string or none mean unlimited:

def get_edgar_http_timeout() -> Optional[float]:
    raw = os.environ.get("EDGAR_HTTP_TIMEOUT", "30.0")
    if raw == "" or raw.lower() in ("none", "unlimited"):
        return None
    return float(raw)

Then in get_http_mgr():

t = get_edgar_http_timeout()
if t is not None:
    http_mgr.httpx_params["timeout"] = httpx.Timeout(t, connect=10.0)

Three reasons I prefer a dedicated function over configure_http(timeout=None):

  1. Consistencyconfigure_http's other three params (verify_ssl, use_system_certs, proxy) all use None to mean "don't change." Breaking that pattern for just timeout would be a code-review footgun.
  2. Intent-revealingdisable_http_timeout() is impossible to misread. configure_http(timeout=None) always raises the "is that 'no timeout' or 'no change'?" question.
  3. Safety asymmetry — disabling timeouts is exactly the dangerous thing your PR exists to prevent. Making it require an ugly explicit call rather than a natural-looking parameter value is the right friction.

Worth adding two tests: disable_http_timeout() actually removes the key, and EDGAR_HTTP_TIMEOUT= (or =none) round-trips to no timeout being set.

One other minor: EDGAR_HTTP_TIMEOUT=0 currently produces httpx.Timeout(0.0, connect=10.0) which httpx treats as "immediate timeout" — probably worth guarding against <= 0 or having it route through the same "unlimited" path so users don't shoot themselves in the foot.

…UT=0

Address review feedback on the default-timeout PR.

- get_edgar_http_timeout() now returns Optional[float]. Empty, 'none',
  'unlimited', and any value <= 0 route through the 'no timeout' path.
  Critically, EDGAR_HTTP_TIMEOUT=0 used to produce httpx.Timeout(0.0)
  which httpx treats as immediate-timeout on every I/O phase.

- get_http_mgr() only sets httpx_params['timeout'] when the resolved
  value is not None, so the unlimited path leaves the key absent.

- configure_http(timeout=None) remains a no-op (matches the 'leave
  unchanged' semantics shared by verify_ssl/use_system_certs/proxy);
  disable_http_timeout() is the dedicated escape hatch.

- Docstring on configure_http points users at disable_http_timeout()
  instead of leaving the opt-out path undocumented.
@kevinchiu
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review — all three concerns addressed in 0b505745:

1. Dedicated disable_http_timeout() escape hatch. configure_http(timeout=None) stays a no-op (consistency with verify_ssl / use_system_certs / proxyNone always means "leave unchanged"). The docstring now points users at disable_http_timeout(), which pops timeout from httpx_params and recycles the client.

2. Env-var opt-out. get_edgar_http_timeout() now returns Optional[float]. Empty string, "none", "unlimited" (case-insensitive) all return None. get_http_mgr() skips the assignment when the resolved value is None, so httpx_params never has a timeout key in the unlimited path.

3. EDGAR_HTTP_TIMEOUT=0 footgun closed. Anything <= 0 routes through the same unlimited path. A user setting it to 0 (or -1) gets "no timeout" rather than httpx.Timeout(0.0) — which, as you noted, httpx treats as immediate-failure on every I/O phase.

Tests added:

  • test_default_http_timeout_env_opt_out — parametrized over ["", "none", "None", "NONE", "unlimited", "UNLIMITED", "0", "0.0", "-1"]; asserts get_edgar_http_timeout() is None and that the resulting HTTP_MGR has no timeout key.
  • test_disable_http_timeout_removes_timeout — set a timeout, call disable_http_timeout(), verify the key is gone, then verify configure_http(timeout=None) does not reinstate it.

All 13 timeout-related tests pass locally (pytest -k "timeout or disable_http").

Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kevin — all three concerns cleanly addressed in 0b50574. The dedicated disable_http_timeout() keeps configure_http's "None = leave unchanged" semantics consistent across all four params, the env-var opt-out covers the obvious aliases, and the <= 0 guard closes the EDGAR_HTTP_TIMEOUT=0 footgun. Parametrized test coverage on the opt-out path is a nice touch. CI green across the board.

@dgunning dgunning merged commit 567b083 into dgunning:main May 27, 2026
6 checks passed
dgunning added a commit that referenced this pull request May 28, 2026
Added:
- xbrl.calculation_linkbase() — per-filing calculation linkbase as a
  pandas DataFrame, one row per parent->child arc (GH #766 Phase 1)
- Statement.extension_arcs() — surfaces filer-authored concepts that
  participate in a statement's calc linkbase but are absent from its
  presentation tree (GH #766 Phase 2)
- Section.markdown() — structure-preserving per-section markdown for
  per-item chunkers / RAG pipelines (PR #833, @HonzaCuhel)

Fixed:
- StreamingParser dropped 20%+ of text from <span>-wrapped paragraphs
  on filings crossing the 10MB streaming threshold (PR #830, @kevinchiu)
- HTTP_MGR had no default timeout — stalled requests could pin
  workers indefinitely (PR #831, @kevinchiu)
- 13F-HR holdings merged Put/Call positions into the underlying equity
  row, losing the PutCall column (GH #824)
- import edgar emitted DeprecationWarning on every startup, breaking
  downstream test suites running under -W error (PR #832, @kevinchiu)
- Filing.search() / Filing.grep() returned nothing on pre-2002
  plain-text filings (GH #819)
- TOC analyzer fabricated phantom Items on 10-Q filings via three
  10-K-shaped heuristics that fired regardless of form (PR #827,
  @HonzaCuhel)
- SearchResults panel labels conflated BM25 rank with section index
  (GH #765)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants