fix(documents): preserve span-wrapped text in StreamingParser#830
Conversation
The streaming HTML parser silently dropped text from <span>-wrapped paragraphs on filings that crossed streaming_threshold (default 10MB). For SEC filings in the ~30MB–110MB band — which routinely nest every word inside style-bearing <span> tags — filing.text() returned output 20%+ shorter than the non-streaming path with no exception or warning. Two compounding bugs in the iterparse loop: 1. elem.clear() ran on every event (both start and end). At start events, lxml's HTML-mode lookahead has populated child elements and their text; clearing at start destroyed that data before any handler could read it. 2. elem.clear() ran on every element regardless of whether an enclosing structural element (<p>, <h1>-<h6>, <section>) had finished reading its children. iterparse fires end events depth-first, so a child <span>'s end event cleared its .text and .tail before the parent <p>'s end event called _get_text_content(p). The pre-existing _table_depth gate already protected <table> from the same defect. Fix: clear only on end events, and gate clearing on a new _content_depth counter that tracks open p/h1-h6/section elements (mirroring _table_depth). Regression test exercises the SEC pattern of span-wrapped paragraph text under forced streaming mode.
dgunning
left a comment
There was a problem hiding this comment.
Strong fix, Kevin — diagnosis is precise (separating the start-event clearing from the missing content-depth gate as two distinct bugs is sharp), the _content_depth pattern mirrors _table_depth cleanly, and the cover-page recovery on Stepstone is convincing evidence the span bug was real.
One thing I want to understand before merging: your production table shows streaming-mode output now exceeds non-streaming by 25–67%.
| Filing | Pre-fix streaming | Non-streaming | Post-fix streaming | Δ vs non-streaming |
|---|---|---|---|---|
| Stepstone 10-K | 1,140,129 | 1,429,000 | 1,816,872 | +27% |
| Stepstone 20-F | — | 1,610,414 | 2,007,281 | +25% |
20-F 0001104659… |
— | 2,001,895 | 3,347,578 | +67% |
20-F 0001193125… |
— | 1,779,974 | 2,350,296 | +32% |
The span bug explains why pre-fix streaming was below non-streaming. It doesn't explain why post-fix streaming is above it. Three possibilities I can think of:
- Pre-existing divergence between paths (different whitespace/tail handling) that was masked while streaming was losing content
- Non-streaming has its own separate content-loss bug — possibly span-related at a different scale
- Streaming is now over-including — sibling
.tailaccumulating twice, or buffer flush interacting with the deferred clear
Have you compared the actual content (not just length) between the two paths on one of these filings? The +67% on 0001104659… is large enough that I'd want to know whether streaming is now correct and non-streaming is buggy, or vice versa, or both paths have different (defensible) semantics.
The regression test asserts content presence in both paths but doesn't compare lengths or do a content diff — adding a length-comparison assertion (or a diff on a known fixture) would lock in whichever interpretation is correct.
Not blocking the fix to the span bug — that's clearly the right call regardless. Just want to understand the overshoot before declaring streaming "fixed."
Once paragraph-text recovery from 794529d was in place, paragraphs nested inside table cells emitted twice in the streaming path: once as a free-standing ParagraphNode (from _handle_start_tag/_end_tag calling _start_paragraph/_end_paragraph unconditionally), and once as TableNode cell text (from _end_table running processor.process over the full subtree). Same applied to <h*> and <section> inside <td>. Pre-794529dd this was masked because <p> handlers produced empty nodes anyway. Post-fix it surfaced as 10-36% content overshoot vs non-streaming on table-heavy filings, visible as the same financial-statement labels appearing dozens of times in streaming output. Cross-checked on the four PR-cited filings (Stepstone 10-K and 20-F, AVAL 20-F, IFS 20-F): without this gate, S/NS normalized content ratio runs 0.95-1.36; with the gate, 0.86-1.05. Gate _handle_start_tag and _handle_end_tag on _table_depth == 0 for <p>/<h1-h6>/<section>, symmetrical to the existing _table_depth gate on elem.clear(). The table processor remains the single source of cell text. Regression test covers a 2x2 table with <p> in every cell and asserts each cell appears exactly once in both streaming and non-streaming output.
|
Thanks for pushing back on this — the overshoot concern is well-founded. I re-examined the PR against current The fix targets a real and severe bugPre-fix streaming on current
Pre-fix streaming was missing 25-66% of the content non-streaming captures. The regression test reproduces the failure mode on the span fixture; on production filings the gap is much larger. The diagnosis (deferred The +27% overshoot was real, and you correctly suspected over-inclusionWith just the span fix:
3 of 4 filings overshoot NS by 10-36% in content (not just whitespace) — exactly your concern. Root cause: table cell paragraphs double-emitMinimal reproducer: html = """<html><body>
<table>
<tr><td><p>Cell paragraph one</p></td><td><p>Cell paragraph two</p></td></tr>
<tr><td><p>Row two A</p></td><td><p>Row two B</p></td></tr>
</table>
</body></html>"""Streaming, span-fix only: Non-streaming: Each cell's text appears twice in streaming — once as a free-standing Fix pushed as
|
| Filing | NS content | Post + gate | S/NS |
|---|---|---|---|
| Stepstone 10-K | 1,096,791 | 1,025,627 | 0.94 |
| Stepstone 20-F | 1,418,465 | 1,316,023 | 0.93 |
| AVAL 20-F | 1,629,903 | 1,707,399 | 1.05 |
| IFS 20-F | 1,505,955 | 1,291,519 | 0.86 |
All four within 0.86-1.05× of non-streaming — overshoot eliminated, no more table cell duplication. 162 tests pass across tests/test_html_parser*.py, tests/test_documents.py, etc.
Residual 5-14% under-coverage is pre-existing
The lines now missing on IFS are mostly body-text paragraphs nested inside multiple <div> wrappers (e.g., "You should inquire for yourself whether you are entitled..." which sits inside two nested <div> containers under an <h5> Table-of-Contents link). These aren't inside tables, and they weren't recovered pre-span-fix either — a separate streaming-path limitation around heavily-nested body content, out of scope here.
One PR description nit
The original description says the span fix recovers "UNITED STATES SECURITIES AND EXCHANGE COMMISSION" from the cover page. On current main that phrase is missing from streaming both pre- and post-fix — something in the 9 commits between the PR base and main moved on cover-page handling, and the claim no longer reproduces. Worth dropping that line from the description before merge.
Take a look at e28051fa and let me know if you'd like changes.
dgunning
left a comment
There was a problem hiding this comment.
Verified Kevin's follow-up fix end-to-end:
- Stepstone 10-K reproduces exactly: NS=1,096,791 / S=1,023,748 / ratio 0.93 (Kevin reported 0.94).
- Duplication probes on real cells all near 1.00 (Total 62→59, Interest income 5→5, Risk Factors 7→7, Cash and cash equivalents 1→1) — the +27% overshoot is gone.
- Cover-page text is recovered in streaming output (
UNITED STATES / SECURITIES AND EXCHANGE COMMISSION / WASHINGTON, DC 20549 / FORM 10-K). - The new
test_streaming_does_not_double_emit_table_cell_paragraphsusestext.count(cell) == 1— exact-count assertion across both streaming and non-streaming baselines, which is the lock-in I asked for in my earlier review. - Code review: the
_table_depthgate in_handle_start_tag/_handle_end_tagis symmetrical to the existing clearing gate; the table processor's_extract_textalready pulls cell content viaitertext(), so the structural handlers inside tables were genuinely redundant emitters. - 95 HTML-parser tests pass; 78 pass across the broader
test_html_parser*+test_documentssweep.
Nice diagnosis and minimal fix. Merging.
Added: - xbrl.calculation_linkbase() — per-filing calculation linkbase as a pandas DataFrame, one row per parent->child arc (GH #766 Phase 1) - Statement.extension_arcs() — surfaces filer-authored concepts that participate in a statement's calc linkbase but are absent from its presentation tree (GH #766 Phase 2) - Section.markdown() — structure-preserving per-section markdown for per-item chunkers / RAG pipelines (PR #833, @HonzaCuhel) Fixed: - StreamingParser dropped 20%+ of text from <span>-wrapped paragraphs on filings crossing the 10MB streaming threshold (PR #830, @kevinchiu) - HTTP_MGR had no default timeout — stalled requests could pin workers indefinitely (PR #831, @kevinchiu) - 13F-HR holdings merged Put/Call positions into the underlying equity row, losing the PutCall column (GH #824) - import edgar emitted DeprecationWarning on every startup, breaking downstream test suites running under -W error (PR #832, @kevinchiu) - Filing.search() / Filing.grep() returned nothing on pre-2002 plain-text filings (GH #819) - TOC analyzer fabricated phantom Items on 10-Q filings via three 10-K-shaped heuristics that fired regardless of form (PR #827, @HonzaCuhel) - SearchResults panel labels conflated BM25 rank with section index (GH #765) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom
filing.text()silently returns less text than expected for filingsthat cross
ParserConfig.streaming_threshold(default 10MB), with noexception and no warning.
Concrete measurement, Stepstone 10-K (
0001193125-26-128890, 42.7 MB raw HTML):The streaming path was dropping the entire cover-page block — including
"UNITED STATES SECURITIES AND EXCHANGE COMMISSION / FORM 10-K /
For the fiscal year ended…" — because every line of that block is
nested inside style-bearing
<span>tags.A minimal SEC-style snippet (each word wrapped in
<span style="…">)reproduces the same failure mode without network: streaming drops every
<p>entirely and keeps only<h*>text.Root cause
Two compounding bugs in the
iterparseloop inedgar/documents/utils/streaming.py::StreamingParser.parse:elem.clear()ran on every event (bothstartandend). Atstartevents, lxml's HTML-mode lookahead has already populated childelements and their
.text/.tail; structural handlers such as_start_headingread those at start time. Clearing onstartdestroyed that data before any handler could read it.
No content-depth gate around child clearing.
iterparsefiresendevents depth-first, so a child<span>'sendevent ranelem.clear()(which wipes.textand.tailin lxml) before theenclosing
<p>'sendevent called_get_text_content(p). SinceSEC filings nest essentially every word inside
<span style="…">,_end_paragraphsaw only empty children and produced empty paragraphtext. The pre-existing
_table_depthgate already protected<table>from the identical defect — this just extends the sameidea to the other structural containers.
Fix
Clear only on
endevents, and gate clearing on a new_content_depthcounter that tracks open
<p>/<h1>–<h6>/<section>elements(mirroring
_table_depth). Defers child cleanup until the enclosingstructural element has read its subtree.
Regression test
tests/test_html_parser_regressions.py::TestStreamingParserRegressions::test_streaming_preserves_span_wrapped_paragraph_text—uses a forced-streaming
ParserConfig(streaming_threshold=1)againstSEC-style span-wrapped HTML, asserts that all paragraph and heading
content survives, and cross-checks against the non-streaming baseline.
Fails on
main; passes with this change.Verification
uv run pytest tests/test_html_parser*.py— 68 passed, 3 skipped.End-to-end check on each of the four problem filings reported in
production. Streaming-path
filing.text()length after the fix, withthe non-streaming baseline alongside for reference:
0001193125-26-1288900001193125-26-1776170001104659-26-0444930001193125-26-183398All four return non-empty text on the streaming path, and the streaming
output begins with the expected SEC cover-page text on the Stepstone
10-K (previously truncated to body-only content).