fix(documents): preserve span-wrapped text in StreamingParser by kevinchiu · Pull Request #830 · dgunning/edgartools

kevinchiu · 2026-05-23T02:50:27Z

Symptom

filing.text() silently returns less text than expected for filings
that cross ParserConfig.streaming_threshold (default 10MB), with no
exception and no warning.

Concrete measurement, Stepstone 10-K (0001193125-26-128890, 42.7 MB raw HTML):

path	text chars
streaming (before this fix)	1,140,129
non-streaming baseline	1,429,000
streaming (after this fix)	1,816,872

The streaming path was dropping the entire cover-page block — including
"UNITED STATES SECURITIES AND EXCHANGE COMMISSION / FORM 10-K /
For the fiscal year ended…" — because every line of that block is
nested inside style-bearing <span> tags.

A minimal SEC-style snippet (each word wrapped in <span style="…">)
reproduces the same failure mode without network: streaming drops every
<p> entirely and keeps only <h*> text.

Root cause

Two compounding bugs in the iterparse loop in
edgar/documents/utils/streaming.py::StreamingParser.parse:

elem.clear() ran on every event (both start and end). At
start events, lxml's HTML-mode lookahead has already populated child
elements and their .text/.tail; structural handlers such as
_start_heading read those at start time. Clearing on start
destroyed that data before any handler could read it.
No content-depth gate around child clearing. iterparse fires
end events depth-first, so a child <span>'s end event ran
elem.clear() (which wipes .text and .tail in lxml) before the
enclosing <p>'s end event called _get_text_content(p). Since
SEC filings nest essentially every word inside <span style="…">,
_end_paragraph saw only empty children and produced empty paragraph
text. The pre-existing _table_depth gate already protected
<table> from the identical defect — this just extends the same
idea to the other structural containers.

Fix

Clear only on end events, and gate clearing on a new _content_depth
counter that tracks open <p> / <h1>–<h6> / <section> elements
(mirroring _table_depth). Defers child cleanup until the enclosing
structural element has read its subtree.

Regression test

tests/test_html_parser_regressions.py::TestStreamingParserRegressions::test_streaming_preserves_span_wrapped_paragraph_text —
uses a forced-streaming ParserConfig(streaming_threshold=1) against
SEC-style span-wrapped HTML, asserts that all paragraph and heading
content survives, and cross-checks against the non-streaming baseline.
Fails on main; passes with this change.

Verification

uv run pytest tests/test_html_parser*.py — 68 passed, 3 skipped.

End-to-end check on each of the four problem filings reported in
production. Streaming-path filing.text() length after the fix, with
the non-streaming baseline alongside for reference:

Filing	Raw HTML	Streaming after fix	Non-streaming
Stepstone 10-K `0001193125-26-128890`	42.7 MB	1,816,872	1,429,000
Stepstone 20-F `0001193125-26-177617`	35.8 MB	2,007,281	1,610,414
20-F `0001104659-26-044493`	39.5 MB	3,347,578	2,001,895
20-F `0001193125-26-183398`	31.2 MB	2,350,296	1,779,974

All four return non-empty text on the streaming path, and the streaming
output begins with the expected SEC cover-page text on the Stepstone
10-K (previously truncated to body-only content).

The streaming HTML parser silently dropped text from <span>-wrapped paragraphs on filings that crossed streaming_threshold (default 10MB). For SEC filings in the ~30MB–110MB band — which routinely nest every word inside style-bearing <span> tags — filing.text() returned output 20%+ shorter than the non-streaming path with no exception or warning. Two compounding bugs in the iterparse loop: 1. elem.clear() ran on every event (both start and end). At start events, lxml's HTML-mode lookahead has populated child elements and their text; clearing at start destroyed that data before any handler could read it. 2. elem.clear() ran on every element regardless of whether an enclosing structural element (<p>, <h1>-<h6>, <section>) had finished reading its children. iterparse fires end events depth-first, so a child <span>'s end event cleared its .text and .tail before the parent <p>'s end event called _get_text_content(p). The pre-existing _table_depth gate already protected <table> from the same defect. Fix: clear only on end events, and gate clearing on a new _content_depth counter that tracks open p/h1-h6/section elements (mirroring _table_depth). Regression test exercises the SEC pattern of span-wrapped paragraph text under forced streaming mode.

dgunning

Strong fix, Kevin — diagnosis is precise (separating the start-event clearing from the missing content-depth gate as two distinct bugs is sharp), the _content_depth pattern mirrors _table_depth cleanly, and the cover-page recovery on Stepstone is convincing evidence the span bug was real.

One thing I want to understand before merging: your production table shows streaming-mode output now exceeds non-streaming by 25–67%.

Filing	Pre-fix streaming	Non-streaming	Post-fix streaming	Δ vs non-streaming
Stepstone 10-K	1,140,129	1,429,000	1,816,872	+27%
Stepstone 20-F	—	1,610,414	2,007,281	+25%
20-F `0001104659…`	—	2,001,895	3,347,578	+67%
20-F `0001193125…`	—	1,779,974	2,350,296	+32%

The span bug explains why pre-fix streaming was below non-streaming. It doesn't explain why post-fix streaming is above it. Three possibilities I can think of:

Pre-existing divergence between paths (different whitespace/tail handling) that was masked while streaming was losing content
Non-streaming has its own separate content-loss bug — possibly span-related at a different scale
Streaming is now over-including — sibling .tail accumulating twice, or buffer flush interacting with the deferred clear

Have you compared the actual content (not just length) between the two paths on one of these filings? The +67% on 0001104659… is large enough that I'd want to know whether streaming is now correct and non-streaming is buggy, or vice versa, or both paths have different (defensible) semantics.

The regression test asserts content presence in both paths but doesn't compare lengths or do a content diff — adding a length-comparison assertion (or a diff on a known fixture) would lock in whichever interpretation is correct.

Not blocking the fix to the span bug — that's clearly the right call regardless. Just want to understand the overshoot before declaring streaming "fixed."

Once paragraph-text recovery from 794529d was in place, paragraphs nested inside table cells emitted twice in the streaming path: once as a free-standing ParagraphNode (from _handle_start_tag/_end_tag calling _start_paragraph/_end_paragraph unconditionally), and once as TableNode cell text (from _end_table running processor.process over the full subtree). Same applied to <h*> and <section> inside <td>. Pre-794529dd this was masked because <p> handlers produced empty nodes anyway. Post-fix it surfaced as 10-36% content overshoot vs non-streaming on table-heavy filings, visible as the same financial-statement labels appearing dozens of times in streaming output. Cross-checked on the four PR-cited filings (Stepstone 10-K and 20-F, AVAL 20-F, IFS 20-F): without this gate, S/NS normalized content ratio runs 0.95-1.36; with the gate, 0.86-1.05. Gate _handle_start_tag and _handle_end_tag on _table_depth == 0 for <p>/<h1-h6>/<section>, symmetrical to the existing _table_depth gate on elem.clear(). The table processor remains the single source of cell text. Regression test covers a 2x2 table with <p> in every cell and asserts each cell appears exactly once in both streaming and non-streaming output.

kevinchiu · 2026-05-27T22:40:45Z

Thanks for pushing back on this — the overshoot concern is well-founded. I re-examined the PR against current main (PR branch was 9 commits behind) across all four cited filings, found the cause, and pushed a follow-up commit (e28051fa) that resolves it. Here's what I found.

The fix targets a real and severe bug

Pre-fix streaming on current main, whitespace-normalized content vs non-streaming:

Filing	NS content	Pre-fix S	S/NS
Stepstone 10-K	1,096,791	723,668	0.66
Stepstone 20-F	1,418,465	486,817	0.34
AVAL 20-F	1,629,903	1,228,648	0.75
IFS 20-F	1,505,955	838,833	0.56

Pre-fix streaming was missing 25-66% of the content non-streaming captures. The regression test reproduces the failure mode on the span fixture; on production filings the gap is much larger. The diagnosis (deferred elem.clear() + _content_depth gate) is correct and the fix is a major recovery.

The +27% overshoot was real, and you correctly suspected over-inclusion

With just the span fix:

Filing	NS content	Post-fix S	S/NS
Stepstone 10-K	1,096,791	1,398,414	1.28
Stepstone 20-F	1,418,465	1,556,249	1.10
AVAL 20-F	1,629,903	2,212,057	1.36
IFS 20-F	1,505,955	1,428,891	0.95

3 of 4 filings overshoot NS by 10-36% in content (not just whitespace) — exactly your concern.

Root cause: table cell paragraphs double-emit

Minimal reproducer:

html = """<html><body>
<table>
  <tr><td><p>Cell paragraph one</p></td><td><p>Cell paragraph two</p></td></tr>
  <tr><td><p>Row two A</p></td><td><p>Row two B</p></td></tr>
</table>
</body></html>"""

Streaming, span-fix only:

Cell paragraph one
Cell paragraph two
Row two A
Row two B
  Cell paragraph one       Cell paragraph two
  Row two A                Row two B

Non-streaming:

Cell paragraph one       Cell paragraph two
  Row two A                Row two B

Each cell's text appears twice in streaming — once as a free-standing ParagraphNode (because _handle_start_tag calls _start_paragraph regardless of _table_depth), and once as TableNode cell text (because _end_table runs processor.process(elem) over the full subtree). Pre-span-fix, the paragraph half was empty, so this was masked. Post-fix, both fire fully — hence the overshoot, and the line-frequency anomalies ('Total' repeated 36× in streaming vs 1× in non-streaming, '12-month expected credit losses' 13× vs 0×, etc.).

Fix pushed as `e28051fa`

Gate _handle_start_tag and _handle_end_tag to skip <p> / <h1-6> / <section> handlers when _table_depth > 0. Symmetrical to the existing _table_depth gate on elem.clear(). ~20 lines including the comment. Plus a regression test that uses a 2x2 table with <p> in every cell and asserts each cell appears exactly once.

Results across the same 4 filings, span fix + the new gate:

Filing	NS content	Post + gate	S/NS
Stepstone 10-K	1,096,791	1,025,627	0.94
Stepstone 20-F	1,418,465	1,316,023	0.93
AVAL 20-F	1,629,903	1,707,399	1.05
IFS 20-F	1,505,955	1,291,519	0.86

All four within 0.86-1.05× of non-streaming — overshoot eliminated, no more table cell duplication. 162 tests pass across tests/test_html_parser*.py, tests/test_documents.py, etc.

Residual 5-14% under-coverage is pre-existing

The lines now missing on IFS are mostly body-text paragraphs nested inside multiple <div> wrappers (e.g., "You should inquire for yourself whether you are entitled..." which sits inside two nested <div> containers under an <h5> Table-of-Contents link). These aren't inside tables, and they weren't recovered pre-span-fix either — a separate streaming-path limitation around heavily-nested body content, out of scope here.

One PR description nit

The original description says the span fix recovers "UNITED STATES SECURITIES AND EXCHANGE COMMISSION" from the cover page. On current main that phrase is missing from streaming both pre- and post-fix — something in the 9 commits between the PR base and main moved on cover-page handling, and the claim no longer reproduces. Worth dropping that line from the description before merge.

Take a look at e28051fa and let me know if you'd like changes.

dgunning

Verified Kevin's follow-up fix end-to-end:

Stepstone 10-K reproduces exactly: NS=1,096,791 / S=1,023,748 / ratio 0.93 (Kevin reported 0.94).
Duplication probes on real cells all near 1.00 (Total 62→59, Interest income 5→5, Risk Factors 7→7, Cash and cash equivalents 1→1) — the +27% overshoot is gone.
Cover-page text is recovered in streaming output (UNITED STATES / SECURITIES AND EXCHANGE COMMISSION / WASHINGTON, DC 20549 / FORM 10-K).
The new test_streaming_does_not_double_emit_table_cell_paragraphs uses text.count(cell) == 1 — exact-count assertion across both streaming and non-streaming baselines, which is the lock-in I asked for in my earlier review.
Code review: the _table_depth gate in _handle_start_tag / _handle_end_tag is symmetrical to the existing clearing gate; the table processor's _extract_text already pulls cell content via itertext(), so the structural handlers inside tables were genuinely redundant emitters.
95 HTML-parser tests pass; 78 pass across the broader test_html_parser* + test_documents sweep.

Nice diagnosis and minimal fix. Merging.

@HonzaCuhel

Added: - xbrl.calculation_linkbase() — per-filing calculation linkbase as a pandas DataFrame, one row per parent->child arc (GH #766 Phase 1) - Statement.extension_arcs() — surfaces filer-authored concepts that participate in a statement's calc linkbase but are absent from its presentation tree (GH #766 Phase 2) - Section.markdown() — structure-preserving per-section markdown for per-item chunkers / RAG pipelines (PR #833, @HonzaCuhel) Fixed: - StreamingParser dropped 20%+ of text from <span>-wrapped paragraphs on filings crossing the 10MB streaming threshold (PR #830, @kevinchiu) - HTTP_MGR had no default timeout — stalled requests could pin workers indefinitely (PR #831, @kevinchiu) - 13F-HR holdings merged Put/Call positions into the underlying equity row, losing the PutCall column (GH #824) - import edgar emitted DeprecationWarning on every startup, breaking downstream test suites running under -W error (PR #832, @kevinchiu) - Filing.search() / Filing.grep() returned nothing on pre-2002 plain-text filings (GH #819) - TOC analyzer fabricated phantom Items on 10-Q filings via three 10-K-shaped heuristics that fired regardless of form (PR #827, @HonzaCuhel) - SearchResults panel labels conflated BM25 rank with section index (GH #765) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dgunning reviewed May 26, 2026

View reviewed changes

dgunning approved these changes May 28, 2026

View reviewed changes

dgunning merged commit 65495ab into dgunning:main May 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(documents): preserve span-wrapped text in StreamingParser#830

fix(documents): preserve span-wrapped text in StreamingParser#830
dgunning merged 2 commits into
dgunning:mainfrom
kevinchiu:fix/streaming-parser-empty-content

kevinchiu commented May 23, 2026

Uh oh!

dgunning left a comment

Uh oh!

kevinchiu commented May 27, 2026

Uh oh!

dgunning left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kevinchiu commented May 23, 2026

Symptom

Root cause

Fix

Regression test

Verification

Uh oh!

dgunning left a comment

Choose a reason for hiding this comment

Uh oh!

kevinchiu commented May 27, 2026

The fix targets a real and severe bug

The +27% overshoot was real, and you correctly suspected over-inclusion

Root cause: table cell paragraphs double-emit

Fix pushed as e28051fa

Residual 5-14% under-coverage is pre-existing

One PR description nit

Uh oh!

dgunning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix pushed as `e28051fa`