Skip to content

fix(documents): preserve span-wrapped text in StreamingParser#830

Merged
dgunning merged 2 commits into
dgunning:mainfrom
kevinchiu:fix/streaming-parser-empty-content
May 28, 2026
Merged

fix(documents): preserve span-wrapped text in StreamingParser#830
dgunning merged 2 commits into
dgunning:mainfrom
kevinchiu:fix/streaming-parser-empty-content

Conversation

@kevinchiu
Copy link
Copy Markdown
Contributor

Symptom

filing.text() silently returns less text than expected for filings
that cross ParserConfig.streaming_threshold (default 10MB), with no
exception and no warning.

Concrete measurement, Stepstone 10-K (0001193125-26-128890, 42.7 MB raw HTML):

path text chars
streaming (before this fix) 1,140,129
non-streaming baseline 1,429,000
streaming (after this fix) 1,816,872

The streaming path was dropping the entire cover-page block — including
"UNITED STATES SECURITIES AND EXCHANGE COMMISSION / FORM 10-K /
For the fiscal year ended…" — because every line of that block is
nested inside style-bearing <span> tags.

A minimal SEC-style snippet (each word wrapped in <span style="…">)
reproduces the same failure mode without network: streaming drops every
<p> entirely and keeps only <h*> text.

Root cause

Two compounding bugs in the iterparse loop in
edgar/documents/utils/streaming.py::StreamingParser.parse:

  1. elem.clear() ran on every event (both start and end). At
    start events, lxml's HTML-mode lookahead has already populated child
    elements and their .text/.tail; structural handlers such as
    _start_heading read those at start time. Clearing on start
    destroyed that data before any handler could read it.

  2. No content-depth gate around child clearing. iterparse fires
    end events depth-first, so a child <span>'s end event ran
    elem.clear() (which wipes .text and .tail in lxml) before the
    enclosing <p>'s end event called _get_text_content(p). Since
    SEC filings nest essentially every word inside <span style="…">,
    _end_paragraph saw only empty children and produced empty paragraph
    text. The pre-existing _table_depth gate already protected
    <table> from the identical defect — this just extends the same
    idea to the other structural containers.

Fix

Clear only on end events, and gate clearing on a new _content_depth
counter that tracks open <p> / <h1><h6> / <section> elements
(mirroring _table_depth). Defers child cleanup until the enclosing
structural element has read its subtree.

Regression test

tests/test_html_parser_regressions.py::TestStreamingParserRegressions::test_streaming_preserves_span_wrapped_paragraph_text
uses a forced-streaming ParserConfig(streaming_threshold=1) against
SEC-style span-wrapped HTML, asserts that all paragraph and heading
content survives, and cross-checks against the non-streaming baseline.
Fails on main; passes with this change.

Verification

uv run pytest tests/test_html_parser*.py — 68 passed, 3 skipped.

End-to-end check on each of the four problem filings reported in
production. Streaming-path filing.text() length after the fix, with
the non-streaming baseline alongside for reference:

Filing Raw HTML Streaming after fix Non-streaming
Stepstone 10-K 0001193125-26-128890 42.7 MB 1,816,872 1,429,000
Stepstone 20-F 0001193125-26-177617 35.8 MB 2,007,281 1,610,414
20-F 0001104659-26-044493 39.5 MB 3,347,578 2,001,895
20-F 0001193125-26-183398 31.2 MB 2,350,296 1,779,974

All four return non-empty text on the streaming path, and the streaming
output begins with the expected SEC cover-page text on the Stepstone
10-K (previously truncated to body-only content).

The streaming HTML parser silently dropped text from <span>-wrapped
paragraphs on filings that crossed streaming_threshold (default 10MB).
For SEC filings in the ~30MB–110MB band — which routinely nest every
word inside style-bearing <span> tags — filing.text() returned output
20%+ shorter than the non-streaming path with no exception or warning.

Two compounding bugs in the iterparse loop:

1. elem.clear() ran on every event (both start and end). At start
   events, lxml's HTML-mode lookahead has populated child elements
   and their text; clearing at start destroyed that data before any
   handler could read it.

2. elem.clear() ran on every element regardless of whether an
   enclosing structural element (<p>, <h1>-<h6>, <section>) had
   finished reading its children. iterparse fires end events
   depth-first, so a child <span>'s end event cleared its .text and
   .tail before the parent <p>'s end event called
   _get_text_content(p). The pre-existing _table_depth gate already
   protected <table> from the same defect.

Fix: clear only on end events, and gate clearing on a new
_content_depth counter that tracks open p/h1-h6/section elements
(mirroring _table_depth). Regression test exercises the SEC pattern
of span-wrapped paragraph text under forced streaming mode.
Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong fix, Kevin — diagnosis is precise (separating the start-event clearing from the missing content-depth gate as two distinct bugs is sharp), the _content_depth pattern mirrors _table_depth cleanly, and the cover-page recovery on Stepstone is convincing evidence the span bug was real.

One thing I want to understand before merging: your production table shows streaming-mode output now exceeds non-streaming by 25–67%.

Filing Pre-fix streaming Non-streaming Post-fix streaming Δ vs non-streaming
Stepstone 10-K 1,140,129 1,429,000 1,816,872 +27%
Stepstone 20-F 1,610,414 2,007,281 +25%
20-F 0001104659… 2,001,895 3,347,578 +67%
20-F 0001193125… 1,779,974 2,350,296 +32%

The span bug explains why pre-fix streaming was below non-streaming. It doesn't explain why post-fix streaming is above it. Three possibilities I can think of:

  1. Pre-existing divergence between paths (different whitespace/tail handling) that was masked while streaming was losing content
  2. Non-streaming has its own separate content-loss bug — possibly span-related at a different scale
  3. Streaming is now over-including — sibling .tail accumulating twice, or buffer flush interacting with the deferred clear

Have you compared the actual content (not just length) between the two paths on one of these filings? The +67% on 0001104659… is large enough that I'd want to know whether streaming is now correct and non-streaming is buggy, or vice versa, or both paths have different (defensible) semantics.

The regression test asserts content presence in both paths but doesn't compare lengths or do a content diff — adding a length-comparison assertion (or a diff on a known fixture) would lock in whichever interpretation is correct.

Not blocking the fix to the span bug — that's clearly the right call regardless. Just want to understand the overshoot before declaring streaming "fixed."

Once paragraph-text recovery from 794529d was in place, paragraphs
nested inside table cells emitted twice in the streaming path: once
as a free-standing ParagraphNode (from _handle_start_tag/_end_tag
calling _start_paragraph/_end_paragraph unconditionally), and once
as TableNode cell text (from _end_table running processor.process
over the full subtree). Same applied to <h*> and <section> inside
<td>.

Pre-794529dd this was masked because <p> handlers produced empty
nodes anyway. Post-fix it surfaced as 10-36% content overshoot vs
non-streaming on table-heavy filings, visible as the same
financial-statement labels appearing dozens of times in streaming
output. Cross-checked on the four PR-cited filings (Stepstone 10-K
and 20-F, AVAL 20-F, IFS 20-F): without this gate, S/NS normalized
content ratio runs 0.95-1.36; with the gate, 0.86-1.05.

Gate _handle_start_tag and _handle_end_tag on _table_depth == 0
for <p>/<h1-h6>/<section>, symmetrical to the existing _table_depth
gate on elem.clear(). The table processor remains the single source
of cell text.

Regression test covers a 2x2 table with <p> in every cell and
asserts each cell appears exactly once in both streaming and
non-streaming output.
@kevinchiu
Copy link
Copy Markdown
Contributor Author

Thanks for pushing back on this — the overshoot concern is well-founded. I re-examined the PR against current main (PR branch was 9 commits behind) across all four cited filings, found the cause, and pushed a follow-up commit (e28051fa) that resolves it. Here's what I found.

The fix targets a real and severe bug

Pre-fix streaming on current main, whitespace-normalized content vs non-streaming:

Filing NS content Pre-fix S S/NS
Stepstone 10-K 1,096,791 723,668 0.66
Stepstone 20-F 1,418,465 486,817 0.34
AVAL 20-F 1,629,903 1,228,648 0.75
IFS 20-F 1,505,955 838,833 0.56

Pre-fix streaming was missing 25-66% of the content non-streaming captures. The regression test reproduces the failure mode on the span fixture; on production filings the gap is much larger. The diagnosis (deferred elem.clear() + _content_depth gate) is correct and the fix is a major recovery.

The +27% overshoot was real, and you correctly suspected over-inclusion

With just the span fix:

Filing NS content Post-fix S S/NS
Stepstone 10-K 1,096,791 1,398,414 1.28
Stepstone 20-F 1,418,465 1,556,249 1.10
AVAL 20-F 1,629,903 2,212,057 1.36
IFS 20-F 1,505,955 1,428,891 0.95

3 of 4 filings overshoot NS by 10-36% in content (not just whitespace) — exactly your concern.

Root cause: table cell paragraphs double-emit

Minimal reproducer:

html = """<html><body>
<table>
  <tr><td><p>Cell paragraph one</p></td><td><p>Cell paragraph two</p></td></tr>
  <tr><td><p>Row two A</p></td><td><p>Row two B</p></td></tr>
</table>
</body></html>"""

Streaming, span-fix only:

Cell paragraph one
Cell paragraph two
Row two A
Row two B
  Cell paragraph one       Cell paragraph two
  Row two A                Row two B

Non-streaming:

Cell paragraph one       Cell paragraph two
  Row two A                Row two B

Each cell's text appears twice in streaming — once as a free-standing ParagraphNode (because _handle_start_tag calls _start_paragraph regardless of _table_depth), and once as TableNode cell text (because _end_table runs processor.process(elem) over the full subtree). Pre-span-fix, the paragraph half was empty, so this was masked. Post-fix, both fire fully — hence the overshoot, and the line-frequency anomalies ('Total' repeated 36× in streaming vs 1× in non-streaming, '12-month expected credit losses' 13× vs 0×, etc.).

Fix pushed as e28051fa

Gate _handle_start_tag and _handle_end_tag to skip <p> / <h1-6> / <section> handlers when _table_depth > 0. Symmetrical to the existing _table_depth gate on elem.clear(). ~20 lines including the comment. Plus a regression test that uses a 2x2 table with <p> in every cell and asserts each cell appears exactly once.

Results across the same 4 filings, span fix + the new gate:

Filing NS content Post + gate S/NS
Stepstone 10-K 1,096,791 1,025,627 0.94
Stepstone 20-F 1,418,465 1,316,023 0.93
AVAL 20-F 1,629,903 1,707,399 1.05
IFS 20-F 1,505,955 1,291,519 0.86

All four within 0.86-1.05× of non-streaming — overshoot eliminated, no more table cell duplication. 162 tests pass across tests/test_html_parser*.py, tests/test_documents.py, etc.

Residual 5-14% under-coverage is pre-existing

The lines now missing on IFS are mostly body-text paragraphs nested inside multiple <div> wrappers (e.g., "You should inquire for yourself whether you are entitled..." which sits inside two nested <div> containers under an <h5> Table-of-Contents link). These aren't inside tables, and they weren't recovered pre-span-fix either — a separate streaming-path limitation around heavily-nested body content, out of scope here.

One PR description nit

The original description says the span fix recovers "UNITED STATES SECURITIES AND EXCHANGE COMMISSION" from the cover page. On current main that phrase is missing from streaming both pre- and post-fix — something in the 9 commits between the PR base and main moved on cover-page handling, and the claim no longer reproduces. Worth dropping that line from the description before merge.

Take a look at e28051fa and let me know if you'd like changes.

Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified Kevin's follow-up fix end-to-end:

  • Stepstone 10-K reproduces exactly: NS=1,096,791 / S=1,023,748 / ratio 0.93 (Kevin reported 0.94).
  • Duplication probes on real cells all near 1.00 (Total 62→59, Interest income 5→5, Risk Factors 7→7, Cash and cash equivalents 1→1) — the +27% overshoot is gone.
  • Cover-page text is recovered in streaming output (UNITED STATES / SECURITIES AND EXCHANGE COMMISSION / WASHINGTON, DC 20549 / FORM 10-K).
  • The new test_streaming_does_not_double_emit_table_cell_paragraphs uses text.count(cell) == 1 — exact-count assertion across both streaming and non-streaming baselines, which is the lock-in I asked for in my earlier review.
  • Code review: the _table_depth gate in _handle_start_tag / _handle_end_tag is symmetrical to the existing clearing gate; the table processor's _extract_text already pulls cell content via itertext(), so the structural handlers inside tables were genuinely redundant emitters.
  • 95 HTML-parser tests pass; 78 pass across the broader test_html_parser* + test_documents sweep.

Nice diagnosis and minimal fix. Merging.

@dgunning dgunning merged commit 65495ab into dgunning:main May 28, 2026
6 checks passed
dgunning added a commit that referenced this pull request May 28, 2026
Added:
- xbrl.calculation_linkbase() — per-filing calculation linkbase as a
  pandas DataFrame, one row per parent->child arc (GH #766 Phase 1)
- Statement.extension_arcs() — surfaces filer-authored concepts that
  participate in a statement's calc linkbase but are absent from its
  presentation tree (GH #766 Phase 2)
- Section.markdown() — structure-preserving per-section markdown for
  per-item chunkers / RAG pipelines (PR #833, @HonzaCuhel)

Fixed:
- StreamingParser dropped 20%+ of text from <span>-wrapped paragraphs
  on filings crossing the 10MB streaming threshold (PR #830, @kevinchiu)
- HTTP_MGR had no default timeout — stalled requests could pin
  workers indefinitely (PR #831, @kevinchiu)
- 13F-HR holdings merged Put/Call positions into the underlying equity
  row, losing the PutCall column (GH #824)
- import edgar emitted DeprecationWarning on every startup, breaking
  downstream test suites running under -W error (PR #832, @kevinchiu)
- Filing.search() / Filing.grep() returned nothing on pre-2002
  plain-text filings (GH #819)
- TOC analyzer fabricated phantom Items on 10-Q filings via three
  10-K-shaped heuristics that fired regardless of form (PR #827,
  @HonzaCuhel)
- SearchResults panel labels conflated BM25 rank with section index
  (GH #765)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants