Skip to content

feat(documents): add Section.markdown() accessor#833

Merged
dgunning merged 1 commit into
dgunning:mainfrom
HonzaCuhel:feat/section-markdown-accessor
May 28, 2026
Merged

feat(documents): add Section.markdown() accessor#833
dgunning merged 1 commit into
dgunning:mainfrom
HonzaCuhel:feat/section-markdown-accessor

Conversation

@HonzaCuhel
Copy link
Copy Markdown
Contributor

Summary

Section.text() flattens tables and bullet lists; Filing.markdown() preserves structure but is whole-document. This adds Section.markdown() — the missing per-section markdown accessor — so per-item chunkers / RAG pipelines can get item-aware and structure-preserving markdown in one call.

Why

Real-world example: tenq[\"Part II, Item 2\"] (Issuer Purchases of Equity Securities table) returns vertically-stacked cell text. filing.markdown() would render the same content as pipe-delimited markdown — but you lose the per-item slice. Today downstream consumers have to choose between item-awareness and structure preservation.

section.markdown() removes the trade-off: one call returns markdown scoped to one section.

Implementation

Heading/pattern-based sections — render the cached node tree via MarkdownRenderer().render_node(self.node). Tables emit pipe syntax, bullet lists keep markers. Same boundary-artifact cleanup that text() applies runs on the rendered output, extended to handle markdown-decorated bleed-in artifacts (# Item 5, **Item 5\\.**, # PART II\\n\\n# Item 5).

TOC-based sections (where node.children is empty and content is fetched lazily from _html_source) — fall back to Section.text(). Iterative review surfaced multiple correctness landmines in TOC-section HTML extraction: next-section heading leaks via nested anchors, shared-wrapper LCA computation, same-anchor start/end boundaries, inline anchor wrappers, last-section-in-wrapper leaks, and table-row-bounded anchors losing <table>/<tbody> wrappers. Rather than ship a partial fix for any one, the TOC path is conservative: no regression vs text(), no new landmines. Tracked as a follow-up.

Verification

  • 13 new unit tests in tests/test_section_markdown.py cover: table-pipe preservation, bullet-list preservation, TOC fall-back contract, boundary-artifact cleanup across plain/escaped/heading-decorated/bold-decorated/combined-PART+Item variants, idempotency, and a negative control asserting text() does NOT preserve pipes (so the contrast is meaningful).
  • 100/100 broader section-detection regression tests still pass with EDGAR_IDENTITY set (test_documents, test_10q_section_detection, test_sections_membership, test_section_detection_edge_cases, test_section_tables_toc_fix).
  • 11 rounds of codex review — each round caught a real edge case, all addressed except the documented TOC fall-back (which would re-open the whole TOC extraction class of issues).

Test plan

  • CI runs the full repo test suite
  • Maintainer confirms the API name + scope
  • If maintainer wants TOC structural rendering shipped together, can split into a follow-up PR with the LCA-walk approach (incomplete vs. table-row anchors but covers the more common cases)

🤖 Generated with Claude Code

`Section.text()` walks the HTML subtree and emits newline-joined cell
content — tables and bullet lists are flattened to space/newline soup,
losing column structure and list markers. The whole-document
`Filing.markdown()` (via `Document.to_markdown()`) preserves table
pipe syntax and list markers but is whole-document only — there's no
per-item slice.

This means per-item chunkers / RAG pipelines have to choose between:

- the cheap per-item path (`typed[item_name]` → `Section.text()`) which
  is item-aware but flattens; or
- the faithful whole-document path (`filing.markdown()`) which preserves
  structure but loses item boundaries.

`Section.markdown()` closes that gap for heading/pattern-detected
sections: same scope as `text()` (one section) but the same renderer
as `Document.to_markdown` so tables and lists keep their syntax.

Implementation
--------------
- Heading/pattern-based sections: render the cached node tree directly
  via `MarkdownRenderer().render_node(self.node)`. Tables emit pipe
  syntax, bullet lists keep their markers. Same boundary-artifact
  cleanup that `text()` applies runs on the rendered output, extended
  to handle markdown-decorated bleed-in artifacts (e.g. `# Item 5`,
  `**Item 5\.**`, `# PART II\n\n# Item 5`).
- TOC-based sections (where `node.children` is empty and content is
  fetched lazily from the original HTML): fall back to `Section.text()`.
  Iterative codex review surfaced multiple edge cases for TOC-section
  HTML extraction (next-section heading leaks via nested anchors;
  shared-wrapper LCA computation; same-anchor start/end boundaries;
  inline anchor wrappers; last-section-in-wrapper leaks; table-row-
  bounded anchors losing `<table>`/`<tbody>` wrappers). Rather than
  ship a partial fix for any one of these, the TOC path is conservative
  and returns the same text the existing API already produces — no
  regression, no new correctness landmines. Full TOC markdown support
  is tracked as a follow-up.

Verification
------------
- 13 new unit tests cover table-pipe preservation, bullet-list
  preservation, the TOC fall-back contract, boundary-artifact cleanup
  in plain text + markdown-escaped + markdown-decorated + combined
  PART+Item forms, idempotency, and a negative control asserting
  `text()` does NOT preserve pipes (so the contrast is meaningful).
- 100/100 broader section-detection regression tests pass with
  `EDGAR_IDENTITY` set.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong PR Honza — useful feature, honest about the TOC limitation, and the "negative control" assertion (text() does NOT preserve pipes) is exactly the right way to prove the contrast is meaningful. The boundary-artifact regex changes are thorough and the 11-rounds-of-codex-review discipline matches the rigor on #827.

Two requests before merging:

1. Add a real-filing regression test for the motivating use case. All 13 new tests use synthetic HTML strings. The PR claims "100/100 broader regression tests still pass" but that's non-regression, not evidence the new feature actually delivers value on production SEC HTML. Per the verification constitution ("Documentation is the specification — every documented example must be verifiable"), a network-marked test on the motivating PG&E Part II, Item 2 Issuer Purchases table would lock in the actual contract:

@pytest.mark.network
def test_section_markdown_preserves_table_on_real_10q():
    tenq = Company("PCG").get_filings(form="10-Q").latest(1)[0].obj()
    md = tenq["Part II, Item 2"].markdown()
    assert "|" in md  # pipe table preserved
    # ground-truth assertion on a known cell value
    assert "Total Number of Shares" in md or "Total Shares Purchased" in md

2. Track TOC structural rendering as a follow-up. The docstring is honest about the limitation, but a tracked follow-up issue lets us link it from issue replies and remember to close the gap when the TOC analyzer rewrite lands. A short note referencing the seven edge cases you enumerated in the PR description (next-section heading leaks, shared-wrapper LCA, same-anchor boundaries, inline anchor wrappers, last-section-in-wrapper leaks, table-row-bounded anchors, missing structural wrappers) is enough — happy to file it on our side so it slots into the broader TOC analyzer work planned for v5.32.0.

Broader pattern worth flagging, not blocking: the boundary-artifact regex is now handling plain, markdown-escaped, heading-decorated, bold-decorated, and combined page + decorated PART + decorated Item variants. 11 rounds of codex review each catching a real edge case is impressive thoroughness — and also a leading indicator that the regex approach is fighting the renderer. Long-term, the cleaner fix is MarkdownRenderer not emitting boundary artifacts in the first place (it already knows where the section ends), removing the cleanup step entirely. Not in scope for this PR, but tracked for a broader parser design sprint.

Once those two land (real-filing test + TOC follow-up), this is mergeable. Nice work as always.

Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled both follow-up requests on our side, approving:

1. Real-filing regression test. Added test_section_markdown_preserves_exhibit_table_on_real_8k to tests/test_section_markdown.py — pinned to AAPL 8-K 0000320193-26-000011 Item 9.01, asserts | --- | separator present in markdown() and absent from text() (negative control), with content anchors on the exhibit text. Will land as a follow-up commit alongside the squash-merge.

The exact tenq["Part II, Item 2"].markdown() example I gave doesn't actually exercise the new code path on production filings — for every large-cap 10-K/10-Q I checked (AAPL, MSFT, GOOGL, JPM, BRK-B, PCG, WMT), HybridSectionDetector resolves TOC-first and every section ends up detection_method='toc', which means markdown() takes the conservative fallback to text(). The path that actually delivers structure preservation today is pattern-detected sections — and 8-K Item 9.01 is the reliable real-filing anchor for that. Good thing to know going in: this PR's structural benefit currently lands almost entirely on 8-K and other pattern-detected forms, with 10-K/10-Q getting it once the TOC-rendering follow-up ships. The PR description's framing is still correct (the feature is real and the contract is well-defined), but worth flagging for documentation when we surface this API publicly.

2. TOC structural rendering follow-up. Filed on our side, slotted into the same section-extraction work that's tracked alongside the bank-filing extraction bugs and the broader parser design sprint. Scoped to the seven edge cases you enumerated in the PR description.

The synthetic test coverage is solid, the implementation is honest about its limits, and the docstring on markdown() accurately describes the TOC fallback. Nice work, Honza. Squash-merging.

@dgunning dgunning merged commit b701f31 into dgunning:main May 28, 2026
6 checks passed
dgunning added a commit that referenced this pull request May 28, 2026
Added:
- xbrl.calculation_linkbase() — per-filing calculation linkbase as a
  pandas DataFrame, one row per parent->child arc (GH #766 Phase 1)
- Statement.extension_arcs() — surfaces filer-authored concepts that
  participate in a statement's calc linkbase but are absent from its
  presentation tree (GH #766 Phase 2)
- Section.markdown() — structure-preserving per-section markdown for
  per-item chunkers / RAG pipelines (PR #833, @HonzaCuhel)

Fixed:
- StreamingParser dropped 20%+ of text from <span>-wrapped paragraphs
  on filings crossing the 10MB streaming threshold (PR #830, @kevinchiu)
- HTTP_MGR had no default timeout — stalled requests could pin
  workers indefinitely (PR #831, @kevinchiu)
- 13F-HR holdings merged Put/Call positions into the underlying equity
  row, losing the PutCall column (GH #824)
- import edgar emitted DeprecationWarning on every startup, breaking
  downstream test suites running under -W error (PR #832, @kevinchiu)
- Filing.search() / Filing.grep() returned nothing on pre-2002
  plain-text filings (GH #819)
- TOC analyzer fabricated phantom Items on 10-Q filings via three
  10-K-shaped heuristics that fired regardless of form (PR #827,
  @HonzaCuhel)
- SearchResults panel labels conflated BM25 rank with section index
  (GH #765)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants