feat(documents): add Section.markdown() accessor by HonzaCuhel · Pull Request #833 · dgunning/edgartools

HonzaCuhel · 2026-05-25T17:09:53Z

Summary

Section.text() flattens tables and bullet lists; Filing.markdown() preserves structure but is whole-document. This adds Section.markdown() — the missing per-section markdown accessor — so per-item chunkers / RAG pipelines can get item-aware and structure-preserving markdown in one call.

Why

Real-world example: tenq[\"Part II, Item 2\"] (Issuer Purchases of Equity Securities table) returns vertically-stacked cell text. filing.markdown() would render the same content as pipe-delimited markdown — but you lose the per-item slice. Today downstream consumers have to choose between item-awareness and structure preservation.

section.markdown() removes the trade-off: one call returns markdown scoped to one section.

Implementation

Heading/pattern-based sections — render the cached node tree via MarkdownRenderer().render_node(self.node). Tables emit pipe syntax, bullet lists keep markers. Same boundary-artifact cleanup that text() applies runs on the rendered output, extended to handle markdown-decorated bleed-in artifacts (# Item 5, **Item 5\\.**, # PART II\\n\\n# Item 5).

TOC-based sections (where node.children is empty and content is fetched lazily from _html_source) — fall back to Section.text(). Iterative review surfaced multiple correctness landmines in TOC-section HTML extraction: next-section heading leaks via nested anchors, shared-wrapper LCA computation, same-anchor start/end boundaries, inline anchor wrappers, last-section-in-wrapper leaks, and table-row-bounded anchors losing <table>/<tbody> wrappers. Rather than ship a partial fix for any one, the TOC path is conservative: no regression vs text(), no new landmines. Tracked as a follow-up.

Verification

13 new unit tests in tests/test_section_markdown.py cover: table-pipe preservation, bullet-list preservation, TOC fall-back contract, boundary-artifact cleanup across plain/escaped/heading-decorated/bold-decorated/combined-PART+Item variants, idempotency, and a negative control asserting text() does NOT preserve pipes (so the contrast is meaningful).
100/100 broader section-detection regression tests still pass with EDGAR_IDENTITY set (test_documents, test_10q_section_detection, test_sections_membership, test_section_detection_edge_cases, test_section_tables_toc_fix).
11 rounds of codex review — each round caught a real edge case, all addressed except the documented TOC fall-back (which would re-open the whole TOC extraction class of issues).

Test plan

CI runs the full repo test suite
Maintainer confirms the API name + scope
If maintainer wants TOC structural rendering shipped together, can split into a follow-up PR with the LCA-walk approach (incomplete vs. table-row anchors but covers the more common cases)

🤖 Generated with Claude Code

`Section.text()` walks the HTML subtree and emits newline-joined cell content — tables and bullet lists are flattened to space/newline soup, losing column structure and list markers. The whole-document `Filing.markdown()` (via `Document.to_markdown()`) preserves table pipe syntax and list markers but is whole-document only — there's no per-item slice. This means per-item chunkers / RAG pipelines have to choose between: - the cheap per-item path (`typed[item_name]` → `Section.text()`) which is item-aware but flattens; or - the faithful whole-document path (`filing.markdown()`) which preserves structure but loses item boundaries. `Section.markdown()` closes that gap for heading/pattern-detected sections: same scope as `text()` (one section) but the same renderer as `Document.to_markdown` so tables and lists keep their syntax. Implementation -------------- - Heading/pattern-based sections: render the cached node tree directly via `MarkdownRenderer().render_node(self.node)`. Tables emit pipe syntax, bullet lists keep their markers. Same boundary-artifact cleanup that `text()` applies runs on the rendered output, extended to handle markdown-decorated bleed-in artifacts (e.g. `# Item 5`, `**Item 5\.**`, `# PART II\n\n# Item 5`). - TOC-based sections (where `node.children` is empty and content is fetched lazily from the original HTML): fall back to `Section.text()`. Iterative codex review surfaced multiple edge cases for TOC-section HTML extraction (next-section heading leaks via nested anchors; shared-wrapper LCA computation; same-anchor start/end boundaries; inline anchor wrappers; last-section-in-wrapper leaks; table-row- bounded anchors losing `<table>`/`<tbody>` wrappers). Rather than ship a partial fix for any one of these, the TOC path is conservative and returns the same text the existing API already produces — no regression, no new correctness landmines. Full TOC markdown support is tracked as a follow-up. Verification ------------ - 13 new unit tests cover table-pipe preservation, bullet-list preservation, the TOC fall-back contract, boundary-artifact cleanup in plain text + markdown-escaped + markdown-decorated + combined PART+Item forms, idempotency, and a negative control asserting `text()` does NOT preserve pipes (so the contrast is meaningful). - 100/100 broader section-detection regression tests pass with `EDGAR_IDENTITY` set. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dgunning

Strong PR Honza — useful feature, honest about the TOC limitation, and the "negative control" assertion (text() does NOT preserve pipes) is exactly the right way to prove the contrast is meaningful. The boundary-artifact regex changes are thorough and the 11-rounds-of-codex-review discipline matches the rigor on #827.

Two requests before merging:

1. Add a real-filing regression test for the motivating use case. All 13 new tests use synthetic HTML strings. The PR claims "100/100 broader regression tests still pass" but that's non-regression, not evidence the new feature actually delivers value on production SEC HTML. Per the verification constitution ("Documentation is the specification — every documented example must be verifiable"), a network-marked test on the motivating PG&E Part II, Item 2 Issuer Purchases table would lock in the actual contract:

@pytest.mark.network
def test_section_markdown_preserves_table_on_real_10q():
    tenq = Company("PCG").get_filings(form="10-Q").latest(1)[0].obj()
    md = tenq["Part II, Item 2"].markdown()
    assert "|" in md  # pipe table preserved
    # ground-truth assertion on a known cell value
    assert "Total Number of Shares" in md or "Total Shares Purchased" in md

2. Track TOC structural rendering as a follow-up. The docstring is honest about the limitation, but a tracked follow-up issue lets us link it from issue replies and remember to close the gap when the TOC analyzer rewrite lands. A short note referencing the seven edge cases you enumerated in the PR description (next-section heading leaks, shared-wrapper LCA, same-anchor boundaries, inline anchor wrappers, last-section-in-wrapper leaks, table-row-bounded anchors, missing structural wrappers) is enough — happy to file it on our side so it slots into the broader TOC analyzer work planned for v5.32.0.

Broader pattern worth flagging, not blocking: the boundary-artifact regex is now handling plain, markdown-escaped, heading-decorated, bold-decorated, and combined page + decorated PART + decorated Item variants. 11 rounds of codex review each catching a real edge case is impressive thoroughness — and also a leading indicator that the regex approach is fighting the renderer. Long-term, the cleaner fix is MarkdownRenderer not emitting boundary artifacts in the first place (it already knows where the section ends), removing the cleanup step entirely. Not in scope for this PR, but tracked for a broader parser design sprint.

Once those two land (real-filing test + TOC follow-up), this is mergeable. Nice work as always.

dgunning

Handled both follow-up requests on our side, approving:

1. Real-filing regression test. Added test_section_markdown_preserves_exhibit_table_on_real_8k to tests/test_section_markdown.py — pinned to AAPL 8-K 0000320193-26-000011 Item 9.01, asserts | --- | separator present in markdown() and absent from text() (negative control), with content anchors on the exhibit text. Will land as a follow-up commit alongside the squash-merge.

The exact tenq["Part II, Item 2"].markdown() example I gave doesn't actually exercise the new code path on production filings — for every large-cap 10-K/10-Q I checked (AAPL, MSFT, GOOGL, JPM, BRK-B, PCG, WMT), HybridSectionDetector resolves TOC-first and every section ends up detection_method='toc', which means markdown() takes the conservative fallback to text(). The path that actually delivers structure preservation today is pattern-detected sections — and 8-K Item 9.01 is the reliable real-filing anchor for that. Good thing to know going in: this PR's structural benefit currently lands almost entirely on 8-K and other pattern-detected forms, with 10-K/10-Q getting it once the TOC-rendering follow-up ships. The PR description's framing is still correct (the feature is real and the contract is well-defined), but worth flagging for documentation when we surface this API publicly.

2. TOC structural rendering follow-up. Filed on our side, slotted into the same section-extraction work that's tracked alongside the bank-filing extraction bugs and the broader parser design sprint. Scoped to the seven edge cases you enumerated in the PR description.

The synthetic test coverage is solid, the implementation is honest about its limits, and the docstring on markdown() accurately describes the TOC fallback. Nice work, Honza. Squash-merging.

@HonzaCuhel

Added: - xbrl.calculation_linkbase() — per-filing calculation linkbase as a pandas DataFrame, one row per parent->child arc (GH #766 Phase 1) - Statement.extension_arcs() — surfaces filer-authored concepts that participate in a statement's calc linkbase but are absent from its presentation tree (GH #766 Phase 2) - Section.markdown() — structure-preserving per-section markdown for per-item chunkers / RAG pipelines (PR #833, @HonzaCuhel) Fixed: - StreamingParser dropped 20%+ of text from <span>-wrapped paragraphs on filings crossing the 10MB streaming threshold (PR #830, @kevinchiu) - HTTP_MGR had no default timeout — stalled requests could pin workers indefinitely (PR #831, @kevinchiu) - 13F-HR holdings merged Put/Call positions into the underlying equity row, losing the PutCall column (GH #824) - import edgar emitted DeprecationWarning on every startup, breaking downstream test suites running under -W error (PR #832, @kevinchiu) - Filing.search() / Filing.grep() returned nothing on pre-2002 plain-text filings (GH #819) - TOC analyzer fabricated phantom Items on 10-Q filings via three 10-K-shaped heuristics that fired regardless of form (PR #827, @HonzaCuhel) - SearchResults panel labels conflated BM25 rank with section index (GH #765) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dgunning reviewed May 26, 2026

View reviewed changes

dgunning approved these changes May 28, 2026

View reviewed changes

dgunning merged commit b701f31 into dgunning:main May 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(documents): add Section.markdown() accessor#833

feat(documents): add Section.markdown() accessor#833
dgunning merged 1 commit into
dgunning:mainfrom
HonzaCuhel:feat/section-markdown-accessor

HonzaCuhel commented May 25, 2026

Uh oh!

dgunning left a comment •

edited

Loading

Uh oh!

dgunning left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

HonzaCuhel commented May 25, 2026

Summary

Why

Implementation

Verification

Test plan

Uh oh!

dgunning left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgunning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dgunning left a comment •

edited

Loading