Skip to content

fix: support search/grep on plain-text filings (#819)#823

Merged
dgunning merged 1 commit into
dgunning:mainfrom
0ywfe:fix/issue-819-search-grep-on-text-filings
May 25, 2026
Merged

fix: support search/grep on plain-text filings (#819)#823
dgunning merged 1 commit into
dgunning:mainfrom
0ywfe:fix/issue-819-search-grep-on-text-filings

Conversation

@0ywfe
Copy link
Copy Markdown
Contributor

@0ywfe 0ywfe commented May 21, 2026

Summary

Fixes #819Filing.search() raised AssertionError and Filing.grep() silently returned 0 matches on pre-2002 plain-text filings (e.g. PCG 1999 10-K, accession 0000929624-00-000321).

Root cause

Both methods relied on attachment iteration. SGML decomposition for these older text-only filings produces empty attachment shells — 20 placeholders with empty=True and text_len=0. The full filing text is only accessible via filing.text(), which the search/grep paths didn't touch.

sections() additionally asserted html() is not None, which fails outright on text-only filings.

Fix

  • sections() — drop the assert; when html() is None, chunk filing.text() on <PAGE> markers (legacy SEC format pre-2002 page boundaries) or blank-line breaks, with a 50-char min to filter page-header noise. Falls back to single-section if no chunks survive.
  • grep() — handle attachments lookup failure gracefully (return-on-error replaced with empty list); track whether any attachment yielded usable text; when none did and no non-primary document filter is set, fall back to grep'ing filing.text() as the "primary" document.

Verification

Before fix (PCG 1999 10-K, plain text, 274,587 chars containing "employees"):

  • search("employees")AssertionError
  • grep("employees") → 0 matches

After fix:

  • search("employees") → 2 BM25 hits
  • grep("employees") → 5 matches

HTML-filing path unchanged (verified on PCG 2025 10-K: 167 grep matches, 17 search hits, same as before).

Test plan

  • Regression test added at tests/issues/regression/test_issue_819_search_grep_on_text_filings.py (5 cases, all @pytest.mark.network)
  • Asserts specific ground-truth values from PCG 1999 10-K (per CLAUDE.md verification constitution)
  • Includes regression guard that HTML-filing search/grep still work
  • All 5 new tests pass
  • 51 existing fast tests in test_textsearch.py + test_filing.py still pass (no regressions)

@0ywfe 0ywfe force-pushed the fix/issue-819-search-grep-on-text-filings branch from 7811607 to 380a34d Compare May 22, 2026 22:03
@0ywfe
Copy link
Copy Markdown
Contributor Author

0ywfe commented May 22, 2026

Rebased on main. Noticed your a7e36a10 quick-win shipped while this was open — your commit message scheduled the full sections()/grep() text-filing fix for 5.32.0, and that's exactly what this PR implements.

Conflict resolution:

  • sections(): dropped the ValueError path from your quick-win because sections() now actually works on text filings (chunks on <PAGE> markers / blank lines) — the error message it pointed users at is no longer needed.
  • Removed tests/issues/regression/test_issue_819_text_filing_search.py since its assertions (pytest.raises(ValueError)) became wrong under the new behavior. Our test_issue_819_search_grep_on_text_filings.py covers the same case at the integration level with ground-truth assertions against PCG 1999 10-K (network-marked).

Happy to split this into separate sections() and grep() commits, or restructure however you'd prefer. Just say the word.

Filing.search() raised AssertionError and Filing.grep() returned 0 matches
on pre-2002 plain-text filings. Both relied on attachment iteration that
finds nothing because SGML decomposition emits empty shells for text-only
filings.

sections() now falls back to chunking filing.text() on <PAGE> markers or
blank lines when html() is None. grep() falls back to filing.text() as
the primary document when no attachment yields usable text.

Regression test asserts both work on PCG's 1999 10-K (accession
0000929624-00-000321, 274,587 chars of plain text).
@0ywfe 0ywfe force-pushed the fix/issue-819-search-grep-on-text-filings branch from 380a34d to a2c8043 Compare May 23, 2026 12:30
@0ywfe
Copy link
Copy Markdown
Contributor Author

0ywfe commented May 23, 2026

CodeFactor flagged Filing.grep cyclomatic complexity at D-rank (23) after this PR — the fallback for empty-attachments pushed it past the threshold.

Refactored without changing behaviour (force-pushed):

  • Extracted _attachment_matches(attachment, document) — the document-filter check
  • Extracted _attachment_location(attachment) — the location-label resolution
  • Extracted _grep_filing_text(pattern, regex) — the new text-filing fallback

Filing.grep now C(12) per radon, helpers are A(3-5). All 5 regression tests + 51 fast tests in test_textsearch / test_filing still pass.

None,
)
assert target is not None, f"PCG 10-K {PCG_TEXT_10K_ACCESSION} missing from EDGAR results"
return target
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I do in tests
filing = Filing(form='10-K', filing_date='2000-03-08', company='PG&E CORP', cik=1004980, accession_no='0000929624-00-000321')

'''
How to find a filing by accession number

f = find("0000929624-00-000321")
str(f)
"Filing(form='10-K', filing_date='2000-03-08', company='PG&E CORP', cik=1004980, accession_no='0000929624-00-000321')"
'''

Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is well done and fixes a known problem.

Thanks for your effort

@dgunning dgunning merged commit 84f2015 into dgunning:main May 25, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filing search/grep don't work on text (non-HTML) filings

2 participants