fix: support search/grep on plain-text filings (#819)#823
Conversation
7811607 to
380a34d
Compare
|
Rebased on main. Noticed your Conflict resolution:
Happy to split this into separate |
Filing.search() raised AssertionError and Filing.grep() returned 0 matches on pre-2002 plain-text filings. Both relied on attachment iteration that finds nothing because SGML decomposition emits empty shells for text-only filings. sections() now falls back to chunking filing.text() on <PAGE> markers or blank lines when html() is None. grep() falls back to filing.text() as the primary document when no attachment yields usable text. Regression test asserts both work on PCG's 1999 10-K (accession 0000929624-00-000321, 274,587 chars of plain text).
380a34d to
a2c8043
Compare
|
CodeFactor flagged Filing.grep cyclomatic complexity at D-rank (23) after this PR — the fallback for empty-attachments pushed it past the threshold. Refactored without changing behaviour (force-pushed):
|
| None, | ||
| ) | ||
| assert target is not None, f"PCG 10-K {PCG_TEXT_10K_ACCESSION} missing from EDGAR results" | ||
| return target |
There was a problem hiding this comment.
This is what I do in tests
filing = Filing(form='10-K', filing_date='2000-03-08', company='PG&E CORP', cik=1004980, accession_no='0000929624-00-000321')
'''
How to find a filing by accession number
f = find("0000929624-00-000321")
str(f)
"Filing(form='10-K', filing_date='2000-03-08', company='PG&E CORP', cik=1004980, accession_no='0000929624-00-000321')"
'''
dgunning
left a comment
There was a problem hiding this comment.
This PR is well done and fixes a known problem.
Thanks for your effort
Summary
Fixes #819 —
Filing.search()raisedAssertionErrorandFiling.grep()silently returned 0 matches on pre-2002 plain-text filings (e.g. PCG 1999 10-K, accession0000929624-00-000321).Root cause
Both methods relied on attachment iteration. SGML decomposition for these older text-only filings produces empty attachment shells — 20 placeholders with
empty=Trueandtext_len=0. The full filing text is only accessible viafiling.text(), which the search/grep paths didn't touch.sections()additionally assertedhtml() is not None, which fails outright on text-only filings.Fix
sections()— drop the assert; whenhtml()is None, chunkfiling.text()on<PAGE>markers (legacy SEC format pre-2002 page boundaries) or blank-line breaks, with a 50-char min to filter page-header noise. Falls back to single-section if no chunks survive.grep()— handleattachmentslookup failure gracefully (return-on-error replaced with empty list); track whether any attachment yielded usable text; when none did and no non-primarydocumentfilter is set, fall back to grep'ingfiling.text()as the "primary" document.Verification
Before fix (PCG 1999 10-K, plain text, 274,587 chars containing "employees"):
search("employees")→AssertionErrorgrep("employees")→ 0 matchesAfter fix:
search("employees")→ 2 BM25 hitsgrep("employees")→ 5 matchesHTML-filing path unchanged (verified on PCG 2025 10-K: 167 grep matches, 17 search hits, same as before).
Test plan
tests/issues/regression/test_issue_819_search_grep_on_text_filings.py(5 cases, all@pytest.mark.network)test_textsearch.py+test_filing.pystill pass (no regressions)