-
Notifications
You must be signed in to change notification settings - Fork 3.3k
fix: Clear word/char cells when force_full_page_ocr is used #2738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When force_full_page_ocr=True, the OCR model correctly replaces textline_cells with OCR-extracted text. However, word_cells and char_cells were not cleared, causing downstream components like TableStructureModel to use unreliable PDF-extracted text containing GLYPH artifacts (e.g., GLYPH<c=1,font=/AAAAAH+font000000002ed64673>). This fix clears word_cells and char_cells when force_full_page_ocr is enabled, ensuring TableStructureModel falls back to the OCR- extracted textline cells via its existing fallback logic. Fixes issue where PDFs with problematic fonts (Type3, missing ToUnicode CMap) produced GLYPH artifacts in table content despite force_full_page_ocr being triggered.
|
✅ DCO Check Passed Thanks @tripflex, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
dolfim-ibm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
cau-git
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tripflex this solution could be potentially destructive, since also OCR engines may provide word and char level cells.
I would propose to instead scan these lists for TextCell instances where TextCell.from_ocr == False and throw these out.
…r is used When force_full_page_ocr=True, the OCR model correctly replaces textline_cells with OCR-extracted text. However, word_cells and char_cells from the PDF backend were not handled, causing downstream components like TableStructureModel to use unreliable PDF-extracted text containing GLYPH artifacts. Instead of clearing all word/char cells (which would be destructive for backends like mets_gbs that provide OCR-generated word cells), this fix filters out only cells where from_ocr=False, preserving any OCR-generated cells. This ensures TableStructureModel falls back to the OCR-extracted textline cells via its existing fallback logic when word_cells is empty or only contains OCR cells. Fixes issue where PDFs with problematic fonts (Type3, missing ToUnicode CMap) produced GLYPH artifacts in table content despite force_full_page_ocr being triggered.
b63baa7 to
a4f4e3f
Compare
Great point, thank you for the feedback! I traced through the code more thoroughly and you're absolutely right. Key Findings
Updated FixI've updated the PR to filter by if self.options.force_full_page_ocr:
page.parsed_page.word_cells = [
c for c in page.parsed_page.word_cells if c.from_ocr
]
page.parsed_page.char_cells = [
c for c in page.parsed_page.char_cells if c.from_ocr
]
page.parsed_page.has_words = len(page.parsed_page.word_cells) > 0
page.parsed_page.has_chars = len(page.parsed_page.char_cells) > 0This approach:
Let me know if this looks good or if you have any other suggestions! |
I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: 4197a4e I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: a4f4e3f Signed-off-by: Myles McNamara <[email protected]>
Fixes #2737
Summary
When
force_full_page_ocr=True, the OCR model correctly replacestextline_cellswith OCR-extracted text. However,word_cellsandchar_cellsfrom the PDF backend are not cleared, causing downstream components (specificallyTableStructureModel) to use unreliable PDF-extracted text that contains GLYPH artifacts.Problem
PDFs with problematic fonts (Type3 fonts, fonts with missing ToUnicode CMap) produce unreadable text when extracted programmatically. The text appears as:
The existing
force_full_page_ocroption correctly triggers full-page OCR to handle these cases, and the OCR model properly replacespage.parsed_page.textline_cellswith clean OCR-extracted text.However, the
TableStructureModelindocling/models/table_structure_model.py(lines 224-236) prefers word-level cells for better table cell matching accuracy:Since
word_cellsis populated but contains GLYPH garbage, the fallback totable_cluster.cells(which contains clean OCR text) never triggers.Root Cause
PagePreprocessingModelstorespage.parsed_page = page._backend.get_segmented_page()containing:textline_cells: populated from PDF (may contain GLYPHs)word_cells: populated from PDF (may contain GLYPHs)char_cells: populated from PDF (may contain GLYPHs)OCR model with
force_full_page_ocr=Truereplaces onlytextline_cellsTableStructureModelrequestsword_cellswhich still contains GLYPH patternsThe fallback to OCR cells only triggers when
word_cellsis empty, not when it contains garbageSolution
When
force_full_page_ocr=True, also clearword_cellsandchar_cellsin thepost_process_cells()method. This ensures:TableStructureModel's existing fallback logic triggers correctlyTableStructureModelor other componentsChanges
docling/models/base_ocr_model.py-BaseOcrModel.post_process_cells():Backwards Compatibility
force_full_page_ocr=TrueTableStructureModelTesting
Tested with PDFs containing fonts with missing ToUnicode CMap that previously produced GLYPH artifacts:
Related Code Paths
docling/models/page_preprocessing_model.py- Creates initialparsed_pagedocling/models/table_structure_model.py- Usesword_cellsfor table extractiondocling/backend/docling_parse_v4_backend.py- PDF backend that providesword_cellsDiscussion: Alternative Approach
This fix takes a conservative approach by clearing all word/char cells when
force_full_page_ocr=True. However, I recognize that word-level cells provide better accuracy for table extraction when they contain valid text - that's why docling prefers them over textline cells.An alternative, more surgical approach would be to detect GLYPH patterns in the cells themselves and only fall back to OCR when garbage is detected. This could be implemented in
TableStructureModel:Trade-offs:
The GLYPH detection approach is more granular but requires changes in
TableStructureModelrather than the OCR model. I went with the simpler fix sinceforce_full_page_ocralready signals that PDF text extraction is unreliable for the entire document.Open to feedback - if you guys prefer the GLYPH detection approach or have other suggestions, I'm happy to revise.