fix: Clear word/char cells when force_full_page_ocr is used #2738

tripflex · 2025-12-05T22:14:09Z

Summary

When force_full_page_ocr=True, the OCR model correctly replaces textline_cells with OCR-extracted text. However, word_cells and char_cells from the PDF backend are not cleared, causing downstream components (specifically TableStructureModel) to use unreliable PDF-extracted text that contains GLYPH artifacts.

Problem

PDFs with problematic fonts (Type3 fonts, fonts with missing ToUnicode CMap) produce unreadable text when extracted programmatically. The text appears as:

GLYPH<c=1,font=/AAAAAH+font000000002ed64673> GLYPH<c=1,font=/AAAAAH+font000000002ed64673>

The existing force_full_page_ocr option correctly triggers full-page OCR to handle these cases, and the OCR model properly replaces page.parsed_page.textline_cells with clean OCR-extracted text.

However, the TableStructureModel in docling/models/table_structure_model.py (lines 224-236) prefers word-level cells for better table cell matching accuracy:

# Check if word-level cells are available from backend:
sp = page._backend.get_segmented_page()
if sp is not None:
    tcells = sp.get_cells_in_bbox(
        cell_unit=TextCellUnit.WORD,
        bbox=table_cluster.bbox,
    )
    if len(tcells) == 0:
        # In case word-level cells yield empty
        tcells = table_cluster.cells  # Falls back to textline cells

Since word_cells is populated but contains GLYPH garbage, the fallback to table_cluster.cells (which contains clean OCR text) never triggers.

Root Cause

PagePreprocessingModel stores page.parsed_page = page._backend.get_segmented_page() containing:
- textline_cells: populated from PDF (may contain GLYPHs)
- word_cells: populated from PDF (may contain GLYPHs)
- char_cells: populated from PDF (may contain GLYPHs)
OCR model with force_full_page_ocr=True replaces only textline_cells
TableStructureModel requests word_cells which still contains GLYPH patterns
The fallback to OCR cells only triggers when word_cells is empty, not when it contains garbage

Solution

When force_full_page_ocr=True, also clear word_cells and char_cells in the post_process_cells() method. This ensures:

Semantic consistency: if PDF text extraction is unreliable enough to require full-page OCR, ALL extraction levels are unreliable
TableStructureModel's existing fallback logic triggers correctly
No changes required in TableStructureModel or other components

Changes

docling/models/base_ocr_model.py - BaseOcrModel.post_process_cells():

# When force_full_page_ocr is used, word/char-level cells from PDF
# are also unreliable. Clear them so downstream components (e.g., table
# structure model) fall back to OCR-extracted textline cells.
if self.options.force_full_page_ocr:
    page.parsed_page.word_cells = []
    page.parsed_page.char_cells = []
    page.parsed_page.has_words = False
    page.parsed_page.has_chars = False

Backwards Compatibility

Only affects behavior when force_full_page_ocr=True
Existing behavior for normal OCR (bitmap-coverage-based) is unchanged
Uses existing fallback logic in TableStructureModel

Testing

Tested with PDFs containing fonts with missing ToUnicode CMap that previously produced GLYPH artifacts:

Before fix: 1359 document chunks with GLYPH patterns in table content
After fix: ~50 clean document chunks with proper OCR-extracted text

Related Code Paths

docling/models/page_preprocessing_model.py - Creates initial parsed_page
docling/models/table_structure_model.py - Uses word_cells for table extraction
docling/backend/docling_parse_v4_backend.py - PDF backend that provides word_cells

Discussion: Alternative Approach

This fix takes a conservative approach by clearing all word/char cells when force_full_page_ocr=True. However, I recognize that word-level cells provide better accuracy for table extraction when they contain valid text - that's why docling prefers them over textline cells.

An alternative, more surgical approach would be to detect GLYPH patterns in the cells themselves and only fall back to OCR when garbage is detected. This could be implemented in TableStructureModel:

def _contains_glyph_artifacts(self, cells: List[TextCell]) -> bool:
    """Check if cells contain GLYPH extraction artifacts."""
    import re
    glyph_pattern = re.compile(r"GLYPH<[^>]+>")
    return any(glyph_pattern.search(c.text) for c in cells)

# In predict_tables():
tcells = sp.get_cells_in_bbox(cell_unit=TextCellUnit.WORD, bbox=table_cluster.bbox)
if len(tcells) == 0 or self._contains_glyph_artifacts(tcells):
    tcells = table_cluster.cells  # Fall back to OCR cells

Trade-offs:

Approach	Pros	Cons
This PR (clear cells in OCR model)	Simple, minimal code change, semantic consistency	Loses word-level accuracy even for valid portions of the page
GLYPH detection (in TableStructureModel)	Preserves word-level accuracy where possible	More code, requires pattern matching, may miss edge cases

The GLYPH detection approach is more granular but requires changes in TableStructureModel rather than the OCR model. I went with the simpler fix since force_full_page_ocr already signals that PDF text extraction is unreliable for the entire document.

Open to feedback - if you guys prefer the GLYPH detection approach or have other suggestions, I'm happy to revise.

When force_full_page_ocr=True, the OCR model correctly replaces textline_cells with OCR-extracted text. However, word_cells and char_cells were not cleared, causing downstream components like TableStructureModel to use unreliable PDF-extracted text containing GLYPH artifacts (e.g., GLYPH<c=1,font=/AAAAAH+font000000002ed64673>). This fix clears word_cells and char_cells when force_full_page_ocr is enabled, ensuring TableStructureModel falls back to the OCR- extracted textline cells via its existing fallback logic. Fixes issue where PDFs with problematic fonts (Type3, missing ToUnicode CMap) produced GLYPH artifacts in table content despite force_full_page_ocr being triggered.

github-actions · 2025-12-05T22:14:21Z

✅ DCO Check Passed

Thanks @tripflex, all your commits are properly signed off. 🎉

dosubot · 2025-12-05T22:14:32Z

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

mergify · 2025-12-05T22:14:44Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dolfim-ibm

lgtm

codecov · 2025-12-08T07:44:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

needs some improvement

cau-git

@tripflex this solution could be potentially destructive, since also OCR engines may provide word and char level cells.

I would propose to instead scan these lists for TextCell instances where TextCell.from_ocr == False and throw these out.

…r is used When force_full_page_ocr=True, the OCR model correctly replaces textline_cells with OCR-extracted text. However, word_cells and char_cells from the PDF backend were not handled, causing downstream components like TableStructureModel to use unreliable PDF-extracted text containing GLYPH artifacts. Instead of clearing all word/char cells (which would be destructive for backends like mets_gbs that provide OCR-generated word cells), this fix filters out only cells where from_ocr=False, preserving any OCR-generated cells. This ensures TableStructureModel falls back to the OCR-extracted textline cells via its existing fallback logic when word_cells is empty or only contains OCR cells. Fixes issue where PDFs with problematic fonts (Type3, missing ToUnicode CMap) produced GLYPH artifacts in table content despite force_full_page_ocr being triggered.

tripflex · 2025-12-08T15:45:11Z

@tripflex this solution could be potentially destructive, since also OCR engines may provide word and char level cells.

I would propose to instead scan these lists for TextCell instances where TextCell.from_ocr == False and throw these out.

Great point, thank you for the feedback! I traced through the code more thoroughly and you're absolutely right.

Key Findings

PdfTextCell.from_ocr is always False - All cells from docling-parse are PdfTextCell instances with from_ocr: Literal[False] = False
Current OCR engines only populate textline_cells - Looking at easyocr, tesseract, rapidocr, etc., they all create cells with from_ocr=True but only add them to textline_cells. None currently populate word_cells or char_cells.
mets_gbs_backend.py (GBS Google Books schema) creates OCR word_cells - This backend creates word_cells with from_ocr=True for historical documents.

Updated Fix

I've updated the PR to filter by from_ocr instead of clearing entirely:

if self.options.force_full_page_ocr:
    page.parsed_page.word_cells = [
        c for c in page.parsed_page.word_cells if c.from_ocr
    ]
    page.parsed_page.char_cells = [
        c for c in page.parsed_page.char_cells if c.from_ocr
    ]
    page.parsed_page.has_words = len(page.parsed_page.word_cells) > 0
    page.parsed_page.has_chars = len(page.parsed_page.char_cells) > 0

This approach:

Removes only unreliable PDF-extracted cells (from_ocr=False)
Preserves any OCR-generated word/char cells (from_ocr=True)
Future-proof if OCR engines start providing word/char level cells

Let me know if this looks good or if you have any other suggestions!

I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: 4197a4e I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: a4f4e3f Signed-off-by: Myles McNamara <[email protected]>

dolfim-ibm requested a review from cau-git December 8, 2025 07:17

dolfim-ibm previously approved these changes Dec 8, 2025

View reviewed changes

dolfim-ibm assigned cau-git Dec 8, 2025

cau-git requested changes Dec 8, 2025

View reviewed changes

tripflex force-pushed the mm/ocr/glyph_fix branch from b63baa7 to a4f4e3f Compare December 8, 2025 14:09

tripflex requested a review from cau-git December 8, 2025 19:39

cau-git approved these changes Dec 9, 2025

View reviewed changes

dolfim-ibm approved these changes Dec 9, 2025

View reviewed changes

dolfim-ibm merged commit 1df0560 into docling-project:main Dec 9, 2025
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Clear word/char cells when force_full_page_ocr is used #2738

fix: Clear word/char cells when force_full_page_ocr is used #2738

Uh oh!

tripflex commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

dosubot bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

dolfim-ibm left a comment

Uh oh!

codecov bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

cau-git left a comment

Uh oh!

tripflex commented Dec 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Clear word/char cells when force_full_page_ocr is used #2738

fix: Clear word/char cells when force_full_page_ocr is used #2738

Uh oh!

Conversation

tripflex commented Dec 5, 2025

Summary

Problem

Root Cause

Solution

Changes

Backwards Compatibility

Testing

Related Code Paths

Discussion: Alternative Approach

Uh oh!

github-actions bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 5, 2025

Merge Protections

🟢 Enforce conventional commit

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cau-git left a comment

Choose a reason for hiding this comment

Uh oh!

tripflex commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Findings

Updated Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Dec 5, 2025 •

edited

Loading

dosubot bot commented Dec 5, 2025 •

edited

Loading

codecov bot commented Dec 8, 2025 •

edited

Loading

tripflex commented Dec 8, 2025 •

edited

Loading