Skip to content

Conversation

@tripflex
Copy link
Contributor

@tripflex tripflex commented Dec 5, 2025

Fixes #2737

Summary

When force_full_page_ocr=True, the OCR model correctly replaces textline_cells with OCR-extracted text. However, word_cells and char_cells from the PDF backend are not cleared, causing downstream components (specifically TableStructureModel) to use unreliable PDF-extracted text that contains GLYPH artifacts.

Problem

PDFs with problematic fonts (Type3 fonts, fonts with missing ToUnicode CMap) produce unreadable text when extracted programmatically. The text appears as:

GLYPH<c=1,font=/AAAAAH+font000000002ed64673> GLYPH<c=1,font=/AAAAAH+font000000002ed64673>

The existing force_full_page_ocr option correctly triggers full-page OCR to handle these cases, and the OCR model properly replaces page.parsed_page.textline_cells with clean OCR-extracted text.

However, the TableStructureModel in docling/models/table_structure_model.py (lines 224-236) prefers word-level cells for better table cell matching accuracy:

# Check if word-level cells are available from backend:
sp = page._backend.get_segmented_page()
if sp is not None:
    tcells = sp.get_cells_in_bbox(
        cell_unit=TextCellUnit.WORD,
        bbox=table_cluster.bbox,
    )
    if len(tcells) == 0:
        # In case word-level cells yield empty
        tcells = table_cluster.cells  # Falls back to textline cells

Since word_cells is populated but contains GLYPH garbage, the fallback to table_cluster.cells (which contains clean OCR text) never triggers.

Root Cause

  1. PagePreprocessingModel stores page.parsed_page = page._backend.get_segmented_page() containing:

    • textline_cells: populated from PDF (may contain GLYPHs)
    • word_cells: populated from PDF (may contain GLYPHs)
    • char_cells: populated from PDF (may contain GLYPHs)
  2. OCR model with force_full_page_ocr=True replaces only textline_cells

  3. TableStructureModel requests word_cells which still contains GLYPH patterns

  4. The fallback to OCR cells only triggers when word_cells is empty, not when it contains garbage

Solution

When force_full_page_ocr=True, also clear word_cells and char_cells in the post_process_cells() method. This ensures:

  1. Semantic consistency: if PDF text extraction is unreliable enough to require full-page OCR, ALL extraction levels are unreliable
  2. TableStructureModel's existing fallback logic triggers correctly
  3. No changes required in TableStructureModel or other components

Changes

docling/models/base_ocr_model.py - BaseOcrModel.post_process_cells():

# When force_full_page_ocr is used, word/char-level cells from PDF
# are also unreliable. Clear them so downstream components (e.g., table
# structure model) fall back to OCR-extracted textline cells.
if self.options.force_full_page_ocr:
    page.parsed_page.word_cells = []
    page.parsed_page.char_cells = []
    page.parsed_page.has_words = False
    page.parsed_page.has_chars = False

Backwards Compatibility

  • Only affects behavior when force_full_page_ocr=True
  • Existing behavior for normal OCR (bitmap-coverage-based) is unchanged
  • Uses existing fallback logic in TableStructureModel

Testing

Tested with PDFs containing fonts with missing ToUnicode CMap that previously produced GLYPH artifacts:

  • Before fix: 1359 document chunks with GLYPH patterns in table content
  • After fix: ~50 clean document chunks with proper OCR-extracted text

Related Code Paths

  • docling/models/page_preprocessing_model.py - Creates initial parsed_page
  • docling/models/table_structure_model.py - Uses word_cells for table extraction
  • docling/backend/docling_parse_v4_backend.py - PDF backend that provides word_cells

Discussion: Alternative Approach

This fix takes a conservative approach by clearing all word/char cells when force_full_page_ocr=True. However, I recognize that word-level cells provide better accuracy for table extraction when they contain valid text - that's why docling prefers them over textline cells.

An alternative, more surgical approach would be to detect GLYPH patterns in the cells themselves and only fall back to OCR when garbage is detected. This could be implemented in TableStructureModel:

def _contains_glyph_artifacts(self, cells: List[TextCell]) -> bool:
    """Check if cells contain GLYPH extraction artifacts."""
    import re
    glyph_pattern = re.compile(r"GLYPH<[^>]+>")
    return any(glyph_pattern.search(c.text) for c in cells)

# In predict_tables():
tcells = sp.get_cells_in_bbox(cell_unit=TextCellUnit.WORD, bbox=table_cluster.bbox)
if len(tcells) == 0 or self._contains_glyph_artifacts(tcells):
    tcells = table_cluster.cells  # Fall back to OCR cells

Trade-offs:

Approach Pros Cons
This PR (clear cells in OCR model) Simple, minimal code change, semantic consistency Loses word-level accuracy even for valid portions of the page
GLYPH detection (in TableStructureModel) Preserves word-level accuracy where possible More code, requires pattern matching, may miss edge cases

The GLYPH detection approach is more granular but requires changes in TableStructureModel rather than the OCR model. I went with the simpler fix since force_full_page_ocr already signals that PDF text extraction is unreliable for the entire document.

Open to feedback - if you guys prefer the GLYPH detection approach or have other suggestions, I'm happy to revise.

When force_full_page_ocr=True, the OCR model correctly replaces
textline_cells with OCR-extracted text. However, word_cells and
char_cells were not cleared, causing downstream components like
TableStructureModel to use unreliable PDF-extracted text containing
GLYPH artifacts (e.g., GLYPH<c=1,font=/AAAAAH+font000000002ed64673>).

This fix clears word_cells and char_cells when force_full_page_ocr
is enabled, ensuring TableStructureModel falls back to the OCR-
extracted textline cells via its existing fallback logic.

Fixes issue where PDFs with problematic fonts (Type3, missing
ToUnicode CMap) produced GLYPH artifacts in table content despite
force_full_page_ocr being triggered.
@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

DCO Check Passed

Thanks @tripflex, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Dec 5, 2025

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Dec 5, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dolfim-ibm dolfim-ibm requested a review from cau-git December 8, 2025 07:17
dolfim-ibm
dolfim-ibm previously approved these changes Dec 8, 2025
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@codecov
Copy link

codecov bot commented Dec 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@dolfim-ibm dolfim-ibm dismissed their stale review December 8, 2025 08:11

needs some improvement

Copy link
Contributor

@cau-git cau-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tripflex this solution could be potentially destructive, since also OCR engines may provide word and char level cells.

I would propose to instead scan these lists for TextCell instances where TextCell.from_ocr == False and throw these out.

…r is used

When force_full_page_ocr=True, the OCR model correctly replaces
textline_cells with OCR-extracted text. However, word_cells and
char_cells from the PDF backend were not handled, causing downstream
components like TableStructureModel to use unreliable PDF-extracted
text containing GLYPH artifacts.

Instead of clearing all word/char cells (which would be destructive
for backends like mets_gbs that provide OCR-generated word cells),
this fix filters out only cells where from_ocr=False, preserving any
OCR-generated cells.

This ensures TableStructureModel falls back to the OCR-extracted
textline cells via its existing fallback logic when word_cells is
empty or only contains OCR cells.

Fixes issue where PDFs with problematic fonts (Type3, missing
ToUnicode CMap) produced GLYPH artifacts in table content despite
force_full_page_ocr being triggered.
@tripflex
Copy link
Contributor Author

tripflex commented Dec 8, 2025

@tripflex this solution could be potentially destructive, since also OCR engines may provide word and char level cells.

I would propose to instead scan these lists for TextCell instances where TextCell.from_ocr == False and throw these out.

Great point, thank you for the feedback! I traced through the code more thoroughly and you're absolutely right.

Key Findings

  1. PdfTextCell.from_ocr is always False - All cells from docling-parse are PdfTextCell instances with from_ocr: Literal[False] = False

  2. Current OCR engines only populate textline_cells - Looking at easyocr, tesseract, rapidocr, etc., they all create cells with from_ocr=True but only add them to textline_cells. None currently populate word_cells or char_cells.

  3. mets_gbs_backend.py (GBS Google Books schema) creates OCR word_cells - This backend creates word_cells with from_ocr=True for historical documents.

Updated Fix

I've updated the PR to filter by from_ocr instead of clearing entirely:

if self.options.force_full_page_ocr:
    page.parsed_page.word_cells = [
        c for c in page.parsed_page.word_cells if c.from_ocr
    ]
    page.parsed_page.char_cells = [
        c for c in page.parsed_page.char_cells if c.from_ocr
    ]
    page.parsed_page.has_words = len(page.parsed_page.word_cells) > 0
    page.parsed_page.has_chars = len(page.parsed_page.char_cells) > 0

This approach:

  • Removes only unreliable PDF-extracted cells (from_ocr=False)
  • Preserves any OCR-generated word/char cells (from_ocr=True)
  • Future-proof if OCR engines start providing word/char level cells

Let me know if this looks good or if you have any other suggestions!

I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: 4197a4e
I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: a4f4e3f

Signed-off-by: Myles McNamara <[email protected]>
@tripflex tripflex requested a review from cau-git December 8, 2025 19:39
@dolfim-ibm dolfim-ibm merged commit 1df0560 into docling-project:main Dec 9, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

force_full_page_ocr does not prevent GLYPH artifacts in table content

3 participants