Skip to content

force_full_page_ocr does not prevent GLYPH artifacts in table content #2737

@tripflex

Description

@tripflex

Description

When using force_full_page_ocr=True to handle PDFs with problematic fonts, GLYPH artifacts still appear in table content. The OCR correctly replaces textline_cells, but TableStructureModel uses word_cells from the PDF backend which still contains the corrupted text.

Steps to Reproduce

  1. Process a PDF with problematic fonts (Type3 fonts or fonts with missing ToUnicode CMap)
  2. Enable force_full_page_ocr=True in OCR options
  3. Extract tables from the document

Expected Behavior

All extracted text, including table content, should use OCR-extracted text when force_full_page_ocr=True.

Actual Behavior

Table content contains GLYPH artifacts like:

GLYPH<c=1,font=/AAAAAH+font000000002ed64673> GLYPH<c=1,font=/AAAAAH+font000000002ed64673>

Root Cause Analysis

I traced through the code and identified the issue:

  1. PagePreprocessingModel populates page.parsed_page with cells from the PDF backend:

    • textline_cells - line-level text (may contain GLYPHs)
    • word_cells - word-level text (may contain GLYPHs)
    • char_cells - character-level text (may contain GLYPHs)
  2. OCR Model (base_ocr_model.py) with force_full_page_ocr=True:

    • Correctly replaces page.parsed_page.textline_cells with OCR text ✅
    • Does NOT clear word_cells or char_cells
  3. TableStructureModel (table_structure_model.py, lines 224-236):

    sp = page._backend.get_segmented_page()
    if sp is not None:
        tcells = sp.get_cells_in_bbox(
            cell_unit=TextCellUnit.WORD,  # Requests word-level cells
            bbox=table_cluster.bbox,
        )
        if len(tcells) == 0:
            tcells = table_cluster.cells  # Only falls back if EMPTY

    The fallback to OCR cells only triggers when word_cells is empty, not when it contains garbage.

Environment

  • docling version: 2.64.0
  • Python version: 3.12
  • OS: Linux (Docker container)

Possible Solutions

Option A: Clear word/char cells in OCR model (minimal change)

When force_full_page_ocr=True, also clear word_cells and char_cells in post_process_cells():

if self.options.force_full_page_ocr:
    page.parsed_page.word_cells = []
    page.parsed_page.char_cells = []
    page.parsed_page.has_words = False
    page.parsed_page.has_chars = False

Pros: Simple, semantic consistency (if PDF text is unreliable, all levels are unreliable)
Cons: Loses word-level accuracy even for valid portions

Option B: GLYPH detection in TableStructureModel (surgical)

Check for GLYPH patterns before using word cells:

def _contains_glyph_artifacts(self, cells):
    import re
    glyph_pattern = re.compile(r"GLYPH<[^>]+>")
    return any(glyph_pattern.search(c.text) for c in cells)

# In predict_tables():
if len(tcells) == 0 or self._contains_glyph_artifacts(tcells):
    tcells = table_cluster.cells

Pros: Preserves word-level accuracy where valid
Cons: More code, pattern matching overhead


I have a working fix for Option A that I've tested successfully. Happy to submit a PR or discuss alternative approaches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions