-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Description
When using force_full_page_ocr=True to handle PDFs with problematic fonts, GLYPH artifacts still appear in table content. The OCR correctly replaces textline_cells, but TableStructureModel uses word_cells from the PDF backend which still contains the corrupted text.
Steps to Reproduce
- Process a PDF with problematic fonts (Type3 fonts or fonts with missing ToUnicode CMap)
- Enable
force_full_page_ocr=Truein OCR options - Extract tables from the document
Expected Behavior
All extracted text, including table content, should use OCR-extracted text when force_full_page_ocr=True.
Actual Behavior
Table content contains GLYPH artifacts like:
GLYPH<c=1,font=/AAAAAH+font000000002ed64673> GLYPH<c=1,font=/AAAAAH+font000000002ed64673>
Root Cause Analysis
I traced through the code and identified the issue:
-
PagePreprocessingModelpopulatespage.parsed_pagewith cells from the PDF backend:textline_cells- line-level text (may contain GLYPHs)word_cells- word-level text (may contain GLYPHs)char_cells- character-level text (may contain GLYPHs)
-
OCR Model (
base_ocr_model.py) withforce_full_page_ocr=True:- Correctly replaces
page.parsed_page.textline_cellswith OCR text ✅ - Does NOT clear
word_cellsorchar_cells❌
- Correctly replaces
-
TableStructureModel(table_structure_model.py, lines 224-236):sp = page._backend.get_segmented_page() if sp is not None: tcells = sp.get_cells_in_bbox( cell_unit=TextCellUnit.WORD, # Requests word-level cells bbox=table_cluster.bbox, ) if len(tcells) == 0: tcells = table_cluster.cells # Only falls back if EMPTY
The fallback to OCR cells only triggers when
word_cellsis empty, not when it contains garbage.
Environment
- docling version: 2.64.0
- Python version: 3.12
- OS: Linux (Docker container)
Possible Solutions
Option A: Clear word/char cells in OCR model (minimal change)
When force_full_page_ocr=True, also clear word_cells and char_cells in post_process_cells():
if self.options.force_full_page_ocr:
page.parsed_page.word_cells = []
page.parsed_page.char_cells = []
page.parsed_page.has_words = False
page.parsed_page.has_chars = FalsePros: Simple, semantic consistency (if PDF text is unreliable, all levels are unreliable)
Cons: Loses word-level accuracy even for valid portions
Option B: GLYPH detection in TableStructureModel (surgical)
Check for GLYPH patterns before using word cells:
def _contains_glyph_artifacts(self, cells):
import re
glyph_pattern = re.compile(r"GLYPH<[^>]+>")
return any(glyph_pattern.search(c.text) for c in cells)
# In predict_tables():
if len(tcells) == 0 or self._contains_glyph_artifacts(tcells):
tcells = table_cluster.cellsPros: Preserves word-level accuracy where valid
Cons: More code, pattern matching overhead
I have a working fix for Option A that I've tested successfully. Happy to submit a PR or discuss alternative approaches.