`force_full_page_ocr` does not prevent GLYPH artifacts in table content

## Description

When using `force_full_page_ocr=True` to handle PDFs with problematic fonts, GLYPH artifacts still appear in table content. The OCR correctly replaces `textline_cells`, but `TableStructureModel` uses `word_cells` from the PDF backend which still contains the corrupted text.

## Steps to Reproduce

1. Process a PDF with problematic fonts (Type3 fonts or fonts with missing ToUnicode CMap)
2. Enable `force_full_page_ocr=True` in OCR options
3. Extract tables from the document

## Expected Behavior

All extracted text, including table content, should use OCR-extracted text when `force_full_page_ocr=True`.

## Actual Behavior

Table content contains GLYPH artifacts like:

```
GLYPH<c=1,font=/AAAAAH+font000000002ed64673> GLYPH<c=1,font=/AAAAAH+font000000002ed64673>
```

## Root Cause Analysis

I traced through the code and identified the issue:

1. **`PagePreprocessingModel`** populates `page.parsed_page` with cells from the PDF backend:
   - `textline_cells` - line-level text (may contain GLYPHs)
   - `word_cells` - word-level text (may contain GLYPHs)
   - `char_cells` - character-level text (may contain GLYPHs)

2. **OCR Model** (`base_ocr_model.py`) with `force_full_page_ocr=True`:
   - Correctly replaces `page.parsed_page.textline_cells` with OCR text ✅
   - Does NOT clear `word_cells` or `char_cells` ❌

3. **`TableStructureModel`** (`table_structure_model.py`, lines 224-236):
   ```python
   sp = page._backend.get_segmented_page()
   if sp is not None:
       tcells = sp.get_cells_in_bbox(
           cell_unit=TextCellUnit.WORD,  # Requests word-level cells
           bbox=table_cluster.bbox,
       )
       if len(tcells) == 0:
           tcells = table_cluster.cells  # Only falls back if EMPTY
   ```
   
   The fallback to OCR cells only triggers when `word_cells` is **empty**, not when it contains garbage.

## Environment

- docling version: 2.64.0
- Python version: 3.12
- OS: Linux (Docker container)

## Possible Solutions

### Option A: Clear word/char cells in OCR model (minimal change)

When `force_full_page_ocr=True`, also clear `word_cells` and `char_cells` in `post_process_cells()`:

```python
if self.options.force_full_page_ocr:
    page.parsed_page.word_cells = []
    page.parsed_page.char_cells = []
    page.parsed_page.has_words = False
    page.parsed_page.has_chars = False
```

**Pros:** Simple, semantic consistency (if PDF text is unreliable, all levels are unreliable)  
**Cons:** Loses word-level accuracy even for valid portions

### Option B: GLYPH detection in TableStructureModel (surgical)

Check for GLYPH patterns before using word cells:

```python
def _contains_glyph_artifacts(self, cells):
    import re
    glyph_pattern = re.compile(r"GLYPH<[^>]+>")
    return any(glyph_pattern.search(c.text) for c in cells)

# In predict_tables():
if len(tcells) == 0 or self._contains_glyph_artifacts(tcells):
    tcells = table_cluster.cells
```

**Pros:** Preserves word-level accuracy where valid  
**Cons:** More code, pattern matching overhead

---

I have a working fix for Option A that I've tested successfully. Happy to submit a PR or discuss alternative approaches.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`force_full_page_ocr` does not prevent GLYPH artifacts in table content #2737

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Environment

Possible Solutions

Option A: Clear word/char cells in OCR model (minimal change)

Option B: GLYPH detection in TableStructureModel (surgical)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

force_full_page_ocr does not prevent GLYPH artifacts in table content #2737

Description

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Environment

Possible Solutions

Option A: Clear word/char cells in OCR model (minimal change)

Option B: GLYPH detection in TableStructureModel (surgical)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`force_full_page_ocr` does not prevent GLYPH artifacts in table content #2737