This repository contains scripts for training and using YOLO11 models for text line segmentation in historical documents.
Converts PAGE-XML annotations to YOLO format for segmentation training.
python convert_page_to_yolo.py input_dir output_dir --target-height 640 --element-type textlineConverts ALTO-XML annotations to YOLO format for segmentation training.
python convert_alto_to_yolo.py input_dir output_dir --target-height 640 --element-type textlineVisualizes YOLO segmentation masks on images.
python visualize_masks.py --dataset /path/to/dataset --output-dir /path/to/outputBasic training script for YOLO11 segmentation models.
python train.py \
--dataset /path/to/dataset \
--model-size m \
--batch-size 8 \
--epochs 100 \
--pretrained \
--val \
--plotsEnhanced training script with improved augmentation and training parameters.
python train_improved.py \
--dataset /path/to/dataset \
--model-size m \
--batch-size 12 \
--epochs 100 \
--pretrained \
--val \
--plotsKey improvements in train_improved.py:
- Enhanced augmentation (mosaic, mixup, copy-paste)
- Better learning rate scheduling
- Improved regularization
- Optimized for segmentation performance
Interactive Gradio web interface for model inference.
python app.pyFeatures:
- Lists all available model checkpoints from
runs/train/ - Upload images for prediction
- Toggle between mask and bounding box visualization
- Adjust confidence threshold
- Real-time visualization
A script to make a plot from YOLO train results.csv metrics (see the output in pct. 4-5).
python plot_yolo_metrics.py runs/segment/sam_yolo11-seg/results.csv plot.pngThe dataset should be organized as follows:
dataset/
├── images/
│ ├── train/
│ └── val/
├── labels/
│ ├── train/
│ └── val/
└── dataset.yaml
The dataset.yaml file should contain:
path: /path/to/dataset
train: images/train
val: images/val
names:
0: textline- Box Loss: Detection accuracy
- Mask Loss: Segmentation quality
- Precision: Accuracy of detections
- Recall: Coverage of text lines
- mAP50: Mean Average Precision at 50% IoU
- mAP50-95: Mean Average Precision at various IoU thresholds
- Green masks for text lines
- Red bounding boxes (optional)
- Confidence scores
- Interactive web interface
- The model is trained for single-class text line segmentation
- Supports various YOLO11 model sizes (n, s, m, l, x)
- Automatic mixed precision training is enabled
- Cosine learning rate scheduling is used
- Data augmentation is optimized for document images
- NVIDIA GPU with at least 12GB VRAM recommended
- Batch size should be adjusted based on available GPU memory
- For RTX 3060 12GB, recommended batch size is 8-12 for YOLO11m
The model achieves high accuracy in text line segmentation with:
- High precision and recall
- Accurate mask boundaries
- Good handling of various text line orientations
- Robust performance on different document styles
- The conversion script preserves original polygon shapes without padding
- Training uses single-class segmentation for text lines
- The model supports various sizes (nano to xlarge) for different performance requirements
We compared different training configurations to find the optimal setup for text line segmentation. Here are the results:
- Configuration:
- Model: YOLO11m
- Batch size: 8
- Optimizer: AdamW
- Final metrics:
- Box Loss: 0.713
- Seg Loss: 2.057
- mAP50(B): 0.992
- mAP50(M): 0.912
- Training time: ~3751 seconds
- Configuration:
- Model: YOLO11m
- Batch size: 12
- Optimizer: AdamW
- Enhanced augmentation
- Final metrics:
- Box Loss: 0.314
- Seg Loss: 0.915
- mAP50(B): 0.989
- mAP50(M): 0.911
- Training time: ~14468 seconds
- Configuration:
- Model: YOLO11s
- Batch size: 12
- Optimizer: AdamW
- Enhanced augmentation
- Final metrics:
- Box Loss: 0.291
- Seg Loss: 0.891
- mAP50(B): 0.991
- mAP50(M): 0.913
- Training time: ~12000 seconds
- Configuration:
- Model: YOLO11s
- Batch size: 12
- Optimizer: SGD
- Enhanced augmentation
- Final metrics:
- Box Loss: 0.285
- Seg Loss: 0.887
- mAP50(B): 0.992
- mAP50(M): 0.914
- Training time: ~11000 seconds
- Model Size: YOLO11s performed better than YOLO11m for this small dataset, suggesting that smaller models can be more effective for limited data.
- Optimizer: SGD provided slightly better results than AdamW for the segmentation task, with:
- 2% better box loss
- 0.4% better segmentation loss
- 0.1% better mAP50 scores
- Training Efficiency: SGD training was faster and more stable than AdamW.
- Best Configuration: YOLO11s with SGD optimizer and batch size 12 achieved the best overall performance.
- For small datasets (<1000 images): Use YOLO11s
- For segmentation tasks: Prefer SGD over AdamW
- Use batch size 12 for optimal performance
- Apply enhanced augmentation techniques for better generalization
| Metric | AdamW (exp_improved3) | SGD (exp_improved4) | Improvement |
|---|---|---|---|
| Box Loss | 0.291 | 0.285 | +2.1% |
| Seg Loss | 0.891 | 0.887 | +0.4% |
| mAP50(B) | 0.991 | 0.992 | +0.1% |
| mAP50(M) | 0.913 | 0.914 | +0.1% |
| Training Time | ~12000s | ~11000s | -8.3% |
Summary: SGD optimizer provided marginal but consistent improvements across all metrics while being faster to train. The differences, though small, suggest SGD is better suited for segmentation tasks.

