Fork of alignment-handbook
- Follow the installation instructions to set up your environment etc.
A comprehensive Arabic data processing pipeline with deduplication, stemming, quality checking, readability scoring abd dataset packing, used for the DALLA Models.
Extend existing tokenizers with new vocabulary from custom training data. Train a new tokenizer on your domain-specific data and merge it with any sentencepiece tokenizer (e.g., Gemma) by replacing unused language tokens.
3- r-bpe
Extend existing tokenizers with new vocabulary from custom training data. Train a new tokenizer on your domain-specific data and merge it with any bpe tokenizer (e.g., Llama) by replacing unused language tokens. and is also needed to use the r-bpe tokenizers
1- Start by using dalla-data-processing for data cleaning, (deduplication, Quality checking, Readability scoring, and stemming)
2- Edit the original tokenizer of the model you are starting from, if your model is a sentencepiece tokenizer use dalla-sentencepiece-tokenizer-manipulation, and if it is a BPE tokenizer use r-bpe
3- pack your training data using dataset-packing this repo packs the training dataset to fit the maximum sequence length you choose
Sample datasets
To run the code in this project, first, create a Python virtual environment using e.g. uv:
uv venv handbook --python 3.11 && source handbook/bin/activate && uv pip install --upgrade pipTip
To install uv, follow the UV Installation Guide.
Next, install PyTorch v2.4.1
uv pip install torch==2.4.1 --index-url https://download.pytorch.org/whl/cu121You can then install the remaining package dependencies as follows:
uv pip install .You will also need Flash Attention 2 installed, which can be done by running:
uv pip install "flash-attn==2.7.4.post1" --no-build-isolationNext, log into your Hugging Face account as follows:
huggingface-cli loginFinally, install Git LFS so that you can push models to the Hugging Face Hub:
sudo apt-get update; sudo apt-get install git-lfsTo train a model using CPT (Continued Pre-Training) or SFT (Supervised Fine-Tuning), use the following command:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file configs/accelerate/zero3.yaml train.py --config configs/train_recipe/example_recipe.yamlEdit configs/accelerate/zero3.yaml to configure your hardware setup. For example, to set the number of GPUs:
You need a machine with 8 H100 GPUs to run the example
num_processes: 8 # Set to the number of GPUs you haveEdit configs/train_recipe/example_recipe.yaml to configure your training parameters:
Models and Datasets:
- Use local models or models from Hugging Face Hub
- Use local datasets or datasets from Hugging Face Hub
Layer Freezing (Optional):
By default, all model parameters are trained. If you want to freeze the entire model and train only specific layers, add:
freeze_parameters: true
freeze_except:
- embed_tokens
- layers.0If freeze_parameters is not specified, the entire model will be trained.
Custom Tokenizer (Optional):
By default, the model's original tokenizer is used. To use a different tokenizer:
tokenizer_name_or_path: "your-tokenizer-name"R-BPE Tokenizer (Optional):
If your custom tokenizer is an R-BPE tokenizer:
is_rbpe_tokenizer: trueIf the custom tokenizer has a different vocabulary size than the original model, enable token embedding resizing:
resize_token_embeddings: true├── configs/
│ ├── accelerate/
│ │ └── zero3.yaml <- Accelerate configuration (GPU setup, distributed training)
│ └── train_recipe/
│ └── example_recipe.yaml <- Training recipe configuration (model, dataset, hyperparameters)
├── src/
│ └── alignment/ <- Source code for training
│ ├── configs.py
│ ├── data.py
│ ├── model_utils.py
│ └── release.py
├── tests/ <- Unit tests
├── train.py <- Main training script for CPT or SFT
├── setup.py <- Makes project pip installable
├── setup.cfg <- Installation config
└── README.md <- The top-level README for developers using this project