An experimental repo to play around with automatic speech recognition (ASR) papers and ideas.
Training is driven by Hydra.
Reusable building blocks and system configs live under src/asr/conf/. A training run requires selecting a system config (system=ctc / system=rnnt, which bundles the model components and loss), the data configs, and the remaining per-run settings via the command line.
The command line permits config file settings to be overridden and, if necessary, a full training run to be configured (although it's verbose!).
uv run train \
dataset=ls960 augment=speed_specaug dataloader=bucket \
eval_dataset=dev_clean eval_dataloader=dev \
system=ctc \
tokenizer=char 'tokenizer.specials=["<blank>"]' \
total_steps=10_000 eval_every=500 optim.lr=0.0003 max_grad_norm=2.0 \
device="cuda:1" logger=tqdm
uv run train \
dataset=ls960 augment=speed_specaug dataloader=bucket \
eval_dataset=dev_clean eval_dataloader=dev \
system=rnnt \
tokenizer=char 'tokenizer.specials=["<blank>", "<sos>"]' \
total_steps=10_000 eval_every=500 optim.lr=0.0003 max_grad_norm=2.0 \
device="cuda:1" logger=tqdm
uv run train \
dataset=ls960 augment=speed_specaug dataloader=bucket \
eval_dataset=dev_clean eval_dataloader=dev \
system=tdt \
tokenizer=char 'tokenizer.specials=["<blank>", "<sos>"]' \
total_steps=10_000 eval_every=500 optim.lr=0.0003 max_grad_norm=2.0 \
device="cuda:1" logger=tqdm
# add system.loss.sigma=0.05 to bias toward longer durations
The above usage examples use character-level tokenization. To switch to byte pair encoding (BPE) a model is required. This command generates one containing 1022 tokens so the tokenizer instance can add <blank> and <sos> symbols for the RNN-T and TDT losses bringing it to 1024 (power of 2). For CTC use 1023 and remove <sos> from the example training command below:
uv run train-bpe trainer.vocab_size=1022 output_dir=models/tokenizers/ls960_bpe1022/
Then train a speech recognition model as per any of the commands above but switching the tokenizer line for:
...
tokenizer=bpe tokenizer.model_dir=models/tokenizers/ls960_bpe1022 'tokenizer.specials=["<blank>", "<sos>"]' \
...
- Connectionist Temporal Classification (CTC). Paper. Implementation in CUDA with PyTorch stable ABI bindings.
- Transducer / RNN-T. Paper. Implementation in CUDA with PyTorch stable ABI bindings.
- Token-and-Duration Transducer (TDT) with optimal logits under-normalization to bias towards longer durations. Paper. Implementation in CUDA with PyTorch stable ABI bindings.
- Transformer (RoPE, SwiGLU, Prenorm). Implementation.
- LSTM stack. Wraps PyTorch, implementation.
- Conv subsampling frontend, implementation.
- Character. Implementation.
- Byte pair encoding (BPE), via SentencePiece. Paper.
- LibriSpeech. Website.
- "Global" normalization. Per-feature normalization using pre-computed per-feature mean and std.
- Per sample per feature normalization. Mean and std stats computed over the time dimension for each sample.
- Dynamic length bucketing with a per-batch frame budget, inspired by Lhotse.
Small scale training, <2 days on an NVIDIA 3090: