ASR

An experimental repo to play around with automatic speech recognition (ASR) papers and ideas.

Usage
- CTC
- RNN-T
- TDT
- BPE
Features
Example Training
- TDT

Usage

Training is driven by Hydra.

Reusable building blocks and system configs live under src/asr/conf/. A training run requires selecting a system config (system=ctc / system=rnnt, which bundles the model components and loss), the data configs, and the remaining per-run settings via the command line.

The command line permits config file settings to be overridden and, if necessary, a full training run to be configured (although it's verbose!).

CTC

uv run train \
    dataset=ls960 augment=speed_specaug dataloader=bucket \
    eval_dataset=dev_clean eval_dataloader=dev \
    system=ctc \
    tokenizer=char 'tokenizer.specials=["<blank>"]' \
    total_steps=10_000 eval_every=500 optim.lr=0.0003 max_grad_norm=2.0 \
    device="cuda:1" logger=tqdm

RNN-T

uv run train \
    dataset=ls960 augment=speed_specaug dataloader=bucket \
    eval_dataset=dev_clean eval_dataloader=dev \
    system=rnnt \
    tokenizer=char 'tokenizer.specials=["<blank>", "<sos>"]' \
    total_steps=10_000 eval_every=500 optim.lr=0.0003 max_grad_norm=2.0 \
    device="cuda:1" logger=tqdm

TDT

uv run train \
    dataset=ls960 augment=speed_specaug dataloader=bucket \
    eval_dataset=dev_clean eval_dataloader=dev \
    system=tdt \
    tokenizer=char 'tokenizer.specials=["<blank>", "<sos>"]' \
    total_steps=10_000 eval_every=500 optim.lr=0.0003 max_grad_norm=2.0 \
    device="cuda:1" logger=tqdm
    # add system.loss.sigma=0.05 to bias toward longer durations

BPE

The above usage examples use character-level tokenization. To switch to byte pair encoding (BPE) a model is required. This command generates one containing 1022 tokens so the tokenizer instance can add <blank> and <sos> symbols for the RNN-T and TDT losses bringing it to 1024 (power of 2). For CTC use 1023 and remove <sos> from the example training command below:

uv run train-bpe trainer.vocab_size=1022 output_dir=models/tokenizers/ls960_bpe1022/

Then train a speech recognition model as per any of the commands above but switching the tokenizer line for:

...
    tokenizer=bpe tokenizer.model_dir=models/tokenizers/ls960_bpe1022 'tokenizer.specials=["<blank>", "<sos>"]' \
...

Features

Loss

Connectionist Temporal Classification (CTC). Paper. Implementation in CUDA with PyTorch stable ABI bindings.
Transducer / RNN-T. Paper. Implementation in CUDA with PyTorch stable ABI bindings.
Token-and-Duration Transducer (TDT) with optimal logits under-normalization to bias towards longer durations. Paper. Implementation in CUDA with PyTorch stable ABI bindings.

Architecture

Transformer (RoPE, SwiGLU, Prenorm). Implementation.
LSTM stack. Wraps PyTorch, implementation.
Conv subsampling frontend, implementation.

Tokenizer

Character. Implementation.
Byte pair encoding (BPE), via SentencePiece. Paper.

Dataset

LibriSpeech. Website.

Data Normalization

"Global" normalization. Per-feature normalization using pre-computed per-feature mean and std.
Per sample per feature normalization. Mean and std stats computed over the time dimension for each sample.

Data Augmentation

SpecAugment time and frequency masking. Paper.
Speed perturbation. Paper.

Data Batching

Dynamic length bucketing with a per-batch frame budget, inspired by Lhotse.

Logging

tqdm. Website.
Aim. Website.

Example Training

TDT

Small scale training, <2 days on an NVIDIA 3090:

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
plots		plots
src/asr		src/asr
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASR

Usage

CTC

RNN-T

TDT

BPE

Features

Loss

Architecture

Tokenizer

Dataset

Data Normalization

Data Augmentation

Data Batching

Logging

Example Training

TDT

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ASR

Usage

CTC

RNN-T

TDT

BPE

Features

Loss

Architecture

Tokenizer

Dataset

Data Normalization

Data Augmentation

Data Batching

Logging

Example Training

TDT

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages