Skip to content

samgd/asr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASR

An experimental repo to play around with automatic speech recognition (ASR) papers and ideas.

Usage

Training is driven by Hydra.

Reusable building blocks and system configs live under src/asr/conf/. A training run requires selecting a system config (system=ctc / system=rnnt, which bundles the model components and loss), the data configs, and the remaining per-run settings via the command line.

The command line permits config file settings to be overridden and, if necessary, a full training run to be configured (although it's verbose!).

CTC

uv run train \
    dataset=ls960 augment=speed_specaug dataloader=bucket \
    eval_dataset=dev_clean eval_dataloader=dev \
    system=ctc \
    tokenizer=char 'tokenizer.specials=["<blank>"]' \
    total_steps=10_000 eval_every=500 optim.lr=0.0003 max_grad_norm=2.0 \
    device="cuda:1" logger=tqdm

RNN-T

uv run train \
    dataset=ls960 augment=speed_specaug dataloader=bucket \
    eval_dataset=dev_clean eval_dataloader=dev \
    system=rnnt \
    tokenizer=char 'tokenizer.specials=["<blank>", "<sos>"]' \
    total_steps=10_000 eval_every=500 optim.lr=0.0003 max_grad_norm=2.0 \
    device="cuda:1" logger=tqdm

TDT

uv run train \
    dataset=ls960 augment=speed_specaug dataloader=bucket \
    eval_dataset=dev_clean eval_dataloader=dev \
    system=tdt \
    tokenizer=char 'tokenizer.specials=["<blank>", "<sos>"]' \
    total_steps=10_000 eval_every=500 optim.lr=0.0003 max_grad_norm=2.0 \
    device="cuda:1" logger=tqdm
    # add system.loss.sigma=0.05 to bias toward longer durations

BPE

The above usage examples use character-level tokenization. To switch to byte pair encoding (BPE) a model is required. This command generates one containing 1022 tokens so the tokenizer instance can add <blank> and <sos> symbols for the RNN-T and TDT losses bringing it to 1024 (power of 2). For CTC use 1023 and remove <sos> from the example training command below:

uv run train-bpe trainer.vocab_size=1022 output_dir=models/tokenizers/ls960_bpe1022/

Then train a speech recognition model as per any of the commands above but switching the tokenizer line for:

...
    tokenizer=bpe tokenizer.model_dir=models/tokenizers/ls960_bpe1022 'tokenizer.specials=["<blank>", "<sos>"]' \
...

Features

Loss

  • Connectionist Temporal Classification (CTC). Paper. Implementation in CUDA with PyTorch stable ABI bindings.
  • Transducer / RNN-T. Paper. Implementation in CUDA with PyTorch stable ABI bindings.
  • Token-and-Duration Transducer (TDT) with optimal logits under-normalization to bias towards longer durations. Paper. Implementation in CUDA with PyTorch stable ABI bindings.

Architecture

Tokenizer

Dataset

Data Normalization

  • "Global" normalization. Per-feature normalization using pre-computed per-feature mean and std.
  • Per sample per feature normalization. Mean and std stats computed over the time dimension for each sample.

Data Augmentation

  • SpecAugment time and frequency masking. Paper.
  • Speed perturbation. Paper.

Data Batching

  • Dynamic length bucketing with a per-batch frame budget, inspired by Lhotse.

Logging

Example Training

TDT

Small scale training, <2 days on an NVIDIA 3090:

TDT Training

About

A repo to play around with automatic speech recognition (ASR) papers and ideas.

Resources

Stars

Watchers

Forks

Contributors