Skip to content

laoszu/si-tones

Repository files navigation

Si-Tones (四 Tones, or 四声)

ASR (Automatic Speech Recognition) project for Mandarin Chinese (普通话), focused on speech recognition and pronunciation assessment.

Part of a phoneme-level pronunciation assessment pipeline using the GOP (Goodness of Pronunciation) metric - trained purely on L1 data, requiring no L2 corpus. This is especially practical for Mandarin, where L2 corpora are scarce and Polish-speaker data is virtually nonexistent.


Architecture

  • Model: Conformer (CTC)
  • Input: 80-band Mel spectrogram (16 kHz, hop 160)
  • Training data: AISHELL-1 - 170h of Mandarin read speech
  • Output: Temporary only Pinyin character sequence

Project structure

si-tone/
├── data/               # dataset, collate
├── models/
│   └── confromer/      # Conformer encoder, attention, conv
├── utils/              # tokenizer, vocab, pinyin helpers
├── inference/          # inference scripts, notebooks
│   └── inference.py    # single-file inference
├── examples/           # sample audio files
├── checkpoints/        # saved model checkpoints + config.json
└── train/

Setup

python3.13 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Verify:

python --version
which pip
echo $VIRTUAL_ENV

Deactivate:

deactivate

Training

Download AISHELL-1:

wget http://www.openslr.org/33/data_aishell.tgz
tar -xzf data_aishell.tgz

Inference

Single file:

python inference/inference.py --audio examples/chinese_granny.mp3
python inference/inference.py --audio examples/chinese_granny.mp3 --checkpoint checkpoints/conformer_epoch_10.pt

Saves to results/<filename>_epoch<N>_<timestamp>/:

prediction.txt   # predicted pinyin + diagnostics
mel.png          # mel spectrogram
probs.png        # blank vs. max token probability over time

## Local Deployment (Web App)

You can run the complete web application (Frontend + FastAPI backend) locally..

From the root directory of the project, run:

docker compose up --build

If you are developing the frontend locally without Docker and need to build the production assets manually:

cd frontend
npm install
npm run build

Once the Docker containers are up and running, open your browser and navigate to: http://localhost:8000

Notes on convergence

The model needs ~30-50 epochs before CER drops meaningfully. Typical trajectory on AISHELL-1:

Epoch CER
5–10 ~90%
20–30 ~50%
50+ ~20%
100+ <10%

A blank_frac above 90% in inference output means the model is still in early training.


Acknowledgements

About

Mandarin Speech-To-Text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors