ASR (Automatic Speech Recognition) project for Mandarin Chinese (普通话), focused on speech recognition and pronunciation assessment.
Part of a phoneme-level pronunciation assessment pipeline using the GOP (Goodness of Pronunciation) metric - trained purely on L1 data, requiring no L2 corpus. This is especially practical for Mandarin, where L2 corpora are scarce and Polish-speaker data is virtually nonexistent.
- Model: Conformer (CTC)
- Input: 80-band Mel spectrogram (16 kHz, hop 160)
- Training data: AISHELL-1 - 170h of Mandarin read speech
- Output: Temporary only Pinyin character sequence
si-tone/
├── data/ # dataset, collate
├── models/
│ └── confromer/ # Conformer encoder, attention, conv
├── utils/ # tokenizer, vocab, pinyin helpers
├── inference/ # inference scripts, notebooks
│ └── inference.py # single-file inference
├── examples/ # sample audio files
├── checkpoints/ # saved model checkpoints + config.json
└── train/
python3.13 -m venv venv
source venv/bin/activate
pip install -r requirements.txtVerify:
python --version
which pip
echo $VIRTUAL_ENVDeactivate:
deactivateDownload AISHELL-1:
wget http://www.openslr.org/33/data_aishell.tgz
tar -xzf data_aishell.tgzSingle file:
python inference/inference.py --audio examples/chinese_granny.mp3
python inference/inference.py --audio examples/chinese_granny.mp3 --checkpoint checkpoints/conformer_epoch_10.ptSaves to results/<filename>_epoch<N>_<timestamp>/:
prediction.txt # predicted pinyin + diagnostics
mel.png # mel spectrogram
probs.png # blank vs. max token probability over time
You can run the complete web application (Frontend + FastAPI backend) locally..
From the root directory of the project, run:
docker compose up --buildIf you are developing the frontend locally without Docker and need to build the production assets manually:
cd frontend
npm install
npm run buildOnce the Docker containers are up and running, open your browser and navigate to: http://localhost:8000
The model needs ~30-50 epochs before CER drops meaningfully. Typical trajectory on AISHELL-1:
| Epoch | CER |
|---|---|
| 5–10 | ~90% |
| 20–30 | ~50% |
| 50+ | ~20% |
| 100+ | <10% |
A blank_frac above 90% in inference output means the model is still in early training.
- Native speech (L1): AISHELL-1
- L2 learner speech: Mandarin Learners' Speech Bank