You can get the source code for these examples on GitHub.
The examples folder contains examples for speech recognition:
DataIter
over speech data.AMI
SDM1 dataset. You can use it as a template for writing other configuration files.io_func/feat_readers/reader_kaldi.py
.Connect to Kaldi:
simple
method for decoding).A full receipt:
To create the speech acoustic modeling example, use the following steps.
Build Kaldi as shared libraries if you have not already done so.
cd kaldi/src ./configure --shared # and other options that you need make depend make
python_wrap
folder to kaldi/src
.cd kaldi/src/python_wrap/ make
The acoustic models use Mel filter-bank or MFCC as input features. They also need to use Kaldi to perform force-alignment to generate frame-level labels from the text transcriptions. For example, if you want to work on the AMI
data SDM1
, you can run kaldi/egs/ami/s5/run_sdm.sh
. Before you can run the examples, you need to configure some paths in kaldi/egs/ami/s5/cmd.sh
and kaldi/egs/ami/s5/run_sdm.sh
. Refer to Kaldi's documentation for details.
The default run_sdm.sh
script generates the force-alignment labels in their stage 7, and saves the force-aligned labels in exp/sdm1/tri3a_ali
. The default script generates MFCC features (13-dimensional). You can try training with the MFCC features, or you can create Mel filter-bank features by yourself. For example, you can use a script like this to compute Mel filter-bank features using Kaldi:
#!/bin/bash -u . ./cmd.sh . ./path.sh # SDM - Single Distant Microphone micid=1 #which mic from array should be used? mic=sdm$micid # Set bash to 'debug' mode, it prints the commands (option '-x') and exits on : # -e 'error', -u 'undefined variable', -o pipefail 'error in pipeline', set -euxo pipefail # Path where AMI gets downloaded (or where locally available): AMI_DIR=$PWD/wav_db # Default, data_dir=$PWD/data/$mic # make filter bank data for dset in train dev eval; do steps/make_fbank.sh --nj 48 --cmd "$train_cmd" $data_dir/$dset \ $data_dir/$dset/log $data_dir/$dset/data-fbank steps/compute_cmvn_stats.sh $data_dir/$dset \ $data_dir/$dset/log $data_dir/$dset/data apply-cmvn --utt2spk=ark:$data_dir/$dset/utt2spk \ scp:$data_dir/$dset/cmvn.scp scp:$data_dir/$dset/feats.scp \ ark,scp:$data_dir/$dset/feats-cmvn.ark,$data_dir/$dset/feats-cmvn.scp mv $data_dir/$dset/feats-cmvn.scp $data_dir/$dset/feats.scp done
apply-cmvn
provides mean-variance normalization. The default setup was applied per speaker. It's more common to perform mean-variance normalization for the whole corpus, and then feed the results to the neural networks:
compute-cmvn-stats scp:data/sdm1/train_fbank/feats.scp data/sdm1/train_fbank/cmvn_g.ark apply-cmvn --norm-vars=true data/sdm1/train_fbank/cmvn_g.ark scp:data/sdm1/train_fbank/feats.scp ark,scp:data/sdm1/train_fbank_gcmvn/feats.ark,data/sdm1/train_fbank_gcmvn/feats.scp
Note that Kaldi always tries to find features in feats.scp
. Ensure that the normalized features are organized as Kaldi expects them during decoding.
Finally, put the features and labels together in a file so that MXNet can find them. More specifically, for each data set (train, dev, eval), you will need to create a file similar to train_mxnet.feats
, with the following contents:
TRANSFORM scp:feat.scp scp:label.scp
TRANSFORM
is the transformation you want to apply to the features. By default, we use NO_FEATURE_TRANSFORM
. The scp:
syntax is from Kaldi. feat.scp
is typically the file from data/sdm1/train/feats.scp
, and label.scp
is converted from the force-aligned labels located in exp/sdm1/tri3a_ali
. Because the force-alignments are generated only on the training data, we split the training set in two, using a 90/10 ratio, and then use the 1/10 holdout as the dev set (validation set). The script run_ami.sh automatically splits and formats the file for MXNet. Before running it, set the path in the script correctly. The run_ami.sh script actually runs the full pipeline, including training the acoustic model and decoding. If the scripts ran successfully, you can skip the following sections.
default.cfg
, and edit the necessary parameters, such as the path to the dataset you just prepared.python train_lstm.py --configfile=your-config.cfg
. For help, use python train_lstm.py --help
. You can set all of the configuration parameters in default.cfg
, the customized config file, and through the command line (e.g., using --train_batch_size=50
). The latter values overwrite the former ones.Here are some example outputs from training on the TIMIT dataset:
Example output for TIMIT: Summary of dataset ================== bucket of len 100 : 3 samples bucket of len 200 : 346 samples bucket of len 300 : 1496 samples bucket of len 400 : 974 samples bucket of len 500 : 420 samples bucket of len 600 : 90 samples bucket of len 700 : 11 samples bucket of len 800 : 2 samples Summary of dataset ================== bucket of len 100 : 0 samples bucket of len 200 : 28 samples bucket of len 300 : 169 samples bucket of len 400 : 107 samples bucket of len 500 : 41 samples bucket of len 600 : 6 samples bucket of len 700 : 3 samples bucket of len 800 : 0 samples 2016-04-21 20:02:40,904 Epoch[0] Train-Acc_exlude_padding=0.154763 2016-04-21 20:02:40,904 Epoch[0] Time cost=91.574 2016-04-21 20:02:44,419 Epoch[0] Validation-Acc_exlude_padding=0.353552 2016-04-21 20:04:17,290 Epoch[1] Train-Acc_exlude_padding=0.447318 2016-04-21 20:04:17,290 Epoch[1] Time cost=92.870 2016-04-21 20:04:20,738 Epoch[1] Validation-Acc_exlude_padding=0.506458 2016-04-21 20:05:53,127 Epoch[2] Train-Acc_exlude_padding=0.557543 2016-04-21 20:05:53,128 Epoch[2] Time cost=92.390 2016-04-21 20:05:56,568 Epoch[2] Validation-Acc_exlude_padding=0.548100
The final frame accuracy was approximately 62%.
python make_stats.py --configfile=your-config.cfg | copy-feats ark:- ark:label_mean.ark
(edit necessary items, such as the path to the training dataset). This command generates the label counts in label_mean.ark
.local/
and utils/
and run ./run_ami.sh --model prefix model --num_epoch num
.Here are the results for the TIMIT and AMI test sets (using the default setup, three-layer LSTM with projection layers):
| Corpus | WER | |--------|-----| |TIMIT | 18.9| |AMI | 51.7 (42.2) |
For AMI 42.2 was evaluated non-overlapped speech. The Kaldi-HMM baseline was 67.2%, and DNN was 57.5%.