You can get the source code for these examples on GitHub.
The examples folder contains examples for speech recognition:
DataIter over speech data.AMI SDM1 dataset. You can use it as a template for writing other configuration files.io_func/feat_readers/reader_kaldi.py.Connect to Kaldi:
simple method for decoding).A full receipt:
To create the speech acoustic modeling example, use the following steps.
Build Kaldi as shared libraies if you have not already done so.
cd kaldi/src ./configure --shared # and other options that you need make depend make
python_wrap folder to kaldi/src.cd kaldi/src/python_wrap/ make
The acoustic models use Mel filter-bank or MFCC as input features. They also need to use Kaldi to perform force-alignment to generate frame-level labels from the text transcriptions. For example, if you want to work on the AMI data SDM1, you can run kaldi/egs/ami/s5/run_sdm.sh. Before you can run the examples, you need to configure some paths in kaldi/egs/ami/s5/cmd.sh and kaldi/egs/ami/s5/run_sdm.sh. Refer to Kaldi's documentation for details.
The default run_sdm.sh script generates the force-alignment labels in their stage 7, and saves the force-aligned labels in exp/sdm1/tri3a_ali. The default script generates MFCC features (13-dimensional). You can try training with the MFCC features, or you can create Mel filter-bank features by yourself. For example, you can use a script like this to compute Mel filter-bank features using Kaldi:
#!/bin/bash -u . ./cmd.sh . ./path.sh # SDM - Signle Distant Microphone micid=1 #which mic from array should be used? mic=sdm$micid # Set bash to 'debug' mode, it prints the commands (option '-x') and exits on : # -e 'error', -u 'undefined variable', -o pipefail 'error in pipeline', set -euxo pipefail # Path where AMI gets downloaded (or where locally available): AMI_DIR=$PWD/wav_db # Default, data_dir=$PWD/data/$mic # make filter bank data for dset in train dev eval; do steps/make_fbank.sh --nj 48 --cmd "$train_cmd" $data_dir/$dset \ $data_dir/$dset/log $data_dir/$dset/data-fbank steps/compute_cmvn_stats.sh $data_dir/$dset \ $data_dir/$dset/log $data_dir/$dset/data apply-cmvn --utt2spk=ark:$data_dir/$dset/utt2spk \ scp:$data_dir/$dset/cmvn.scp scp:$data_dir/$dset/feats.scp \ ark,scp:$data_dir/$dset/feats-cmvn.ark,$data_dir/$dset/feats-cmvn.scp mv $data_dir/$dset/feats-cmvn.scp $data_dir/$dset/feats.scp done
apply-cmvn provides mean-variance normalization. The default setup was applied per speaker. It's more common to perform mean-variance normalization for the whole corpus, and then feed the results to the neural networks:
compute-cmvn-stats scp:data/sdm1/train_fbank/feats.scp data/sdm1/train_fbank/cmvn_g.ark apply-cmvn --norm-vars=true data/sdm1/train_fbank/cmvn_g.ark scp:data/sdm1/train_fbank/feats.scp ark,scp:data/sdm1/train_fbank_gcmvn/feats.ark,data/sdm1/train_fbank_gcmvn/feats.scp
Note that Kaldi always tries to find features in feats.scp. Ensure that the normalized features are organized as Kaldi expects them during decoding.
Finally, put the features and labels together in a file so that MXNet can find them. More specifically, for each data set (train, dev, eval), you will need to create a file similar to train_mxnet.feats, with the following contents:
TRANSFORM scp:feat.scp scp:label.scp
TRANSFORM is the transformation you want to apply to the features. By default, we use NO_FEATURE_TRANSFORM. The scp: syntax is from Kaldi. feat.scp is typically the file from data/sdm1/train/feats.scp, and label.scp is converted from the force-aligned labels located in exp/sdm1/tri3a_ali. Because the force-alignments are generated only on the training data, we split the training set in two, using a 90/10 ratio, and then use the 1/10 holdout as the dev set (validation set). The script run_ami.sh automatically splits and formats the file for MXNet. Before running it, set the path in the script correctly. The run_ami.sh script actually runs the full pipeline, including training the acoustic model and decoding. If the scripts ran successfully, you can skip the following sections.
default.cfg, and edit the necessary parameters, such as the path to the dataset you just prepared.python train_lstm.py --configfile=your-config.cfg. For help, use python train_lstm.py --help. You can set all of the configuration parameters in default.cfg, the customized config file, and through the command line (e.g., using --train_batch_size=50). The latter values overwrite the former ones.Here are some example outputs from training on the TIMIT dataset:
Example output for TIMIT: Summary of dataset ================== bucket of len 100 : 3 samples bucket of len 200 : 346 samples bucket of len 300 : 1496 samples bucket of len 400 : 974 samples bucket of len 500 : 420 samples bucket of len 600 : 90 samples bucket of len 700 : 11 samples bucket of len 800 : 2 samples Summary of dataset ================== bucket of len 100 : 0 samples bucket of len 200 : 28 samples bucket of len 300 : 169 samples bucket of len 400 : 107 samples bucket of len 500 : 41 samples bucket of len 600 : 6 samples bucket of len 700 : 3 samples bucket of len 800 : 0 samples 2016-04-21 20:02:40,904 Epoch[0] Train-Acc_exlude_padding=0.154763 2016-04-21 20:02:40,904 Epoch[0] Time cost=91.574 2016-04-21 20:02:44,419 Epoch[0] Validation-Acc_exlude_padding=0.353552 2016-04-21 20:04:17,290 Epoch[1] Train-Acc_exlude_padding=0.447318 2016-04-21 20:04:17,290 Epoch[1] Time cost=92.870 2016-04-21 20:04:20,738 Epoch[1] Validation-Acc_exlude_padding=0.506458 2016-04-21 20:05:53,127 Epoch[2] Train-Acc_exlude_padding=0.557543 2016-04-21 20:05:53,128 Epoch[2] Time cost=92.390 2016-04-21 20:05:56,568 Epoch[2] Validation-Acc_exlude_padding=0.548100
The final frame accuracy was approximately 62%.
python make_stats.py --configfile=your-config.cfg | copy-feats ark:- ark:label_mean.ark (edit necessary items, such as the path to the training dataset). This command generates the label counts in label_mean.ark.local/ and utils/ and run ./run_ami.sh --model prefix model --num_epoch num.Here are the results for the TIMIT and AMI test sets (using the default setup, three-layer LSTM with projection layers):
| Corpus | WER | |--------|-----| |TIMIT | 18.9| |AMI | 51.7 (42.2) |
For AMI 42.2 was evaluated non-overlapped speech. The Kaldi-HMM baseline was 67.2%, and DNN was 57.5%.