_sources/tutorials/speech_recognition/speech_lstm.md.txt - mxnet-site - Git at Google

 # Speech LSTM
 You can get the source code for these examples on [GitHub](https://github.com/dmlc/mxnet/tree/master/example/speech-demo).

 ## Speech Acoustic Modeling Example

 The examples folder contains examples for speech recognition:

 - [lstm_proj.py](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/lstm_proj.py): Functions for building an LSTM network with and without a projection layer.
 - [io_util.py](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/io_util.py): Wrapper functions for `DataIter` over speech data.
 - [train_lstm_proj.py](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/train_lstm_proj.py): A script for training an LSTM acoustic model.
 - [decode_mxnet.py](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/decode_mxnet.py): A script for decoding an LSTMP acoustic model.
 - [default.cfg](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/default.cfg): Configuration for training on the `AMI` SDM1 dataset. You can use it as a template for writing other configuration files.
 - [python_wrap](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/python_wrap): C wrappers for Kaldi C++ code, built into an .so file. Python code that loads the .so file and calls the C wrapper functions in `io_func/feat_readers/reader_kaldi.py`.

 Connect to Kaldi:

 - [decode_mxnet.sh](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/decode_mxnet.sh): Called by Kaldi to decode an acoustic model trained by MXNet (select the `simple` method for decoding).

 A full receipt:

 - [run_ami.sh](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/run_ami.sh): A full receipt to train and decode an acoustic model on AMI. It takes features and alignment from Kaldi to train an acoustic model and decode it.

 To create the speech acoustic modeling example, use the following steps.

 ### Build Kaldi

 Build Kaldi as shared libraries if you have not already done so.

 ```bash
 cd kaldi/src
 ./configure --shared # and other options that you need
 make depend
 make
 ```

 ### Build the Python Wrapper

 1. Copy or link the attached `python_wrap` folder to `kaldi/src`.
 2. Compile python_wrap/.

 ```
 cd kaldi/src/python_wrap/
 make
 ```

 ### Extract Features and Prepare Frame-level Labels

 The acoustic models use Mel filter-bank or MFCC as input features. They also need to use Kaldi to perform force-alignment to generate frame-level labels from the text transcriptions. For example, if you want to work on the `AMI` data `SDM1`, you can run `kaldi/egs/ami/s5/run_sdm.sh`. Before you can run the examples, you need to configure some paths in `kaldi/egs/ami/s5/cmd.sh` and `kaldi/egs/ami/s5/run_sdm.sh`. Refer to Kaldi's documentation for details.

 The default `run_sdm.sh` script generates the force-alignment labels in their stage 7, and saves the force-aligned labels in `exp/sdm1/tri3a_ali`. The default script generates MFCC features (13-dimensional). You can try training with the MFCC features, or you can create Mel filter-bank features by yourself. For example, you can use a script like this to compute Mel filter-bank features using Kaldi:

 ```bash
 #!/bin/bash -u

 . ./cmd.sh
 . ./path.sh

 # SDM - Single Distant Microphone
 micid=1 #which mic from array should be used?
 mic=sdm$micid

 # Set bash to 'debug' mode, it prints the commands (option '-x') and exits on :
 # -e 'error', -u 'undefined variable', -o pipefail 'error in pipeline',
 set -euxo pipefail

 # Path where AMI gets downloaded (or where locally available):
 AMI_DIR=$PWD/wav_db # Default,
 data_dir=$PWD/data/$mic

 # make filter bank data
 for dset in train dev eval; do
   steps/make_fbank.sh --nj 48 --cmd "$train_cmd" $data_dir/$dset \
     $data_dir/$dset/log $data_dir/$dset/data-fbank
   steps/compute_cmvn_stats.sh $data_dir/$dset \
     $data_dir/$dset/log $data_dir/$dset/data

   apply-cmvn --utt2spk=ark:$data_dir/$dset/utt2spk \
     scp:$data_dir/$dset/cmvn.scp scp:$data_dir/$dset/feats.scp \
     ark,scp:$data_dir/$dset/feats-cmvn.ark,$data_dir/$dset/feats-cmvn.scp

   mv $data_dir/$dset/feats-cmvn.scp $data_dir/$dset/feats.scp
 done
 ```
 `apply-cmvn` provides mean-variance normalization. The default setup was applied per speaker. It's more common to perform mean-variance normalization for the whole corpus, and then feed the results to the neural networks:

 ```
  compute-cmvn-stats scp:data/sdm1/train_fbank/feats.scp data/sdm1/train_fbank/cmvn_g.ark
  apply-cmvn --norm-vars=true data/sdm1/train_fbank/cmvn_g.ark scp:data/sdm1/train_fbank/feats.scp ark,scp:data/sdm1/train_fbank_gcmvn/feats.ark,data/sdm1/train_fbank_gcmvn/feats.scp
 ```
 Note that Kaldi always tries to find features in `feats.scp`. Ensure that the normalized features are organized as Kaldi expects them during decoding.

 Finally, put the features and labels together in a file so that MXNet can find them. More specifically, for each data set (train, dev, eval), you will need to create a file similar to `train_mxnet.feats`, with the following contents:

 ```
 TRANSFORM scp:feat.scp
 scp:label.scp
 ```

 `TRANSFORM` is the transformation you want to apply to the features. By default, we use `NO_FEATURE_TRANSFORM`. The `scp:` syntax is from Kaldi. `feat.scp` is typically the file from `data/sdm1/train/feats.scp`, and `label.scp` is converted from the force-aligned labels located in `exp/sdm1/tri3a_ali`. Because the force-alignments are generated only on the training data, we split the training set in two, using a 90/10 ratio, and then use the 1/10 holdout as the dev set (validation set). The script [run_ami.sh](https://github.com/dmlc/mxnet/blob/master/example/speech-demo/run_ami.sh) automatically splits and formats the file for MXNet. Before running it, set the path in the script correctly. The [run_ami.sh](https://github.com/dmlc/mxnet/blob/master/example/speech-demo/run_ami.sh) script actually runs the full pipeline, including training the acoustic model and decoding. If the scripts ran successfully, you can skip the following sections.

 ### Run MXNet Acoustic Model Training

 1. Return to the speech demo directory in MXNet. Make a copy of `default.cfg`, and edit the necessary parameters, such as the path to the dataset you just prepared.
 2. Run `python train_lstm.py --configfile=your-config.cfg`. For help, use `python train_lstm.py --help`. You can set all of the configuration parameters in `default.cfg`, the customized config file, and through the command line (e.g., using `--train_batch_size=50`). The latter values overwrite the former ones.

 Here are some example outputs from training on the TIMIT dataset:

 ```
 Example output for TIMIT:
 Summary of dataset ==================
 bucket of len 100 : 3 samples
 bucket of len 200 : 346 samples
 bucket of len 300 : 1496 samples
 bucket of len 400 : 974 samples
 bucket of len 500 : 420 samples
 bucket of len 600 : 90 samples
 bucket of len 700 : 11 samples
 bucket of len 800 : 2 samples
 Summary of dataset ==================
 bucket of len 100 : 0 samples
 bucket of len 200 : 28 samples
 bucket of len 300 : 169 samples
 bucket of len 400 : 107 samples
 bucket of len 500 : 41 samples
 bucket of len 600 : 6 samples
 bucket of len 700 : 3 samples
 bucket of len 800 : 0 samples
 2016-04-21 20:02:40,904 Epoch[0] Train-Acc_exlude_padding=0.154763
 2016-04-21 20:02:40,904 Epoch[0] Time cost=91.574
 2016-04-21 20:02:44,419 Epoch[0] Validation-Acc_exlude_padding=0.353552
 2016-04-21 20:04:17,290 Epoch[1] Train-Acc_exlude_padding=0.447318
 2016-04-21 20:04:17,290 Epoch[1] Time cost=92.870
 2016-04-21 20:04:20,738 Epoch[1] Validation-Acc_exlude_padding=0.506458
 2016-04-21 20:05:53,127 Epoch[2] Train-Acc_exlude_padding=0.557543
 2016-04-21 20:05:53,128 Epoch[2] Time cost=92.390
 2016-04-21 20:05:56,568 Epoch[2] Validation-Acc_exlude_padding=0.548100
 ```

 The final frame accuracy was approximately 62%.

 ### Run Decode on the Trained Acoustic Model

 1. Estimate senone priors by running `python make_stats.py --configfile=your-config.cfg | copy-feats ark:- ark:label_mean.ark` (edit necessary items, such as the path to the training dataset). This command generates the label counts in `label_mean.ark`.
 2. Link to the necessary Kaldi decode setup, e.g., `local/` and `utils/` and run `./run_ami.sh --model prefix model --num_epoch num`.

 Here are the results for the TIMIT and AMI test sets (using the default setup, three-layer LSTM with projection layers):

 	| Corpus | WER |
 	|--------|-----|
 	|TIMIT   | 18.9|
 	|AMI     | 51.7 (42.2) |

 For AMI 42.2 was evaluated non-overlapped speech. The Kaldi-HMM baseline was 67.2%, and DNN was 57.5%.

 ## Next Steps
 * [MXNet tutorials index](http://mxnet.io/tutorials/index.html)
	# Speech LSTM
	You can get the source code for these examples on [GitHub](https://github.com/dmlc/mxnet/tree/master/example/speech-demo).

	## Speech Acoustic Modeling Example

	The examples folder contains examples for speech recognition:

	- [lstm_proj.py](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/lstm_proj.py): Functions for building an LSTM network with and without a projection layer.
	- [io_util.py](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/io_util.py): Wrapper functions for `DataIter` over speech data.
	- [train_lstm_proj.py](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/train_lstm_proj.py): A script for training an LSTM acoustic model.
	- [decode_mxnet.py](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/decode_mxnet.py): A script for decoding an LSTMP acoustic model.
	- [default.cfg](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/default.cfg): Configuration for training on the `AMI` SDM1 dataset. You can use it as a template for writing other configuration files.
	- [python_wrap](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/python_wrap): C wrappers for Kaldi C++ code, built into an .so file. Python code that loads the .so file and calls the C wrapper functions in `io_func/feat_readers/reader_kaldi.py`.

	Connect to Kaldi:

	- [decode_mxnet.sh](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/decode_mxnet.sh): Called by Kaldi to decode an acoustic model trained by MXNet (select the `simple` method for decoding).

	A full receipt:

	- [run_ami.sh](https://github.com/dmlc/mxnet/tree/master/example/speech-demo/run_ami.sh): A full receipt to train and decode an acoustic model on AMI. It takes features and alignment from Kaldi to train an acoustic model and decode it.

	To create the speech acoustic modeling example, use the following steps.

	### Build Kaldi

	Build Kaldi as shared libraries if you have not already done so.

	```bash
	cd kaldi/src
	./configure --shared # and other options that you need
	make depend
	make
	```

	### Build the Python Wrapper

	1. Copy or link the attached `python_wrap` folder to `kaldi/src`.
	2. Compile python_wrap/.

	```
	cd kaldi/src/python_wrap/
	make
	```

	### Extract Features and Prepare Frame-level Labels

	The acoustic models use Mel filter-bank or MFCC as input features. They also need to use Kaldi to perform force-alignment to generate frame-level labels from the text transcriptions. For example, if you want to work on the `AMI` data `SDM1`, you can run `kaldi/egs/ami/s5/run_sdm.sh`. Before you can run the examples, you need to configure some paths in `kaldi/egs/ami/s5/cmd.sh` and `kaldi/egs/ami/s5/run_sdm.sh`. Refer to Kaldi's documentation for details.

	The default `run_sdm.sh` script generates the force-alignment labels in their stage 7, and saves the force-aligned labels in `exp/sdm1/tri3a_ali`. The default script generates MFCC features (13-dimensional). You can try training with the MFCC features, or you can create Mel filter-bank features by yourself. For example, you can use a script like this to compute Mel filter-bank features using Kaldi:

	```bash
	#!/bin/bash -u

	. ./cmd.sh
	. ./path.sh

	# SDM - Single Distant Microphone
	micid=1 #which mic from array should be used?
	mic=sdm$micid

	# Set bash to 'debug' mode, it prints the commands (option '-x') and exits on :
	# -e 'error', -u 'undefined variable', -o pipefail 'error in pipeline',
	set -euxo pipefail

	# Path where AMI gets downloaded (or where locally available):
	AMI_DIR=$PWD/wav_db # Default,
	data_dir=$PWD/data/$mic

	# make filter bank data
	for dset in train dev eval; do
	steps/make_fbank.sh --nj 48 --cmd "$train_cmd" $data_dir/$dset \
	$data_dir/$dset/log $data_dir/$dset/data-fbank
	steps/compute_cmvn_stats.sh $data_dir/$dset \
	$data_dir/$dset/log $data_dir/$dset/data

	apply-cmvn --utt2spk=ark:$data_dir/$dset/utt2spk \
	scp:$data_dir/$dset/cmvn.scp scp:$data_dir/$dset/feats.scp \
	ark,scp:$data_dir/$dset/feats-cmvn.ark,$data_dir/$dset/feats-cmvn.scp

	mv $data_dir/$dset/feats-cmvn.scp $data_dir/$dset/feats.scp
	done
	```
	`apply-cmvn` provides mean-variance normalization. The default setup was applied per speaker. It's more common to perform mean-variance normalization for the whole corpus, and then feed the results to the neural networks:

	```
	compute-cmvn-stats scp:data/sdm1/train_fbank/feats.scp data/sdm1/train_fbank/cmvn_g.ark
	apply-cmvn --norm-vars=true data/sdm1/train_fbank/cmvn_g.ark scp:data/sdm1/train_fbank/feats.scp ark,scp:data/sdm1/train_fbank_gcmvn/feats.ark,data/sdm1/train_fbank_gcmvn/feats.scp
	```
	Note that Kaldi always tries to find features in `feats.scp`. Ensure that the normalized features are organized as Kaldi expects them during decoding.

	Finally, put the features and labels together in a file so that MXNet can find them. More specifically, for each data set (train, dev, eval), you will need to create a file similar to `train_mxnet.feats`, with the following contents:

	```
	TRANSFORM scp:feat.scp
	scp:label.scp
	```

	`TRANSFORM` is the transformation you want to apply to the features. By default, we use `NO_FEATURE_TRANSFORM`. The `scp:` syntax is from Kaldi. `feat.scp` is typically the file from `data/sdm1/train/feats.scp`, and `label.scp` is converted from the force-aligned labels located in `exp/sdm1/tri3a_ali`. Because the force-alignments are generated only on the training data, we split the training set in two, using a 90/10 ratio, and then use the 1/10 holdout as the dev set (validation set). The script [run_ami.sh](https://github.com/dmlc/mxnet/blob/master/example/speech-demo/run_ami.sh) automatically splits and formats the file for MXNet. Before running it, set the path in the script correctly. The [run_ami.sh](https://github.com/dmlc/mxnet/blob/master/example/speech-demo/run_ami.sh) script actually runs the full pipeline, including training the acoustic model and decoding. If the scripts ran successfully, you can skip the following sections.

	### Run MXNet Acoustic Model Training

	1. Return to the speech demo directory in MXNet. Make a copy of `default.cfg`, and edit the necessary parameters, such as the path to the dataset you just prepared.
	2. Run `python train_lstm.py --configfile=your-config.cfg`. For help, use `python train_lstm.py --help`. You can set all of the configuration parameters in `default.cfg`, the customized config file, and through the command line (e.g., using `--train_batch_size=50`). The latter values overwrite the former ones.

	Here are some example outputs from training on the TIMIT dataset:

	```
	Example output for TIMIT:
	Summary of dataset ==================
	bucket of len 100 : 3 samples
	bucket of len 200 : 346 samples
	bucket of len 300 : 1496 samples
	bucket of len 400 : 974 samples
	bucket of len 500 : 420 samples
	bucket of len 600 : 90 samples
	bucket of len 700 : 11 samples
	bucket of len 800 : 2 samples
	Summary of dataset ==================
	bucket of len 100 : 0 samples
	bucket of len 200 : 28 samples
	bucket of len 300 : 169 samples
	bucket of len 400 : 107 samples
	bucket of len 500 : 41 samples
	bucket of len 600 : 6 samples
	bucket of len 700 : 3 samples
	bucket of len 800 : 0 samples
	2016-04-21 20:02:40,904 Epoch[0] Train-Acc_exlude_padding=0.154763
	2016-04-21 20:02:40,904 Epoch[0] Time cost=91.574
	2016-04-21 20:02:44,419 Epoch[0] Validation-Acc_exlude_padding=0.353552
	2016-04-21 20:04:17,290 Epoch[1] Train-Acc_exlude_padding=0.447318
	2016-04-21 20:04:17,290 Epoch[1] Time cost=92.870
	2016-04-21 20:04:20,738 Epoch[1] Validation-Acc_exlude_padding=0.506458
	2016-04-21 20:05:53,127 Epoch[2] Train-Acc_exlude_padding=0.557543
	2016-04-21 20:05:53,128 Epoch[2] Time cost=92.390
	2016-04-21 20:05:56,568 Epoch[2] Validation-Acc_exlude_padding=0.548100
	```

	The final frame accuracy was approximately 62%.

	### Run Decode on the Trained Acoustic Model

	1. Estimate senone priors by running `python make_stats.py --configfile=your-config.cfg \| copy-feats ark:- ark:label_mean.ark` (edit necessary items, such as the path to the training dataset). This command generates the label counts in `label_mean.ark`.
	2. Link to the necessary Kaldi decode setup, e.g., `local/` and `utils/` and run `./run_ami.sh --model prefix model --num_epoch num`.

	Here are the results for the TIMIT and AMI test sets (using the default setup, three-layer LSTM with projection layers):

	\| Corpus \| WER \|
	\|--------\|-----\|
	\|TIMIT \| 18.9\|
	\|AMI \| 51.7 (42.2) \|

	For AMI 42.2 was evaluated non-overlapped speech. The Kaldi-HMM baseline was 67.2%, and DNN was 57.5%.

	## Next Steps
	* [MXNet tutorials index](http://mxnet.io/tutorials/index.html)