example/speech_recognition/README.md - mxnet-test - Git at Google

 **deepSpeech.mxnet: Rich Speech Example**
 =========================================

 This example based on [DeepSpeech2 of Baidu](https://arxiv.org/abs/1512.02595) helps you to build Speech-To-Text (STT) models at scale using
 - CNNs, fully connected networks, (Bi-) RNNs, (Bi-) LSTMs, and (Bi-) GRUs for network layers,
 - batch-normalization and drop-outs for training efficiency,
 - and a Warp CTC for loss calculations.

 In order to make your own STT models, besides, all you need is to just edit a configuration file not actual codes.


 * * *
 ## **Motivation**
 This example is intended to guide people who want to making practical STT models with MXNet.
 With rich functionalities and convenience explained above, you can build your own speech recognition models with it easier than former examples.


 * * *
 ## **Environments**
 - MXNet version: 0.9.5+
 - GPU memory size: 2.4GB+
 - Install tensorboard for logging
 <pre>
 <code>pip install tensorboard</code>
 </pre>

 - [SoundFile](https://pypi.python.org/pypi/SoundFile/0.8.1) for audio preprocessing (If encounter errors about libsndfile, follow [this tutorial](http://www.linuxfromscratch.org/blfs/view/svn/multimedia/libsndfile.html).)
 <pre>
 <code>pip install soundfile</code>
 </pre>
 - Warp CTC: Follow [this instruction](https://github.com/dmlc/mxnet/tree/master/example/warpctc) to install Baidu's Warp CTC.
 - **We strongly recommend that you first test a model of small networks.**


 * * *
 ## **How it works**
 ### **Preparing data**
 Input data are described in a JSON file **Libri_sample.json** as followed.
 <pre>
 <code>{"duration": 2.9450625, "text": "and sharing her house which was near by", "key": "./Libri_sample/3830-12531-0030.wav"}
 {"duration": 3.94, "text": "we were able to impart the information that we wanted", "key": "./Libri_sample/3830-12529-0005.wav"}</code>
 </pre>
 You can download two wave files above from [this](https://github.com/samsungsds-rnd/deepspeech.mxnet/tree/master/Libri_sample). Put them under /path/to/yourproject/Libri_sample/.


 ### **Setting the configuration file**
 **[Notice]** The configuration file "default.cfg" included describes DeepSpeech2 with slight changes. You can test the original DeepSpeech2("deepspeech.cfg") with a few line changes to the cfg file:
 <pre><code>
 [common]
 ...
 learning_rate = 0.0003
 # constant learning rate annealing by factor
 learning_rate_annealing = 1.1
 optimizer = sgd
 ...
 is_bi_graphemes = True
 ...
 [arch]
 ...
 num_rnn_layer = 7
 num_hidden_rnn_list = [1760, 1760, 1760, 1760, 1760, 1760, 1760]
 num_hidden_proj = 0
 num_rear_fc_layers = 1
 num_hidden_rear_fc_list = [1760]
 act_type_rear_fc_list = ["relu"]
 ...
 [train]
 ...
 learning_rate = 0.0003
 # constant learning rate annealing by factor
 learning_rate_annealing = 1.1
 optimizer = sgd
 ...
 </code></pre>


 * * *
 ## **Run the example**
 ### **Train**
 <pre><code>cd /path/to/your/project/
 mkdir checkpoints
 mkdir log
 python main.py --configfile default.cfg</code></pre>
 Checkpoints of the model will be saved at every n-th epoch.

 ### **Load**
 You can (re-) train (saved) models by loading checkpoints (starting from 0). For this, you need to modify only two lines of the file "default.cfg".
 <pre><code>...
 [common]
 # mode can be one of the followings - train, predict, load
 mode = load
 ...
 model_file = 'file name of your model saved'
 ...</code></pre>


 ### **Predict**
 You can predict (or test) audios by specifying the mode, model, and test data in the file "default.cfg".
 <pre><code>...
 [common]
 # mode can be one of the followings - train, predict, load
 mode = predict
 ...
 model_file = 'file name of your model to be tested'
 ...
 [data]
 ...
 test_json = 'a json file described test audios'
 ...</code></pre>
 <br />
 Run the following line after all modification explained above.
 <pre><code>python main.py --configfile default.cfg</code></pre>


 * * *
 ## **Train and test your own models**

 Train and test your own models by preparing two files.
 1) A new configuration file, i.e., custom.cfg, corresponding to the file 'default.cfg'.
 The new file should specify the items below the '[arch]' section of the original file.
 2) A new implementation file, i.e., arch_custom.py, corresponding to the file 'arch_deepspeech.py'.
 The new file should implement two functions, prepare_data() and arch(), for building networks described in the new configuration file.

 Run the following line after preparing the files.
 <pre><code>python main.py --configfile custom.cfg --archfile arch_custom</pre></code>
	deepSpeech.mxnet: Rich Speech Example
	=========================================

	This example based on [DeepSpeech2 of Baidu](https://arxiv.org/abs/1512.02595) helps you to build Speech-To-Text (STT) models at scale using
	- CNNs, fully connected networks, (Bi-) RNNs, (Bi-) LSTMs, and (Bi-) GRUs for network layers,
	- batch-normalization and drop-outs for training efficiency,
	- and a Warp CTC for loss calculations.

	In order to make your own STT models, besides, all you need is to just edit a configuration file not actual codes.


	* * *
	## Motivation
	This example is intended to guide people who want to making practical STT models with MXNet.
	With rich functionalities and convenience explained above, you can build your own speech recognition models with it easier than former examples.


	* * *
	## Environments
	- MXNet version: 0.9.5+
	- GPU memory size: 2.4GB+
	- Install tensorboard for logging
	<pre>
	<code>pip install tensorboard</code>
	</pre>

	- [SoundFile](https://pypi.python.org/pypi/SoundFile/0.8.1) for audio preprocessing (If encounter errors about libsndfile, follow [this tutorial](http://www.linuxfromscratch.org/blfs/view/svn/multimedia/libsndfile.html).)
	<pre>
	<code>pip install soundfile</code>
	</pre>
	- Warp CTC: Follow [this instruction](https://github.com/dmlc/mxnet/tree/master/example/warpctc) to install Baidu's Warp CTC.
	- We strongly recommend that you first test a model of small networks.


	* * *
	## How it works
	### Preparing data
	Input data are described in a JSON file Libri_sample.json as followed.
	<pre>
	<code>{"duration": 2.9450625, "text": "and sharing her house which was near by", "key": "./Libri_sample/3830-12531-0030.wav"}
	{"duration": 3.94, "text": "we were able to impart the information that we wanted", "key": "./Libri_sample/3830-12529-0005.wav"}</code>
	</pre>
	You can download two wave files above from [this](https://github.com/samsungsds-rnd/deepspeech.mxnet/tree/master/Libri_sample). Put them under /path/to/yourproject/Libri_sample/.


	### Setting the configuration file
	[Notice] The configuration file "default.cfg" included describes DeepSpeech2 with slight changes. You can test the original DeepSpeech2("deepspeech.cfg") with a few line changes to the cfg file:
	<pre><code>
	[common]
	...
	learning_rate = 0.0003
	# constant learning rate annealing by factor
	learning_rate_annealing = 1.1
	optimizer = sgd
	...
	is_bi_graphemes = True
	...
	[arch]
	...
	num_rnn_layer = 7
	num_hidden_rnn_list = [1760, 1760, 1760, 1760, 1760, 1760, 1760]
	num_hidden_proj = 0
	num_rear_fc_layers = 1
	num_hidden_rear_fc_list = [1760]
	act_type_rear_fc_list = ["relu"]
	...
	[train]
	...
	learning_rate = 0.0003
	# constant learning rate annealing by factor
	learning_rate_annealing = 1.1
	optimizer = sgd
	...
	</code></pre>


	* * *
	## Run the example
	### Train
	<pre><code>cd /path/to/your/project/
	mkdir checkpoints
	mkdir log
	python main.py --configfile default.cfg</code></pre>
	Checkpoints of the model will be saved at every n-th epoch.

	### Load
	You can (re-) train (saved) models by loading checkpoints (starting from 0). For this, you need to modify only two lines of the file "default.cfg".
	<pre><code>...
	[common]
	# mode can be one of the followings - train, predict, load
	mode = load
	...
	model_file = 'file name of your model saved'
	...</code></pre>


	### Predict
	You can predict (or test) audios by specifying the mode, model, and test data in the file "default.cfg".
	<pre><code>...
	[common]
	# mode can be one of the followings - train, predict, load
	mode = predict
	...
	model_file = 'file name of your model to be tested'
	...
	[data]
	...
	test_json = 'a json file described test audios'
	...</code></pre>
	<br />
	Run the following line after all modification explained above.
	<pre><code>python main.py --configfile default.cfg</code></pre>


	* * *
	## Train and test your own models

	Train and test your own models by preparing two files.
	1) A new configuration file, i.e., custom.cfg, corresponding to the file 'default.cfg'.
	The new file should specify the items below the '[arch]' section of the original file.
	2) A new implementation file, i.e., arch_custom.py, corresponding to the file 'arch_deepspeech.py'.
	The new file should implement two functions, prepare_data() and arch(), for building networks described in the new configuration file.

	Run the following line after preparing the files.
	<pre><code>python main.py --configfile custom.cfg --archfile arch_custom</pre></code>