This is a Gluon implementation of LipNet: End-to-End Sentence-level Lipreading

pip install -r requirements.txt
You can train the model yourself in the following sections, you can test a pretrained model's inference, or resume training from the model checkpoint. To work with the provided pretrained model, first download it, then run one of the provided Python scripts for inference (infer.py) or training (main.py).
python infer.py model_path='checkpoint/epoches_81_loss_15.7157'
python main.py model_path='checkpoint/epoches_81_loss_15.7157'
You can prepare the data yourself, or you can download preprocessed data.
There are two download routes provided for the preprocessed data.
To download tar zipped files by link, download the following files and extract in a folder called data in the root of this example folder. You should have the following structure:
/lipnet/data/align /lipnet/data/datasets
To get the folders and files all unzipped with AWS CLI, can use the following command. This will provide the folder structure for you. Run this command from /lipnet/:
aws s3 sync s3://mxnet-public/lipnet/data .
cd ./utils && python download_data.py --n_process=$(nproc)
example:
video: ./data/mp4s/s2/bbbf7p.mpg
align(target): ./data/align/s2/bbbf7p.align : ‘sil bin blue by f seven please sil’
Video to the images (75 Frames)
| Frame 0 | Frame 1 | ... | Frame 74 |
|---|---|---|---|
| ... |
| Frame 0 | Frame 1 | ... | Frame 74 |
|---|---|---|---|
| ... |
Arguments
Outputs
Elapsed time
You can run the preprocessing with just one processor, but this will take a long time (>48 hours). To use all of the available processors, use the following command:
cd ./utils && python preprocess_data.py --n_process=$(nproc)
The training data folder should look like :
<train_data_root>
|--datasets
|--s1
|--bbir7s
|--mouth_000.png
|--mouth_001.png
...
|--bgaa8p
|--mouth_000.png
|--mouth_001.png
...
|--s2
...
|--align
|--bw1d8a.align
|--bggzzs.align
...
After you have acquired the preprocessed data you are ready to train the lipnet model.
According to LipNet: End-to-End Sentence-level Lipreading, four (S1, S2, S20, S22) of the 34 subjects are used for evaluation. The other subjects are used for training.
To use the multi-gpu, it is recommended to make the batch size $(num_gpus) times larger.
arguments
python main.py
72 CPU cores
1 GPU (NVIDIA Tesla V100 SXM2 32 GB)
128 Batch Size
python infer.py --model_path=$(model_path)
[Target] ['lay green with a zero again', 'bin blue with r nine please', 'set blue with e five again', 'bin green by t seven soon', 'lay red at d five now', 'bin green in x eight now', 'bin blue with e one now', 'lay red at j nine now']
[Pred] ['lay green with s zero again', 'bin blue with r nine please', 'set blue with e five again', 'bin green by t seven soon', 'lay red at c five now', 'bin green in x eight now', 'bin blue with m one now', 'lay red at j nine now']