MXNet Model Zoo

MXNet features fast implementations of many state-of-the-art models reported in the academic literature. This Model Zoo is an ongoing project to collect complete models, with python scripts, pre-trained weights as well as instructions on how to build and fine tune these models.

How to Contribute a Pre-Trained Model (and what to include)

The Model Zoo has good entries for CNNs but is seeking content in other areas.

Issue a Pull Request containing the following:

Gist Log
.json model definition
Model parameter file
Readme file (details below)

Readme file should contain:

Model Location, access instructions (wget)
Confirmation the trained model meets published accuracy from original paper
Step by step instructions on how to use the trained model
References to any other applicable docs or arxiv papers the model is based on

Convolutional Neural Networks (CNNs)

Convolutional neural networks are the state-of-art architecture for many image and video processing problems. Some available datasets include:

ImageNet: a large corpus of 1 million natural images, divided into 1000 categories.
CIFAR10: 60,000 natural images (32 x 32 pixels) from 10 categories.
PASCAL_VOC: A subset of ImageNet images with object bounding boxes.
UCF101: 13,320 videos from 101 action categories.
Mini-Places2: Subset of the Places2 dataset. Includes 100,000 images from 100 scene categories.
ImageNet 11k
Places2: There are 1.6 million train images from 365 scene categories in the Places365-Standard, which are used to train the Places365 CNNs. There are 50 images per category in the validation set and 900 images per category in the testing set. Compared to the train set of Places365-Standard, the train set of Places365-Challenge has 6.2 million extra images, leading to totally 8 million train images for the Places365 challenge 2016. The validation set and testing set are the same as the Places365-Standard.
Multimedia Commons: YFCC100M (99.2 million images and 0.8 million videos from Flickr) and supplemental material (pre-extracted features, additional annotations).

For instructions on using these models, see the python tutorial on using pre-trained ImageNet models.

Model Definition	Dataset	Model Weights	Research Basis	Contributors
CaffeNet	ImageNet	Param File	Krizhevsky, 2012	@jspisak
Network in Network (NiN)	ImageNet	Param File	Lin et al.., 2014	@jspisak
SqueezeNet v1.1	ImageNet	Param File	Iandola et al.., 2016	@jspisak
VGG16	ImageNet	Param File	Simonyan et al.., 2015	@jspisak
VGG19	ImageNet	Param File	Simonyan et al.., 2015	@jspisak
Inception v3 w/BatchNorm	ImageNet	Param File	Szegedy et al.., 2015	@jspisak
ResidualNet152	ImageNet	Param File	He et al.., 2015	@jspisak
ResNext101-64x4d	ImageNet	Param File	Xie et al.., 2016	@Jerryzcn
Fast-RCNN	PASCAL VOC	[Param File]	Girshick, 2015
Faster-RCNN	PASCAL VOC	[Param File]	Ren et al..,2016
Single Shot Detection (SSD)	PASCAL VOC	[Param File]	Liu et al.., 2016
LocationNet	MultimediaCommons	Param File	Weyand et al.., 2016	@jychoi84 @kevinli7

Recurrent Neural Networks (RNNs) including LSTMs

MXNet supports many types of recurrent neural networks (RNNs), including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) networks. Some available datasets include:

Penn Treebank (PTB): Text corpus with ~1 million words. Vocabulary is limited to 10,000 words. The task is predicting downstream words/characters.
Shakespeare: Complete text from Shakespeare's works.
IMDB reviews: 25,000 movie reviews, labeled as positive or negative
Facebook bAbI: As a set of 20 question & answer tasks, each with 1,000 training examples.
Flickr8k, COCO: Images with associated caption (sentences). Flickr8k consists of 8,092 images captioned by AmazonTurkers with ~40,000 captions. COCO has 328,000 images, each with 5 captions. The COCO images also come with labeled objects using segmentation algorithms.

Model Definition	Dataset	Research Basis	Contributors
LSTM - Image Captioning	Flickr8k, MS COCO	[Vinyals et al.., 2015](https://arxiv.org/pdf/ 1411.4555v2.pdf)	@...
LSTM - Q&A System	bAbl	Weston et al.., 2015
LSTM - Sentiment Analysis	IMDB	Li et al.., 2015

Generative Adversarial Networks (GANs)

Generative Adversarial Networks train a competing pair of neural networks: a generator network which transforms a latent vector into content like an image, and a discriminator network that tries to distinguish between generated content and supplied “real” training content. When properly trained the two achieve a Nash equilibrium.

Model Definition	Dataset	Research Basis	Contributors
DCGANs	ImageNet	Radford et al..,2016	@...
Text to Image Synthesis	MS COCO	Reed et al.., 2016
Deep Jazz		Deepjazz.io

Other Models

MXNet Supports a variety of model types beyond the canonical CNN and LSTM model types. These include deep reinforcement learning, linear models, etc.. Some available datasets and sources include:

Google News: A text corpus with a vocabulary of 3 million words architected for word2vec.
MovieLens 20M Dataset: 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
Atari Video Game Emulator: Stella is a multi-platform Atari 2600 VCS emulator released under the GNU General Public License (GPL).

Model Definition	Dataset	Research Basis	Contributors
Word2Vec	Google News	Mikolov et al.., 2013	@...
Matrix Factorization	MovieLens 20M	Huang et al.., 2013
Deep Q-Network	Atari video games	Minh et al.., 2015
Asynchronous advantage actor-critic (A3C)	Atari video games	Minh et al.., 2016