MXNet Model Zoo

MXNet features fast implementations of many state-of-the-art models reported in the academic literature. This Model Zoo is an ongoing project to collect complete models, with python scripts, pre-trained weights as well as instructions on how to build and fine tune these models.

How to Contribute a Pre-Trained Model (and what to include)

The Model Zoo has good entries for CNNs but is seeking content in other areas.

Issue a Pull Request containing the following:

  • Gist Log
  • .json model definition
  • Model parameter file
  • Readme file (details below)

Readme file should contain:

  • Model Location, access instructions (wget)
  • Confirmation the trained model meets published accuracy from original paper
  • Step by step instructions on how to use the trained model
  • References to any other applicable docs or arxiv papers the model is based on

Convolutional Neural Networks (CNNs)

Convolutional neural networks are the state-of-art architecture for many image and video processing problems. Some available datasets include:

  • ImageNet: a large corpus of 1 million natural images, divided into 1000 categories.
  • CIFAR10: 60,000 natural images (32 x 32 pixels) from 10 categories.
  • PASCAL_VOC: A subset of ImageNet images with object bounding boxes.
  • UCF101: 13,320 videos from 101 action categories.
  • Mini-Places2: Subset of the Places2 dataset. Includes 100,000 images from 100 scene categories.
  • ImageNet 11k
  • Places2: There are 1.6 million train images from 365 scene categories in the Places365-Standard, which are used to train the Places365 CNNs. There are 50 images per category in the validation set and 900 images per category in the testing set. Compared to the train set of Places365-Standard, the train set of Places365-Challenge has 6.2 million extra images, leading to totally 8 million train images for the Places365 challenge 2016. The validation set and testing set are the same as the Places365-Standard.
  • Multimedia Commons: YFCC100M (99.2 million images and 0.8 million videos from Flickr) and supplemental material (pre-extracted features, additional annotations).

For instructions on using these models, see the python tutorial on using pre-trained ImageNet models.

Model DefinitionDatasetModel WeightsResearch BasisContributors
CaffeNetImageNetParam FileKrizhevsky, 2012@jspisak
Network in Network (NiN)ImageNetParam FileLin et al.., 2014@jspisak
SqueezeNet v1.1ImageNetParam FileIandola et al.., 2016@jspisak
VGG16ImageNetParam FileSimonyan et al.., 2015@jspisak
VGG19ImageNetParam FileSimonyan et al.., 2015@jspisak
Inception v3 w/BatchNormImageNetParam FileSzegedy et al.., 2015@jspisak
ResidualNet152ImageNetParam FileHe et al.., 2015@jspisak
ResNext101-64x4dImageNetParam FileXie et al.., 2016@Jerryzcn
Fast-RCNNPASCAL VOC[Param File]Girshick, 2015
Faster-RCNNPASCAL VOC[Param File]Ren et al..,2016
Single Shot Detection (SSD)PASCAL VOC[Param File]Liu et al.., 2016
LocationNetMultimediaCommonsParam FileWeyand et al.., 2016@jychoi84 @kevinli7

Recurrent Neural Networks (RNNs) including LSTMs

MXNet supports many types of recurrent neural networks (RNNs), including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) networks. Some available datasets include:

  • Penn Treebank (PTB): Text corpus with ~1 million words. Vocabulary is limited to 10,000 words. The task is predicting downstream words/characters.
  • Shakespeare: Complete text from Shakespeare's works.
  • IMDB reviews: 25,000 movie reviews, labeled as positive or negative
  • Facebook bAbI: As a set of 20 question & answer tasks, each with 1,000 training examples.
  • Flickr8k, COCO: Images with associated caption (sentences). Flickr8k consists of 8,092 images captioned by AmazonTurkers with ~40,000 captions. COCO has 328,000 images, each with 5 captions. The COCO images also come with labeled objects using segmentation algorithms.
Model DefinitionDatasetModel WeightsResearch BasisContributors
LSTM - Image CaptioningFlickr8k, MS COCO[Vinyals et al.., 2015](https://arxiv.org/pdf/ 1411.4555v2.pdf)@...
LSTM - Q&A SystembAblWeston et al.., 2015
LSTM - Sentiment AnalysisIMDBLi et al.., 2015

Generative Adversarial Networks (GANs)

Generative Adversarial Networks train a competing pair of neural networks: a generator network which transforms a latent vector into content like an image, and a discriminator network that tries to distinguish between generated content and supplied “real” training content. When properly trained the two achieve a Nash equilibrium.

Model DefinitionDatasetModel WeightsResearch BasisContributors
DCGANsImageNetRadford et al..,2016@...
Text to Image SynthesisMS COCOReed et al.., 2016
Deep JazzDeepjazz.io

Other Models

MXNet Supports a variety of model types beyond the canonical CNN and LSTM model types. These include deep reinforcement learning, linear models, etc.. Some available datasets and sources include:

  • Google News: A text corpus with a vocabulary of 3 million words architected for word2vec.
  • MovieLens 20M Dataset: 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
  • Atari Video Game Emulator: Stella is a multi-platform Atari 2600 VCS emulator released under the GNU General Public License (GPL).
Model DefinitionDatasetModel WeightsResearch BasisContributors
Word2VecGoogle NewsMikolov et al.., 2013@...
Matrix FactorizationMovieLens 20MHuang et al.., 2013
Deep Q-NetworkAtari video gamesMinh et al.., 2015
Asynchronous advantage actor-critic (A3C)Atari video gamesMinh et al.., 2016