blob: 408925d1a16f7fcf56552d16498fd941aa309afd [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"id": "cc9ac7fb",
"metadata": {},
"source": [
"<!--- Licensed to the Apache Software Foundation (ASF) under one -->\n",
"<!--- or more contributor license agreements. See the NOTICE file -->\n",
"<!--- distributed with this work for additional information -->\n",
"<!--- regarding copyright ownership. The ASF licenses this file -->\n",
"<!--- to you under the Apache License, Version 2.0 (the -->\n",
"<!--- \"License\"); you may not use this file except in compliance -->\n",
"<!--- with the License. You may obtain a copy of the License at -->\n",
"\n",
"<!--- http://www.apache.org/licenses/LICENSE-2.0 -->\n",
"\n",
"<!--- Unless required by applicable law or agreed to in writing, -->\n",
"<!--- software distributed under the License is distributed on an -->\n",
"<!--- \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->\n",
"<!--- KIND, either express or implied. See the License for the -->\n",
"<!--- specific language governing permissions and limitations -->\n",
"<!--- under the License. -->\n",
"\n",
"\n",
"# Step 5: `Dataset`s and `DataLoader`\n",
"\n",
"One of the most critical steps for model training and inference is loading the data: without data you can't do Machine Learning! In this tutorial you will use the Gluon API to define a Dataset and use a DataLoader to iterate through the dataset in mini-batches."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6cbaaab9",
"metadata": {},
"outputs": [],
"source": [
"import mxnet as mx\n",
"import os\n",
"import time\n",
"import tarfile"
]
},
{
"cell_type": "markdown",
"id": "bc6ac987",
"metadata": {},
"source": [
"## Introduction to `Dataset`s\n",
"\n",
"Dataset objects are used to represent collections of data, and include methods to load and parse the data (that is often stored on disk). Gluon has a number of different `Dataset` classes for working with image data straight out-of-the-box, but you'll use the ArrayDataset to introduce the idea of a `Dataset`.\n",
"\n",
"You will first start by generating random data `X` (with 3 variables) and corresponding random labels `y` to simulate a typical supervised learning task. You will generate 10 samples and pass them all to the `ArrayDataset`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f8a738b",
"metadata": {},
"outputs": [],
"source": [
"mx.random.seed(42) # Fix the seed for reproducibility\n",
"X = mx.random.uniform(shape=(10, 3))\n",
"y = mx.random.uniform(shape=(10, 1))\n",
"dataset = mx.gluon.data.dataset.ArrayDataset(X, y)"
]
},
{
"cell_type": "markdown",
"id": "565356d5",
"metadata": {},
"source": [
"A key feature of a `Dataset` is the __*ability to retrieve a single sample given an index*__. Our random data and labels were generated in memory, so this `ArrayDataset` doesn't have to load anything from disk, but the interface is the same for all `Dataset`'s."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa360bcf",
"metadata": {},
"outputs": [],
"source": [
"sample_idx = 4\n",
"sample = dataset[sample_idx]\n",
"\n",
"assert len(sample) == 2\n",
"assert sample[0].shape == (3, )\n",
"assert sample[1].shape == (1, )\n",
"print(sample)"
]
},
{
"cell_type": "markdown",
"id": "bd97ae5c",
"metadata": {},
"source": [
"You get a tuple of a data sample and its corresponding label, which makes sense because you passed the data `X` and the labels `y` in that order when you instantiated the `ArrayDataset`. You don't usually retrieve individual samples from `Dataset` objects though (unless you're quality checking the output samples). Instead you use a `DataLoader`.\n",
"\n",
"## Introduction to `DataLoader`\n",
"\n",
"A DataLoader is used to create mini-batches of samples from a Dataset, and provides a convenient iterator interface for looping these batches. It's typically much more efficient to pass a mini-batch of data through a neural network than a single sample at a time, because the computation can be performed in parallel. A required parameter of `DataLoader` is the size of the mini-batches you want to create, called `batch_size`.\n",
"\n",
"Another benefit of using `DataLoader` is the ability to easily load data in parallel using multiprocessing. You can set the `num_workers` parameter to the number of CPUs available on your machine for maximum performance, or limit it to a lower number to spare resources."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc1701fc",
"metadata": {},
"outputs": [],
"source": [
"from multiprocessing import cpu_count\n",
"CPU_COUNT = cpu_count()\n",
"\n",
"data_loader = mx.gluon.data.DataLoader(dataset, batch_size=5, num_workers=CPU_COUNT)\n",
"\n",
"for X_batch, y_batch in data_loader:\n",
" print(\"X_batch has shape {}, and y_batch has shape {}\".format(X_batch.shape, y_batch.shape))"
]
},
{
"cell_type": "markdown",
"id": "3cd683d7",
"metadata": {},
"source": [
"You can see 2 mini-batches of data (and labels), each with 5 samples, which makes sense given that you started with a dataset of 10 samples. When comparing the shape of the batches to the samples returned by the `Dataset`,you've gained an extra dimension at the start which is sometimes called the batch axis.\n",
"\n",
"Our `data_loader` loop will stop when every sample of `dataset` has been returned as part of a batch. Sometimes the dataset length isn't divisible by the mini-batch size, leaving a final batch with a smaller number of samples. `DataLoader`'s default behavior is to return this smaller mini-batch, but this can be changed by setting the `last_batch` parameter to `discard` (which ignores the last batch) or `rollover` (which starts the next epoch with the remaining samples).\n",
"\n",
"## Machine learning with `Dataset`s and `DataLoader`s\n",
"\n",
"You will often use a few different `Dataset` objects in your Machine Learning project. It's essential to separate your training dataset from testing dataset, and it's also good practice to have validation dataset (a.k.a. development dataset) that can be used for optimising hyperparameters.\n",
"\n",
"Using Gluon `Dataset` objects, you define the data to be included in each of these separate datasets. It's simple to create your own custom `Dataset` classes for other types of data. You can even use included `Dataset` objects for common datasets if you want to experiment quickly; they download and parse the data for you! In this example you use the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset from Zalando Research.\n",
"\n",
"Many of the image `Dataset`'s accept a function (via the optional `transform` parameter) which is applied to each sample returned by the `Dataset`. It's useful for performing data augmentation, but can also be used for more simple data type conversion and pixel value scaling as seen below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "113bdad2",
"metadata": {},
"outputs": [],
"source": [
"def transform(data, label):\n",
" data = data.astype('float32')/255\n",
" return data, label\n",
"\n",
"train_dataset = mx.gluon.data.vision.datasets.FashionMNIST(train=True, transform=transform)\n",
"valid_dataset = mx.gluon.data.vision.datasets.FashionMNIST(train=False, transform=transform)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ff3ba39",
"metadata": {},
"outputs": [],
"source": [
"from matplotlib.pylab import imshow\n",
"\n",
"sample_idx = 234\n",
"sample = train_dataset[sample_idx]\n",
"data = sample[0]\n",
"label = sample[1]\n",
"label_desc = {0:'T-shirt/top', 1:'Trouser', 2:'Pullover', 3:'Dress', 4:'Coat', 5:'Sandal', 6:'Shirt', 7:'Sneaker', 8:'Bag', 9:'Ankle boot'}\n",
"\n",
"print(\"Data type: {}\".format(data.dtype))\n",
"print(\"Label: {}\".format(label))\n",
"print(\"Label description: {}\".format(label_desc[label]))\n",
"imshow(data[:,:,0].asnumpy(), cmap='gray')"
]
},
{
"cell_type": "markdown",
"id": "b1bcda9d",
"metadata": {},
"source": [
"![datasets fashion mnist bag](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/gluon/datasets/fashion_mnist_bag.png)\n",
"\n",
"When training machine learning models it is important to shuffle the training samples every time you pass through the dataset (i.e. each epoch). Sometimes the order of your samples will have a spurious relationship with the target variable, and shuffling the samples helps remove this. With DataLoader it's as simple as adding `shuffle=True`. You don't need to shuffle the validation and testing data though."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09a42cba",
"metadata": {},
"outputs": [],
"source": [
"batch_size = 32\n",
"train_data_loader = mx.gluon.data.DataLoader(train_dataset, batch_size, shuffle=True, num_workers=CPU_COUNT)\n",
"valid_data_loader = mx.gluon.data.DataLoader(valid_dataset, batch_size, num_workers=CPU_COUNT)"
]
},
{
"cell_type": "markdown",
"id": "2830cd7b",
"metadata": {},
"source": [
"With both `DataLoader`s defined, you can now train a model to classify each image and evaluate the validation loss at each epoch. See the next tutorial for how this is done.\n",
"\n",
"# Using own data with included `Dataset`s\n",
"\n",
"Gluon has a number of different Dataset classes for working with your own image data straight out-of-the-box. You can get started quickly using the mxnet.gluon.data.vision.datasets.ImageFolderDataset which loads images directly from a user-defined folder, and infers the label (i.e. class) from the folders.\n",
"\n",
"Here you will run through an example for image classification, but a similar process applies for other vision tasks. If you already have your own collection of images to work with you should partition your data into training and test sets, and place all objects of the same class into seperate folders. Similar to:"
]
},
{
"cell_type": "markdown",
"id": "59e320e3",
"metadata": {},
"source": [
"```\n",
"./images/train/car/abc.jpg\n",
"./images/train/car/efg.jpg\n",
"./images/train/bus/hij.jpg\n",
"./images/train/bus/klm.jpg\n",
"./images/test/car/xyz.jpg\n",
"./images/test/bus/uvw.jpg\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "b38736b5",
"metadata": {},
"source": [
"You can download the Caltech 101 dataset if you don't already have images to work with for this example, but please note the download is 126MB."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e4093d4",
"metadata": {},
"outputs": [],
"source": [
"data_folder = \"data\"\n",
"dataset_name = \"101_ObjectCategories\"\n",
"archive_file = \"{}.tar.gz\".format(dataset_name)\n",
"archive_path = os.path.join(data_folder, archive_file)\n",
"data_url = \"https://s3.us-east-2.amazonaws.com/mxnet-public/\"\n",
"\n",
"if not os.path.isfile(archive_path):\n",
" mx.test_utils.download(\"{}{}\".format(data_url, archive_file), dirname = data_folder)\n",
" print('Extracting {} in {}...'.format(archive_file, data_folder))\n",
" tar = tarfile.open(archive_path, \"r:gz\")\n",
" tar.extractall(data_folder)\n",
" tar.close()\n",
" print('Data extracted.')"
]
},
{
"cell_type": "markdown",
"id": "c65be796",
"metadata": {},
"source": [
"After downloading and extracting the data archive, you have two folders: `data/101_ObjectCategories` and `data/101_ObjectCategories_test`. You can then load the data into separate training and testing ImageFolderDatasets.\n",
"\n",
"training_path = os.path.join(data_folder, dataset_name)\n",
"testing_path = os.path.join(data_folder, \"{}_test\".format(dataset_name))\n",
"\n",
"You instantiate the ImageFolderDatasets by providing the path to the data, and the folder structure will be traversed to determine which image classes are available and which images correspond to each class. You must take care to ensure the same classes are both the training and testing datasets, otherwise the label encodings can get muddled.\n",
"\n",
"Optionally, you can pass a `transform` parameter to these `Dataset`'s as you've seen before."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "39de5fb4",
"metadata": {},
"outputs": [],
"source": [
"training_path='/home/ec2-user/SageMaker/data/101_ObjectCategories'\n",
"testing_path='/home/ec2-user/SageMaker/data/101_ObjectCategories_test'\n",
"train_dataset = mx.gluon.data.vision.datasets.ImageFolderDataset(training_path)\n",
"test_dataset = mx.gluon.data.vision.datasets.ImageFolderDataset(testing_path)"
]
},
{
"cell_type": "markdown",
"id": "7dc42b2d",
"metadata": {},
"source": [
"Samples from these datasets are tuples of data and label. Images are loaded from disk, decoded and optionally transformed when the `__getitem__(i)` method is called (equivalent to `train_dataset[i]`).\n",
"\n",
"As with the Fashion MNIST dataset the labels will be integer encoded. You can use the `synsets` property of the ImageFolderDatasets to retrieve the original descriptions (e.g. `train_dataset.synsets[i]`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "713df323",
"metadata": {},
"outputs": [],
"source": [
"sample_idx = 539\n",
"sample = train_dataset[sample_idx]\n",
"data = sample[0]\n",
"label = sample[1]\n",
"\n",
"print(\"Data type: {}\".format(data.dtype))\n",
"print(\"Label: {}\".format(label))\n",
"print(\"Label description: {}\".format(train_dataset.synsets[label]))\n",
"assert label == 1\n",
"\n",
"imshow(data.asnumpy(), cmap='gray')"
]
},
{
"cell_type": "markdown",
"id": "b213966f",
"metadata": {},
"source": [
"![datasets caltech101 face](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/gluon/datasets/caltech101_face.png)<!--notebook-skip-line-->\n",
"\n",
"# Using your own data with custom `Dataset`s\n",
"\n",
"Sometimes you have data that doesn't quite fit the format expected by the included Datasets. You might be able to preprocess your data to fit the expected format, but it is easy to create your own dataset to do this.\n",
"\n",
"All you need to do is create a class that implements a `__getitem__` method, that returns a sample (i.e. a tuple of mx.nd.NDArrays).\n",
"\n",
"# New in MXNet 2.0: faster C++ backend dataloaders\n",
"\n",
"As part of an effort to speed up the current data loading pipeline using gluon dataset and dataloader, a new dataloader was created that uses only a C++ backend and avoids potentially slow calls to Python functions.\n",
"\n",
"See [original issue](https://github.com/apache/incubator-mxnet/issues/17269), [pull request](https://github.com/apache/incubator-mxnet/pull/17464) and [implementation](https://github.com/apache/incubator-mxnet/pull/17841).\n",
"\n",
"The current data loading pipeline is the major bottleneck for many training tasks. The flow can be summarized as:\n",
"\n",
"- `Dataset.__getitem__`\n",
"- `Transform.__call__()/forward()`\n",
"- `Batchify`\n",
"- (optional communicate through shared_mem)\n",
"- `split_and_load(ctxs)`\n",
"- training on GPUs\n",
"\n",
"Performance concerns include slow python dataset/transform functions, multithreading issues due to global interpreter lock, Python multiprocessing issues due to speed, and batchify issues due to poor memory management.\n",
"\n",
"This new dataloader provides: \n",
"- common C++ batchify functions that are split and context aware\n",
"- a C++ MultithreadingDataLoader which inherit the same arguments as gluon.data.DataLoader but use MXNet internal multithreading rather than python multiprocessing.\n",
"- fallback to python multiprocessing whenever the dataset is not fully supported by backend (e.g., there are custom python datasets) in the case that:\n",
" - the transform is not fully hybridizable\n",
" - batchify is not fully supported by backend\n",
"\n",
"Users can continue to with the traditional gluon.data.Dataloader and the C++ backend will be applied automatically. The 'try_nopython' default is 'Auto', which detects whether the C++ backend is available given the dataset and transforms. \n",
"\n",
"Here you will show a performance increase on a t3.2xl instance for the CIFAR10 dataset with the C++ backend.\n",
"\n",
"## Using the C++ backend:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0044032",
"metadata": {},
"outputs": [],
"source": [
"cpp_dl = mx.gluon.data.DataLoader(\n",
" mx.gluon.data.vision.CIFAR10(train=True, transform=None), batch_size=32, num_workers=2,try_nopython=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "259b7e49",
"metadata": {},
"outputs": [],
"source": [
"start = time.time()\n",
"for _ in range(3):\n",
" print(len(cpp_dl))\n",
" for _ in cpp_dl:\n",
" pass\n",
"print('Elapsed time for backend dataloader:', time.time() - start)"
]
},
{
"cell_type": "markdown",
"id": "bf153de6",
"metadata": {},
"source": [
"## Using the Python backend:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3faef118",
"metadata": {},
"outputs": [],
"source": [
"dl = mx.gluon.data.DataLoader(\n",
" mx.gluon.data.vision.CIFAR10(train=True, transform=None), batch_size=32, num_workers=2,try_nopython=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b61c62e",
"metadata": {},
"outputs": [],
"source": [
"start = time.time()\n",
"for _ in range(3):\n",
" print(len(dl))\n",
" for _ in dl:\n",
" pass\n",
"print('Elapsed time for python dataloader:', time.time() - start)"
]
},
{
"cell_type": "markdown",
"id": "24e044bd",
"metadata": {},
"source": [
"## Next Steps\n",
"\n",
"Now that you have some experience with MXNet's datasets and dataloaders, it's time to use them for [Step 6: Training a Neural Network](./6-train-nn.ipynb)."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}