blob: 376b7799363b27a749c23f5cc7fd25c63d4958e5 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"id": "e1047030",
"metadata": {},
"source": [
"<!--- Licensed to the Apache Software Foundation (ASF) under one -->\n",
"<!--- or more contributor license agreements. See the NOTICE file -->\n",
"<!--- distributed with this work for additional information -->\n",
"<!--- regarding copyright ownership. The ASF licenses this file -->\n",
"<!--- to you under the Apache License, Version 2.0 (the -->\n",
"<!--- \"License\"); you may not use this file except in compliance -->\n",
"<!--- with the License. You may obtain a copy of the License at -->\n",
"\n",
"<!--- http://www.apache.org/licenses/LICENSE-2.0 -->\n",
"\n",
"<!--- Unless required by applicable law or agreed to in writing, -->\n",
"<!--- software distributed under the License is distributed on an -->\n",
"<!--- \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->\n",
"<!--- KIND, either express or implied. See the License for the -->\n",
"<!--- specific language governing permissions and limitations -->\n",
"<!--- under the License. -->\n",
"\n",
"\n",
"# CSRNDArray - NDArray in Compressed Sparse Row Storage Format\n",
"\n",
"Many real world datasets deal with high dimensional sparse feature vectors. Take for instance a recommendation system where the number of categories and users is on the order of millions. The purchase data for each category by user would show that most users only make a few purchases, leading to a dataset with high sparsity (i.e. most of the elements are zeros).\n",
"\n",
"Storing and manipulating such large sparse matrices in the default dense structure results in wasted memory and processing on the zeros. To take advantage of the sparse structure of the matrix, the `CSRNDArray` in MXNet stores the matrix in [compressed sparse row (CSR)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29) format and uses specialized algorithms in operators.\n",
"**The format is designed for 2D matrices with a large number of columns,\n",
"and each row is sparse (i.e. with only a few nonzeros).**\n",
"\n",
"## Advantages of Compressed Sparse Row NDArray (CSRNDArray)\n",
"For matrices of high sparsity (e.g. ~1% non-zeros = ~1% density), there are two primary advantages of `CSRNDArray` over the existing `NDArray`:\n",
"\n",
"- memory consumption is reduced significantly\n",
"- certain operations are much faster (e.g. matrix-vector multiplication)\n",
"\n",
"You may be familiar with the CSR storage format in [SciPy](https://www.scipy.org/) and will note the similarities in MXNet's implementation. However there are some additional competitive features in `CSRNDArray` inherited from `NDArray`, such as non-blocking asynchronous evaluation and automatic parallelization that are not available in SciPy's flavor of CSR. You can find further explanations for evaluation and parallelization strategy in MXNet in the [NDArray tutorial](https://mxnet.apache.org/tutorials/basic/ndarray.html#lazy-evaluation-and-automatic-parallelization).\n",
"\n",
"The introduction of `CSRNDArray` also brings a new attribute, `stype` as a holder for storage type info, to `NDArray`. You can query **ndarray.stype** now in addition to the oft-queried attributes such as **ndarray.shape**, **ndarray.dtype**, and **ndarray.context**. For a typical dense NDArray, the value of `stype` is **\"default\"**. For a `CSRNDArray`, the value of stype is **\"csr\"**.\n",
"\n",
"## Prerequisites\n",
"\n",
"To complete this tutorial, you will need:\n",
"\n",
"- MXNet. See the instructions for your operating system in [Setup and Installation](https://mxnet.io/get_started)\n",
"- [Jupyter](http://jupyter.org/)\n",
" ```\n",
" pip install jupyter\n",
" ```\n",
"- Basic knowledge of NDArray in MXNet. See the detailed tutorial for NDArray in [NDArray - Imperative tensor operations on CPU/GPU](https://mxnet.apache.org/tutorials/basic/ndarray.html).\n",
"- SciPy - A section of this tutorial uses SciPy package in Python. If you don't have SciPy, the example in that section will be ignored.\n",
"- GPUs - A section of this tutorial uses GPUs. If you don't have GPUs on your machine, simply set the variable `gpu_device` (set in the GPUs section of this tutorial) to `mx.cpu()`.\n",
"\n",
"## Compressed Sparse Row Matrix\n",
"\n",
"A CSRNDArray represents a 2D matrix as three separate 1D arrays: **data**, **indptr** and **indices**, where the column indices for row `i` are stored in `indices[indptr[i]:indptr[i+1]]` in ascending order, and their corresponding values are stored in `data[indptr[i]:indptr[i+1]]`.\n",
"\n",
"- **data**: CSR format data array of the matrix\n",
"- **indices**: CSR format index array of the matrix\n",
"- **indptr**: CSR format index pointer array of the matrix\n",
"\n",
"### Example Matrix Compression\n",
"\n",
"For example, given the matrix:"
]
},
{
"cell_type": "markdown",
"id": "ab953c22",
"metadata": {},
"source": [
"```\n",
"[[7, 0, 8, 0]\n",
" [0, 0, 0, 0]\n",
" [0, 9, 0, 0]]\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "3f0cf752",
"metadata": {},
"source": [
"We can compress this matrix using CSR, and to do so we need to calculate `data`, `indices`, and `indptr`.\n",
"\n",
"The `data` array holds all the non-zero entries of the matrix in row-major order. Put another way, you create a data array that has all of the zeros removed from the matrix, row by row, storing the numbers in that order. Your result:"
]
},
{
"cell_type": "markdown",
"id": "21213ae8",
"metadata": {},
"source": [
"```\n",
"data = [7, 8, 9]\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "024277e8",
"metadata": {},
"source": [
"The `indices` array stores the column index for each non-zero element in `data`. As you cycle through the data array, starting with 7, you can see it is in column 0. Then looking at 8, you can see it is in column 2. Lastly 9 is in column 1. Your result:"
]
},
{
"cell_type": "markdown",
"id": "492761b8",
"metadata": {},
"source": [
"```\n",
"indices = [0, 2, 1]\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "4eeed937",
"metadata": {},
"source": [
"The `indptr` array is what will help identify the rows where the data appears. It stores the offset into `data` of the first non-zero element number of each row of the matrix. This array always starts with 0 (reasons can be explored later), so indptr[0] is 0. Each subsequent value in the array is the aggregate number of non-zero elements up to that row. Looking at the first row of the matrix you can see two non-zero values, so indptr[1] is 2. The next row contains all zeros, so the aggregate is still 2, so indptr[2] is 2. Finally, you see the last row contains one non-zero element bring the aggregate to 3, so indptr[3] is 3. To reconstruct the dense matrix, you will use `data[0:2]` and `indices[0:2]` for the first row, `data[2:2]` and `indices[2:2]` for the second row (which contains all zeros), and `data[2:3]` and `indices[2:3]` for the third row. Your result:"
]
},
{
"cell_type": "markdown",
"id": "e51eff0e",
"metadata": {},
"source": [
"```text\n",
"indptr = [0, 2, 2, 3]\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "0cea71bf",
"metadata": {},
"source": [
"Note that in MXNet, the column indices for a given row are always sorted in ascending order,\n",
"and duplicated column indices for the same row are not allowed.\n",
"\n",
"## Array Creation\n",
"\n",
"There are a few different ways to create a `CSRNDArray`, but first let's recreate the matrix we just discussed using the `data`, `indices`, and `indptr` we calculated in the previous example.\n",
"\n",
"You can create a CSRNDArray with data, indices and indptr by using the `csr_matrix` function:"
]
},
{
"cell_type": "markdown",
"id": "99236ed6",
"metadata": {},
"source": [
"```python\n",
"import mxnet as mx\n",
"# Create a CSRNDArray with python lists\n",
"shape = (3, 4)\n",
"data_list = [7, 8, 9]\n",
"indices_list = [0, 2, 1]\n",
"indptr_list = [0, 2, 2, 3]\n",
"a = mx.nd.sparse.csr_matrix((data_list, indices_list, indptr_list), shape=shape)\n",
"# Inspect the matrix\n",
"a.asnumpy()\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "385f0b2a",
"metadata": {},
"source": [
"```\n",
"array([[ 7., 0., 8., 0.],\n",
" [ 0., 0., 0., 0.],\n",
" [ 0., 9., 0., 0.]], dtype=float32)\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "1f9487c6",
"metadata": {},
"source": [
"```python\n",
"import numpy as np\n",
"# Create a CSRNDArray with numpy arrays\n",
"data_np = np.array([7, 8, 9])\n",
"indptr_np = np.array([0, 2, 2, 3])\n",
"indices_np = np.array([0, 2, 1])\n",
"b = mx.nd.sparse.csr_matrix((data_np, indices_np, indptr_np), shape=shape)\n",
"b.asnumpy()\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "be1a77c2",
"metadata": {},
"source": [
"```\n",
"array([[7, 0, 8, 0],\n",
" [0, 0, 0, 0],\n",
" [0, 9, 0, 0]])\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "4b39ddbc",
"metadata": {},
"source": [
"```python\n",
"# Compare the two. They are exactly the same.\n",
"{'a':a.asnumpy(), 'b':b.asnumpy()}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "d7f8b62f",
"metadata": {},
"source": [
"```\n",
"{'a': array([[ 7., 0., 8., 0.],\n",
" [ 0., 0., 0., 0.],\n",
" [ 0., 9., 0., 0.]], dtype=float32), 'b': array([[7, 0, 8, 0],\n",
" [0, 0, 0, 0],\n",
" [0, 9, 0, 0]])}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "8a3c1d84",
"metadata": {},
"source": [
"You can create an MXNet CSRNDArray from a `scipy.sparse.csr.csr_matrix` object by using the `array` function:"
]
},
{
"cell_type": "markdown",
"id": "d35c99c6",
"metadata": {},
"source": [
"```python\n",
"try:\n",
" import scipy.sparse as spsp\n",
" # generate a csr matrix in scipy\n",
" c = spsp.csr.csr_matrix((data_np, indices_np, indptr_np), shape=shape)\n",
" # create a CSRNDArray from a scipy csr object\n",
" d = mx.nd.sparse.array(c)\n",
" print('d:{}'.format(d.asnumpy()))\n",
"except ImportError:\n",
" print(\"scipy package is required\")\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "a8a5fb4f",
"metadata": {},
"source": [
"```\n",
"d:[[7 0 8 0]\n",
" [0 0 0 0]\n",
" [0 9 0 0]]\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "9b966be9",
"metadata": {},
"source": [
"What if you have a big set of data and you haven't calculated indices or indptr yet? Let's try a simple CSRNDArray from an existing array of data and derive those values with some built-in functions. We can mockup a \"big\" dataset with a random amount of the data being non-zero, then compress it by using the `tostype` function, which is explained further in the [Storage Type Conversion](#storage-type-conversion) section:"
]
},
{
"cell_type": "markdown",
"id": "2a7b0613",
"metadata": {},
"source": [
"```python\n",
"big_array = mx.nd.round(mx.nd.random.uniform(low=0, high=1, shape=(1000, 100)))\n",
"print(big_array)\n",
"big_array_csr = big_array.tostype('csr')\n",
"# Access indices array\n",
"indices = big_array_csr.indices\n",
"# Access indptr array\n",
"indptr = big_array_csr.indptr\n",
"# Access data array\n",
"data = big_array_csr.data\n",
"# The total size of `data`, `indices` and `indptr` arrays is much lesser than the dense big_array!\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "30955536",
"metadata": {},
"source": [
"```\n",
"[[ 1. 1. 0. ..., 0. 1. 1.]\n",
" [ 0. 0. 0. ..., 0. 0. 1.]\n",
" [ 1. 0. 0. ..., 1. 0. 0.]\n",
" ..., \n",
" [ 0. 1. 1. ..., 0. 0. 0.]\n",
" [ 1. 1. 0. ..., 1. 0. 1.]\n",
" [ 1. 0. 1. ..., 1. 0. 0.]]\n",
"<NDArray 1000x100 @cpu(0)>\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "014bd1d1",
"metadata": {},
"source": [
"You can also create a CSRNDArray from another using the `array` function specifying the element data type with the option `dtype`,\n",
"which accepts a numpy type. By default, `float32` is used."
]
},
{
"cell_type": "markdown",
"id": "a97c3139",
"metadata": {},
"source": [
"```python\n",
"# Float32 is used by default\n",
"e = mx.nd.sparse.array(a)\n",
"# Create a 16-bit float array\n",
"f = mx.nd.array(a, dtype=np.float16)\n",
"(e.dtype, f.dtype)\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "76f26813",
"metadata": {},
"source": [
"```\n",
"(numpy.float32, numpy.float16)\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "dbd1bdba",
"metadata": {},
"source": [
"## Inspecting Arrays\n",
"\n",
"A variety of methods are available for you to use for inspecting CSR arrays:\n",
"* **.asnumpy()**\n",
"* **.data**\n",
"* **.indices**\n",
"* **.indptr**\n",
"\n",
"As you have seen already, we can inspect the contents of a `CSRNDArray` by filling\n",
"its contents into a dense `numpy.ndarray` using the `asnumpy` function."
]
},
{
"cell_type": "markdown",
"id": "973b7023",
"metadata": {},
"source": [
"```python\n",
"a.asnumpy()\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "96293488",
"metadata": {},
"source": [
"```\n",
"array([[ 7., 0., 8., 0.],\n",
" [ 0., 0., 0., 0.],\n",
" [ 0., 9., 0., 0.]], dtype=float32)\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "3d01d7e6",
"metadata": {},
"source": [
"You can also inspect the internal storage of a CSRNDArray by accessing attributes such as `indptr`, `indices` and `data`:"
]
},
{
"cell_type": "markdown",
"id": "2344c7a4",
"metadata": {},
"source": [
"```python\n",
"# Access data array\n",
"data = a.data\n",
"# Access indices array\n",
"indices = a.indices\n",
"# Access indptr array\n",
"indptr = a.indptr\n",
"{'a.stype': a.stype, 'data':data, 'indices':indices, 'indptr':indptr}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "eff6df7d",
"metadata": {},
"source": [
"```\n",
"{'a.stype': 'csr', 'data': \n",
" [ 7. 8. 9.]\n",
" <NDArray 3 @cpu(0)>, 'indices': \n",
" [0 2 1]\n",
" <NDArray 3 @cpu(0)>, 'indptr': \n",
" [0 2 2 3]\n",
" <NDArray 4 @cpu(0)>}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "f74dee9d",
"metadata": {},
"source": [
"## Storage Type Conversion\n",
"\n",
"You can also convert storage types with:\n",
"* **tostype**\n",
"* **cast_storage**\n",
"\n",
"To convert an NDArray to a CSRNDArray and vice versa by using the ``tostype`` function:"
]
},
{
"cell_type": "markdown",
"id": "1ee2716a",
"metadata": {},
"source": [
"```python\n",
"# Create a dense NDArray\n",
"ones = mx.nd.ones((2,2))\n",
"# Cast the storage type from `default` to `csr`\n",
"csr = ones.tostype('csr')\n",
"# Cast the storage type from `csr` to `default`\n",
"dense = csr.tostype('default')\n",
"{'csr':csr, 'dense':dense}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "40241030",
"metadata": {},
"source": [
"```\n",
"{'csr': \n",
" <CSRNDArray 2x2 @cpu(0)>, 'dense': \n",
" [[ 1. 1.]\n",
" [ 1. 1.]]\n",
" <NDArray 2x2 @cpu(0)>}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "320f6b7c",
"metadata": {},
"source": [
"To convert the storage type by using the `cast_storage` operator:"
]
},
{
"cell_type": "markdown",
"id": "71edf52b",
"metadata": {},
"source": [
"```python\n",
"# Create a dense NDArray\n",
"ones = mx.nd.ones((2,2))\n",
"# Cast the storage type to `csr`\n",
"csr = mx.nd.sparse.cast_storage(ones, 'csr')\n",
"# Cast the storage type to `default`\n",
"dense = mx.nd.sparse.cast_storage(csr, 'default')\n",
"{'csr':csr, 'dense':dense}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "8a7eef95",
"metadata": {},
"source": [
"```\n",
"{'csr': \n",
" <CSRNDArray 2x2 @cpu(0)>, 'dense': \n",
" [[ 1. 1.]\n",
" [ 1. 1.]]\n",
" <NDArray 2x2 @cpu(0)>}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "ceb8080f",
"metadata": {},
"source": [
"## Copies\n",
"\n",
"You can use the `copy` method which makes a deep copy of the array and its data, and returns a new array.\n",
"You can also use the `copyto` method or the slice operator `[]` to deep copy to an existing array."
]
},
{
"cell_type": "markdown",
"id": "1d91c075",
"metadata": {},
"source": [
"```python\n",
"a = mx.nd.ones((2,2)).tostype('csr')\n",
"b = a.copy()\n",
"c = mx.nd.sparse.zeros('csr', (2,2))\n",
"c[:] = a\n",
"d = mx.nd.sparse.zeros('csr', (2,2))\n",
"a.copyto(d)\n",
"{'b is a': b is a, 'b.asnumpy()':b.asnumpy(), 'c.asnumpy()':c.asnumpy(), 'd.asnumpy()':d.asnumpy()}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "05e6a7c8",
"metadata": {},
"source": [
"```\n",
"{'b is a': False, 'b.asnumpy()': array([[ 1., 1.],\n",
" [ 1., 1.]], dtype=float32), 'c.asnumpy()': array([[ 1., 1.],\n",
" [ 1., 1.]], dtype=float32), 'd.asnumpy()': array([[ 1., 1.],\n",
" [ 1., 1.]], dtype=float32)}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "0e8d0ccf",
"metadata": {},
"source": [
"If the storage types of source array and destination array do not match,\n",
"the storage type of destination array will not change when copying with `copyto` or\n",
"the slice operator `[]`."
]
},
{
"cell_type": "markdown",
"id": "2ba64308",
"metadata": {},
"source": [
"```python\n",
"e = mx.nd.sparse.zeros('csr', (2,2))\n",
"f = mx.nd.sparse.zeros('csr', (2,2))\n",
"g = mx.nd.ones(e.shape)\n",
"e[:] = g\n",
"g.copyto(f)\n",
"{'e.stype':e.stype, 'f.stype':f.stype, 'g.stype':g.stype}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "70b57f3d",
"metadata": {},
"source": [
"```\n",
"{'e.stype': 'csr', 'f.stype': 'csr', 'g.stype': 'default'}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "a02ad7db",
"metadata": {},
"source": [
"## Indexing and Slicing\n",
"You can slice a CSRNDArray on axis 0 with operator `[]`, which copies the slices and returns a new CSRNDArray."
]
},
{
"cell_type": "markdown",
"id": "692a6324",
"metadata": {},
"source": [
"```python\n",
"a = mx.nd.array(np.arange(6).reshape(3,2)).tostype('csr')\n",
"b = a[1:2].asnumpy()\n",
"c = a[:].asnumpy()\n",
"{'a':a, 'b':b, 'c':c}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "3ceee3a7",
"metadata": {},
"source": [
"```\n",
"{'a': \n",
" <CSRNDArray 3x2 @cpu(0)>,\n",
" 'b': array([[ 2., 3.]], dtype=float32),\n",
" 'c': array([[ 0., 1.],\n",
" [ 2., 3.],\n",
" [ 4., 5.]], dtype=float32)}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "12f096aa",
"metadata": {},
"source": [
"Note that multi-dimensional indexing or slicing along a particular axis is currently not supported for a CSRNDArray.\n",
"\n",
"## Sparse Operators and Storage Type Inference\n",
"\n",
"Operators that have specialized implementation for sparse arrays can be accessed in `mx.nd.sparse`. You can read the [mxnet.ndarray.sparse API documentation](https://mxnet.apache.org/versions/master/api/python/ndarray/sparse.html) to find what sparse operators are available."
]
},
{
"cell_type": "markdown",
"id": "18f679ba",
"metadata": {},
"source": [
"```python\n",
"shape = (3, 4)\n",
"data = [7, 8, 9]\n",
"indptr = [0, 2, 2, 3]\n",
"indices = [0, 2, 1]\n",
"a = mx.nd.sparse.csr_matrix((data, indices, indptr), shape=shape) # a csr matrix as lhs\n",
"rhs = mx.nd.ones((4, 1)) # a dense vector as rhs\n",
"out = mx.nd.sparse.dot(a, rhs) # invoke sparse dot operator specialized for dot(csr, dense)\n",
"{'out':out}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "c41b0a68",
"metadata": {},
"source": [
"```\n",
"{'out': \n",
" [[ 15.]\n",
" [ 0.]\n",
" [ 9.]]\n",
" <NDArray 3x1 @cpu(0)>}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "cce16e3f",
"metadata": {},
"source": [
"For any sparse operator, the storage type of output array is inferred based on inputs. You can either read the documentation or inspect the `stype` attribute of the output array to know what storage type is inferred:"
]
},
{
"cell_type": "markdown",
"id": "a0b1f3b7",
"metadata": {},
"source": [
"```python\n",
"b = a * 2 # b will be a CSRNDArray since zero multiplied by 2 is still zero\n",
"c = a + mx.nd.ones(shape=(3, 4)) # c will be a dense NDArray\n",
"{'b.stype':b.stype, 'c.stype':c.stype}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "0bf1ca66",
"metadata": {},
"source": [
"```\n",
"{'b.stype': 'csr', 'c.stype': 'default'}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "85fdd2a4",
"metadata": {},
"source": [
"For operators that don't specialize in sparse arrays, we can still use them with sparse inputs with some performance penalty. In MXNet, dense operators require all inputs and outputs to be in the dense format.\n",
"\n",
"If sparse inputs are provided, MXNet will convert sparse inputs into dense ones temporarily, so that the dense operator can be used.\n",
"\n",
"If sparse outputs are provided, MXNet will convert the dense outputs generated by the dense operator into the provided sparse format."
]
},
{
"cell_type": "markdown",
"id": "bb70e1c0",
"metadata": {},
"source": [
"```python\n",
"e = mx.nd.sparse.zeros('csr', a.shape)\n",
"d = mx.nd.log(a) # dense operator with a sparse input\n",
"e = mx.nd.log(a, out=e) # dense operator with a sparse output\n",
"{'a.stype':a.stype, 'd.stype':d.stype, 'e.stype':e.stype} # stypes of a and e will be not changed\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "989e8aa9",
"metadata": {},
"source": [
"```\n",
"{'a.stype': 'csr', 'd.stype': 'default', 'e.stype': 'csr'}\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "56f2f2d8",
"metadata": {},
"source": [
"Note that warning messages will be printed when such a storage fallback event happens. If you are using jupyter notebook, the warning message will be printed in your terminal console.\n",
"\n",
"## Data Loading\n",
"\n",
"You can load data in batches from a CSRNDArray using `mx.io.NDArrayIter`:"
]
},
{
"cell_type": "markdown",
"id": "d7377d10",
"metadata": {},
"source": [
"```python\n",
"# Create the source CSRNDArray\n",
"data = mx.nd.array(np.arange(36).reshape((9,4))).tostype('csr')\n",
"labels = np.ones([9, 1])\n",
"batch_size = 3\n",
"dataiter = mx.io.NDArrayIter(data, labels, batch_size, last_batch_handle='discard')\n",
"# Inspect the data batches\n",
"[batch.data[0] for batch in dataiter]\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "9a939875",
"metadata": {},
"source": [
"```\n",
"[\n",
" <CSRNDArray 3x4 @cpu(0)>, \n",
" <CSRNDArray 3x4 @cpu(0)>, \n",
" <CSRNDArray 3x4 @cpu(0)>]\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "5176e5fa",
"metadata": {},
"source": [
"You can also load data stored in the [libsvm file format](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/) using `mx.io.LibSVMIter`, where the format is: ``<label> <col_idx1>:<value1> <col_idx2>:<value2> ... <col_idxN>:<valueN>``. Each line in the file records the label and the column indices and data for non-zero entries. For example, for a matrix with 6 columns, ``1 2:1.5 4:-3.5`` means the label is ``1``, the data is ``[[0, 0, 1,5, 0, -3.5, 0]]``. More detailed examples of `mx.io.LibSVMIter` are available in the [API documentation](https://mxnet.apache.org/versions/master/api/python/io/io.html#mxnet.io.LibSVMIter)."
]
},
{
"cell_type": "markdown",
"id": "ac8ef3c5",
"metadata": {},
"source": [
"```python\n",
"# Create a sample libsvm file in current working directory\n",
"import os\n",
"cwd = os.getcwd()\n",
"data_path = os.path.join(cwd, 'data.t')\n",
"with open(data_path, 'w') as fout:\n",
" fout.write('1.0 0:1 2:2\\n')\n",
" fout.write('1.0 0:3 5:4\\n')\n",
" fout.write('1.0 2:5 8:6 9:7\\n')\n",
" fout.write('1.0 3:8\\n')\n",
" fout.write('-1 0:0.5 9:1.5\\n')\n",
" fout.write('-2.0\\n')\n",
" fout.write('-3.0 0:-0.6 1:2.25 2:1.25\\n')\n",
" fout.write('-3.0 1:2 2:-1.25\\n')\n",
" fout.write('4 2:-1.2\\n')\n",
"\n",
"# Load CSRNDArrays from the file\n",
"data_train = mx.io.LibSVMIter(data_libsvm=data_path, data_shape=(10,), label_shape=(1,), batch_size=3)\n",
"for batch in data_train:\n",
" print(data_train.getdata())\n",
" print(data_train.getlabel())\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "472389b2",
"metadata": {},
"source": [
"```\n",
"<CSRNDArray 3x10 @cpu(0)>\n",
"\n",
"[ 1. 1. 1.]\n",
"<NDArray 3 @cpu(0)>\n",
"\n",
"<CSRNDArray 3x10 @cpu(0)>\n",
"\n",
"[ 1. -1. -2.]\n",
"<NDArray 3 @cpu(0)>\n",
"\n",
"<CSRNDArray 3x10 @cpu(0)>\n",
"\n",
"[-3. -3. 4.]\n",
"<NDArray 3 @cpu(0)>\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "ccd79854",
"metadata": {},
"source": [
"Note that in the file the column indices are expected to be sorted in ascending order per row, and be zero-based instead of one-based.\n",
"\n",
"## Advanced Topics\n",
"\n",
"### GPU Support\n",
"\n",
"By default, `CSRNDArray` operators are executed on CPU. To create a `CSRNDArray` on a GPU, we need to explicitly specify the context:\n",
"\n",
"**Note** If a GPU is not available, an error will be reported in the following section. In order to execute it a cpu, set `gpu_device` to `mx.cpu()`."
]
},
{
"cell_type": "markdown",
"id": "0fc6afcc",
"metadata": {},
"source": [
"```python\n",
"import sys\n",
"gpu_device=mx.gpu() # Change this to mx.cpu() in absence of GPUs.\n",
"try:\n",
" a = mx.nd.sparse.zeros('csr', (100, 100), ctx=gpu_device)\n",
" a\n",
"except mx.MXNetError as err:\n",
" sys.stderr.write(str(err))\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "0b521c18",
"metadata": {},
"source": [
"## Next \n",
"\n",
"[Train a Linear Regression Model with Sparse Symbols](/api/python/docs/tutorials/packages/ndarray/sparse/train.html)\n",
"\n",
"\n",
"<!-- INSERT SOURCE DOWNLOAD BUTTONS -->"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}