blob: b45730327943dccf2558b089fc4fe9d30c0cf655 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Preprocessor for image data - distribution rules\n",
"\n",
"This notebook shows how to set distribution rules parameter for moving training and validation datasets to certain segments where training will be done. How you distribute data may depend, for example, on how GPUs are attached to your database cluster.\n",
"\n",
"The distribution rules parameter is part of the mini-batch preprocessor utility for image data:\n",
"* `training_preprocessor_dl()` for training datasets\n",
"* `validation_preprocessor_dl()` for validation datasets\n",
"\n",
"\n",
"## Table of contents\n",
"\n",
"<a href=\"#distr\">1. Setup distribution rules</a>\n",
"\n",
"<a href=\"#pp_train\">2. Run preprocessor for training image data</a>\n",
"<ul>\n",
"<a href=\"#pp_train2a\">2a. Distribute to all segments</a>\n",
" \n",
"<a href=\"#pp_train2b\">2b. Distribute to segments on hosts with GPUs attached</a>\n",
"\n",
"<a href=\"#pp_train2c\">2c. Distribute to segments on a subset of hosts</a>\n",
"\n",
"<a href=\"#pp_train2d\">2d. Distribute to 1 segment only</a>\n",
"\n",
"</ul>\n",
"\n",
"<a href=\"#pp_val\">3. Run preprocessor for validation image data</a>"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The sql extension is already loaded. To reload it, use:\n",
" %reload_ext sql\n"
]
}
],
"source": [
"%load_ext sql"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"u'Connected: gpadmin@cifar_places'"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Greenplum Database 5.x on GCP - via tunnel\n",
"%sql postgresql://gpadmin@localhost:8000/cifar_places\n",
" \n",
"# PostgreSQL local\n",
"#%sql postgresql://fmcquillan@localhost:5432/madlib"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>version</th>\n",
" </tr>\n",
" <tr>\n",
" <td>MADlib version: 1.17-dev, git revision: rc/1.16-rc1-95-gc62dfe7, cmake configuration time: Tue Mar 17 16:53:55 UTC 2020, build type: RelWithDebInfo, build system: Linux-2.6.32-754.6.3.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(u'MADlib version: 1.17-dev, git revision: rc/1.16-rc1-95-gc62dfe7, cmake configuration time: Tue Mar 17 16:53:55 UTC 2020, build type: RelWithDebInfo, build system: Linux-2.6.32-754.6.3.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql select madlib.version();\n",
"#%sql select version();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"distr\"></a>\n",
"# 1. Setup distribution rules\n",
"\n",
"Here are different ways to set up distribution rules tables.\n",
"\n",
"First get the GPU configuration in the cluster using the MADlib helper function `gpu_configuration`:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"1 rows affected.\n",
"20 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>hostname</th>\n",
" <th>gpu_descr</th>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix0</td>\n",
" <td>device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix0</td>\n",
" <td>device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix0</td>\n",
" <td>device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix0</td>\n",
" <td>device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix1</td>\n",
" <td>device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix1</td>\n",
" <td>device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix1</td>\n",
" <td>device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix1</td>\n",
" <td>device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix2</td>\n",
" <td>device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix2</td>\n",
" <td>device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix2</td>\n",
" <td>device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix2</td>\n",
" <td>device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix3</td>\n",
" <td>device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix3</td>\n",
" <td>device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix3</td>\n",
" <td>device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix3</td>\n",
" <td>device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix4</td>\n",
" <td>device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix4</td>\n",
" <td>device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix4</td>\n",
" <td>device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>gpsix4</td>\n",
" <td>device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(u'gpsix0', u'device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0'),\n",
" (u'gpsix0', u'device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0'),\n",
" (u'gpsix0', u'device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0'),\n",
" (u'gpsix0', u'device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0'),\n",
" (u'gpsix1', u'device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0'),\n",
" (u'gpsix1', u'device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0'),\n",
" (u'gpsix1', u'device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0'),\n",
" (u'gpsix1', u'device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0'),\n",
" (u'gpsix2', u'device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0'),\n",
" (u'gpsix2', u'device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0'),\n",
" (u'gpsix2', u'device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0'),\n",
" (u'gpsix2', u'device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0'),\n",
" (u'gpsix3', u'device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0'),\n",
" (u'gpsix3', u'device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0'),\n",
" (u'gpsix3', u'device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0'),\n",
" (u'gpsix3', u'device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0'),\n",
" (u'gpsix4', u'device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0'),\n",
" (u'gpsix4', u'device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0'),\n",
" (u'gpsix4', u'device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0'),\n",
" (u'gpsix4', u'device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0')]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS host_gpu_mapping_tf;\n",
"SELECT * FROM madlib.gpu_configuration('host_gpu_mapping_tf');\n",
"SELECT * FROM host_gpu_mapping_tf ORDER BY hostname, gpu_descr;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Review the Greenplum segments in the `gp_segment_configuration` table:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"21 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>dbid</th>\n",
" <th>content</th>\n",
" <th>role</th>\n",
" <th>preferred_role</th>\n",
" <th>mode</th>\n",
" <th>status</th>\n",
" <th>port</th>\n",
" <th>hostname</th>\n",
" <th>address</th>\n",
" <th>datadir</th>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>5432</td>\n",
" <td>gpsix0</td>\n",
" <td>gpsix0</td>\n",
" <td>/data/master/gpseg-1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40000</td>\n",
" <td>gpsix0</td>\n",
" <td>gpsix0</td>\n",
" <td>/data/primary0/gpseg0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40001</td>\n",
" <td>gpsix0</td>\n",
" <td>gpsix0</td>\n",
" <td>/data/primary1/gpseg1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40002</td>\n",
" <td>gpsix0</td>\n",
" <td>gpsix0</td>\n",
" <td>/data/primary2/gpseg2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>3</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40003</td>\n",
" <td>gpsix0</td>\n",
" <td>gpsix0</td>\n",
" <td>/data/primary3/gpseg3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>6</td>\n",
" <td>4</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40000</td>\n",
" <td>gpsix1</td>\n",
" <td>gpsix1</td>\n",
" <td>/data/primary0/gpseg4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>7</td>\n",
" <td>5</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40001</td>\n",
" <td>gpsix1</td>\n",
" <td>gpsix1</td>\n",
" <td>/data/primary1/gpseg5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>8</td>\n",
" <td>6</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40002</td>\n",
" <td>gpsix1</td>\n",
" <td>gpsix1</td>\n",
" <td>/data/primary2/gpseg6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>9</td>\n",
" <td>7</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40003</td>\n",
" <td>gpsix1</td>\n",
" <td>gpsix1</td>\n",
" <td>/data/primary3/gpseg7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>10</td>\n",
" <td>8</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40000</td>\n",
" <td>gpsix2</td>\n",
" <td>gpsix2</td>\n",
" <td>/data/primary0/gpseg8</td>\n",
" </tr>\n",
" <tr>\n",
" <td>11</td>\n",
" <td>9</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40001</td>\n",
" <td>gpsix2</td>\n",
" <td>gpsix2</td>\n",
" <td>/data/primary1/gpseg9</td>\n",
" </tr>\n",
" <tr>\n",
" <td>12</td>\n",
" <td>10</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40002</td>\n",
" <td>gpsix2</td>\n",
" <td>gpsix2</td>\n",
" <td>/data/primary2/gpseg10</td>\n",
" </tr>\n",
" <tr>\n",
" <td>13</td>\n",
" <td>11</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40003</td>\n",
" <td>gpsix2</td>\n",
" <td>gpsix2</td>\n",
" <td>/data/primary3/gpseg11</td>\n",
" </tr>\n",
" <tr>\n",
" <td>14</td>\n",
" <td>12</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40000</td>\n",
" <td>gpsix3</td>\n",
" <td>gpsix3</td>\n",
" <td>/data/primary0/gpseg12</td>\n",
" </tr>\n",
" <tr>\n",
" <td>15</td>\n",
" <td>13</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40001</td>\n",
" <td>gpsix3</td>\n",
" <td>gpsix3</td>\n",
" <td>/data/primary1/gpseg13</td>\n",
" </tr>\n",
" <tr>\n",
" <td>16</td>\n",
" <td>14</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40002</td>\n",
" <td>gpsix3</td>\n",
" <td>gpsix3</td>\n",
" <td>/data/primary2/gpseg14</td>\n",
" </tr>\n",
" <tr>\n",
" <td>17</td>\n",
" <td>15</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40003</td>\n",
" <td>gpsix3</td>\n",
" <td>gpsix3</td>\n",
" <td>/data/primary3/gpseg15</td>\n",
" </tr>\n",
" <tr>\n",
" <td>18</td>\n",
" <td>16</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40000</td>\n",
" <td>gpsix4</td>\n",
" <td>gpsix4</td>\n",
" <td>/data/primary0/gpseg16</td>\n",
" </tr>\n",
" <tr>\n",
" <td>19</td>\n",
" <td>17</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40001</td>\n",
" <td>gpsix4</td>\n",
" <td>gpsix4</td>\n",
" <td>/data/primary1/gpseg17</td>\n",
" </tr>\n",
" <tr>\n",
" <td>20</td>\n",
" <td>18</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40002</td>\n",
" <td>gpsix4</td>\n",
" <td>gpsix4</td>\n",
" <td>/data/primary2/gpseg18</td>\n",
" </tr>\n",
" <tr>\n",
" <td>21</td>\n",
" <td>19</td>\n",
" <td>p</td>\n",
" <td>p</td>\n",
" <td>n</td>\n",
" <td>u</td>\n",
" <td>40003</td>\n",
" <td>gpsix4</td>\n",
" <td>gpsix4</td>\n",
" <td>/data/primary3/gpseg19</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(1, -1, u'p', u'p', u'n', u'u', 5432, u'gpsix0', u'gpsix0', u'/data/master/gpseg-1'),\n",
" (2, 0, u'p', u'p', u'n', u'u', 40000, u'gpsix0', u'gpsix0', u'/data/primary0/gpseg0'),\n",
" (3, 1, u'p', u'p', u'n', u'u', 40001, u'gpsix0', u'gpsix0', u'/data/primary1/gpseg1'),\n",
" (4, 2, u'p', u'p', u'n', u'u', 40002, u'gpsix0', u'gpsix0', u'/data/primary2/gpseg2'),\n",
" (5, 3, u'p', u'p', u'n', u'u', 40003, u'gpsix0', u'gpsix0', u'/data/primary3/gpseg3'),\n",
" (6, 4, u'p', u'p', u'n', u'u', 40000, u'gpsix1', u'gpsix1', u'/data/primary0/gpseg4'),\n",
" (7, 5, u'p', u'p', u'n', u'u', 40001, u'gpsix1', u'gpsix1', u'/data/primary1/gpseg5'),\n",
" (8, 6, u'p', u'p', u'n', u'u', 40002, u'gpsix1', u'gpsix1', u'/data/primary2/gpseg6'),\n",
" (9, 7, u'p', u'p', u'n', u'u', 40003, u'gpsix1', u'gpsix1', u'/data/primary3/gpseg7'),\n",
" (10, 8, u'p', u'p', u'n', u'u', 40000, u'gpsix2', u'gpsix2', u'/data/primary0/gpseg8'),\n",
" (11, 9, u'p', u'p', u'n', u'u', 40001, u'gpsix2', u'gpsix2', u'/data/primary1/gpseg9'),\n",
" (12, 10, u'p', u'p', u'n', u'u', 40002, u'gpsix2', u'gpsix2', u'/data/primary2/gpseg10'),\n",
" (13, 11, u'p', u'p', u'n', u'u', 40003, u'gpsix2', u'gpsix2', u'/data/primary3/gpseg11'),\n",
" (14, 12, u'p', u'p', u'n', u'u', 40000, u'gpsix3', u'gpsix3', u'/data/primary0/gpseg12'),\n",
" (15, 13, u'p', u'p', u'n', u'u', 40001, u'gpsix3', u'gpsix3', u'/data/primary1/gpseg13'),\n",
" (16, 14, u'p', u'p', u'n', u'u', 40002, u'gpsix3', u'gpsix3', u'/data/primary2/gpseg14'),\n",
" (17, 15, u'p', u'p', u'n', u'u', 40003, u'gpsix3', u'gpsix3', u'/data/primary3/gpseg15'),\n",
" (18, 16, u'p', u'p', u'n', u'u', 40000, u'gpsix4', u'gpsix4', u'/data/primary0/gpseg16'),\n",
" (19, 17, u'p', u'p', u'n', u'u', 40001, u'gpsix4', u'gpsix4', u'/data/primary1/gpseg17'),\n",
" (20, 18, u'p', u'p', u'n', u'u', 40002, u'gpsix4', u'gpsix4', u'/data/primary2/gpseg18'),\n",
" (21, 19, u'p', u'p', u'n', u'u', 40003, u'gpsix4', u'gpsix4', u'/data/primary3/gpseg19')]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT * FROM gp_segment_configuration ORDER BY dbid;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now JOIN the above 2 tables to build up various distribution rules, depending on your needs.\n",
"\n",
"Build distribution rules table for 4 VMs:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"16 rows affected.\n",
"16 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>dbid</th>\n",
" <th>hostname</th>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>6</td>\n",
" <td>gpsix1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>7</td>\n",
" <td>gpsix1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>8</td>\n",
" <td>gpsix1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>9</td>\n",
" <td>gpsix1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>10</td>\n",
" <td>gpsix2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>11</td>\n",
" <td>gpsix2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>12</td>\n",
" <td>gpsix2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>13</td>\n",
" <td>gpsix2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>14</td>\n",
" <td>gpsix3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>15</td>\n",
" <td>gpsix3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>16</td>\n",
" <td>gpsix3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>17</td>\n",
" <td>gpsix3</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(2, u'gpsix0'),\n",
" (3, u'gpsix0'),\n",
" (4, u'gpsix0'),\n",
" (5, u'gpsix0'),\n",
" (6, u'gpsix1'),\n",
" (7, u'gpsix1'),\n",
" (8, u'gpsix1'),\n",
" (9, u'gpsix1'),\n",
" (10, u'gpsix2'),\n",
" (11, u'gpsix2'),\n",
" (12, u'gpsix2'),\n",
" (13, u'gpsix2'),\n",
" (14, u'gpsix3'),\n",
" (15, u'gpsix3'),\n",
" (16, u'gpsix3'),\n",
" (17, u'gpsix3')]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS segments_to_use_4VMs;\n",
"CREATE TABLE segments_to_use_4VMs AS\n",
" SELECT DISTINCT dbid, hostname FROM gp_segment_configuration JOIN host_gpu_mapping_tf USING (hostname)\n",
" WHERE role='p' AND content>=0 AND hostname!='gpsix4';\n",
"SELECT * FROM segments_to_use_4VMs ORDER BY hostname, dbid;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Build distribution rules table for 2 VMs:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"8 rows affected.\n",
"8 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>dbid</th>\n",
" <th>hostname</th>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>6</td>\n",
" <td>gpsix1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>7</td>\n",
" <td>gpsix1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>8</td>\n",
" <td>gpsix1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>9</td>\n",
" <td>gpsix1</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(2, u'gpsix0'),\n",
" (3, u'gpsix0'),\n",
" (4, u'gpsix0'),\n",
" (5, u'gpsix0'),\n",
" (6, u'gpsix1'),\n",
" (7, u'gpsix1'),\n",
" (8, u'gpsix1'),\n",
" (9, u'gpsix1')]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS segments_to_use_2VMs;\n",
"CREATE TABLE segments_to_use_2VMs AS\n",
" SELECT DISTINCT dbid, hostname FROM gp_segment_configuration JOIN host_gpu_mapping_tf USING (hostname)\n",
" WHERE role='p' AND content>=0 AND (hostname='gpsix0' OR hostname='gpsix1');\n",
"SELECT * FROM segments_to_use_2VMs ORDER BY hostname, dbid;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Build distribution rules table for 1 VM:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"4 rows affected.\n",
"4 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>dbid</th>\n",
" <th>hostname</th>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(2, u'gpsix0'), (3, u'gpsix0'), (4, u'gpsix0'), (5, u'gpsix0')]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS segments_to_use_1VM;\n",
"CREATE TABLE segments_to_use_1VM AS\n",
" SELECT DISTINCT dbid, hostname FROM gp_segment_configuration JOIN host_gpu_mapping_tf USING (hostname)\n",
" WHERE role='p' AND content>=0 AND hostname='gpsix0';\n",
"SELECT * FROM segments_to_use_1VM ORDER BY hostname, dbid;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Build distribution rules table for 1 segment:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"1 rows affected.\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>dbid</th>\n",
" <th>hostname</th>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>gpsix0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(2, u'gpsix0')]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS segments_to_use_1seg;\n",
"CREATE TABLE segments_to_use_1seg AS\n",
" SELECT DISTINCT dbid, hostname FROM gp_segment_configuration JOIN host_gpu_mapping_tf USING (hostname)\n",
" WHERE dbid=2;\n",
"SELECT * FROM segments_to_use_1seg ORDER BY hostname, dbid;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"pp_train\"></a>\n",
"# 2. Run preprocessor for training image data\n",
"\n",
"Run the preprocessor to generate the packed output table on the segments that you want to use for training and validation. The training data in our example is CIFAR-10 and is in table `image_data_train` and the validation data is in `image_data_val` .\n",
"\n",
"<a id=\"pp_train2a\"></a>\n",
"## 2a. All segments\n",
"\n",
"First distribute to all segments:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"1 rows affected.\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>source_table</th>\n",
" <th>output_table</th>\n",
" <th>dependent_varname</th>\n",
" <th>independent_varname</th>\n",
" <th>dependent_vartype</th>\n",
" <th>class_values</th>\n",
" <th>buffer_size</th>\n",
" <th>normalizing_const</th>\n",
" <th>num_classes</th>\n",
" <th>distribution_rules</th>\n",
" <th>__internal_gpu_config__</th>\n",
" </tr>\n",
" <tr>\n",
" <td>image_data_train</td>\n",
" <td>image_data_train_packed</td>\n",
" <td>y</td>\n",
" <td>x</td>\n",
" <td>smallint</td>\n",
" <td>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]</td>\n",
" <td>2500</td>\n",
" <td>255.0</td>\n",
" <td>10</td>\n",
" <td>all_segments</td>\n",
" <td>all_segments</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(u'image_data_train', u'image_data_train_packed', u'y', u'x', u'smallint', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 2500, 255.0, 10, 'all_segments', 'all_segments')]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS image_data_train_packed, image_data_train_packed_summary;\n",
"\n",
"SELECT madlib.training_preprocessor_dl('image_data_train', -- Source table\n",
" 'image_data_train_packed', -- Output table\n",
" 'y', -- Dependent variable\n",
" 'x', -- Independent variable\n",
" NULL, -- Buffer size\n",
" 255, -- Normalizing constant\n",
" NULL, -- Number of classes\n",
" 'all_segments' -- Distribution rules\n",
" );\n",
"SELECT * FROM image_data_train_packed_summary;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check distribution:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"20 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>__dist_key__</th>\n",
" <th>independent_var_shape</th>\n",
" <th>dependent_var_shape</th>\n",
" <th>buffer_id</th>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>17</td>\n",
" </tr>\n",
" <tr>\n",
" <td>6</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>16</td>\n",
" </tr>\n",
" <tr>\n",
" <td>7</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <td>9</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <td>12</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <td>14</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>19</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <td>21</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <td>27</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <td>28</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>29</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <td>33</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <td>34</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>55</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>56</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>2</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(0, [2500, 32, 32, 3], [2500, 10], 1),\n",
" (1, [2500, 32, 32, 3], [2500, 10], 4),\n",
" (2, [2500, 32, 32, 3], [2500, 10], 9),\n",
" (3, [2500, 32, 32, 3], [2500, 10], 7),\n",
" (4, [2500, 32, 32, 3], [2500, 10], 14),\n",
" (5, [2500, 32, 32, 3], [2500, 10], 17),\n",
" (6, [2500, 32, 32, 3], [2500, 10], 16),\n",
" (7, [2500, 32, 32, 3], [2500, 10], 11),\n",
" (9, [2500, 32, 32, 3], [2500, 10], 13),\n",
" (12, [2500, 32, 32, 3], [2500, 10], 15),\n",
" (14, [2500, 32, 32, 3], [2500, 10], 6),\n",
" (19, [2500, 32, 32, 3], [2500, 10], 12),\n",
" (21, [2500, 32, 32, 3], [2500, 10], 18),\n",
" (27, [2500, 32, 32, 3], [2500, 10], 10),\n",
" (28, [2500, 32, 32, 3], [2500, 10], 5),\n",
" (29, [2500, 32, 32, 3], [2500, 10], 8),\n",
" (33, [2500, 32, 32, 3], [2500, 10], 19),\n",
" (34, [2500, 32, 32, 3], [2500, 10], 0),\n",
" (55, [2500, 32, 32, 3], [2500, 10], 3),\n",
" (56, [2500, 32, 32, 3], [2500, 10], 2)]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT __dist_key__, independent_var_shape, dependent_var_shape, buffer_id FROM image_data_train_packed ORDER BY __dist_key__;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"pp_train2b\"></a>\n",
"## 2b. All segments on hosts with GPUs\n",
"\n",
"Now distribute to all segments on hosts with GPUs attached:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"1 rows affected.\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>source_table</th>\n",
" <th>output_table</th>\n",
" <th>dependent_varname</th>\n",
" <th>independent_varname</th>\n",
" <th>dependent_vartype</th>\n",
" <th>class_values</th>\n",
" <th>buffer_size</th>\n",
" <th>normalizing_const</th>\n",
" <th>num_classes</th>\n",
" <th>distribution_rules</th>\n",
" <th>__internal_gpu_config__</th>\n",
" </tr>\n",
" <tr>\n",
" <td>image_data_train</td>\n",
" <td>image_data_train_packed</td>\n",
" <td>y</td>\n",
" <td>x</td>\n",
" <td>smallint</td>\n",
" <td>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]</td>\n",
" <td>2500</td>\n",
" <td>255.0</td>\n",
" <td>10</td>\n",
" <td>[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]</td>\n",
" <td>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(u'image_data_train', u'image_data_train_packed', u'y', u'x', u'smallint', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 2500, 255.0, 10, [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS image_data_train_packed, image_data_train_packed_summary;\n",
"\n",
"SELECT madlib.training_preprocessor_dl('image_data_train', -- Source table\n",
" 'image_data_train_packed', -- Output table\n",
" 'y', -- Dependent variable\n",
" 'x', -- Independent variable\n",
" NULL, -- Buffer size\n",
" 255, -- Normalizing constant\n",
" NULL, -- Number of classes\n",
" 'gpu_segments' -- Distribution rules\n",
" );\n",
"SELECT * FROM image_data_train_packed_summary;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check distribution:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"20 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>__dist_key__</th>\n",
" <th>independent_var_shape</th>\n",
" <th>dependent_var_shape</th>\n",
" <th>buffer_id</th>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>17</td>\n",
" </tr>\n",
" <tr>\n",
" <td>6</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>16</td>\n",
" </tr>\n",
" <tr>\n",
" <td>7</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <td>9</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <td>12</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <td>14</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>19</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <td>21</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <td>27</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <td>28</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>29</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <td>33</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <td>34</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>55</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>56</td>\n",
" <td>[2500, 32, 32, 3]</td>\n",
" <td>[2500, 10]</td>\n",
" <td>2</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(0, [2500, 32, 32, 3], [2500, 10], 1),\n",
" (1, [2500, 32, 32, 3], [2500, 10], 4),\n",
" (2, [2500, 32, 32, 3], [2500, 10], 9),\n",
" (3, [2500, 32, 32, 3], [2500, 10], 7),\n",
" (4, [2500, 32, 32, 3], [2500, 10], 14),\n",
" (5, [2500, 32, 32, 3], [2500, 10], 17),\n",
" (6, [2500, 32, 32, 3], [2500, 10], 16),\n",
" (7, [2500, 32, 32, 3], [2500, 10], 11),\n",
" (9, [2500, 32, 32, 3], [2500, 10], 13),\n",
" (12, [2500, 32, 32, 3], [2500, 10], 15),\n",
" (14, [2500, 32, 32, 3], [2500, 10], 6),\n",
" (19, [2500, 32, 32, 3], [2500, 10], 12),\n",
" (21, [2500, 32, 32, 3], [2500, 10], 18),\n",
" (27, [2500, 32, 32, 3], [2500, 10], 10),\n",
" (28, [2500, 32, 32, 3], [2500, 10], 5),\n",
" (29, [2500, 32, 32, 3], [2500, 10], 8),\n",
" (33, [2500, 32, 32, 3], [2500, 10], 19),\n",
" (34, [2500, 32, 32, 3], [2500, 10], 0),\n",
" (55, [2500, 32, 32, 3], [2500, 10], 3),\n",
" (56, [2500, 32, 32, 3], [2500, 10], 2)]"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT __dist_key__, independent_var_shape, dependent_var_shape, buffer_id FROM image_data_train_packed ORDER BY __dist_key__;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"pp_train2c\"></a>\n",
"## 2c. Segments on 2 hosts with GPUs\n",
"\n",
"Now distribute to segments on 2 hosts with GPUs attached (if for some reason I need to do this):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For small datasets like in this example, buffer size is mainly determined by the number of segments in the database. For a Greenplum database with 2 segments, there will be 2 rows with a buffer size of 26. For PostgresSQL, there would be only one row with a buffer size of 52 since it is a single node database. For larger data sets, other factors go into computing buffers size besides number of segments. \n",
"\n",
"Review the output summary table:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"1 rows affected.\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>source_table</th>\n",
" <th>output_table</th>\n",
" <th>dependent_varname</th>\n",
" <th>independent_varname</th>\n",
" <th>dependent_vartype</th>\n",
" <th>class_values</th>\n",
" <th>buffer_size</th>\n",
" <th>normalizing_const</th>\n",
" <th>num_classes</th>\n",
" <th>distribution_rules</th>\n",
" <th>__internal_gpu_config__</th>\n",
" </tr>\n",
" <tr>\n",
" <td>image_data_train</td>\n",
" <td>image_data_train_packed</td>\n",
" <td>y</td>\n",
" <td>x</td>\n",
" <td>smallint</td>\n",
" <td>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]</td>\n",
" <td>6250</td>\n",
" <td>255.0</td>\n",
" <td>10</td>\n",
" <td>[2, 3, 4, 5, 6, 7, 8, 9]</td>\n",
" <td>[0, 1, 2, 3, 4, 5, 6, 7]</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(u'image_data_train', u'image_data_train_packed', u'y', u'x', u'smallint', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 6250, 255.0, 10, [2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7])]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS image_data_train_packed, image_data_train_packed_summary;\n",
"\n",
"SELECT madlib.training_preprocessor_dl('image_data_train', -- Source table\n",
" 'image_data_train_packed', -- Output table\n",
" 'y', -- Dependent variable\n",
" 'x', -- Independent variable\n",
" NULL, -- Buffer size\n",
" 255, -- Normalizing constant\n",
" NULL, -- Number of classes\n",
" 'segments_to_use_2VMs' -- Distribution rules\n",
" );\n",
"SELECT * FROM image_data_train_packed_summary;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check distribution:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"8 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>__dist_key__</th>\n",
" <th>independent_var_shape</th>\n",
" <th>dependent_var_shape</th>\n",
" <th>buffer_id</th>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>[6250, 32, 32, 3]</td>\n",
" <td>[6250, 10]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>[6250, 32, 32, 3]</td>\n",
" <td>[6250, 10]</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>[6250, 32, 32, 3]</td>\n",
" <td>[6250, 10]</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>14</td>\n",
" <td>[6250, 32, 32, 3]</td>\n",
" <td>[6250, 10]</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>28</td>\n",
" <td>[6250, 32, 32, 3]</td>\n",
" <td>[6250, 10]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>34</td>\n",
" <td>[6250, 32, 32, 3]</td>\n",
" <td>[6250, 10]</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>55</td>\n",
" <td>[6250, 32, 32, 3]</td>\n",
" <td>[6250, 10]</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>56</td>\n",
" <td>[6250, 32, 32, 3]</td>\n",
" <td>[6250, 10]</td>\n",
" <td>2</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(0, [6250, 32, 32, 3], [6250, 10], 1),\n",
" (1, [6250, 32, 32, 3], [6250, 10], 4),\n",
" (3, [6250, 32, 32, 3], [6250, 10], 7),\n",
" (14, [6250, 32, 32, 3], [6250, 10], 6),\n",
" (28, [6250, 32, 32, 3], [6250, 10], 5),\n",
" (34, [6250, 32, 32, 3], [6250, 10], 0),\n",
" (55, [6250, 32, 32, 3], [6250, 10], 3),\n",
" (56, [6250, 32, 32, 3], [6250, 10], 2)]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT __dist_key__, independent_var_shape, dependent_var_shape, buffer_id FROM image_data_train_packed ORDER BY __dist_key__;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"pp_train2d\"></a>\n",
"## 2d. Segments on 1 segment\n",
"\n",
"Now distribute 1 segment on a host with GPUs attached (if for some reason I need to do this):"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"1 rows affected.\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>source_table</th>\n",
" <th>output_table</th>\n",
" <th>dependent_varname</th>\n",
" <th>independent_varname</th>\n",
" <th>dependent_vartype</th>\n",
" <th>class_values</th>\n",
" <th>buffer_size</th>\n",
" <th>normalizing_const</th>\n",
" <th>num_classes</th>\n",
" <th>distribution_rules</th>\n",
" <th>__internal_gpu_config__</th>\n",
" </tr>\n",
" <tr>\n",
" <td>image_data_train</td>\n",
" <td>image_data_train_packed</td>\n",
" <td>y</td>\n",
" <td>x</td>\n",
" <td>smallint</td>\n",
" <td>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]</td>\n",
" <td>16667</td>\n",
" <td>255.0</td>\n",
" <td>10</td>\n",
" <td>[2]</td>\n",
" <td>[0]</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(u'image_data_train', u'image_data_train_packed', u'y', u'x', u'smallint', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 16667, 255.0, 10, [2], [0])]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS image_data_train_packed, image_data_train_packed_summary;\n",
"\n",
"SELECT madlib.training_preprocessor_dl('image_data_train', -- Source table\n",
" 'image_data_train_packed', -- Output table\n",
" 'y', -- Dependent variable\n",
" 'x', -- Independent variable\n",
" NULL, -- Buffer size\n",
" 255, -- Normalizing constant\n",
" NULL, -- Number of classes\n",
" 'segments_to_use_1seg' -- Distribution rules\n",
" );\n",
"SELECT * FROM image_data_train_packed_summary;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check distribution:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"3 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>__dist_key__</th>\n",
" <th>independent_var_shape</th>\n",
" <th>dependent_var_shape</th>\n",
" <th>buffer_id</th>\n",
" </tr>\n",
" <tr>\n",
" <td>34</td>\n",
" <td>[16667, 32, 32, 3]</td>\n",
" <td>[16667, 10]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>34</td>\n",
" <td>[16666, 32, 32, 3]</td>\n",
" <td>[16666, 10]</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>34</td>\n",
" <td>[16667, 32, 32, 3]</td>\n",
" <td>[16667, 10]</td>\n",
" <td>0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(34, [16667, 32, 32, 3], [16667, 10], 1),\n",
" (34, [16666, 32, 32, 3], [16666, 10], 2),\n",
" (34, [16667, 32, 32, 3], [16667, 10], 0)]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT __dist_key__, independent_var_shape, dependent_var_shape, buffer_id FROM image_data_train_packed ORDER BY __dist_key__;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"pp_val\"></a>\n",
"# 3. Run preprocessor for validation image data\n",
"\n",
"The same idea applies to the validation data set for distribution rules. Continuing the example above with distribution to a single segment:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"Done.\n",
"1 rows affected.\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>source_table</th>\n",
" <th>output_table</th>\n",
" <th>dependent_varname</th>\n",
" <th>independent_varname</th>\n",
" <th>dependent_vartype</th>\n",
" <th>class_values</th>\n",
" <th>buffer_size</th>\n",
" <th>normalizing_const</th>\n",
" <th>num_classes</th>\n",
" <th>distribution_rules</th>\n",
" <th>__internal_gpu_config__</th>\n",
" </tr>\n",
" <tr>\n",
" <td>image_data_val</td>\n",
" <td>image_data_val_packed</td>\n",
" <td>y</td>\n",
" <td>x</td>\n",
" <td>smallint</td>\n",
" <td>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]</td>\n",
" <td>10000</td>\n",
" <td>255.0</td>\n",
" <td>10</td>\n",
" <td>[2]</td>\n",
" <td>[0]</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(u'image_data_val', u'image_data_val_packed', u'y', u'x', u'smallint', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 10000, 255.0, 10, [2], [0])]"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS image_data_val_packed, image_data_val_packed_summary;\n",
"\n",
"SELECT madlib.validation_preprocessor_dl('image_data_val', -- Source table\n",
" 'image_data_val_packed', -- Output table\n",
" 'y', -- Dependent variable\n",
" 'x', -- Independent variable\n",
" 'image_data_train_packed', -- Training preprocessor output table \n",
" NULL, -- Buffer size\n",
" 'segments_to_use_1seg' -- Distribution rules\n",
" );\n",
"SELECT * FROM image_data_val_packed_summary;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check distribution:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://gpadmin@localhost:8000/cifar_places\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>__dist_key__</th>\n",
" <th>independent_var_shape</th>\n",
" <th>dependent_var_shape</th>\n",
" <th>buffer_id</th>\n",
" </tr>\n",
" <tr>\n",
" <td>34</td>\n",
" <td>[10000, 32, 32, 3]</td>\n",
" <td>[10000, 10]</td>\n",
" <td>0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(34, [10000, 32, 32, 3], [10000, 10], 0)]"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT __dist_key__, independent_var_shape, dependent_var_shape, buffer_id FROM image_data_val_packed ORDER BY __dist_key__;"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.16"
}
},
"nbformat": 4,
"nbformat_minor": 1
}