updated DL preprocessor docs for bytea (#445)
* updated DL preprocessor docs for bytea
* address review comments
diff --git a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
index a3f4281..8d70431 100644
--- a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
+++ b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
@@ -18,7 +18,7 @@
* under the License.
*
* @file input_preprocessor_dl.sql_in
- * @brief TODO
+ * @brief Utilities to prepare input image data for use by deep learning modules.
* @date December 2018
*
*/
@@ -86,9 +86,10 @@
<dd>TEXT. Name of the output table from the training preprocessor which
will be used as input to algorithms that support mini-batching.
Note that the arrays packed into the output table are shuffled
- and normalized (by dividing each element in the independent variable array
- by the optional 'normalizing_const' parameter), so they will not match
- up in an obvious way with the rows in the source table.
+ and normalized, by dividing each element in the independent variable array
+ by the optional 'normalizing_const' parameter. For performance reasons,
+ packed arrays are converted to PostgreSQL bytea format, which is a
+ variable-length binary string.
In the case a validation data set is used (see
later on this page), this output table is also used
@@ -158,11 +159,15 @@
<dt>output_table</dt>
<dd>TEXT. Name of the output table from the validation
- preprocessor which will be used as input to algorithms that support mini-batching. The arrays packed into the output table are
+ preprocessor which will be used as input to algorithms that support mini-batching.
+ The arrays packed into the output table are
normalized using the same normalizing constant from the
training preprocessor as specified in
the 'training_preprocessor_table' parameter described below.
Validation data is not shuffled.
+ For performance reasons,
+ packed arrays are converted to PostgreSQL bytea format, which is a
+ variable-length binary string.
</dd>
<dt>dependent_varname</dt>
@@ -209,25 +214,43 @@
validation_preprocessor_dl() contain the following columns:
<table class="output">
<tr>
- <th>buffer_id</th>
- <td>INTEGER. Unique id for each row in the packed table.
+ <th>independent_var</th>
+ <td>BYTEA. Packed array of independent variables in PostgreSQL bytea format.
+ Arrays of independent variables packed into the output table are
+ normalized by dividing each element in the independent variable array by the
+ optional 'normalizing_const' parameter. Training data is shuffled, but
+ validation data is not.
</td>
</tr>
<tr>
<th>dependent_var</th>
- <td>ANYARRAY[]. Packed array of dependent variables.
+ <td>BYTEA. Packed array of dependent variables in PostgreSQL bytea format.
The dependent variable is always one-hot encoded as an
- INTEGER[] array. For now, we are assuming that
+ integer array. For now, we are assuming that
input_preprocessor_dl() will be used
only for classification problems using deep learning. So
the dependent variable is one-hot encoded, unless it's already a
numeric array in which case we assume it's already one-hot
- encoded and just cast it to an INTEGER[] array.
+ encoded and just cast it to an integer array.
</td>
</tr>
<tr>
- <th>independent_var</th>
- <td>REAL[]. Packed array of independent variables.
+ <th>independent_var_shape</th>
+ <td>INTEGER[]. Shape of the independent variable array after preprocessing.
+ The first element is the number of images packed per row, and subsequent
+ elements will depend on how the image is described (e.g., channels first or last).
+ </td>
+ </tr>
+ <tr>
+ <th>dependent_var_shape</th>
+ <td>INTEGER[]. Shape of the dependent variable array after preprocessing.
+ The first element is the number of images packed per row, and the second
+ element is the number of class values.
+ </td>
+ </tr>
+ <tr>
+ <th>buffer_id</th>
+ <td>INTEGER. Unique id for each row in the packed table.
</td>
</tr>
</table>
@@ -272,7 +295,7 @@
<th>num_classes</th>
<td>Number of dependent levels the one-hot encoding is created
for. NULLs are padded at the end if the number of distinct class
- levels found in the input data is lesser than 'num_classes' parameter
+ levels found in the input data is less than the 'num_classes' parameter
specified in training_preprocessor_dl().</td>
</tr>
</table>
@@ -374,35 +397,22 @@
255 -- Normalizing constant
);
</pre>
-For small datasets like in this example, buffer size is mainly
-determined by the number of segments in the database.
-This example is run on a Greenplum database with 3 segments,
-so there are 3 rows with a buffer size of 18 (in this case
-two segments will get 18 rows and one segment will get 16 rows).
-For PostgresSQL, there would be only one row with a buffer
-size of 52 since it is a single node database.
-For larger data sets, other factors go into
-computing buffers size besides number of segments.
-Note that dependent variable is a text type, and it is one-hot encoded
-after preprocessing.
-Here is a sample of the packed output table:
+For small datasets like in this example, buffer size is mainly determined
+by the number of segments in the database. For a Greenplum database with 2 segments,
+there will be 2 rows with a buffer size of 26. For PostgresSQL, there would
+be only one row with a buffer size of 52 since it is a single node database.
+For larger data sets, other factors go into computing buffers size besides
+number of segments.
+Here is the packed output table of training data for our simple example:
<pre class="example">
-\\x on
-SELECT * FROM image_data_packed ORDER BY buffer_id;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
</pre>
<pre class="result">
--[ RECORD 1 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.921569,0.207843,0.152941},{0.568627,0.654902,0.819608}},{{0.772549,0.576471,0.870588},{0.215686,0.854902,0.207843}}},...}
-dependent_var | {{0,0,1},{0,0,1},{1,0,0},{0,1,0},...}
-buffer_id | 0
--[ RECORD 2 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.639216,0.886275,0.631373},{0.219608,0.713726,0.937255}},{{0.505882,0.603922,0.137255},{0.286275,0.454902,0.803922}}},...}
-dependent_var | {{1,0,0},{0,1,0},{1,0,0},{0,0,1},...}
-buffer_id | 1
--[ RECORD 3 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.635294,0.745098,0.486275},{0.721569,0.258824,0.541176}},{{0.0392157,0.941177,0.313726},{0.631373,0.266667,0.568627}}},...}
-dependent_var | {{0,0,1},{0,0,1},{0,1,0},{1,0,0},...}
-buffer_id | 2
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,2,2,3} | {26,3} | 0
+ {26,2,2,3} | {26,3} | 1
+(2 rows)
</pre>
Review the output summary table:
<pre class="example">
@@ -417,8 +427,8 @@
independent_varname | rgb
dependent_vartype | text
class_values | {bird,cat,dog}
-buffer_size | 18
-normalizing_const | 255.0
+buffer_size | 26
+normalizing_const | 255
num_classes | 3
</pre>
@@ -434,32 +444,23 @@
'species', -- Dependent variable
'rgb', -- Independent variable
'image_data_packed', -- From training preprocessor step
- 2 -- Buffer size
+ NULL -- Buffer size
);
</pre>
We can choose to use a new buffer size compared to the
training_preprocessor_dl run. Other parameters such as num_classes and
normalizing_const that were passed to training_preprocessor_dl are
automatically inferred using the image_data_packed param that is passed.
-Here is a sample of the packed output table:
+Here is the packed output table of validation data for our simple example:
<pre class="example">
-\\x on
-SELECT * FROM val_image_data_packed ORDER BY buffer_id;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM val_image_data_packed ORDER BY buffer_id;
</pre>
<pre class="result">
--[ RECORD 1 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.270588,0.0666667,0.435294},{0.4,0.133333,0.207843}},{{0.588235,0.933333,0.556863},...}
-dependent_var | {{1,0,0},{0,1,0}}
-buffer_id | 0
--[ RECORD 2 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.301961,0.337255,0.427451},{0.317647,0.909804,0.835294}},{{0.933333,0.247059,0.886275},...}
-dependent_var | {{1,0,0},{1,0,0}}
-buffer_id | 1
--[ RECORD 3 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.556863,0.956863,0.117647},{0.764706,0.929412,0.160784}},{{0.0235294,0.886275,0.0196078},...}
-dependent_var | {{1,0,0},{1,0,0}}
-buffer_id | 2
-...
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,2,2,3} | {26,3} | 0
+ {26,2,2,3} | {26,3} | 1
+(2 rows)
</pre>
Review the output summary table:
<pre class="example">
@@ -474,8 +475,8 @@
independent_varname | rgb
dependent_vartype | text
class_values | {bird,cat,dog}
-buffer_size | 2
-normalizing_const | 255.0
+buffer_size | 26
+normalizing_const | 255
num_classes | 3
</pre>
@@ -573,22 +574,14 @@
</pre>
Here is a sample of the packed output table:
<pre class="example">
-\\x on
-SELECT * FROM image_data_packed ORDER BY buffer_id;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
</pre>
<pre class="result">
--[ RECORD 1 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.203922,0.564706,0.905882,0.0470588,0.298039,0.00392157,0.635294,0.0431373,0.447059,0.552941,0.270588,0.0117647},...}
-dependent_var | {{0,1,0},{1,0,0},{1,0,0},{1,0,0},{0,0,1},...}
-buffer_id | 0
--[ RECORD 2 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.25098,0.984314,0.239216,0.6,0.0509804,0.392157,0.568627,0.709804,0.0313726,0.439216,0.462745,0.419608},...}
-dependent_var | {{0,0,1},{0,0,1},{0,1,0},{0,0,1},{1,0,0},...}
-buffer_id | 1
--[ RECORD 3 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.796079,0.537255,0.403922,0.0666667,0.235294,0.984314,0.596078,0.25098,0.141176,0.317647,0.658824,0.937255},...}
-dependent_var | {{0,1,0},{0,1,0},{0,1,0},{0,0,1},{0,0,1},...}
-buffer_id | 2
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,12} | {26,3} | 0
+ {26,12} | {26,3} | 1
+(2 rows)
</pre>
-# Run the preprocessor for the validation dataset.
@@ -608,20 +601,14 @@
</pre>
Here is a sample of the packed output summary table:
<pre class="example">
-\\x on
-SELECT * FROM val_image_data_packed_summary;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM val_image_data_packed ORDER BY buffer_id;
</pre>
<pre class="result">
--[ RECORD 1 ]-------+----------------------
-source_table | image_data
-output_table | val_image_data_packed
-dependent_varname | species
-independent_varname | rgb
-dependent_vartype | text
-class_values | {bird,cat,dog}
-buffer_size | 18
-normalizing_const | 255.0
-num_classes | 3
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,12} | {26,3} | 0
+ {26,12} | {26,3} | 1
+(2 rows)
</pre>
-# Generally the default buffer size will work well,
@@ -629,18 +616,24 @@
<pre class="example">
DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;
SELECT madlib.training_preprocessor_dl('image_data', -- Source table
- 'image_data_packed', -- Output table
- 'species', -- Dependent variable
- 'rgb', -- Independent variable
+ 'image_data_packed', -- Output table
+ 'species', -- Dependent variable
+ 'rgb', -- Independent variable
10, -- Buffer size
255 -- Normalizing constant
);
-SELECT COUNT(*) FROM image_data_packed;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
</pre>
<pre class="result">
- count
-+-------
- 6
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {8,12} | {8,3} | 0
+ {9,12} | {9,3} | 1
+ {9,12} | {9,3} | 2
+ {9,12} | {9,3} | 3
+ {9,12} | {9,3} | 4
+ {8,12} | {8,3} | 5
+(6 rows)
</pre>
Review the output summary table:
<pre class="example">
@@ -656,7 +649,7 @@
dependent_vartype | text
class_values | {bird,cat,dog}
buffer_size | 10
-normalizing_const | 255.0
+normalizing_const | 255
num_classes | 3
</pre>
@@ -674,22 +667,14 @@
</pre>
Here is a sample of the packed output table with the padded 1-hot vector:
<pre class="example">
-\\x on
-SELECT * FROM image_data_packed ORDER BY buffer_id;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
</pre>
<pre class="result">
--[ RECORD 1 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.639216,0.517647,0.87451,0.0862745,0.784314,...},...}
-dependent_var | {{0,0,1,0,0},{1,0,0,0,0},{1,0,0,0,0},{1,0,0,0,0},...}
-buffer_id | 0
--[ RECORD 2 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.866667,0.0666667,0.803922,0.239216,0.741176,...},...}
-dependent_var | {{0,0,1,0,0},{0,0,1,0,0},{0,1,0,0,0},{0,1,0,0,0},...}
-buffer_id | 1
--[ RECORD 3 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.184314,0.87451,0.227451,0.466667,0.203922,...},...}
-dependent_var | {{1,0,0,0,0},{0,1,0,0,0},{1,0,0,0,0},{0,0,1,0,0},...}
-buffer_id | 2
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,12} | {26,5} | 0
+ {26,12} | {26,5} | 1
+(2 rows)
</pre>
Review the output summary table:
<pre class="example">
@@ -704,8 +689,8 @@
independent_varname | rgb
dependent_vartype | text
class_values | {bird,cat,dog,NULL,NULL}
-buffer_size | 18
-normalizing_const | 255.0
+buffer_size | 26
+normalizing_const | 255
num_classes | 5
</pre>
@@ -832,8 +817,9 @@
DROP AGGREGATE IF EXISTS MADLIB_SCHEMA.agg_array_concat(anyarray);
CREATE AGGREGATE MADLIB_SCHEMA.agg_array_concat(anyarray) (
SFUNC = array_cat,
- STYPE = anyarray,
- PREFUNC = array_cat
+ PREFUNC = array_cat,
+ STYPE = anyarray
+
);
CREATE FUNCTION MADLIB_SCHEMA.convert_array_to_bytea(var REAL[])