updated DL preprocessor docs for bytea (#445) * updated DL preprocessor docs for bytea * address review comments

commit: 63f40e70f8dbb6c9ed2b1b91c847fd3819b1a627 [log] [tgz]
author: Frank McQuillan <fmcquillan@pivotal.io> Tue Oct 01 13:52:40 2019 -0700
committer: GitHub <noreply@github.com> Tue Oct 01 13:52:40 2019 -0700
tree: c9a8b2fc1a0da5ae4a236e6e093acba84eb1fcdf
parent: 9edd74582008413ca4405e52c4f06c64efc7664c [diff]
diff --git a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
index a3f4281..8d70431 100644
--- a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
+++ b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in

@@ -18,7 +18,7 @@
  * under the License.
  *
  * @file input_preprocessor_dl.sql_in
- * @brief TODO
+ * @brief Utilities to prepare input image data for use by deep learning modules.
  * @date December 2018
  *
  */
@@ -86,9 +86,10 @@
   <dd>TEXT.  Name of the output table from the training preprocessor which
   will be used as input to algorithms that support mini-batching.
   Note that the arrays packed into the output table are shuffled
-  and normalized (by dividing each element in the independent variable array
-  by the optional 'normalizing_const' parameter), so they will not match
-  up in an obvious way with the rows in the source table.
+  and normalized, by dividing each element in the independent variable array
+  by the optional 'normalizing_const' parameter. For performance reasons,
+  packed arrays are converted to PostgreSQL bytea format, which is a
+  variable-length binary string.
 
   In the case a validation data set is used (see
   later on this page), this output table is also used
@@ -158,11 +159,15 @@
 
   <dt>output_table</dt>
   <dd>TEXT.  Name of the output table from the validation
-  preprocessor which will be used as input to algorithms that support mini-batching.  The arrays packed into the output table are
+  preprocessor which will be used as input to algorithms that support mini-batching.
+  The arrays packed into the output table are
   normalized using the same normalizing constant from the
   training preprocessor as specified in
   the 'training_preprocessor_table' parameter described below.
   Validation data is not shuffled.
+  For performance reasons,
+  packed arrays are converted to PostgreSQL bytea format, which is a
+  variable-length binary string.
   </dd>
 
   <dt>dependent_varname</dt>
@@ -209,25 +214,43 @@
     validation_preprocessor_dl() contain the following columns:
     <table class="output">
       <tr>
-        <th>buffer_id</th>
-        <td>INTEGER. Unique id for each row in the packed table.
+        <th>independent_var</th>
+        <td>BYTEA. Packed array of independent variables in PostgreSQL bytea format.
+        Arrays of independent variables packed into the output table are
+        normalized by dividing each element in the independent variable array by the
+        optional 'normalizing_const' parameter.  Training data is shuffled, but
+        validation data is not.
         </td>
       </tr>
       <tr>
         <th>dependent_var</th>
-        <td>ANYARRAY[]. Packed array of dependent variables.
+        <td>BYTEA. Packed array of dependent variables in PostgreSQL bytea format.
         The dependent variable is always one-hot encoded as an
-        INTEGER[] array. For now, we are assuming that
+        integer array. For now, we are assuming that
         input_preprocessor_dl() will be used
         only for classification problems using deep learning. So
         the dependent variable is one-hot encoded, unless it's already a
         numeric array in which case we assume it's already one-hot
-        encoded and just cast it to an INTEGER[] array.
+        encoded and just cast it to an integer array.
         </td>
       </tr>
       <tr>
-        <th>independent_var</th>
-        <td>REAL[]. Packed array of independent variables.
+        <th>independent_var_shape</th>
+        <td>INTEGER[]. Shape of the independent variable array after preprocessing.
+        The first element is the number of images packed per row, and subsequent
+        elements will depend on how the image is described (e.g., channels first or last).
+        </td>
+      </tr>
+      <tr>
+        <th>dependent_var_shape</th>
+        <td>INTEGER[]. Shape of the dependent variable array after preprocessing.
+        The first element is the number of images packed per row, and the second
+        element is the number of class values.
+        </td>
+      </tr>
+      <tr>
+        <th>buffer_id</th>
+        <td>INTEGER. Unique id for each row in the packed table.
         </td>
       </tr>
     </table>
@@ -272,7 +295,7 @@
         <th>num_classes</th>
         <td>Number of dependent levels the one-hot encoding is created
         for. NULLs are padded at the end if the number of distinct class
-        levels found in the input data is lesser than 'num_classes' parameter
+        levels found in the input data is less than the 'num_classes' parameter
         specified in training_preprocessor_dl().</td>
     </tr>
    </table>
@@ -374,35 +397,22 @@
                                         255                   -- Normalizing constant
                                         );
 </pre>
-For small datasets like in this example, buffer size is mainly
-determined by the number of segments in the database.
-This example is run on a Greenplum database with 3 segments,
-so there are 3 rows with a buffer size of 18 (in this case
-two segments will get 18 rows and one segment will get 16 rows).
-For PostgresSQL, there would be only one row with a buffer
-size of 52 since it is a single node database.
-For larger data sets, other factors go into
-computing buffers size besides number of segments.
-Note that dependent variable is a text type, and it is one-hot encoded
-after preprocessing.
-Here is a sample of the packed output table:
+For small datasets like in this example, buffer size is mainly determined
+by the number of segments in the database. For a Greenplum database with 2 segments,
+there will be 2 rows with a buffer size of 26. For PostgresSQL, there would
+be only one row with a buffer size of 52 since it is a single node database.
+For larger data sets, other factors go into computing buffers size besides
+number of segments.
+Here is the packed output table of training data for our simple example:
 <pre class="example">
-\\x on
-SELECT * FROM image_data_packed ORDER BY buffer_id;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
 </pre>
 <pre class="result">
--[ RECORD 1 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.921569,0.207843,0.152941},{0.568627,0.654902,0.819608}},{{0.772549,0.576471,0.870588},{0.215686,0.854902,0.207843}}},...}
-dependent_var   | {{0,0,1},{0,0,1},{1,0,0},{0,1,0},...}
-buffer_id       | 0
--[ RECORD 2 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.639216,0.886275,0.631373},{0.219608,0.713726,0.937255}},{{0.505882,0.603922,0.137255},{0.286275,0.454902,0.803922}}},...}
-dependent_var   | {{1,0,0},{0,1,0},{1,0,0},{0,0,1},...}
-buffer_id       | 1
--[ RECORD 3 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.635294,0.745098,0.486275},{0.721569,0.258824,0.541176}},{{0.0392157,0.941177,0.313726},{0.631373,0.266667,0.568627}}},...}
-dependent_var   | {{0,0,1},{0,0,1},{0,1,0},{1,0,0},...}
-buffer_id       | 2
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,2,2,3}            | {26,3}              |         0
+ {26,2,2,3}            | {26,3}              |         1
+(2 rows)
 </pre>
 Review the output summary table:
 <pre class="example">
@@ -417,8 +427,8 @@
 independent_varname | rgb
 dependent_vartype   | text
 class_values        | {bird,cat,dog}
-buffer_size         | 18
-normalizing_const   | 255.0
+buffer_size         | 26
+normalizing_const   | 255
 num_classes         | 3
 </pre>
 
@@ -434,32 +444,23 @@
       'species',                -- Dependent variable
       'rgb',                    -- Independent variable
       'image_data_packed',      -- From training preprocessor step
-      2                         -- Buffer size
+      NULL                      -- Buffer size
       );
 </pre>
 We can choose to use a new buffer size compared to the
 training_preprocessor_dl run. Other parameters such as num_classes and
 normalizing_const that were passed to training_preprocessor_dl are
 automatically inferred using the image_data_packed param that is passed.
-Here is a sample of the packed output table:
+Here is the packed output table of validation data for our simple example:
 <pre class="example">
-\\x on
-SELECT * FROM val_image_data_packed ORDER BY buffer_id;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM val_image_data_packed ORDER BY buffer_id;
 </pre>
 <pre class="result">
--[ RECORD 1 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.270588,0.0666667,0.435294},{0.4,0.133333,0.207843}},{{0.588235,0.933333,0.556863},...}
-dependent_var   | {{1,0,0},{0,1,0}}
-buffer_id       | 0
--[ RECORD 2 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.301961,0.337255,0.427451},{0.317647,0.909804,0.835294}},{{0.933333,0.247059,0.886275},...}
-dependent_var   | {{1,0,0},{1,0,0}}
-buffer_id       | 1
--[ RECORD 3 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{{{0.556863,0.956863,0.117647},{0.764706,0.929412,0.160784}},{{0.0235294,0.886275,0.0196078},...}
-dependent_var   | {{1,0,0},{1,0,0}}
-buffer_id       | 2
-...
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,2,2,3}            | {26,3}              |         0
+ {26,2,2,3}            | {26,3}              |         1
+(2 rows)
 </pre>
 Review the output summary table:
 <pre class="example">
@@ -474,8 +475,8 @@
 independent_varname | rgb
 dependent_vartype   | text
 class_values        | {bird,cat,dog}
-buffer_size         | 2
-normalizing_const   | 255.0
+buffer_size         | 26
+normalizing_const   | 255
 num_classes         | 3
 </pre>
 
@@ -573,22 +574,14 @@
 </pre>
 Here is a sample of the packed output table:
 <pre class="example">
-\\x on
-SELECT * FROM image_data_packed ORDER BY buffer_id;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
 </pre>
 <pre class="result">
--[ RECORD 1 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.203922,0.564706,0.905882,0.0470588,0.298039,0.00392157,0.635294,0.0431373,0.447059,0.552941,0.270588,0.0117647},...}
-dependent_var   | {{0,1,0},{1,0,0},{1,0,0},{1,0,0},{0,0,1},...}
-buffer_id       | 0
--[ RECORD 2 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.25098,0.984314,0.239216,0.6,0.0509804,0.392157,0.568627,0.709804,0.0313726,0.439216,0.462745,0.419608},...}
-dependent_var   | {{0,0,1},{0,0,1},{0,1,0},{0,0,1},{1,0,0},...}
-buffer_id       | 1
--[ RECORD 3 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.796079,0.537255,0.403922,0.0666667,0.235294,0.984314,0.596078,0.25098,0.141176,0.317647,0.658824,0.937255},...}
-dependent_var   | {{0,1,0},{0,1,0},{0,1,0},{0,0,1},{0,0,1},...}
-buffer_id       | 2
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,12}               | {26,3}              |         0
+ {26,12}               | {26,3}              |         1
+(2 rows)
 </pre>
 
 -#  Run the preprocessor for the validation dataset.
@@ -608,20 +601,14 @@
 </pre>
 Here is a sample of the packed output summary table:
 <pre class="example">
-\\x on
-SELECT * FROM val_image_data_packed_summary;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM val_image_data_packed ORDER BY buffer_id;
 </pre>
 <pre class="result">
--[ RECORD 1 ]-------+----------------------
-source_table        | image_data
-output_table        | val_image_data_packed
-dependent_varname   | species
-independent_varname | rgb
-dependent_vartype   | text
-class_values        | {bird,cat,dog}
-buffer_size         | 18
-normalizing_const   | 255.0
-num_classes         | 3
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,12}               | {26,3}              |         0
+ {26,12}               | {26,3}              |         1
+(2 rows)
 </pre>
 
 -# Generally the default buffer size will work well,
@@ -629,18 +616,24 @@
 <pre class="example">
 DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;
 SELECT madlib.training_preprocessor_dl('image_data',         -- Source table
-                                        'image_data_packed',  -- Output table
-                                        'species',            -- Dependent variable
-                                        'rgb',                -- Independent variable
+                                       'image_data_packed',  -- Output table
+                                       'species',            -- Dependent variable
+                                       'rgb',                -- Independent variable
                                         10,                   -- Buffer size
                                         255                   -- Normalizing constant
                                         );
-SELECT COUNT(*) FROM image_data_packed;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
 </pre>
 <pre class="result">
- count
-+-------
-     6
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {8,12}                | {8,3}               |         0
+ {9,12}                | {9,3}               |         1
+ {9,12}                | {9,3}               |         2
+ {9,12}                | {9,3}               |         3
+ {9,12}                | {9,3}               |         4
+ {8,12}                | {8,3}               |         5
+(6 rows)
 </pre>
 Review the output summary table:
 <pre class="example">
@@ -656,7 +649,7 @@
 dependent_vartype   | text
 class_values        | {bird,cat,dog}
 buffer_size         | 10
-normalizing_const   | 255.0
+normalizing_const   | 255
 num_classes         | 3
 </pre>
 
@@ -674,22 +667,14 @@
 </pre>
 Here is a sample of the packed output table with the padded 1-hot vector:
 <pre class="example">
-\\x on
-SELECT * FROM image_data_packed ORDER BY buffer_id;
+SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
 </pre>
 <pre class="result">
--[ RECORD 1 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.639216,0.517647,0.87451,0.0862745,0.784314,...},...}
-dependent_var   | {{0,0,1,0,0},{1,0,0,0,0},{1,0,0,0,0},{1,0,0,0,0},...}
-buffer_id       | 0
--[ RECORD 2 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.866667,0.0666667,0.803922,0.239216,0.741176,...},...}
-dependent_var   | {{0,0,1,0,0},{0,0,1,0,0},{0,1,0,0,0},{0,1,0,0,0},...}
-buffer_id       | 1
--[ RECORD 3 ]---+---------------------------------------------------------------------------------------------------------------------
-independent_var | {{0.184314,0.87451,0.227451,0.466667,0.203922,...},...}
-dependent_var   | {{1,0,0,0,0},{0,1,0,0,0},{1,0,0,0,0},{0,0,1,0,0},...}
-buffer_id       | 2
+ independent_var_shape | dependent_var_shape | buffer_id
+-----------------------+---------------------+-----------
+ {26,12}               | {26,5}              |         0
+ {26,12}               | {26,5}              |         1
+(2 rows)
 </pre>
 Review the output summary table:
 <pre class="example">
@@ -704,8 +689,8 @@
 independent_varname | rgb
 dependent_vartype   | text
 class_values        | {bird,cat,dog,NULL,NULL}
-buffer_size         | 18
-normalizing_const   | 255.0
+buffer_size         | 26
+normalizing_const   | 255
 num_classes         | 5
 </pre>
 
@@ -832,8 +817,9 @@
 DROP AGGREGATE IF EXISTS MADLIB_SCHEMA.agg_array_concat(anyarray);
 CREATE AGGREGATE MADLIB_SCHEMA.agg_array_concat(anyarray) (
    SFUNC = array_cat,
-   STYPE = anyarray,
-   PREFUNC = array_cat
+   PREFUNC = array_cat,
+   STYPE = anyarray
+
    );
 
 CREATE FUNCTION MADLIB_SCHEMA.convert_array_to_bytea(var REAL[])
commit	63f40e70f8dbb6c9ed2b1b91c847fd3819b1a627	[log] [tgz]
author	Frank McQuillan <fmcquillan@pivotal.io>	Tue Oct 01 13:52:40 2019 -0700
committer	GitHub <noreply@github.com>	Tue Oct 01 13:52:40 2019 -0700
tree	c9a8b2fc1a0da5ae4a236e6e093acba84eb1fcdf
parent	9edd74582008413ca4405e52c4f06c64efc7664c [diff]