SQOOP-3418: Document decimal support in Hive external import into parquet files (Fero Szabo via Szabolcs Vasas)

commit: b94a0bd948a5f51b9fd708dece5653130f1239ec [log] [tgz]
author: Szabolcs Vasas <vasas@apache.org> Wed Dec 12 15:57:45 2018 +0100
committer: GitHub <noreply@github.com> Wed Dec 12 15:57:45 2018 +0100
tree: aa84ae6d94107300af629ed4d0c1c8b9b7f37362
parent: a50394977bcdec8ae2274618b3a5c9e7e6a1082b [diff]
parent: 54b6e9bed4fdfb5fca3e97d0b6b395ca2941f2c7 [diff]
diff --git a/src/docs/user/hive.txt b/src/docs/user/hive.txt
index 75a389b..979e7af 100644
--- a/src/docs/user/hive.txt
+++ b/src/docs/user/hive.txt

@@ -112,10 +112,14 @@
 currently requires that all partitions of a table be compressed with the lzop
 codec.
 
-The user can specify the +\--external-table-dir+ option in the sqoop command to
-work with an external Hive table (instead of a managed table, i.e. the default behavior).
-To import data into an external table, one has to specify +\--hive-import+ in the command
-line arguments. Table creation is also supported with the use of +\--create-hive-table+.
+External table import
++++++++++++++++++++++
+
+You can specify the +\--external-table-dir+ option in the sqoop command to
+work with an external Hive table (instead of a managed table, i.e. the default
+behavior). To import data into an external table, one has to specify the
++\--hive-import+ option in the command line arguments. Table creation is
+also supported with the use of the +\--create-hive-table+ option.
 
 Importing into an external Hive table:
 ----
@@ -126,3 +130,38 @@
 ----
 $ sqoop import --hive-import --create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar
 ----
+
+Decimals in Hive import using parquet file
+++++++++++++++++++++++++++++++++++++++++++
+
+As mentioned above, a Hive import is a two-step process in Sqoop:
+first, the data is imported onto HDFS, then a HQL statement is generated and
+executed to create the Hive table.
+
+During the first step, an Avro schema is generated from the SQL data types.
+This schema is then used in a regular Parquet import. After the data was
+imported onto HDFS successfully, Sqoop takes the Avro schema, maps the Avro
+types to Hive types and to generates the HQL statement to create the table.
+
+Decimal SQL types are converted to Strings in a parquet import per default,
+so Decimal columns appear as String columns in Hive per default. You can change
+this behavior by enabling logical types for parquet, so that Decimals will be
+properly mapped to the Hive type Decimal as well. This can be done with the
++sqoop.parquet.logical_types.decimal.enable+ property. As noted in the section
+discussing 'Enabling Logical Types in Avro and Parquet import for numbers',
+you should also specify the default precision and scale and enable padding.
+
+A limitation of Hive is that the maximum precision and scale is 38. When
+converting to the Hive Decimal type, precision and scale will be reduced
+if necessary to meet this limitation, automatically. The data itself however,
+will only have to adhere to the limitations of the Avro schema, thus values
+with a precision and scale bigger than 38 are allowed and will be present on
+storage, but they won't be readable by Hive, (since Hive is a
+schema-on-read tool).
+
+Enabling padding and specifying a default precision and scale in a Hive Import:
+----
+$ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.parquet.logical_types.decimal.enable=true
+            -Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
+            --hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --as-parquetfile
+----

diff --git a/src/docs/user/import.txt b/src/docs/user/import.txt
index 79f7101..ae7c7ed 100644
--- a/src/docs/user/import.txt
+++ b/src/docs/user/import.txt

@@ -472,36 +472,48 @@
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 To enable the use of logical types in Sqoop's avro schema generation,
-i.e. used during both avro and parquet imports, one has to use the
-sqoop.avro.logical_types.decimal.enable flag. This is necessary if one
+i.e. used both during avro and parquet imports, one has to use the
++sqoop.avro.logical_types.decimal.enable+ property. This is necessary if one
 wants to store values as decimals in the avro file format.
 
-Padding number types in avro import
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+In case of a parquet import, one has to use the
++sqoop.parquet.logical_types.decimal.enable+ property.
+
+Padding number types in avro and parquet import
++++++++++++++++++++++++++++++++++++++++++++++++
 
 Certain databases, such as Oracle and Postgres store number and decimal
 values without padding. For example 1.5 in a column declared
-as NUMBER (20,5) is stored as is in Oracle, while the equivalent
+as NUMBER (20, 5) is stored as is in Oracle, while the equivalent
 DECIMAL (20, 5) is stored as 1.50000 in an SQL server instance.
-This leads to a scale mismatch during avro import.
+This leads to a scale mismatch during the import.
 
-To avoid this error, one can use the sqoop.avro.decimal_padding.enable flag
-to turn on padding with 0s. This flag has to be used together with the
-sqoop.avro.logical_types.decimal.enable flag set to true.
+To avoid this error, one can use the +sqoop.avro.decimal_padding.enable+
+property to turn on padding with 0s during import. Naturally, this property is
+used together with logical types enabled, either in avro or in parquet import.
 
-Default precision and scale in avro import
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Default precision and scale in avro and parquet import
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 All of the databases allow users to specify numeric columns without
 a precision or scale. While MS SQL and MySQL translate these into
-a valid precision and scale values, Oracle and Postgres don't.
+valid precision and scale, Oracle and Postgres don't.
 
-Therefore, when a table contains NUMBER in a table in Oracle or
-NUMERIC/DECIMAL in Postgres, one can specify a default precision and scale
-to be used in the avro schema by using the +sqoop.avro.logical_types.decimal.default.precision+
-and +sqoop.avro.logical_types.decimal.default.scale+ flags.
+When a table contains a NUMBER column in Oracle or NUMERIC/DECIMAL in
+Postgres, one can specify a default precision and scale to be used in the
+avro schema by using the +sqoop.avro.logical_types.decimal.default.precision+
+and +sqoop.avro.logical_types.decimal.default.scale+ properties.
 Avro padding also has to be enabled, if the values are shorter than
-the specified default scale.
+the specified default scale, together with logical types.
+
+Even though the name of the properties contain 'avro', the very same properties
+(+sqoop.avro.logical_types.decimal.default.precision+ and
++sqoop.avro.logical_types.decimal.default.scale+)
+can be used to specify defaults during a parquet import as well.
+
+The implementation of this logic and the padding is database independent.
+However, our tests cover Oracle, Postgres, MS Sql server and MySQL databases
+only, therefore these are the supported ones.
 
 Large Objects
 ^^^^^^^^^^^^^
@@ -838,20 +850,27 @@
 ----
 
 Enabling logical types in avro import and also turning on padding with 0s:
-
 ----
 $ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true
-    --connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS"
+    --connect $MYCONN --username $MYUSER --password $MYPASS --query "select * from table_name where \$CONDITIONS"
     --target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1
 
 ----
 
 Enabling logical types in avro import and also turning on padding with 0s, while specifying default precision and scale as well:
-
 ----
 $ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true
     -Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
-    --connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS"
+    --connect $MYCONN --username $MYUSER --password $MYPASS --query "select * from table_name where \$CONDITIONS"
     --target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1
 
 ----
+
+Enabling logical types in parquet import and also turning on padding with 0s, while specifying default precision and scale as well:
+----
+$ sqoop import -Dsqoop.parquet.logical_types.decimal.enable=true -Dsqoop.avro.decimal_padding.enable=true
+    -Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
+    --connect $MYCONN --username $MYUSER --password $MYPASS --query "select * from table_name where \$CONDITIONS"
+    --target-dir hdfs://nameservice1//etl/target_path --as-parquetfile --verbose -m 1
+
+----
commit	b94a0bd948a5f51b9fd708dece5653130f1239ec	[log] [tgz]
author	Szabolcs Vasas <vasas@apache.org>	Wed Dec 12 15:57:45 2018 +0100
committer	GitHub <noreply@github.com>	Wed Dec 12 15:57:45 2018 +0100
tree	aa84ae6d94107300af629ed4d0c1c8b9b7f37362
parent	a50394977bcdec8ae2274618b3a5c9e7e6a1082b [diff]
parent	54b6e9bed4fdfb5fca3e97d0b6b395ca2941f2c7 [diff]