src/docs/user/hive.txt - sqoop - Git at Google


 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 ////


 Importing Data Into Hive
 ^^^^^^^^^^^^^^^^^^^^^^^^

 Sqoop's import tool's main function is to upload your data into files
 in HDFS. If you have a Hive metastore associated with your HDFS
 cluster, Sqoop can also import the data into Hive by generating and
 executing a +CREATE TABLE+ statement to define the data's layout in
 Hive. Importing data into Hive is as simple as adding the
 *+\--hive-import+* option to your Sqoop command line.

 If the Hive table already exists, you can specify the
 *+\--hive-overwrite+* option to indicate that existing table in hive must
 be replaced. After your data is imported into HDFS or this step is
 omitted, Sqoop will generate a Hive script containing a +CREATE TABLE+
 operation defining your columns using Hive's types, and a +LOAD DATA INPATH+
 statement to move the data files into Hive's warehouse directory.

 The script can be executed in two ways:

 - By the default the script will be executed by calling
 the installed copy of hive on the machine where Sqoop is run. If you have
 multiple Hive installations, or +hive+ is not in your +$PATH+, use the
 *+\--hive-home+* option to identify the Hive installation directory.
 Sqoop will use +$HIVE_HOME/bin/hive+ from here.

 - If the user specifies the *+\--hs2-url+* parameter then the script will
  be sent to HiveServer2 through a JDBC connection. Note that the data itself
  will not be transferred via the JDBC connection it is written directly to HDFS
  just like in case of the default hive import. As HiveServer2 provides proper
  authorization and auditing features it is recommended to use this instead of
  the default. Currently only Kerberos authentication and text file format is
  supported with this option.

 NOTE: This function is incompatible with +\--as-avrodatafile+ and
 +\--as-sequencefile+.

 Even though Hive supports escaping characters, it does not
 handle escaping of new-line character. Also, it does not support
 the notion of enclosing characters that may include field delimiters
 in the enclosed string.  It is therefore recommended that you choose
 unambiguous field and record-terminating delimiters without the help
 of escaping and enclosing characters when working with Hive; this is
 due to limitations of Hive's input parsing abilities. If you do use
 +\--escaped-by+, +\--enclosed-by+, or +\--optionally-enclosed-by+ when
 importing data into Hive, Sqoop will print a warning message.

 Hive will have problems using Sqoop-imported data if your database's
 rows contain string fields that have Hive's default row delimiters
 (+\n+ and +\r+ characters) or column delimiters (+\01+ characters)
 present in them.  You can use the +\--hive-drop-import-delims+ option
 to drop those characters on import to give Hive-compatible text data.
 Alternatively, you can use the +\--hive-delims-replacement+ option
 to replace those characters with a user-defined string on import to give
 Hive-compatible text data.  These options should only be used if you use
 Hive's default delimiters and should not be used if different delimiters
 are specified.

 Sqoop will pass the field and record delimiters through to Hive. If you do
 not set any delimiters and do use +\--hive-import+, the field delimiter will
 be set to +^A+ and the record delimiter will be set to +\n+ to be consistent
 with Hive's defaults.

 Sqoop will by default import NULL values as string +null+. Hive is however
 using string +\N+ to denote +NULL+ values and therefore predicates dealing
 with +NULL+ (like +IS NULL+) will not work correctly. You should append
 parameters +\--null-string+ and +\--null-non-string+ in case of import job or
 +--input-null-string+ and +--input-null-non-string+ in case of an export job if
 you wish to properly preserve +NULL+ values. Because sqoop is using those
 parameters in generated code, you need to properly escape value +\N+ to +\\N+:

 ----
 $ sqoop import  ... --null-string '\\N' --null-non-string '\\N'
 ----

 The table name used in Hive is, by default, the same as that of the
 source table. You can control the output table name with the +\--hive-table+
 option.

 Hive can put data into partitions for more efficient query
 performance.  You can tell a Sqoop job to import data for Hive into a
 particular partition by specifying the +\--hive-partition-key+ and
 +\--hive-partition-value+ arguments.  The partition value must be a
 string.  Please see the Hive documentation for more details on
 partitioning.

 You can import compressed tables into Hive using the +\--compress+ and
 +\--compression-codec+ options. One downside to compressing tables imported
 into Hive is that many codecs cannot be split for processing by parallel map
 tasks. The lzop codec, however, does support splitting. When importing tables
 with this codec, Sqoop will automatically index the files for splitting and
 configuring a new Hive table with the correct InputFormat. This feature
 currently requires that all partitions of a table be compressed with the lzop
 codec.

 External table import
 +++++++++++++++++++++

 You can specify the +\--external-table-dir+ option in the sqoop command to
 work with an external Hive table (instead of a managed table, i.e. the default
 behavior). To import data into an external table, one has to specify the
 +\--hive-import+ option in the command line arguments. Table creation is
 also supported with the use of the +\--create-hive-table+ option.

 Importing into an external Hive table:
 ----
 $ sqoop import --hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/external_table_example
 ----

 Create an external Hive table:
 ----
 $ sqoop import --hive-import --create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar
 ----

 Decimals in Hive import using parquet file
 ++++++++++++++++++++++++++++++++++++++++++

 As mentioned above, a Hive import is a two-step process in Sqoop:
 first, the data is imported onto HDFS, then a HQL statement is generated and
 executed to create the Hive table.

 During the first step, an Avro schema is generated from the SQL data types.
 This schema is then used in a regular Parquet import. After the data was
 imported onto HDFS successfully, Sqoop takes the Avro schema, maps the Avro
 types to Hive types and to generates the HQL statement to create the table.

 Decimal SQL types are converted to Strings in a parquet import per default,
 so Decimal columns appear as String columns in Hive per default. You can change
 this behavior by enabling logical types for parquet, so that Decimals will be
 properly mapped to the Hive type Decimal as well. This can be done with the
 +sqoop.parquet.logical_types.decimal.enable+ property. As noted in the section
 discussing 'Enabling Logical Types in Avro and Parquet import for numbers',
 you should also specify the default precision and scale and enable padding.

 A limitation of Hive is that the maximum precision and scale is 38. When
 converting to the Hive Decimal type, precision and scale will be reduced
 if necessary to meet this limitation, automatically. The data itself however,
 will only have to adhere to the limitations of the Avro schema, thus values
 with a precision and scale bigger than 38 are allowed and will be present on
 storage, but they won't be readable by Hive, (since Hive is a
 schema-on-read tool).

 Enabling padding and specifying a default precision and scale in a Hive Import:
 ----
 $ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.parquet.logical_types.decimal.enable=true
             -Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
             --hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --as-parquetfile
 ----

	////
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////


	Importing Data Into Hive
	^^^^^^^^^^^^^^^^^^^^^^^^

	Sqoop's import tool's main function is to upload your data into files
	in HDFS. If you have a Hive metastore associated with your HDFS
	cluster, Sqoop can also import the data into Hive by generating and
	executing a +CREATE TABLE+ statement to define the data's layout in
	Hive. Importing data into Hive is as simple as adding the
	+\--hive-import+ option to your Sqoop command line.

	If the Hive table already exists, you can specify the
	+\--hive-overwrite+ option to indicate that existing table in hive must
	be replaced. After your data is imported into HDFS or this step is
	omitted, Sqoop will generate a Hive script containing a +CREATE TABLE+
	operation defining your columns using Hive's types, and a +LOAD DATA INPATH+
	statement to move the data files into Hive's warehouse directory.

	The script can be executed in two ways:

	- By the default the script will be executed by calling
	the installed copy of hive on the machine where Sqoop is run. If you have
	multiple Hive installations, or +hive+ is not in your +$PATH+, use the
	+\--hive-home+ option to identify the Hive installation directory.
	Sqoop will use +$HIVE_HOME/bin/hive+ from here.

	- If the user specifies the +\--hs2-url+ parameter then the script will
	be sent to HiveServer2 through a JDBC connection. Note that the data itself
	will not be transferred via the JDBC connection it is written directly to HDFS
	just like in case of the default hive import. As HiveServer2 provides proper
	authorization and auditing features it is recommended to use this instead of
	the default. Currently only Kerberos authentication and text file format is
	supported with this option.

	NOTE: This function is incompatible with +\--as-avrodatafile+ and
	+\--as-sequencefile+.

	Even though Hive supports escaping characters, it does not
	handle escaping of new-line character. Also, it does not support
	the notion of enclosing characters that may include field delimiters
	in the enclosed string. It is therefore recommended that you choose
	unambiguous field and record-terminating delimiters without the help
	of escaping and enclosing characters when working with Hive; this is
	due to limitations of Hive's input parsing abilities. If you do use
	+\--escaped-by+, +\--enclosed-by+, or +\--optionally-enclosed-by+ when
	importing data into Hive, Sqoop will print a warning message.

	Hive will have problems using Sqoop-imported data if your database's
	rows contain string fields that have Hive's default row delimiters
	(+\n+ and +\r+ characters) or column delimiters (+\01+ characters)
	present in them. You can use the +\--hive-drop-import-delims+ option
	to drop those characters on import to give Hive-compatible text data.
	Alternatively, you can use the +\--hive-delims-replacement+ option
	to replace those characters with a user-defined string on import to give
	Hive-compatible text data. These options should only be used if you use
	Hive's default delimiters and should not be used if different delimiters
	are specified.

	Sqoop will pass the field and record delimiters through to Hive. If you do
	not set any delimiters and do use +\--hive-import+, the field delimiter will
	be set to +^A+ and the record delimiter will be set to +\n+ to be consistent
	with Hive's defaults.

	Sqoop will by default import NULL values as string +null+. Hive is however
	using string +\N+ to denote +NULL+ values and therefore predicates dealing
	with +NULL+ (like +IS NULL+) will not work correctly. You should append
	parameters +\--null-string+ and +\--null-non-string+ in case of import job or
	+--input-null-string+ and +--input-null-non-string+ in case of an export job if
	you wish to properly preserve +NULL+ values. Because sqoop is using those
	parameters in generated code, you need to properly escape value +\N+ to +\\N+:

	----
	$ sqoop import ... --null-string '\\N' --null-non-string '\\N'
	----

	The table name used in Hive is, by default, the same as that of the
	source table. You can control the output table name with the +\--hive-table+
	option.

	Hive can put data into partitions for more efficient query
	performance. You can tell a Sqoop job to import data for Hive into a
	particular partition by specifying the +\--hive-partition-key+ and
	+\--hive-partition-value+ arguments. The partition value must be a
	string. Please see the Hive documentation for more details on
	partitioning.

	You can import compressed tables into Hive using the +\--compress+ and
	+\--compression-codec+ options. One downside to compressing tables imported
	into Hive is that many codecs cannot be split for processing by parallel map
	tasks. The lzop codec, however, does support splitting. When importing tables
	with this codec, Sqoop will automatically index the files for splitting and
	configuring a new Hive table with the correct InputFormat. This feature
	currently requires that all partitions of a table be compressed with the lzop
	codec.

	External table import
	+++++++++++++++++++++

	You can specify the +\--external-table-dir+ option in the sqoop command to
	work with an external Hive table (instead of a managed table, i.e. the default
	behavior). To import data into an external table, one has to specify the
	+\--hive-import+ option in the command line arguments. Table creation is
	also supported with the use of the +\--create-hive-table+ option.

	Importing into an external Hive table:
	----
	$ sqoop import --hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/external_table_example
	----

	Create an external Hive table:
	----
	$ sqoop import --hive-import --create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar
	----

	Decimals in Hive import using parquet file
	++++++++++++++++++++++++++++++++++++++++++

	As mentioned above, a Hive import is a two-step process in Sqoop:
	first, the data is imported onto HDFS, then a HQL statement is generated and
	executed to create the Hive table.

	During the first step, an Avro schema is generated from the SQL data types.
	This schema is then used in a regular Parquet import. After the data was
	imported onto HDFS successfully, Sqoop takes the Avro schema, maps the Avro
	types to Hive types and to generates the HQL statement to create the table.

	Decimal SQL types are converted to Strings in a parquet import per default,
	so Decimal columns appear as String columns in Hive per default. You can change
	this behavior by enabling logical types for parquet, so that Decimals will be
	properly mapped to the Hive type Decimal as well. This can be done with the
	+sqoop.parquet.logical_types.decimal.enable+ property. As noted in the section
	discussing 'Enabling Logical Types in Avro and Parquet import for numbers',
	you should also specify the default precision and scale and enable padding.

	A limitation of Hive is that the maximum precision and scale is 38. When
	converting to the Hive Decimal type, precision and scale will be reduced
	if necessary to meet this limitation, automatically. The data itself however,
	will only have to adhere to the limitations of the Avro schema, thus values
	with a precision and scale bigger than 38 are allowed and will be present on
	storage, but they won't be readable by Hive, (since Hive is a
	schema-on-read tool).

	Enabling padding and specifying a default precision and scale in a Hive Import:
	----
	$ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.parquet.logical_types.decimal.enable=true
	-Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
	--hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --as-parquetfile
	----