| |
| //// |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| |
| |
| Importing Data Into Hive |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Sqoop's import tool's main function is to upload your data into files |
| in HDFS. If you have a Hive metastore associated with your HDFS |
| cluster, Sqoop can also import the data into Hive by generating and |
| executing a +CREATE TABLE+ statement to define the data's layout in |
| Hive. Importing data into Hive is as simple as adding the |
| *+\--hive-import+* option to your Sqoop command line. |
| |
| If the Hive table already exists, you can specify the |
| *+\--hive-overwrite+* option to indicate that existing table in hive must |
| be replaced. After your data is imported into HDFS or this step is |
| omitted, Sqoop will generate a Hive script containing a +CREATE TABLE+ |
| operation defining your columns using Hive's types, and a +LOAD DATA INPATH+ |
| statement to move the data files into Hive's warehouse directory. |
| |
| The script can be executed in two ways: |
| |
| - By the default the script will be executed by calling |
| the installed copy of hive on the machine where Sqoop is run. If you have |
| multiple Hive installations, or +hive+ is not in your +$PATH+, use the |
| *+\--hive-home+* option to identify the Hive installation directory. |
| Sqoop will use +$HIVE_HOME/bin/hive+ from here. |
| |
| - If the user specifies the *+\--hs2-url+* parameter then the script will |
| be sent to HiveServer2 through a JDBC connection. Note that the data itself |
| will not be transferred via the JDBC connection it is written directly to HDFS |
| just like in case of the default hive import. As HiveServer2 provides proper |
| authorization and auditing features it is recommended to use this instead of |
| the default. Currently only Kerberos authentication and text file format is |
| supported with this option. |
| |
| NOTE: This function is incompatible with +\--as-avrodatafile+ and |
| +\--as-sequencefile+. |
| |
| Even though Hive supports escaping characters, it does not |
| handle escaping of new-line character. Also, it does not support |
| the notion of enclosing characters that may include field delimiters |
| in the enclosed string. It is therefore recommended that you choose |
| unambiguous field and record-terminating delimiters without the help |
| of escaping and enclosing characters when working with Hive; this is |
| due to limitations of Hive's input parsing abilities. If you do use |
| +\--escaped-by+, +\--enclosed-by+, or +\--optionally-enclosed-by+ when |
| importing data into Hive, Sqoop will print a warning message. |
| |
| Hive will have problems using Sqoop-imported data if your database's |
| rows contain string fields that have Hive's default row delimiters |
| (+\n+ and +\r+ characters) or column delimiters (+\01+ characters) |
| present in them. You can use the +\--hive-drop-import-delims+ option |
| to drop those characters on import to give Hive-compatible text data. |
| Alternatively, you can use the +\--hive-delims-replacement+ option |
| to replace those characters with a user-defined string on import to give |
| Hive-compatible text data. These options should only be used if you use |
| Hive's default delimiters and should not be used if different delimiters |
| are specified. |
| |
| Sqoop will pass the field and record delimiters through to Hive. If you do |
| not set any delimiters and do use +\--hive-import+, the field delimiter will |
| be set to +^A+ and the record delimiter will be set to +\n+ to be consistent |
| with Hive's defaults. |
| |
| Sqoop will by default import NULL values as string +null+. Hive is however |
| using string +\N+ to denote +NULL+ values and therefore predicates dealing |
| with +NULL+ (like +IS NULL+) will not work correctly. You should append |
| parameters +\--null-string+ and +\--null-non-string+ in case of import job or |
| +--input-null-string+ and +--input-null-non-string+ in case of an export job if |
| you wish to properly preserve +NULL+ values. Because sqoop is using those |
| parameters in generated code, you need to properly escape value +\N+ to +\\N+: |
| |
| ---- |
| $ sqoop import ... --null-string '\\N' --null-non-string '\\N' |
| ---- |
| |
| The table name used in Hive is, by default, the same as that of the |
| source table. You can control the output table name with the +\--hive-table+ |
| option. |
| |
| Hive can put data into partitions for more efficient query |
| performance. You can tell a Sqoop job to import data for Hive into a |
| particular partition by specifying the +\--hive-partition-key+ and |
| +\--hive-partition-value+ arguments. The partition value must be a |
| string. Please see the Hive documentation for more details on |
| partitioning. |
| |
| You can import compressed tables into Hive using the +\--compress+ and |
| +\--compression-codec+ options. One downside to compressing tables imported |
| into Hive is that many codecs cannot be split for processing by parallel map |
| tasks. The lzop codec, however, does support splitting. When importing tables |
| with this codec, Sqoop will automatically index the files for splitting and |
| configuring a new Hive table with the correct InputFormat. This feature |
| currently requires that all partitions of a table be compressed with the lzop |
| codec. |
| |
| External table import |
| +++++++++++++++++++++ |
| |
| You can specify the +\--external-table-dir+ option in the sqoop command to |
| work with an external Hive table (instead of a managed table, i.e. the default |
| behavior). To import data into an external table, one has to specify the |
| +\--hive-import+ option in the command line arguments. Table creation is |
| also supported with the use of the +\--create-hive-table+ option. |
| |
| Importing into an external Hive table: |
| ---- |
| $ sqoop import --hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/external_table_example |
| ---- |
| |
| Create an external Hive table: |
| ---- |
| $ sqoop import --hive-import --create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar |
| ---- |
| |
| Decimals in Hive import using parquet file |
| ++++++++++++++++++++++++++++++++++++++++++ |
| |
| As mentioned above, a Hive import is a two-step process in Sqoop: |
| first, the data is imported onto HDFS, then a HQL statement is generated and |
| executed to create the Hive table. |
| |
| During the first step, an Avro schema is generated from the SQL data types. |
| This schema is then used in a regular Parquet import. After the data was |
| imported onto HDFS successfully, Sqoop takes the Avro schema, maps the Avro |
| types to Hive types and to generates the HQL statement to create the table. |
| |
| Decimal SQL types are converted to Strings in a parquet import per default, |
| so Decimal columns appear as String columns in Hive per default. You can change |
| this behavior by enabling logical types for parquet, so that Decimals will be |
| properly mapped to the Hive type Decimal as well. This can be done with the |
| +sqoop.parquet.logical_types.decimal.enable+ property. As noted in the section |
| discussing 'Enabling Logical Types in Avro and Parquet import for numbers', |
| you should also specify the default precision and scale and enable padding. |
| |
| A limitation of Hive is that the maximum precision and scale is 38. When |
| converting to the Hive Decimal type, precision and scale will be reduced |
| if necessary to meet this limitation, automatically. The data itself however, |
| will only have to adhere to the limitations of the Avro schema, thus values |
| with a precision and scale bigger than 38 are allowed and will be present on |
| storage, but they won't be readable by Hive, (since Hive is a |
| schema-on-read tool). |
| |
| Enabling padding and specifying a default precision and scale in a Hive Import: |
| ---- |
| $ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.parquet.logical_types.decimal.enable=true |
| -Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10 |
| --hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --as-parquetfile |
| ---- |