| |
| //// |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| |
| |
| Importing Data Into Hive |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Sqoop's import tool's main function is to upload your data into files |
| in HDFS. If you have a Hive metastore associated with your HDFS |
| cluster, Sqoop can also import the data into Hive by generating and |
| executing a +CREATE TABLE+ statement to define the data's layout in |
| Hive. Importing data into Hive is as simple as adding the |
| *+\--hive-import+* option to your Sqoop command line. |
| |
| If the Hive table already exists, you can specify the |
| *+\--hive-overwrite+* option to indicate that existing table in hive must |
| be replaced. After your data is imported into HDFS or this step is |
| omitted, Sqoop will generate a Hive script containing a +CREATE TABLE+ |
| operation defining your columns using Hive's types, and a +LOAD DATA INPATH+ |
| statement to move the data files into Hive's warehouse directory. |
| |
| The script can be executed in two ways: |
| |
| - By the default the script will be executed by calling |
| the installed copy of hive on the machine where Sqoop is run. If you have |
| multiple Hive installations, or +hive+ is not in your +$PATH+, use the |
| *+\--hive-home+* option to identify the Hive installation directory. |
| Sqoop will use +$HIVE_HOME/bin/hive+ from here. |
| |
| - If the user specifies the *+\--hs2-url+* parameter then the script will |
| be sent to HiveServer2 through a JDBC connection. Note that the data itself |
| will not be transferred via the JDBC connection it is written directly to HDFS |
| just like in case of the default hive import. As HiveServer2 provides proper |
| authorization and auditing features it is recommended to use this instead of |
| the default. Currently only Kerberos authentication and text file format is |
| supported with this option. |
| |
| NOTE: This function is incompatible with +\--as-avrodatafile+ and |
| +\--as-sequencefile+. |
| |
| Even though Hive supports escaping characters, it does not |
| handle escaping of new-line character. Also, it does not support |
| the notion of enclosing characters that may include field delimiters |
| in the enclosed string. It is therefore recommended that you choose |
| unambiguous field and record-terminating delimiters without the help |
| of escaping and enclosing characters when working with Hive; this is |
| due to limitations of Hive's input parsing abilities. If you do use |
| +\--escaped-by+, +\--enclosed-by+, or +\--optionally-enclosed-by+ when |
| importing data into Hive, Sqoop will print a warning message. |
| |
| Hive will have problems using Sqoop-imported data if your database's |
| rows contain string fields that have Hive's default row delimiters |
| (+\n+ and +\r+ characters) or column delimiters (+\01+ characters) |
| present in them. You can use the +\--hive-drop-import-delims+ option |
| to drop those characters on import to give Hive-compatible text data. |
| Alternatively, you can use the +\--hive-delims-replacement+ option |
| to replace those characters with a user-defined string on import to give |
| Hive-compatible text data. These options should only be used if you use |
| Hive's default delimiters and should not be used if different delimiters |
| are specified. |
| |
| Sqoop will pass the field and record delimiters through to Hive. If you do |
| not set any delimiters and do use +\--hive-import+, the field delimiter will |
| be set to +^A+ and the record delimiter will be set to +\n+ to be consistent |
| with Hive's defaults. |
| |
| Sqoop will by default import NULL values as string +null+. Hive is however |
| using string +\N+ to denote +NULL+ values and therefore predicates dealing |
| with +NULL+ (like +IS NULL+) will not work correctly. You should append |
| parameters +\--null-string+ and +\--null-non-string+ in case of import job or |
| +--input-null-string+ and +--input-null-non-string+ in case of an export job if |
| you wish to properly preserve +NULL+ values. Because sqoop is using those |
| parameters in generated code, you need to properly escape value +\N+ to +\\N+: |
| |
| ---- |
| $ sqoop import ... --null-string '\\N' --null-non-string '\\N' |
| ---- |
| |
| The table name used in Hive is, by default, the same as that of the |
| source table. You can control the output table name with the +\--hive-table+ |
| option. |
| |
| Hive can put data into partitions for more efficient query |
| performance. You can tell a Sqoop job to import data for Hive into a |
| particular partition by specifying the +\--hive-partition-key+ and |
| +\--hive-partition-value+ arguments. The partition value must be a |
| string. Please see the Hive documentation for more details on |
| partitioning. |
| |
| You can import compressed tables into Hive using the +\--compress+ and |
| +\--compression-codec+ options. One downside to compressing tables imported |
| into Hive is that many codecs cannot be split for processing by parallel map |
| tasks. The lzop codec, however, does support splitting. When importing tables |
| with this codec, Sqoop will automatically index the files for splitting and |
| configuring a new Hive table with the correct InputFormat. This feature |
| currently requires that all partitions of a table be compressed with the lzop |
| codec. |