blob: 96a90f75ca7e6e3ee9e0e086759ed6a8cde0d44b [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
Sqoop-HCatalog Integration
--------------------------
HCatalog Background
~~~~~~~~~~~~~~~~~~~
HCatalog is a table and storage management service for Hadoop that enables
users with different data processing tools Pig, MapReduce, and Hive
to more easily read and write data on the grid. HCatalog's table abstraction
presents users with a relational view of data in the Hadoop distributed
file system (HDFS) and ensures that users need not worry about where or
in what format their data is stored: RCFile format, text files, or
SequenceFiles.
HCatalog supports reading and writing files in any format for which a Hive
SerDe (serializer-deserializer) has been written. By default, HCatalog
supports RCFile, CSV, JSON, and SequenceFile formats. To use a custom
format, you must provide the InputFormat and OutputFormat as well as the SerDe.
The ability of HCatalog to abstract various storage formats is used in
providing the RCFile (and future file types) support to Sqoop.
Exposing HCatalog Tables to Sqoop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
HCatalog integration with Sqoop is patterned on an existing feature set that
supports Avro and Hive tables. Seven new command line options are introduced,
and some command line options defined for Hive have been reused.
New Command Line Options
^^^^^^^^^^^^^^^^^^^^^^^^
+--hcatalog-database+::
Specifies the database name for the HCatalog table. If not specified,
the default database name +default+ is used. Providing the
+--hcatalog-database+ option without +--hcatalog-table+ is an error.
This is not a required option.
+--hcatalog-table+::
The argument value for this option is the HCatalog tablename.
The presence of the +--hcatalog-table+ option signifies that the import
or export job is done using HCatalog tables, and it is a required option for
HCatalog jobs.
+--hcatalog-external-table+::
Use this flag if you need to create external Hive table for example to store
data in non-transactional tables. For e.g. with:
--hcatalog-storage-stanza "stored as orc tblproperties (\"transactional\"=\"false\")"
This flag can only be used when
--create-hcatalog-table or --drop-and-create-hcatalog-table is used.
+--hcatalog-home+::
The home directory for the HCatalog installation. The directory is
expected to have a +lib+ subdirectory and a +share/hcatalog+ subdirectory
with necessary HCatalog libraries. If not specified, the system property
+hcatalog.home+ will be checked and failing that, a system environment
variable +HCAT_HOME+ will be checked. If none of these are set, the
default value will be used and currently the default is set to
+/usr/lib/hcatalog+.
This is not a required option.
+--create-hcatalog-table+::
This option specifies whether an HCatalog table should be created
automatically when importing data. By default, HCatalog tables are assumed
to exist. The table name will be the same as the database table name
translated to lower case. Further described in +Automatic Table Creation+
below.
+--drop-and-create-hcatalog-table+::
Same as +--create-hcatalog-table+, but does a +drop if exists+ before creating
the table.
+--hcatalog-storage-stanza+::
This option specifies the storage stanza to be appended to the table.
Further described in +Automatic Table Creation+ below.
+--hcatalog-partition-keys+ and +--hcatalog-partition-values+::
These two options are used to specify multiple static partition key/value
pairs. In the prior releases, +--hive-partition-key+ and
+--hive-partition-value+ options were used to specify the static partition
key/value pair, but only one level of static partition keys could be provided.
The options +--hcatalog-partition-keys+ and +--hcatalog-partition-values+
allow multiple keys and values to be provided as static partitioning keys.
Multiple option values are to be separated by ',' (comma).
For example, if the hive partition keys for the table to export/import from are
defined with partition key names year, month and date and a specific partition
with year=1999, month=12, day=31 is the desired partition, then the values
for the two options will be as follows:
* +--hcatalog-partition-keys+ year,month,day
* +--hcatalog-partition-values+ 1999,12,31
To provide backward compatibility, if +--hcatalog-partition-keys+ or
+--hcatalog-partition-values+ options are not provided, then
+--hive-partitition-key+ and +--hive-partition-value+ will be used if provided.
It is an error to specify only one of +--hcatalog-partition-keys+ or
+--hcatalog-partition-values+ options. Either both of the options should be
provided or neither of the options should be provided.
Supported Sqoop Hive Options
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following Sqoop options are also used along with the +--hcatalog-table+
option to provide additional input to the HCatalog jobs. Some of the existing
Hive import job options are reused with HCatalog jobs instead of creating
HCatalog-specific options for the same purpose.
+--map-column-hive+::
This option maps a database column to HCatalog with a specific HCatalog
type.
+--hive-home+::
The Hive home location.
+--hive-partition-key+::
Used for static partitioning filter. The partitioning key should be of
type STRING. There can be only one static partitioning key.
Please see the discussion about +--hcatalog-partition-keys+ and
+--hcatalog-partition-values+ options.
+--hive-partition-value+::
The value associated with the partition.
Please see the discussion about +--hcatalog-partition-keys+ and
+--hcatalog-partition-values+ options.
Direct Mode support
^^^^^^^^^^^^^^^^^^^
HCatalog integration in Sqoop has been enhanced to support direct mode
connectors (which are high performance connectors specific to a database).
Netezza direct mode connector has been enhanced to take advatange of this
feature.
IMPORTANT: Only Netezza direct mode connector is currently enabled to work
with HCatalog.
Unsupported Sqoop Options
^^^^^^^^^^^^^^^^^^^^^^^^^
Unsupported Sqoop Hive Import Options
+++++++++++++++++++++++++++++++++++++
The following Sqoop Hive import options are not supported with HCatalog jobs.
* +--hive-import+
* +--hive-overwrite+
Unsupported Sqoop Export and Import Options
+++++++++++++++++++++++++++++++++++++++++++
The following Sqoop export and import options are not supported with HCatalog jobs.
* +--export-dir+
* +--target-dir+
* +--warehouse-dir+
* +--append+
* +--as-sequencefile+
* +--as-avrodatafile+
* +--as-parquetfile+
Ignored Sqoop Options
^^^^^^^^^^^^^^^^^^^^^
The following options are ignored with HCatalog jobs.
* All input delimiter options are ignored.
* Output delimiters are generally ignored unless either
+--hive-drop-import-delims+ or +--hive-delims-replacement+ is used. When the
+--hive-drop-import-delims+ or +--hive-delims-replacement+ option is
specified, all +CHAR+ type database table columns will be post-processed
to either remove or replace the delimiters, respectively. See +Delimited Text
Formats and Field and Line Delimiter Characters+ below. This is only needed
if the HCatalog table uses text formats.
Automatic Table Creation
~~~~~~~~~~~~~~~~~~~~~~~~
One of the key features of Sqoop is to manage and create the table metadata
when importing into Hadoop. HCatalog import jobs also provide for this
feature with the option +--create-hcatalog-table+. Furthermore, one of the
important benefits of the HCatalog integration is to provide storage
agnosticism to Sqoop data movement jobs. To provide for that feature,
HCatalog import jobs provide an option that lets a user specifiy the
storage format for the created table.
The option +--create-hcatalog-table+ is used as an indicator that a table
has to be created as part of the HCatalog import job. If the option
+--create-hcatalog-table+ is specified and the table exists, then the
table creation will fail and the job will be aborted.
The option +--hcatalog-storage-stanza+ can be used to specify the storage
format of the newly created table. The default value for this option is
+stored as rcfile+. The value specified for this option is assumed to be a
valid Hive storage format expression. It will be appended to the +create table+
command generated by the HCatalog import job as part of automatic table
creation. Any error in the storage stanza will cause the table creation to
fail and the import job will be aborted.
Any additional resources needed to support the storage format referenced in
the option +--hcatalog-storage-stanza+ should be provided to the job either
by placing them in +$HIVE_HOME/lib+ or by providing them in +HADOOP_CLASSPATH+
and +LIBJAR+ files.
If the option +--hive-partition-key+ is specified, then the value of this
option is used as the partitioning key for the newly created table. Only
one partitioning key can be specified with this option.
Object names are mapped to the lowercase equivalents as specified below
when mapped to an HCatalog table. This includes the table name (which
is the same as the external store table name converted to lower case)
and field names.
Delimited Text Formats and Field and Line Delimiter Characters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
HCatalog supports delimited text format as one of the table storage formats.
But when delimited text is used and the imported data has fields that contain
those delimiters, then the data may be parsed into a different number of
fields and records by Hive, thereby losing data fidelity.
For this case, one of these existing Sqoop import options can be used:
* +--hive-delims-replacement+
* +--hive-drop-import-delims+
If either of these options is provided for import, then any column of type
STRING will be formatted with the Hive delimiter processing and then written
to the HCatalog table.
HCatalog Table Requirements
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The HCatalog table should be created before using it as part of a Sqoop job
if the default table creation options (with optional storage stanza) are not
sufficient. All storage formats supported by HCatalog can be used with the
creation of the HCatalog tables. This makes this feature readily adopt new
storage formats that come into the Hive project, such as ORC files.
Support for Partitioning
~~~~~~~~~~~~~~~~~~~~~~~~
The Sqoop HCatalog feature supports the following table types:
* Unpartitioned tables
* Partitioned tables with a static partitioning key specified
* Partitioned tables with dynamic partition keys from the database
result set
* Partitioned tables with a combination of a static key and additional
dynamic partitioning keys
Schema Mapping
~~~~~~~~~~~~~~
Sqoop currently does not support column name mapping. However, the user
is allowed to override the type mapping. Type mapping loosely follows
the Hive type mapping already present in Sqoop except that SQL types
FLOAT and REAL are mapped to HCatalog type float. In the Sqoop type
mapping for Hive, these two are mapped to double. Type mapping is primarily
used for checking the column definition correctness only and can be overridden
with the --map-column-hive option.
All types except binary are assignable to a String type.
Any field of number type (int, shortint, tinyint, bigint and bigdecimal,
float and double) is assignable to another field of any number type during
exports and imports. Depending on the precision and scale of the target type
of assignment, truncations can occur.
Furthermore, date/time/timestamps are mapped to date/timestamp hive types.
(the full date/time/timestamp representation). Date/time/timstamp columns
can also be mapped to bigint Hive type in which case the value will be
the number of milliseconds since epoch.
BLOBs and CLOBs are only supported for imports. The BLOB/CLOB objects when
imported are stored in a Sqoop-specific format and knowledge of this format
is needed for processing these objects in a Pig/Hive job or another Map Reduce
job.
Database column names are mapped to their lowercase equivalents when mapped
to the HCatalog fields. Currently, case-sensitive database object names are
not supported.
Projection of a set of columns from a table to an HCatalog table or loading
to a column projection is allowed, subject to table constraints. The dynamic
partitioning columns, if any, must be part of the projection when importing
data into HCatalog tables.
Dynamic partitioning fields should be mapped to database columns that are
defined with the NOT NULL attribute (although this is not enforced during
schema mapping). A null value during import for a dynamic partitioning
column will abort the Sqoop job.
Support for HCatalog Data Types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All the primitive Hive types that are part of Hive 0.13 version are supported.
Currently all the complex HCatalog types are not supported.
BLOB/CLOB database types are only supported for imports.
Providing Hive and HCatalog Libraries for the Sqoop Job
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With the support for HCatalog added to Sqoop, any HCatalog job depends on a
set of jar files being available both on the Sqoop client host and where the
Map/Reduce tasks run. To run HCatalog jobs, the environment variable
+HADOOP_CLASSPATH+ must be set up as shown below before launching the Sqoop
HCatalog jobs.
+HADOOP_CLASSPATH=$(hcat -classpath)+
+export HADOOP_CLASSPATH+
The necessary HCatalog dependencies will be copied to the distributed cache
automatically by the Sqoop job.
Examples
~~~~~~~~
Create an HCatalog table, such as:
+hcat -e "create table txn(txn_date string, cust_id string, amount float,
store_id int) partitioned by (cust_id string) stored as rcfile;"+
Then Sqoop import and export of the "txn" HCatalog table can be invoked as
follows:
Import
~~~~~~
+$SQOOP_HOME/bin/sqoop import --connect <jdbc-url> -table <table-name> --hcatalog-table txn <other sqoop options>+
Export
~~~~~~
+$SQOOP_HOME/bin/sqoop export --connect <jdbc-url> -table <table-name> --hcatalog-table txn <other sqoop options>+