blob: 59e3e00b5f568e127d8f4759f9629891c7de0155 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
[[connectors]]
Notes for specific connectors
-----------------------------
MySQL JDBC Connector
~~~~~~~~~~~~~~~~~~~~
This section contains information specific to MySQL JDBC Connector.
Upsert functionality
^^^^^^^^^^^^^^^^^^^^
MySQL JDBC Connector is supporting upsert functionality using argument
+\--update-mode allowinsert+. To achieve that Sqoop is using MySQL clause INSERT INTO
... ON DUPLICATE KEY UPDATE. This clause do not allow user to specify which columns
should be used to distinct whether we should update existing row or add new row. Instead
this clause relies on table's unique keys (primary key belongs to this set). MySQL
will try to insert new row and if the insertion fails with duplicate unique key error
it will update appropriate row instead. As a result, Sqoop is ignoring values specified
in parameter +\--update-key+, however user needs to specify at least one valid column
to turn on update mode itself.
MySQL Direct Connector
~~~~~~~~~~~~~~~~~~~~~~
MySQL Direct Connector allows faster import and export to/from MySQL using +mysqldump+ and +mysqlimport+ tools functionality
instead of SQL selects and inserts.
To use the MySQL Direct Connector, specify the +\--direct+ argument for your import or export job.
Example:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--direct
----
Passing additional parameters to mysqldump:
----
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1
----
Requirements
^^^^^^^^^^^^
Utilities +mysqldump+ and +mysqlimport+ should be present in the shell path of the user running the Sqoop command on
all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
Limitations
^^^^^^^^^^^^
* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
* Importing to HBase and Accumulo is not supported
* Use of a staging table when exporting data is not supported
* Import of views is not supported
Direct-mode Transactions
^^^^^^^^^^^^^^^^^^^^^^^^
For performance, each writer will commit the current transaction
approximately every 32 MB of exported data. You can control this
by specifying the following argument _before_ any tool-specific arguments: +-D
sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
bytes. Set _size_ to 0 to disable intermediate checkpoints,
but individual files being exported will continue to be committed
independently of one another.
Sometimes you need to export large data with Sqoop to a live MySQL cluster that
is under a high load serving random queries from the users of your application.
While data consistency issues during the export can be easily solved with a
staging table, there is still a problem with the performance impact caused by
the heavy export.
First off, the resources of MySQL dedicated to the import process can affect
the performance of the live product, both on the master and on the slaves.
Second, even if the servers can handle the import with no significant
performance impact (mysqlimport should be relatively "cheap"), importing big
tables can cause serious replication lag in the cluster risking data
inconsistency.
With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
milliseconds, you can let the server relax between checkpoints and the replicas
catch up by pausing the export process after transferring the number of bytes
specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
settings of these two parameters to archieve an export pace that doesn't
endanger the stability of your MySQL cluster.
IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
parameter=value+ are Hadoop _generic arguments_ and must appear before
any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
Don't forget that these parameters are only supported with the +\--direct+
flag set.
Microsoft SQL Connector
~~~~~~~~~~~~~~~~~~~~~~~
Extra arguments
^^^^^^^^^^^^^^^
List of all extra arguments supported by Microsoft SQL Connector is shown below:
.Supported Microsoft SQL Connector extra arguments:
[grid="all"]
`----------------------------------------`---------------------------------------
Argument Description
---------------------------------------------------------------------------------
+\--identity-insert Set IDENTITY_INSERT to ON before \
export insert.
+\--resilient+ Attempt to recover failed \
export operations.
+\--schema <name>+ Scheme name that sqoop should use. \
Default is "dbo".
+\--table-hints <hints>+ Table hints that Sqoop should use for \
data movement.
---------------------------------------------------------------------------------
Allow identity inserts
^^^^^^^^^^^^^^^^^^^^^^
You can allow inserts on columns that have identity. For example:
----
$ sqoop export ... --export-dir custom_dir --table custom_table -- --identity-insert
----
Resilient operations
^^^^^^^^^^^^^^^^^^^^
You can override the default and use resilient operations during export or import.
This will retry failed operations, i.e. if the connection gets dropped by
SQL Server, the mapper will try to reconnect and continue from where it was before.
In case of export, the +\--resilient+ option will ensure that Sqoop will try to recover
from connection resets.
In case of import, however, one has to use both the +\--resilient+ option and specify
the +\--split-by+ column to trigger the retry mechanism. An important requirement is
that the data must be unique and ordered ascending by the split-by column, otherwise
records could be either lost or duplicated.
Example commands using resilient operations:
----
$ sqoop export ... --export-dir custom_dir --table custom_table -- --resilient
----
Importing from a table:
----
$ sqoop import ... --table custom_table --split-by id -- --resilient
----
Importing via a query:
----
$ sqoop import ... --query "SELECT ... WHERE $CONDITIONS" --split-by ordered_column -- --resilient
Schema support
^^^^^^^^^^^^^^
If you need to work with tables that are located in non-default schemas, you can
specify schema names via the +\--schema+ argument. Custom schemas are supported for
both import and export jobs. For example:
----
$ sqoop import ... --table custom_table -- --schema custom_schema
----
Table hints
^^^^^^^^^^^
Sqoop supports table hints in both import and export jobs. Table hints are used only
for queries that move data from/to Microsoft SQL Server, but they cannot be used for
meta data queries. You can specify a comma-separated list of table hints in the
+\--table-hints+ argument. For example:
----
$ sqoop import ... --table custom_table -- --table-hints NOLOCK
----
PostgreSQL Connector
~~~~~~~~~~~~~~~~~~~~~
Extra arguments
^^^^^^^^^^^^^^^
List of all extra arguments supported by PostgreSQL Connector is shown below:
.Supported PostgreSQL extra arguments:
[grid="all"]
`----------------------------------------`---------------------------------------
Argument Description
---------------------------------------------------------------------------------
+\--schema <name>+ Scheme name that sqoop should use. \
Default is "public".
---------------------------------------------------------------------------------
Schema support
^^^^^^^^^^^^^^
If you need to work with table that is located in schema other than default one,
you need to specify extra argument +\--schema+. Custom schemas are supported for
both import and export job (optional staging table however must be present in the
same schema as target table). Example invocation:
----
$ sqoop import ... --table custom_table -- --schema custom_schema
----
PostgreSQL Direct Connector
~~~~~~~~~~~~~~~~~~~~~~~~~~~
PostgreSQL Direct Connector allows faster import and export to/from PostgresSQL "COPY" command.
To use the PostgreSQL Direct Connector, specify the +\--direct+ argument for your import or export job.
When importing from PostgreSQL in conjunction with direct mode, you
can split the import into separate files after
individual files reach a certain size. This size limit is controlled
with the +\--direct-split-size+ argument.
The direct connector offers also additional extra arguments:
.Additional supported PostgreSQL extra arguments in direct mode:
[grid="all"]
`----------------------------------------`---------------------------------------
Argument Description
---------------------------------------------------------------------------------
+\--boolean-true-string <str>+ String that will be used to encode \
+true+ value of +boolean+ columns.
Default is "TRUE".
+\--boolean-false-string <str>+ String that will be used to encode \
+false+ value of +boolean+ columns.
Default is "FALSE".
---------------------------------------------------------------------------------
Requirements
^^^^^^^^^^^^
Utility +psql+ should be present in the shell path of the user running the Sqoop command on
all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
Limitations
^^^^^^^^^^^^
* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
* Importing to HBase and Accumulo is not supported
* Import of views is not supported
pg_bulkload connector
~~~~~~~~~~~~~~~~~~~~~
Purpose
^^^^^^^
pg_bulkload connector is a direct connector for exporting data into PostgreSQL.
This connector uses
http://pgbulkload.projects.postgresql.org/index.html[pg_bulkload].
Users benefit from functionality of pg_bulkload such as
fast exports bypassing shared bufferes and WAL,
flexible error records handling,
and ETL feature with filter functions.
Requirements
^^^^^^^^^^^^
pg_bulkload connector requires following conditions for export job execution:
* The link:http://pgbulkload.projects.postgresql.org/index.html[pg_bulkload]
must be installed on DB server and all slave nodes.
RPM for RedHat or CentOS is available in then
link:http://pgfoundry.org/frs/?group_id=1000261[download page].
* The link:http://jdbc.postgresql.org/index.html[PostgreSQL JDBC]
is required on client node.
* Superuser role of PostgreSQL database is required for execution of pg_bulkload.
Syntax
^^^^^^
Use +--connection-manager+ option to specify connection manager classname.
----
$ sqoop export (generic-args) --connection-manager org.apache.sqoop.manager.PGBulkloadManager (export-args)
$ sqoop-export (generic-args) --connection-manager org.apache.sqoop.manager.PGBulkloadManager (export-args)
----
This connector supports export arguments shown below.
.Supported export control arguments:
[grid="all"]
`----------------------------------------`---------------------------------------
Argument Description
---------------------------------------------------------------------------------
+\--export-dir <dir>+ HDFS source path for the export
+-m,\--num-mappers <n>+ Use 'n' map tasks to export in\
parallel
+\--table <table-name>+ Table to populate
+\--input-null-string <null-string>+ The string to be interpreted as\
null for string columns
---------------------------------------------------------------------------------
There are additional configuration for pg_bulkload execution
specified via Hadoop Configuration properties
which can be given with +-D <property=value>+ option.
Because Hadoop Configuration properties are generic arguments of the sqoop,
it must preceed any export control arguments.
.Supported export control properties:
[grid="all"]
`------------------------------`----------------------------------------------
Property Description
-----------------------------------------------------------------------------
mapred.reduce.tasks Number of reduce tasks for staging. \
The defalt value is 1. \
Each tasks do staging in a single transaction.
pgbulkload.bin Path of the pg_bulkoad binary \
installed on each slave nodes.
pgbulkload.check.constraints Specify whether CHECK constraints are checked \
during the loading. \
The default value is YES.
pgbulkload.parse.errors The maximum mumber of ingored records \
that cause errors during parsing, \
encoding, filtering, constraints checking, \
and data type conversion. \
Error records are recorded \
in the PARSE BADFILE. \
The default value is INFINITE.
pgbulkload.duplicate.errors Number of ingored records \
that violate unique constraints. \
Duplicated records are recorded in the \
DUPLICATE BADFILE on DB server. \
The default value is INFINITE.
pgbulkload.filter Specify the filter function \
to convert each row in the input file. \
See the pg_bulkload documentation to know \
how to write FILTER functions.
pgbulkload.clear.staging.table Indicates that any data present in\
the staging table can be dropped.
------------------------------------------------------------------------------
Here is a example of complete command line.
----
$ sqoop export \
-Dmapred.reduce.tasks=2
-Dpgbulkload.bin="/usr/local/bin/pg_bulkload" \
-Dpgbulkload.input.field.delim=$'\t' \
-Dpgbulkload.check.constraints="YES" \
-Dpgbulkload.parse.errors="INFINITE" \
-Dpgbulkload.duplicate.errors="INFINITE" \
--connect jdbc:postgresql://pgsql.example.net:5432/sqooptest \
--connection-manager org.apache.sqoop.manager.PGBulkloadManager \
--table test --username sqooptest --export-dir=/test -m 2
----
Data Staging
^^^^^^^^^^^^
Each map tasks of pg_bulkload connector's export job create
their own staging table on the fly.
The Name of staging tables is decided based on the destination table
and the task attempt ids.
For example, the name of staging table for the "test" table is like
+test_attempt_1345021837431_0001_m_000000_0+ .
Staging tables are automatically dropped if tasks successfully complete
or map tasks fail.
When reduce task fails,
staging table for the task are left for manual retry and
users must take care of it.
Netezza Connector
~~~~~~~~~~~~~~~~~
Extra arguments
^^^^^^^^^^^^^^^
List of all extra arguments supported by Netezza Connector is shown below:
.Supported Netezza extra arguments:
[grid="all"]
`-------------------------------------`----------------------------------------
Argument Description
-------------------------------------------------------------------------------
+--partitioned-access+ Whether each mapper acts on a subset\
of data slices of a table or all\
Default is "false" for standard mode\
and "true" for direct mode.
+--max-errors+ Applicable only for direct mode export.\
This option specifies the error threshold\
per mapper while transferring data. If\
the number of errors encountered exceed\
this threshold then the job will fail.
Default value is 1.
+--log-dir+ Applicable only for direct mode export.\
Specifies the directory where Netezza\
external table operation logs are stored\
on the hadoop filesystem. Logs are\
stored under this directory with one\
directory for the job and sub-directories\
for each task number and attempt.\
Default value is the user home directory.\
The nzlog and nzbad files will be under
(logdir)/job-id/job-attempt-id.
+--trunc-string+ Applicable only for direct mode export.\
Specifies whether the system \
truncates strings to the declared\
storage and loads the data. By default\
truncation of strings is reported as an\
error.
+--ctrl-chars+ Applicable only for direct mode export.\
Specifies whether control characters \
(ASCII chars 1 - 31) can be allowed \
to be part of char/nchar/varchar/nvarchar\
columns. Default is false.
+--crin-string+ Applicable only for direct mode export.\
Specifies whether carriage return \
(ASCII char 13) can be allowed \
to be part of char/nchar/varchar/nvarchar\
columns. Note that CR can no longer \
be a record delimiter with this option.\
Default is false.
+--ignore-zero+ Applicable only for direct mode export.\
Specifies whether NUL character \
(ASCII char 0) should be scanned \
and ignored as part of the data loaded\
into char/nchar/varchar/nvarchar \
columns.\
Default is false.
--------------------------------------------------------------------------------
Direct Mode
^^^^^^^^^^^
Netezza connector supports an optimized data transfer facility using the
Netezza external tables feature. Each map tasks of Netezza connector's import
job will work on a subset of the Netezza partitions and transparently create
and use an external table to transport data. Similarly, export jobs will use
the external table to push data fast onto the NZ system. Direct mode does
not support staging tables, upsert options etc.
Here is an example of complete command line for import using the Netezza
external table feature.
----
$ sqoop import \
--direct \
--connect jdbc:netezza://nzhost:5480/sqoop \
--table nztable \
--username nzuser \
--password nzpass \
--target-dir hdfsdir
----
Here is an example of complete command line for export with tab as the field
terminator character.
----
$ sqoop export \
--direct \
--connect jdbc:netezza://nzhost:5480/sqoop \
--table nztable \
--username nzuser \
--password nzpass \
--export-dir hdfsdir \
--input-fields-terminated-by "\t"
----
Null string handling
^^^^^^^^^^^^^^^^^^^^
Netezza direct connector supports the null-string features of Sqoop. The null
string values are converted to appropriate external table options during export
and import operations.
.Supported export control arguments:
[grid="all"]
`----------------------------------------`---------------------------------------
Argument Description
---------------------------------------------------------------------------------
+\--input-null-string <null-string>+ The string to be interpreted as\
null for string columns.
+\--input-null-non-string <null-string>+ The string to be interpreted as\
null for non string columns.
---------------------------------------------------------------------------------
In the case of Netezza direct mode connector, both the arguments must be
left to the default values or explicitly set to the same value. Furthermore
the null string value is restricted to 0-4 utf8 characters.
On export, for non-string columns, if the chosen null value is a valid
representation in the column domain, then the column might not be loaded as
null. For example, if the null string value is specified as "1", then on
export, any occurrence of "1" in the input file will be loaded as value 1
instead of NULL for int columns.
It is suggested that the null value be specified as empty string for
performance and consistency.
.Supported import control arguments:
[grid="all"]
`----------------------------------------`---------------------------------------
Argument Description
---------------------------------------------------------------------------------
+\--null-string <null-string>+ The string to be interpreted as\
null for string columns.
+\--null-non-string <null-string>+ The string to be interpreted as\
null for non string columns.
---------------------------------------------------------------------------------
In the case of Netezza direct mode connector, both the arguments must be
left to the default values or explicitly set to the same value. Furthermore
the null string value is restricted to 0-4 utf8 characters.
On import, for non-string columns, the chosen null value in current
implementations the null value representation is ignored for non character
columns. For example, if the null string value is specified as "\N", then on
import, any occurrence of NULL for non-char columns in the table will be
imported as an empty string instead of '\N', the chosen null string
representation.
It is suggested that the null value be specified as empty string for
performance and consistency.
Data Connector for Oracle and Hadoop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
About
^^^^^
The Data Connector for Oracle and Hadoop is now included in Sqoop.
It can be enabled by specifying the +\--direct+ argument for your import or
export job.
Jobs
++++
The Data Connector for Oracle and Hadoop inspects each Sqoop job and assumes
responsibility for the ones it can perform better than the Oracle manager built
into Sqoop.
Data Connector for Oracle and Hadoop accepts responsibility for the following
Sqoop Job types:
- *Import* jobs that are *Non-Incremental*.
- *Export* jobs
- Data Connector for Oracle and Hadoop does not accept responsibility for other
Sqoop job types. For example Data Connector for Oracle and Hadoop does not
accept *eval* jobs etc.
Data Connector for Oracle and Hadoop accepts responsibility for those Sqoop Jobs
with the following attributes:
- Oracle-related
- Table-Based - Jobs where the table argument is used and the specified object
is a table.
+
NOTE: Data Connector for Oracle and Hadoop does not process index-organized
tables unless the table is partitioned and +oraoop.chunk.method+ is set
to +PARTITION+
- There are at least 2 mappers — Jobs where the Sqoop command-line does not
include: +--num-mappers 1+
How The Standard Oracle Manager Works for Imports
+++++++++++++++++++++++++++++++++++++++++++++++++
The Oracle manager built into Sqoop uses a range-based query for each mapper.
Each mapper executes a query of the form:
----
SELECT * FROM sometable WHERE id >= lo AND id < hi
----
The *lo* and *hi* values are based on the number of mappers and the minimum and
maximum values of the data in the column the table is being split by.
If no suitable index exists on the table then these queries result in full
table-scans within Oracle. Even with a suitable index, multiple mappers may
fetch data stored within the same Oracle blocks, resulting in redundant IO
calls.
How The Data Connector for Oracle and Hadoop Works for Imports
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
The Data Connector for Oracle and Hadoop generates queries for the mappers of
the form:
----
SELECT *
FROM sometable
WHERE rowid >= dbms_rowid.rowid_create(1, 893, 1, 279, 0) AND
rowid <= dbms_rowid.rowid_create(1, 893, 1, 286, 32767)
----
The Data Connector for Oracle and Hadoop queries ensure that:
- No two mappers read data from the same Oracle block. This minimizes
redundant IO.
- The table does not require indexes.
- The Sqoop command line does not need to specify a +--split-by+ column.
Data Connector for Oracle and Hadoop Exports
++++++++++++++++++++++++++++++++++++++++++++
Benefits of the Data Connector for Oracle and Hadoop:
- *Merge-Export facility* - Update Oracle tables by modifying changed rows AND
inserting rows from the HDFS file that did not previously exist in the Oracle
table. The Connector for Oracle and Hadoop's Merge-Export is unique - there is
no Sqoop equivalent.
- *Lower impact on the Oracle database* - Update the rows in the Oracle table
that have changed, not all rows in the Oracle table. This has performance
benefits and reduces the impact of the query on Oracle (for example, the Oracle
redo logs).
- *Improved performance* - With partitioned tables, mappers utilize temporary
Oracle tables which allow parallel inserts and direct path writes.
Requirements
^^^^^^^^^^^^
Ensure The Oracle Database JDBC Driver Is Setup Correctly
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
You may want to ensure the Oracle Database 11g Release 2 JDBC driver is setup
correctly on your system. This driver is required for Sqoop to work with Oracle.
The Oracle Database 11g Release 2 JDBC driver file is +ojdbc6.jar+ (3.2Mb).
If this file is not on your system then download it from:
http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html
This file should be put into the +$SQOOP_HOME/lib+ directory.
Oracle Roles and Privileges
+++++++++++++++++++++++++++
The Oracle user for The Data Connector for Oracle and Hadoop requires the
following roles and privileges:
- +create session+
In addition, the user must have the select any dictionary privilege or
select_catalog_role role or all of the following object privileges:
- +select on v_$instance+
- +select on dba_tables+
- +select on dba_tab_columns+
- +select on dba_objects+
- +select on dba_extents+
- +select on dba_segments+ — Required for Sqoop imports only
- +select on dba_constraints+ — Required for Sqoop imports only
- +select on v_$database+ — Required for Sqoop imports only
- +select on v_$parameter+ — Required for Sqoop imports only
NOTE: The user also requires the alter session privilege to make use of session
tracing functionality. See "oraoop.oracle.session.initialization.statements"
for more information.
Additional Oracle Roles And Privileges Required for Export
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
The Oracle user for Data Connector for Oracle and Hadoop requires:
- Quota on the tablespace in which the Oracle export tables are located.
+
An example Oracle command to achieve this is
+
----
alter user username quota unlimited on tablespace
----
- The following privileges:
+
[grid="all"]
`-----------------------------------------------------------------------------------------------`--------------------------------------------------------------------------
Type of Export Privileges Required
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All Export +create table+
+select on dba_tab_partitions+
+select on dba_tab_subpartitions+
+select on dba_indexes+
+select on dba_ind_columns+
Insert-Export with a template table into another schema +select any table+
+create any table+
+insert any table+
+alter any table+ (partitioning)
Insert-Export without a template table into another schema +select,insert on table+ (no partitioning)
+select,alter on table+ (partitioning)
Update-Export into another schema +select,update on table+ (no partitioning)
+select,delete,alter,insert on table+ (partitioning)
Merge-Export into another schema +select,insert,update on table+ (no partitioning)
+select,insert,delete,alter on table+ (partitioning)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Supported Data Types
++++++++++++++++++++
The following Oracle data types are supported by the Data Connector for
Oracle and Hadoop:
[grid="all"]
`----------------------------------------`-----------------------------------------------------------------
BINARY_DOUBLE NCLOB
BINARY_FLOAT NUMBER
BLOB NVARCHAR2
CHAR RAW
CLOB ROWID
DATE TIMESTAMP
FLOAT TIMESTAMP WITH TIME ZONE
INTERVAL DAY TO SECOND TIMESTAMP WITH LOCAL TIME ZONE
INTERVAL YEAR TO MONTH URITYPE
LONG VARCHAR2
NCHAR
-----------------------------------------------------------------------------------------------------------
All other Oracle column types are NOT supported. Example Oracle column types NOT
supported by Data Connector for Oracle and Hadoop include:
[grid="all"]
`-----------------------------------------------------`-------------------------------
All of the ANY types BFILE
All of the MEDIA types LONG RAW
All of the SPATIAL types MLSLABEL
Any type referred to as UNDEFINED UROWID
All custom (user-defined) URI types XMLTYPE
--------------------------------------------------------------------------------------
NOTE: Data types RAW, LONG and LOB (BLOB, CLOB and NCLOB) are supported for
Data Connector for Oracle and Hadoop imports. They are not supported for Data
Connector for Oracle and Hadoop exports.
Execute Sqoop With Data Connector for Oracle and Hadoop
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Connect to Oracle / Oracle RAC
++++++++++++++++++++++++++++++
The Sqoop +--connect+ parameter defines the Oracle instance or Oracle RAC to
connect to. It is required with all Sqoop import and export commands.
Data Connector for Oracle and Hadoop expects the associated connection string
to be of a specific format dependent on whether the Oracle SID, Service
or TNS name is defined. The TNS name based URL scheme can be used to enable
authentication using Oracle wallets.
+--connect jdbc:oracle:thin:@OracleServer:OraclePort:OracleSID+
+--connect jdbc:oracle:thin:@//OracleServer:OraclePort/OracleService+
+--connect jdbc:oracle:thin:@TNSName+
Connect to An Oracle Database Instance
++++++++++++++++++++++++++++++++++++++
[grid="all"]
`---------------------------------------------`----------------------------------------
Parameter / Component Description
---------------------------------------------------------------------------------------
+jdbc:oracle:thin+ The Data Connector for Oracle and\
Hadoop requires the connection string\
starts with jdbc:oracle.
The Data Connector for Oracle and\
Hadoop has been tested with the thin\
driver however it should work equally\
well with other drivers such as OCI.
+OracleServer+ The host name of the Oracle server.
+OraclePort+ The port to connect to the Oracle server.
+OracleSID+ The Oracle instance.
+OracleService+ The Oracle Service.
+TNSName+ The TNS name for the entry describing\
the connection to the Oracle server.
----------------------------------------------------------------------------------------
[NOTE]
================================================================================
The Hadoop mappers connect to the Oracle database using a dynamically
generated JDBC URL. This is designed to improve performance however it can be
disabled by specifying:
+-D oraoop.jdbc.url.verbatim=true+
================================================================================
Connect to An Oracle RAC
++++++++++++++++++++++++
Use the +--connect+ parameter as above. The connection string should point to
one instance of the Oracle RAC. The listener of the host of this Oracle
instance will locate the other instances of the Oracle RAC.
NOTE: To improve performance, The Data Connector for Oracle and Hadoop
identifies the active instances of the Oracle RAC and connects each Hadoop
mapper to them in a roundrobin manner.
If services are defined for this Oracle RAC then use the following parameter
to specify the service name:
+-D oraoop.oracle.rac.service.name=ServiceName+
[grid="all"]
`--------------------------------------------------------------------`------------------------------------------------
Parameter / Component Description
----------------------------------------------------------------------------------------------------------------------
+OracleServer:OraclePort:OracleInstance+ Name one instance of the Oracle RAC.\
The Data Connector for Oracle and\
Hadoop assumes the same port number for\
all instances of the Oracle RAC.
The listener of the host of this Oracle\
instance is used to locate other\
instances of the Oracle RAC. For more\
information enter this command on the\
host command line:
+lsnrctl status+
+-D oraoop.oracle.rac.service.name=ServiceName+ The service to connect to in the Oracle RAC.
A connection is made to all instances\
of the Oracle RAC associated with the\
service given by +ServiceName+.
If omitted, a connection is made to all\
instances of the Oracle RAC.
The listener of the host of this Oracle\
instance needs to know the +ServiceName+\
and all instances of the Oracle RAC. For\
more information enter this command on\
the host command line:
+lsnrctl status+
----------------------------------------------------------------------------------------------------------------------
Login to The Oracle Instance
++++++++++++++++++++++++++++
Login to the Oracle instance on the Sqoop command line:
+--connect jdbc:oracle:thin:@OracleServer:OraclePort:OracleInstance --username
UserName -P+
[grid="all"]
`-------------------------------------`----------------------------------------
Parameter / Component Description
-------------------------------------------------------------------------------
+--username UserName+ The username to login to the Oracle\
instance (SID).
+-P+ You will be prompted for the password\
to login to the Oracle instance.
--------------------------------------------------------------------------------
Kill Data Connector for Oracle and Hadoop Jobs
++++++++++++++++++++++++++++++++++++++++++++++
Use the Hadoop Job Tracker to kill the Sqoop job, just as you would kill any
other Map-Reduce job.
$ +hadoop job -kill jobid+
To allow an Oracle DBA to kill a Data Connector for Oracle and Hadoop
job (via killing the sessions in Oracle) you need to prevent Map-Reduce from
re-attempting failed jobs. This is done via the following Sqoop
command-line switch:
+-D mapred.map.max.attempts=1+
This sends instructions similar to the following to the console:
----
14/07/07 15:24:51 INFO oracle.OraOopManagerFactory:
Note: This Data Connector for Oracle and Hadoop job can be killed via Oracle
by executing the following statement:
begin
for row in (select sid,serial# from v$session where
module='Data Connector for Oracle and Hadoop' and
action='import 20140707152451EST') loop
execute immediate 'alter system kill session ''' || row.sid ||
',' || row.serial# || '''';
end loop;
end;
----
Import Data from Oracle
^^^^^^^^^^^^^^^^^^^^^^^
Execute Sqoop. Following is an example command:
$ +sqoop import --direct --connect ... --table OracleTableName+
If The Data Connector for Oracle and Hadoop accepts the job then the following
text is output:
----
**************************************************
*** Using Data Connector for Oracle and Hadoop ***
**************************************************
----
[NOTE]
================================================================================
- More information is available on the +--connect+ parameter. See "Connect to
Oracle / Oracle RAC" for more information.
- If Java runs out of memory the workaround is to specify each mapper's
JVM memory allocation. Add the following parameter for example to allocate 4GB:
+
+-Dmapred.child.java.opts=-Xmx4000M+
- An Oracle optimizer hint is included in the SELECT statement by default.
See "oraoop.import.hint" for more information.
+
You can alter the hint on the command line as follows:
+
+-Doraoop.import.hint="NO_INDEX(t)"+
+
You can turn off the hint on the command line as follows (notice the space
between the double quotes):
+
+-Doraoop.import.hint=" "+
================================================================================
Match Hadoop Files to Oracle Table Partitions
+++++++++++++++++++++++++++++++++++++++++++++
+-Doraoop.chunk.method={ROWID|PARTITION}+
To import data from a partitioned table in such a way that the resulting HDFS
folder structure in Hadoop will match the table’s partitions, set the chunk
method to PARTITION. The alternative (default) chunk method is ROWID.
[NOTE]
================================================================================
- For the number of Hadoop files to match the number of Oracle partitions,
set the number of mappers to be greater than or equal to the number of
partitions.
- If the table is not partitioned then value PARTITION will lead to an error.
================================================================================
Specify The Partitions To Import
++++++++++++++++++++++++++++++++
+-Doraoop.import.partitions=PartitionA,PartitionB --table OracleTableName+
Imports +PartitionA+ and +PartitionB+ of +OracleTableName+.
[NOTE]
================================================================================
- You can enclose an individual partition name in double quotes to retain the
letter case or if the name has special characters.
+
+-Doraoop.import.partitions=\'"PartitionA",PartitionB' --table OracleTableName+
+
If the partition name is not double quoted then its name will be automatically
converted to upper case, PARTITIONB for above.
+
When using double quotes the entire list of partition names must be enclosed in
single quotes.
+
If the last partition name in the list is double quoted then there must be a
comma at the end of the list.
+
+-Doraoop.import.partitions=\'"PartitionA","PartitionB",' --table
OracleTableName+
- Name each partition to be included. There is no facility to provide a range of
partition names.
- There is no facility to define sub partitions. The entire partition is
included/excluded as per the filter.
================================================================================
Consistent Read: All Mappers Read From The Same Point In Time
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+-Doraoop.import.consistent.read={true|false}+
When set to +false+ (by default) each mapper runs a select query. This will
return potentially inconsistent data if there are a lot of DML operations on
the table at the time of import.
Set to +true+ to ensure all mappers read from the same point in time. The
System Change Number (SCN) is passed down to all mappers, which use the Oracle
Flashback Query to query the table as at that SCN.
[NOTE]
================================================================================
- Values +true+ | +false+ are case sensitive.
- By default the SCN is taken from V$database. You can specify the SCN in the
following command
+
+-Doraoop.import.consistent.read.scn=12345+
================================================================================
Export Data into Oracle
^^^^^^^^^^^^^^^^^^^^^^^
Execute Sqoop. Following is an example command:
$ +sqoop export --direct --connect ... --table OracleTableName --export-dir
/user/username/tablename+
The Data Connector for Oracle and Hadoop accepts all jobs that export data to
Oracle. You can verify The Data Connector for Oracle and Hadoop is in use by
checking the following text is output:
----
**************************************************
*** Using Data Connector for Oracle and Hadoop ***
**************************************************
----
[NOTE]
================================================================================
- +OracleTableName+ is the Oracle table the data will export into.
- +OracleTableName+ can be in a schema other than that for the connecting user.
Prefix the table name with the schema, for example +SchemaName.OracleTableName+.
- Hadoop tables are picked up from the +/user/username/tablename+ directory.
- The export will fail if the Hadoop file contains any fields of a data type
not supported by The Data Connector for Oracle and Hadoop. See
"Supported Data Types" for more information.
- The export will fail if the column definitions in the Hadoop table do not
exactly match the column definitions in the Oracle table.
- The Data Connector for Oracle and Hadoop indicates if it finds temporary
tables that it created more than a day ago that still exist. Usually these
tables can be dropped. The only circumstance when these tables should not be
dropped is when an The Data Connector for Oracle and Hadoop job has been
running for more than 24 hours and is still running.
- More information is available on the +--connect+ parameter. See
"Connect to Oracle / Oracle RAC" for more information.
================================================================================
Insert-Export
+++++++++++++
Appends data to +OracleTableName+. It does not modify existing data in
+OracleTableName+.
Insert-Export is the default method, executed in the absence of the
+--update-key parameter+. All rows in the HDFS file in
+/user/UserName/TableName+ are inserted into +OracleTableName+. No
change is made to pre-existing data in +OracleTableName+.
$ +sqoop export --direct --connect ... --table OracleTableName --export-dir
/user/username/tablename+
[NOTE]
================================================================================
- If +OracleTableName+ was previously created by The Data Connector for Oracle
and Hadoop with partitions then this export will create a new partition for the
data being inserted.
- When creating +OracleTableName+ specify a template. See
"Create Oracle Tables" for more information.
================================================================================
Update-Export
+++++++++++++
+--update-key OBJECT+
Updates existing rows in +OracleTableName+.
Rows in the HDFS file in +/user/UserName/TableName+ are matched to rows in
+OracleTableName+ by the +OBJECT+ column. Rows that match are copied from the
HDFS file to the Oracle table. No action is taken on rows that do not match.
$ +sqoop export --direct --connect ... --update-key OBJECT --table
OracleTableName --export-dir /user/username/tablename+
[NOTE]
================================================================================
- If +OracleTableName+ was previously created by The Data Connector for Oracle
and Hadoop with partitions then this export will create a new partition for the
data being inserted. Updated rows will be moved to the new partition that was
created for the export.
- For performance reasons it is strongly recommended that where more than a few
rows are involved column +OBJECT+ be an index column of +OracleTableName+.
- Ensure the column name defined with +--update-key OBJECT+ is specified in the
correct letter case. Sqoop will show an error if the letter case is incorrect.
- It is possible to match rows via multiple columns. See "Match Rows Via
Multiple Columns" for more information.
================================================================================
Merge-Export
++++++++++++
+--update-key OBJECT -Doraoop.export.merge=true+
Updates existing rows in +OracleTableName+. Copies across rows from the HDFS
file that do not exist within the Oracle table.
Rows in the HDFS file in +/user/UserName/TableName+ are matched to rows in
+OracleTableName+ by the +OBJECT+ column. Rows that match are copied from the
HDFS file to the Oracle table. Rows in the HDFS file that do not exist in
+OracleTableName+ are added to +OracleTableName+.
$ +sqoop export --direct --connect ... --update-key OBJECT
-Doraoop.export.merge=true --table OracleTableName --export-dir
/user/username/tablename+
[NOTE]
================================================================================
- Merge-Export is unique to The Data Connector for Oracle and Hadoop. It is
not a standard Sqoop feature.
- If +OracleTableName+ was previously created by The Data Connector for Oracle
and Hadoop with partitions, then this export will create a new partition for
the data being inserted. Updated rows will be moved to the new partition that
was created for the export.
- For performance reasons it is strongly recommended that where more than a
few rows are involved column +OBJECT+ be an index column of +OracleTableName+.
- Ensure the column name defined with +--update-key OBJECT+ is specified in the
correct letter case. Sqoop will show an error if the letter case is incorrect.
- It is possible to match rows via multiple columns. See "Match Rows Via
Multiple Columns" for more information.
================================================================================
Create Oracle Tables
++++++++++++++++++++
+-Doraoop.template.table=TemplateTableName+
Creates +OracleTableName+ by replicating the structure and data types of
+TemplateTableName+. +TemplateTableName+ is a table that exists in Oracle prior
to executing the Sqoop command.
[NOTE]
================================================================================
- The export will fail if the Hadoop file contains any fields of a data type
not supported by The Data Connector for Oracle and Hadoop. See "Supported
Data Types" for more information.
- The export will fail if the column definitions in the Hadoop table do not
exactly match the column definitions in the Oracle table.
- This parameter is specific to creating an Oracle table. The export will fail
if +OracleTableName+ already exists in Oracle.
================================================================================
Example command:
$ +sqoop export --direct --connect.. --table OracleTableName --export-dir
/user/username/tablename -Doraoop.template.table=TemplateTableName+
NOLOGGING
+++++++++
+-Doraoop.nologging=true+
Assigns the NOLOGGING option to +OracleTableName+.
NOLOGGING may enhance performance but you will be unable to backup the table.
Partitioning
++++++++++++
+-Doraoop.partitioned=true+
Partitions the table with the following benefits:
- The speed of the export is improved by allowing each mapper to insert data
into a separate Oracle table using direct path writes. (An alter table exchange
subpartition SQL statement is subsequently executed to swap the data into the
export table.)
- You can selectively query or delete the data inserted by each Sqoop export
job. For example, you can delete old data by dropping old partitions from
the table.
The partition value is the SYSDATE of when Sqoop export job was performed.
The partitioned table created by The Data Connector for Oracle and Hadoop
includes the following columns that don't exist in the template table:
- +oraoop_export_sysdate+ - This is the Oracle SYSDATE when the Sqoop export
job was performed. The created table will be partitioned by this column.
- +oraoop_mapper_id+ - This is the id of the Hadoop mapper that was used to
process the rows from the HDFS file. Each partition is subpartitioned by this
column. This column exists merely to facilitate the exchange subpartition
mechanism that is performed by each mapper during the export process.
- +oraoop_mapper_row+ - A unique row id within the mapper / partition.
NOTE: If a unique row id is required for the table it can be formed by a
combination of oraoop_export_sysdate, oraoop_mapper_id and oraoop_mapper_row.
Match Rows Via Multiple Columns
+++++++++++++++++++++++++++++++
+-Doraoop.update.key.extra.columns="ColumnA,ColumnB"+
Used with Update-Export and Merge-Export to match on more than one column. The
first column to be matched on is +--update-key OBJECT+. To match on additional
columns, specify those columns on this parameter.
[NOTE]
================================================================================
- Letter case for the column names on this parameter is not important.
- All columns used for matching should be indexed. The first three items on the
index should be +ColumnA+, +ColumnB+ and the column specified on
+--update-key+ - but the order in which the columns are specified is not
important.
================================================================================
Storage Clauses
+++++++++++++++
+-Doraoop.temporary.table.storage.clause="StorageClause"+
+-Doraoop.table.storage.clause="StorageClause"+
Use to customize storage with Oracle clauses as in TABLESPACE or COMPRESS
+-Doraoop.table.storage.clause+ applies to the export table that is created
from +-Doraoop.template.table+. See "Create Oracle Tables" for more
information. +-Doraoop.temporary.table.storage.clause+ applies to all other
working tables that are created during the export process and then dropped at
the end of the export job.
Manage Date And Timestamp Data Types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Import Date And Timestamp Data Types from Oracle
++++++++++++++++++++++++++++++++++++++++++++++++
This section lists known differences in the data obtained by performing an
Data Connector for Oracle and Hadoop import of an Oracle table versus a native
Sqoop import of the same table.
The Data Connector for Oracle and Hadoop Does Not Apply A Time Zone to DATE / TIMESTAMP Data Types
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Data stored in a DATE or TIMESTAMP column of an Oracle table is not associated
with a time zone. Sqoop without the Data Connector for Oracle and Hadoop
inappropriately applies time zone information to this data.
Take for example the following timestamp in an Oracle DATE or TIMESTAMP column:
+2am on 3rd October, 2010+.
Request Sqoop without the Data Connector for Oracle and Hadoop import this data
using a system located in Melbourne Australia. The data is adjusted to Melbourne
Daylight Saving Time. The data is imported into Hadoop as:
+3am on 3rd October, 2010.+
The Data Connector for Oracle and Hadoop does not apply time zone information to
these Oracle data-types. Even from a system located in Melbourne Australia, The
Data Connector for Oracle and Hadoop ensures the Oracle and Hadoop timestamps
match. The Data Connector for Oracle and Hadoop correctly imports this
timestamp as:
+2am on 3rd October, 2010+.
NOTE: In order for The Data Connector for Oracle and Hadoop to ensure data
accuracy, Oracle DATE and TIMESTAMP values must be represented by a String,
even when +--as-sequencefile+ is used on the Sqoop command-line to produce a
binary file in Hadoop.
The Data Connector for Oracle and Hadoop Retains Time Zone Information in TIMEZONE Data Types
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Data stored in a TIMESTAMP WITH TIME ZONE column of an Oracle table is
associated with a time zone. This data consists of two distinct parts: when the
event occurred and where the event occurred.
When Sqoop without The Data Connector for Oracle and Hadoop is used to import
data it converts the timestamp to the time zone of the system running Sqoop and
omits the component of the data that specifies where the event occurred.
Take for example the following timestamps (with time zone) in an Oracle
TIMESTAMP WITH TIME ZONE column:
----
2:59:00 am on 4th April, 2010. Australia/Melbourne
2:59:00 am on 4th April, 2010. America/New York
----
Request Sqoop without The Data Connector for Oracle and Hadoop import this data
using a system located in Melbourne Australia. From the data imported into
Hadoop we know when the events occurred, assuming we know the Sqoop command was
run from a system located in the Australia/Melbourne time zone, but we have lost
the information regarding where the event occurred.
----
2010-04-04 02:59:00.0
2010-04-04 16:59:00.0
----
Sqoop with The Data Connector for Oracle and Hadoop imports the example
timestamps as follows. The Data Connector for Oracle and Hadoop retains the
time zone portion of the data.
----
2010-04-04 02:59:00.0 Australia/Melbourne
2010-04-04 02:59:00.0 America/New_York
----
Data Connector for Oracle and Hadoop Explicitly States Time Zone for LOCAL TIMEZONE Data Types
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Data stored in a TIMESTAMP WITH LOCAL TIME ZONE column of an Oracle table is
associated with a time zone. Multiple end-users in differing time zones
(locales) will each have that data expressed as a timestamp within their
respective locale.
When Sqoop without the Data Connector for Oracle and Hadoop is used to import
data it converts the timestamp to the time zone of the system running Sqoop and
omits the component of the data that specifies location.
Take for example the following two timestamps (with time zone) in an Oracle
TIMESTAMP WITH LOCAL TIME ZONE column:
----
2:59:00 am on 4th April, 2010. Australia/Melbourne
2:59:00 am on 4th April, 2010. America/New York
----
Request Sqoop without the Data Connector for Oracle and Hadoop import this data
using a system located in Melbourne Australia. The timestamps are imported
correctly but the local time zone has to be guessed. If multiple systems in
different locale were executing the Sqoop import it would be very difficult to
diagnose the cause of the data corruption.
----
2010-04-04 02:59:00.0
2010-04-04 16:59:00.0
----
Sqoop with the Data Connector for Oracle and Hadoop explicitly states the time
zone portion of the data imported into Hadoop. The local time zone is GMT by
default. You can set the local time zone with parameter:
+-Doracle.sessionTimeZone=Australia/Melbourne+
The Data Connector for Oracle and Hadoop would import these two timestamps as:
----
2010-04-04 02:59:00.0 Australia/Melbourne
2010-04-04 16:59:00.0 Australia/Melbourne
----
java.sql.Timestamp
++++++++++++++++++
To use Sqoop's handling of date and timestamp data types when importing data
from Oracle use the following parameter:
+-Doraoop.timestamp.string=false+
NOTE: Sqoop's handling of date and timestamp data types does not store the
timezone. However, some developers may prefer Sqoop's handling as the Data
Connector for Oracle and Hadoop converts date and timestamp data types to
string. This may not work for some developers as the string will require
parsing later in the workflow.
Export Date And Timestamp Data Types into Oracle
++++++++++++++++++++++++++++++++++++++++++++++++
Ensure the data in the HDFS file fits the required format exactly before using
Sqoop to export the data into Oracle.
[NOTE]
================================================================================
- The Sqoop export command will fail if the data is not in the required format.
- ff = Fractional second
- TZR = Time Zone Region
================================================================================
[grid="all"]
`-------------------------------------`-------------------------------------------------------
Oracle Data Type Required Format of The Data in the HDFS File
----------------------------------------------------------------------------------------------
DATE +yyyy-mm-dd hh24:mi:ss+
TIMESTAMP +yyyy-mm-dd hh24:mi:ss.ff+
TIMESTAMPTZ +yyyy-mm-dd hh24:mi:ss.ff TZR+
TIMESTAMPLTZ +yyyy-mm-dd hh24:mi:ss.ff TZR+
----------------------------------------------------------------------------------------------
Configure The Data Connector for Oracle and Hadoop
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
oraoop-site-template.xml
++++++++++++++++++++++++
The oraoop-site-template.xml file is supplied with the Data Connector for
Oracle and Hadoop. It contains a number of ALTER SESSION statements that are
used to initialize the Oracle sessions created by the Data Connector for Oracle
and Hadoop.
If you need to customize these initializations to your environment then:
1. Find +oraoop-site-template.xml+ in the Sqoop configuration directory.
2. Copy +oraoop-site-template.xml+ to +oraoop-site.xml+.
3. Edit the +ALTER SESSION+ statements in +oraoop-site.xml+.
oraoop.oracle.session.initialization.statements
+++++++++++++++++++++++++++++++++++++++++++++++
The value of this property is a semicolon-delimited list of Oracle SQL
statements. These statements are executed, in order, for each Oracle session
created by the Data Connector for Oracle and Hadoop.
The default statements include:
+alter session set time_zone = \'{oracle.sessionTimeZone|GMT}';+::
This statement initializes the timezone of the JDBC client. This ensures that
data from columns of type TIMESTAMP WITH LOCAL TIMEZONE are correctly adjusted
into the timezone of the client and not kept in the timezone of the Oracle
database.
+
[NOTE]
================================================================================
- There is an explanation to the text within the curly-braces. See
"Expressions in oraoop-site.xml" for more information..
- A list of the time zones supported by your Oracle database is available by
executing the following query: +SELECT TZNAME FROM V$TIMEZONE_NAMES;+
================================================================================
+alter session disable parallel query;+::
This statement instructs Oracle to not parallelize SQL statements executed by
the Data Connector for Oracle and Hadoop sessions. This Oracle feature is
disabled because the Map/Reduce job launched by Sqoop is the mechanism used
for parallelization.
+
It is recommended that you not enable parallel query because it can have an
adverse effect the load on the Oracle instance and on the balance between
the Data Connector for Oracle and Hadoop mappers.
+
Some export operations are performed in parallel where deemed appropriate by
the Data Connector for Oracle and Hadoop. See "Parallelization" for
more information.
+alter session set "_serial_direct_read"=true;+::
This statement instructs Oracle to bypass the buffer cache. This is used to
prevent Oracle from filling its buffers with the data being read by the Data
Connector for Oracle and Hadoop, therefore diminishing its capacity to cache
higher prioritized data. Hence, this statement is intended to minimize the
Data Connector for Oracle and Hadoop's impact on the immediate future
performance of the Oracle database.
+--alter session set events \'10046 trace name context forever, level 8';+::
This statement has been commented-out. To allow tracing, remove the comment
token "--" from the start of the line.
[NOTE]
================================================================================
- These statements are placed on separate lines for readability. They do not
need to be placed on separate lines.
- A statement can be commented-out via the standard Oracle double-hyphen
token: "--". The comment takes effect until the next semicolon.
================================================================================
oraoop.table.import.where.clause.location
+++++++++++++++++++++++++++++++++++++++++
SUBSPLIT (default)::
When set to this value, the where clause is applied to each subquery used to
retrieve data from the Oracle table.
+
A Sqoop command like:
+
+sqoop import -D oraoop.table.import.where.clause.location=SUBSPLIT --table
JUNK --where "owner like \'G%'"+
+
Generates SQL query of the form:
+
----
SELECT OWNER,OBJECT_NAME
FROM JUNK
WHERE ((rowid >=
dbms_rowid.rowid_create(1, 113320, 1024, 4223664, 0)
AND rowid <=
dbms_rowid.rowid_create(1, 113320, 1024, 4223671, 32767)))
AND (owner like 'G%')
UNION ALL
SELECT OWNER,OBJECT_NAME
FROM JUNK
WHERE ((rowid >=
dbms_rowid.rowid_create(1, 113320, 1024, 4223672, 0)
AND rowid <=
dbms_rowid.rowid_create(1, 113320, 1024, 4223679, 32767)))
AND (owner like 'G%')
----
SPLIT::
When set to this value, the where clause is applied to the entire SQL
statement used by each split/mapper.
+
A Sqoop command like:
+
+sqoop import -D oraoop.table.import.where.clause.location=SPLIT --table
JUNK --where "rownum <= 10"+
+
Generates SQL query of the form:
+
----
SELECT OWNER,OBJECT_NAME
FROM (
SELECT OWNER,OBJECT_NAME
FROM JUNK
WHERE ((rowid >=
dbms_rowid.rowid_create(1, 113320, 1024, 4223664, 0)
AND rowid <=
dbms_rowid.rowid_create(1, 113320, 1024, 4223671, 32767)))
UNION ALL
SELECT OWNER,OBJECT_NAME
FROM JUNK
WHERE ((rowid >=
dbms_rowid.rowid_create(1, 113320, 1024, 4223672, 0)
AND rowid <=
dbms_rowid.rowid_create(1, 113320, 1024, 4223679,32767)))
)
WHERE rownum <= 10
----
+
[NOTE]
================================================================================
- In this example, there are up to 10 rows imported per mapper.
- The SPLIT clause may result in greater overhead than the SUBSPLIT
clause because the UNION statements need to be fully materialized
before the data can be streamed to the mappers. However, you may
wish to use SPLIT in the case where you want to limit the total
number of rows processed by each mapper.
================================================================================
oracle.row.fetch.size
+++++++++++++++++++++
The value of this property is an integer specifying the number of rows the
Oracle JDBC driver should fetch in each network round-trip to the database.
The default value is 5000.
If you alter this setting, confirmation of the
change is displayed in the logs of the mappers during the Map-Reduce job.
oraoop.import.hint
++++++++++++++++++
The Oracle optimizer hint is added to the SELECT statement for IMPORT jobs
as follows:
----
SELECT /*+ NO_INDEX(t) */ * FROM employees;
----
The default hint is +NO_INDEX(t)+
[NOTE]
================================================================================
- The hint can be added to the command line. See "Import Data from Oracle" for
more information.
- See the Oracle Database Performance Tuning Guide (Using Optimizer Hints)
for more information on Oracle optimizer hints.
- To turn the hint off, insert a space between the <value> elements.
+
----
<property>
<name>oraoop.import.hint</name>
<value> </value>
</property>
----
================================================================================
oraoop.oracle.append.values.hint.usage
++++++++++++++++++++++++++++++++++++++
The value of this property is one of: AUTO / ON / OFF.
AUTO::
AUTO is the default value.
+
Currently AUTO is equivalent to OFF.
ON::
During export the Data Connector for Oracle and Hadoop uses direct path
writes to populate the target Oracle table, bypassing the buffer cache.
Oracle only allows a single session to perform direct writes against a specific
table at any time, so this has the effect of serializing the writes to the
table. This may reduce throughput, especially if the number of mappers is high.
However, for databases where DBWR is very busy, or where the IO bandwidth to
the underlying table is narrow (table resides on a single disk spindle for
instance), then setting +oraoop.oracle.append.values.hint.usage+ to ON may
reduce the load on the Oracle database and possibly increase throughput.
OFF::
During export the Data Connector for Oracle and Hadoop does not use the
+APPEND_VALUES+ Oracle hint.
NOTE: This parameter is only effective on Oracle 11g Release 2 and above.
mapred.map.tasks.speculative.execution
++++++++++++++++++++++++++++++++++++++
By default speculative execution is disabled for the Data Connector for
Oracle and Hadoop. This avoids placing redundant load on the Oracle database.
If Speculative execution is enabled, then Hadoop may initiate multiple mappers
to read the same blocks of data, increasing the overall load on the database.
oraoop.block.allocation
+++++++++++++++++++++++
This setting determines how Oracle's data-blocks are assigned to Map-Reduce mappers.
NOTE: Applicable to import. Not applicable to export.
ROUNDROBIN (default)::
Each chunk of Oracle blocks is allocated to the mappers in a roundrobin
manner. This helps prevent one of the mappers from being
allocated a large proportion of typically small-sized blocks from the
start of Oracle data-files. In doing so it also helps prevent one of the
other mappers from being allocated a large proportion of typically
larger-sized blocks from the end of the Oracle data-files.
+
Use this method to help ensure all the mappers are allocated a similar
amount of work.
RANDOM::
The list of Oracle blocks is randomized before being allocated to the
mappers via a round-robin approach. This has the benefit of increasing
the chance that, at any given instant in time, each mapper is reading
from a different Oracle data-file. If the Oracle data-files are located on
separate spindles, this should increase the overall IO throughput.
SEQUENTIAL::
Each chunk of Oracle blocks is allocated to the mappers sequentially.
This produces the tendency for each mapper to sequentially read a large,
contiguous proportion of an Oracle data-file. It is unlikely for the
performance of this method to exceed that of the round-robin method
and it is more likely to allocate a large difference in the work between
the mappers.
+
Use of this method is generally not recommended.
oraoop.import.omit.lobs.and.long
++++++++++++++++++++++++++++++++
This setting can be used to omit all LOB columns (BLOB, CLOB and NCLOB) and LONG
column from an Oracle table being imported. This is advantageous in
troubleshooting, as it provides a convenient way to exclude all LOB-based data
from the import.
oraoop.locations
++++++++++++++++
NOTE: Applicable to import. Not applicable to export.
By default, four mappers are used for a Sqoop import job. The number of mappers
can be altered via the Sqoop +--num-mappers+ parameter.
If the data-nodes in your Hadoop cluster have 4 task-slots (that is they are
4-CPU core machines) it is likely for all four mappers to execute on the
same machine. Therefore, IO may be concentrated between the Oracle database
and a single machine.
This setting allows you to control which DataNodes in your Hadoop cluster each
mapper executes on. By assigning each mapper to a separate machine you may
improve the overall IO performance for the job. This will also have the
side-effect of the imported data being more diluted across the machines in
the cluster. (HDFS replication will dilute the data across the cluster anyway.)
Specify the machine names as a comma separated list. The locations are
allocated to each of the mappers in a round-robin manner.
If using EC2, specify the internal name of the machines. Here is an example
of using this parameter from the Sqoop command-line:
$ +sqoop import -D
oraoop.locations=ip-10-250-23-225.ec2.internal,ip-10-250-107-32.ec2.internal,ip-10-250-207-2.ec2.internal,ip-10-250-27-114.ec2.internal
--direct --connect...+
sqoop.connection.factories
++++++++++++++++++++++++++
This setting determines behavior if the Data Connector for Oracle and Hadoop
cannot accept the job. By default Sqoop accepts the jobs that the Data Connector
for Oracle and Hadoop rejects.
Set the value to +org.apache.sqoop.manager.oracle.OraOopManagerFactory+ when you
want the job to fail if the Data Connector for Oracle and Hadoop cannot
accept the job.
Expressions in oraoop-site.xml
++++++++++++++++++++++++++++++
Text contained within curly-braces { and } are expressions to be evaluated
prior to the SQL statement being executed. The expression contains the name
of the configuration property optionally followed by a default value to use
if the property has not been set. A pipe | character is used to delimit the
property name and the default value.
For example:
When this Sqoop command is executed::
$ +sqoop import -D oracle.sessionTimeZone=US/Hawaii --direct --connect+
The statement within oraoop-site.xml::
+alter session set time_zone =\'{oracle.sessionTimeZone|GMT}\';+
Becomes::
+alter session set time_zone = \'US/Hawaii'+
If the oracle.sessionTimeZone property had not been set, then this statement would use the specified default value and would become::
+alter session set time_zone = \'GMT'+
NOTE: The +oracle.sessionTimeZone+ property can be specified within the
+sqoop-site.xml+ file if you want this setting to be used all the time.
Troubleshooting The Data Connector for Oracle and Hadoop
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Quote Oracle Owners And Tables
++++++++++++++++++++++++++++++
[frame="topbot",cols="2*v"]
|===============================================================================
|If the owner of the Oracle table needs to be
quoted, use:
|$ +sqoop import ... --table
"\"\"Scott\".customers\""+
This is the equivalent of:
"Scott".customers
|If the Oracle table needs to be quoted, use:
|$ +sqoop import ... --table
"\"scott.\"Customers\"\""+
This is the equivalent of:
scott."Customers"
|If both the owner of the Oracle table and the
table itself needs to be quoted, use:
|$ +sqoop import ... --table
"\"\"Scott\".\"Customers\"\""+
This is the equivalent of:
"Scott"."Customers"
|===============================================================================
[NOTE]
================================================================================
- The HDFS output directory is called something like:
/user/username/"Scott"."Customers"
- If a table name contains a $ character, it may need to be escaped within your
Unix shell. For example, the dr$object table in the ctxsys schema would be
referred to as: $ +sqoop import ... --table "ctxsys.dr\$object"+
================================================================================
Quote Oracle Columns
++++++++++++++++++++
If a column name of an Oracle table needs to be quoted, use::
$ +sqoop import ... --table customers --columns "\"\"first name\"\""+
+
This is the equivalent of: `select "first name" from customers`
If multiple columns need to be quoted, use::
$ +sqoop import ... --table customers --columns "\"\"first name\",\"last
name\",\"region name\"\""+
+
This is the equivalent of: `select "first name", "last name", "region name"
from customers`
Confirm The Data Connector for Oracle and Hadoop Can Initialize The Oracle Session
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
If the Sqoop output includes feedback such as the following then the
configuration properties contained within +oraoop-site-template.xml+ and
+oraoop-site.xml+ have been loaded by Hadoop and can be accessed by the Data
Connector for Oracle and Hadoop.
+14/07/08 15:21:13 INFO oracle.OracleConnectionFactory:
Initializing Oracle session with SQL+
Check The Sqoop Debug Logs for Error Messages
+++++++++++++++++++++++++++++++++++++++++++++
For more information about any errors encountered during the Sqoop import,
refer to the log files generated by each of the (by default 4) mappers that
performed the import.
The logs can be obtained via your Map-Reduce Job Tracker's web page.
Include these log files with any requests you make for assistance on the Sqoop
User Group web site.
Export: Check Tables Are Compatible
+++++++++++++++++++++++++++++++++++
Check tables particularly in the case of a parsing error.
- Ensure the fields contained with the HDFS file and the columns within the
Oracle table are identical. If they are not identical, the Java code
dynamically generated by Sqoop to parse the HDFS file will throw an error when
reading the file – causing the export to fail. When creating a table in Oracle
ensure the definitions for the table template are identical to the definitions
for the HDFS file.
- Ensure the data types in the table are supported. See "Supported Data Types"
for more information.
- Are date and time zone based data types used? See "Export Date And Timestamp
Data Types into Oracle" for more information.
Export: Parallelization
+++++++++++++++++++++++
+-D oraoop.export.oracle.parallelization.enabled=false+
If you see a parallelization error you may decide to disable parallelization
on Oracle queries.
Export: Check oraoop.oracle.append.values.hint.usage
++++++++++++++++++++++++++++++++++++++++++++++++++++
The oraoop.oracle.append.values.hint.usage parameter should not be set to ON
if the Oracle table contains either a BINARY_DOUBLE or BINARY_FLOAT column and
the HDFS file being exported contains a NULL value in either of these column
types. Doing so will result in the error: +ORA-12838: cannot read/modify an
object after modifying it in parallel+.
Turn On Verbose
+++++++++++++++
Turn on verbose on the Sqoop command line.
+--verbose+
Check Sqoop stdout (standard output) and the mapper logs for information as to
where the problem may be.