blob: 2d074f492c3d37f9ae0a33acd7b95da24a07f3ce [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
+sqoop-import+
--------------
Purpose
~~~~~~~
include::import-purpose.txt[]
Syntax
~~~~~~
----
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
----
While the Hadoop generic arguments must precede any import arguments,
you can type the import arguments in any order with respect to one
another.
NOTE: In this document, arguments are grouped into collections
organized by function. Some collections are present in several tools
(for example, the "common" arguments). An extended description of their
functionality is given only on the first presentation in this
document.
include::common-args.txt[]
include::connecting.txt[]
include::validation-args.txt[]
.Import control arguments:
[grid="all"]
`---------------------------------`--------------------------------------
Argument Description
-------------------------------------------------------------------------
+\--append+ Append data to an existing dataset\
in HDFS
+\--as-avrodatafile+ Imports data to Avro Data Files
+\--as-sequencefile+ Imports data to SequenceFiles
+\--as-textfile+ Imports data as plain text (default)
+\--as-parquetfile+ Imports data to Parquet Files
+\--boundary-query <statement>+ Boundary query to use for creating splits
+\--columns <col,col,col...>+ Columns to import from table
+\--delete-target-dir+ Delete the import target directory\
if it exists
+\--direct+ Use direct connector if exists for the database
+\--fetch-size <n>+ Number of entries to read from database\
at once.
+\--inline-lob-limit <n>+ Set the maximum size for an inline LOB
+-m,\--num-mappers <n>+ Use 'n' map tasks to import in parallel
+-e,\--query <statement>+ Import the results of '+statement+'.
+\--split-by <column-name>+ Column of the table used to split work\
units. Cannot be used with\
+--autoreset-to-one-mapper+ option.
+\--split-limit <n>+ Upper Limit for each split size.\
This only applies to Integer and Date columns.\
For date or timestamp fields it is calculated in seconds.
+\--autoreset-to-one-mapper+ Import should use one mapper if a table\
has no primary key and no split-by column\
is provided. Cannot be used with\
+--split-by <col>+ option.
+\--table <table-name>+ Table to read
+\--target-dir <dir>+ HDFS destination dir
+\--temporary-rootdir <dir>+ HDFS directory for temporary files created during import (overrides default "_sqoop")
+\--warehouse-dir <dir>+ HDFS parent for table destination
+\--where <where clause>+ WHERE clause to use during import
+-z,\--compress+ Enable compression
+\--compression-codec <c>+ Use Hadoop codec (default gzip)
+--null-string <null-string>+ The string to be written for a null\
value for string columns
+--null-non-string <null-string>+ The string to be written for a null\
value for non-string columns
-------------------------------------------------------------------------
The +\--null-string+ and +\--null-non-string+ arguments are optional.\
If not specified, then the string "null" will be used.
Selecting the Data to Import
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sqoop typically imports data in a table-centric fashion. Use the
+\--table+ argument to select the table to import. For example, +\--table
employees+. This argument can also identify a +VIEW+ or other table-like
entity in a database.
By default, all columns within a table are selected for import.
Imported data is written to HDFS in its "natural order;" that is, a
table containing columns A, B, and C result in an import of data such
as:
----
A1,B1,C1
A2,B2,C2
...
----
You can select a subset of columns and control their ordering by using
the +\--columns+ argument. This should include a comma-delimited list
of columns to import. For example: +\--columns "name,employee_id,jobtitle"+.
You can control which rows are imported by adding a SQL +WHERE+ clause
to the import statement. By default, Sqoop generates statements of the
form +SELECT <column list> FROM <table name>+. You can append a
+WHERE+ clause to this with the +\--where+ argument. For example: +\--where
"id > 400"+. Only rows where the +id+ column has a value greater than
400 will be imported.
By default sqoop will use query +select min(<split-by>), max(<split-by>) from
<table name>+ to find out boundaries for creating splits. In some cases this query
is not the most optimal so you can specify any arbitrary query returning two
numeric columns using +\--boundary-query+ argument.
Free-form Query Imports
^^^^^^^^^^^^^^^^^^^^^^^
Sqoop can also import the result set of an arbitrary SQL query. Instead of
using the +\--table+, +\--columns+ and +\--where+ arguments, you can specify
a SQL statement with the +\--query+ argument.
When importing a free-form query, you must specify a destination directory
with +\--target-dir+.
If you want to import the results of a query in parallel, then each map task
will need to execute a copy of the query, with results partitioned by bounding
conditions inferred by Sqoop. Your query must include the token +$CONDITIONS+
which each Sqoop process will replace with a unique condition expression.
You must also select a splitting column with +\--split-by+.
For example:
----
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
----
Alternately, the query can be executed once and imported serially, by
specifying a single map task with +-m 1+:
----
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults
----
NOTE: If you are issuing the query wrapped with double quotes ("),
you will have to use +\$CONDITIONS+ instead of just +$CONDITIONS+
to disallow your shell from treating it as a shell variable.
For example, a double quoted query may look like:
+"SELECT * FROM x WHERE a=\'foo' AND \$CONDITIONS"+
NOTE: The facility of using free-form query in the current version of Sqoop
is limited to simple queries where there are no ambiguous projections and
no +OR+ conditions in the +WHERE+ clause. Use of complex queries such as
queries that have sub-queries or joins leading to ambiguous projections can
lead to unexpected results.
Controlling Parallelism
^^^^^^^^^^^^^^^^^^^^^^^
Sqoop imports data in parallel from most database sources. You can
specify the number
of map tasks (parallel processes) to use to perform the import by
using the +-m+ or +\--num-mappers+ argument. Each of these arguments
takes an integer value which corresponds to the degree of parallelism
to employ. By default, four tasks are used. Some databases may see
improved performance by increasing this value to 8 or 16. Do not
increase the degree of parallelism greater than that available within
your MapReduce cluster; tasks will run serially and will likely
increase the amount of time required to perform the import. Likewise,
do not increase the degree of parallism higher than that which your
database can reasonably support. Connecting 100 concurrent clients to
your database may increase the load on the database server to a point
where performance suffers as a result.
When performing parallel imports, Sqoop needs a criterion by which it
can split the workload. Sqoop uses a _splitting column_ to split the
workload. By default, Sqoop will identify the primary key column (if
present) in a table and use it as the splitting column. The low and
high values for the splitting column are retrieved from the database,
and the map tasks operate on evenly-sized components of the total
range. For example, if you had a table with a primary key column of
+id+ whose minimum value was 0 and maximum value was 1000, and Sqoop
was directed to use 4 tasks, Sqoop would run four processes which each
execute SQL statements of the form +SELECT * FROM sometable WHERE id
>= lo AND id < hi+, with +(lo, hi)+ set to (0, 250), (250, 500),
(500, 750), and (750, 1001) in the different tasks.
If the actual values for the primary key are not uniformly distributed
across its range, then this can result in unbalanced tasks. You should
explicitly choose a different column with the +\--split-by+ argument.
For example, +\--split-by employee_id+. Sqoop cannot currently split on
multi-column indices. If your table has no index column, or has a
multi-column key, then you must also manually choose a splitting
column.
User can override the +\--num-mapers+ by using +\--split-limit+ option.
Using the +\--split-limit+ parameter places a limit on the size of the split
section created. If the size of the split created is larger than the size
specified in this parameter, then the splits would be resized to fit within
this limit, and the number of splits will change according to that.This
affects actual number of mappers. If size of a split calculated based on
provided +\--num-mappers+ parameter exceeds +\--split-limit+ parameter then actual
number of mappers will be increased.If the value specified in +\--split-limit+
parameter is 0 or negative, the parameter will be ignored altogether and
the split size will be calculated according to the number of mappers.
If a table does not have a primary key defined and the +--split-by <col>+
is not provided, then import will fail unless the number
of mappers is explicitly set to one with the +--num-mappers 1+ option
or the +--autoreset-to-one-mapper+ option is used. The option
+--autoreset-to-one-mapper+ is typically used with the import-all-tables
tool to automatically handle tables without a primary key in a schema.
include::distributed-cache.txt[]
Controlling the Import Process
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, the import process will use JDBC which provides a
reasonable cross-vendor import channel. Some databases can perform
imports in a more high-performance fashion by using database-specific
data movement tools. For example, MySQL provides the +mysqldump+ tool
which can export data from MySQL to other systems very quickly. By
supplying the +\--direct+ argument, you are specifying that Sqoop
should attempt the direct import channel. This channel may be
higher performance than using JDBC.
Details about use of direct mode with each specific RDBMS, installation requirements, available
options and limitations can be found in <<connectors>>.
By default, Sqoop will import a table named +foo+ to a directory named
+foo+ inside your home directory in HDFS. For example, if your
username is +someuser+, then the import tool will write to
+/user/someuser/foo/(files)+. You can adjust the parent directory of
the import with the +\--warehouse-dir+ argument. For example:
----
$ sqoop import --connect <connect-str> --table foo --warehouse-dir /shared \
...
----
This command would write to a set of files in the +/shared/foo/+ directory.
You can also explicitly choose the target directory, like so:
----
$ sqoop import --connect <connect-str> --table foo --target-dir /dest \
...
----
This will import the files into the +/dest+ directory. +\--target-dir+ is
incompatible with +\--warehouse-dir+.
When using direct mode, you can specify additional arguments which
should be passed to the underlying tool. If the argument
+\--+ is given on the command-line, then subsequent arguments are sent
directly to the underlying tool. For example, the following adjusts
the character set used by +mysqldump+:
----
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1
----
By default, imports go to a new target location. If the destination directory
already exists in HDFS, Sqoop will refuse to import and overwrite that
directory's contents. If you use the +\--append+ argument, Sqoop will import
data to a temporary directory and then rename the files into the normal
target directory in a manner that does not conflict with existing filenames
in that directory.
Controlling transaction isolation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, Sqoop uses the read committed transaction isolation in the mappers
to import data. This may not be the ideal in all ETL workflows and it may
desired to reduce the isolation guarantees. The +\--relaxed-isolation+ option
can be used to instruct Sqoop to use read uncommitted isolation level.
The +read-uncommitted+ isolation level is not supported on all databases
(for example, Oracle), so specifying the option +\--relaxed-isolation+
may not be supported on all databases.
Controlling type mapping
^^^^^^^^^^^^^^^^^^^^^^^^
Sqoop is preconfigured to map most SQL types to appropriate Java or Hive
representatives. However the default mapping might not be suitable for
everyone and might be overridden by +--map-column-java+ (for changing
mapping to Java) or +--map-column-hive+ (for changing Hive mapping).
.Parameters for overriding mapping
[grid="all"]
`---------------------------------`--------------------------------------
Argument Description
-------------------------------------------------------------------------
+\--map-column-java <mapping>+ Override mapping from SQL to Java type\
for configured columns.
+\--map-column-hive <mapping>+ Override mapping from SQL to Hive type\
for configured columns.
-------------------------------------------------------------------------
Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example:
----
$ sqoop import ... --map-column-java id=String,value=Integer
----
Notice that specifying commas in --map-column-hive option, you should use URL encoded
keys and values, for example, use DECIMAL(1%2C%201) instead of DECIMAL(1, 1).
Sqoop will rise exception in case that some configured mapping will not be used.
Schema name handling
^^^^^^^^^^^^^^^^^^^^
When sqoop imports data from an enterprise store, table and column names
may have characters that are not valid Java identifier characters or
Avro/Parquet identifiers. To address this, sqoop translates these characters
to _ as part of the schema creation. Any column name starting with an _
(underscore) character will be translated to have two underscore characters.
For example _AVRO will be converted to __AVRO.
In the case of HCatalog imports, column names are converted to lower case when
mapped to HCatalog columns. This may change in future.
Incremental Imports
^^^^^^^^^^^^^^^^^^^
Sqoop provides an incremental import mode which can be used to retrieve
only rows newer than some previously-imported set of rows.
The following arguments control incremental imports:
.Incremental import arguments:
[grid="all"]
`-----------------------------`--------------------------------------
Argument Description
---------------------------------------------------------------------
+\--check-column (col)+ Specifies the column to be examined \
when determining which rows to import.\
(the column should not be of type \
CHAR/NCHAR/VARCHAR/VARNCHAR/\
LONGVARCHAR/LONGNVARCHAR)
+\--incremental (mode)+ Specifies how Sqoop determines which \
rows are new. Legal values for +mode+\
include +append+ and +lastmodified+.
+\--last-value (value)+ Specifies the maximum value of the \
check column from the previous import.
---------------------------------------------------------------------
Sqoop supports two types of incremental imports: +append+ and +lastmodified+.
You can use the +\--incremental+ argument to specify the type of incremental
import to perform.
You should specify +append+ mode when importing a table where new rows are
continually being added with increasing row id values. You specify the column
containing the row's id with +\--check-column+. Sqoop imports rows where the
check column has a value greater than the one specified with +\--last-value+.
An alternate table update strategy supported by Sqoop is called +lastmodified+
mode. You should use this when rows of the source table may be updated, and
each such update will set the value of a last-modified column to the current
timestamp. Rows where the check column holds a timestamp more recent than the
timestamp specified with +\--last-value+ are imported.
At the end of an incremental import, the value which should be specified as
+\--last-value+ for a subsequent import is printed to the screen. When running
a subsequent import, you should specify +\--last-value+ in this way to ensure
you import only the new or updated data. This is handled automatically by
creating an incremental import as a saved job, which is the preferred
mechanism for performing a recurring incremental import. See the section on
saved jobs later in this document for more information.
File Formats
^^^^^^^^^^^^
You can import data in one of two file formats: delimited text or
SequenceFiles.
Delimited text is the default import format. You can also specify it
explicitly by using the +\--as-textfile+ argument. This argument will write
string-based representations of each record to the output files, with
delimiter characters between individual columns and rows. These
delimiters may be commas, tabs, or other characters. (The delimiters
can be selected; see "Output line formatting arguments.") The
following is the results of an example text-based import:
----
1,here is a message,2010-05-01
2,happy new year!,2010-01-01
3,another message,2009-11-12
----
Delimited text is appropriate for most non-binary data types. It also
readily supports further manipulation by other tools, such as Hive.
SequenceFiles are a binary format that store individual records in
custom record-specific data types. These data types are manifested as
Java classes. Sqoop will automatically generate these data types for
you. This format supports exact storage of all data in binary
representations, and is appropriate for storing binary data
(for example, +VARBINARY+ columns), or data that will be principly
manipulated by custom MapReduce programs (reading from SequenceFiles
is higher-performance than reading from text files, as records do not
need to be parsed).
Avro data files are a compact, efficient binary format that provides
interoperability with applications written in other programming
languages. Avro also supports versioning, so that when, e.g., columns
are added or removed from a table, previously imported data files can
be processed along with new ones.
By default, data is not compressed. You can compress your data by
using the deflate (gzip) algorithm with the +-z+ or +\--compress+
argument, or specify any Hadoop compression codec using the
+\--compression-codec+ argument. This applies to SequenceFile, text,
and Avro files.
Enabling Logical Types in Avro and Parquet import for numbers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To enable the use of logical types in Sqoop's avro schema generation,
i.e. used during both avro and parquet imports, one has to use the
sqoop.avro.logical_types.decimal.enable flag. This is necessary if one
wants to store values as decimals in the avro file format.
Padding number types in avro import
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Certain databases, such as Oracle and Postgres store number and decimal
values without padding. For example 1.5 in a column declared
as NUMBER (20,5) is stored as is in Oracle, while the equivalent
DECIMAL (20, 5) is stored as 1.50000 in an SQL server instance.
This leads to a scale mismatch during avro import.
To avoid this error, one can use the sqoop.avro.decimal_padding.enable flag
to turn on padding with 0s. This flag has to be used together with the
sqoop.avro.logical_types.decimal.enable flag set to true.
Default precision and scale in avro import
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
All of the databases allow users to specify numeric columns without
a precision or scale. While MS SQL and MySQL translate these into
a valid precision and scale values, Oracle and Postgres don't.
Therefore, when a table contains NUMBER in a table in Oracle or
NUMERIC/DECIMAL in Postgres, one can specify a default precision and scale
to be used in the avro schema by using the +sqoop.avro.logical_types.decimal.default.precision+
and +sqoop.avro.logical_types.decimal.default.scale+ flags.
Avro padding also has to be enabled, if the values are shorter than
the specified default scale.
Large Objects
^^^^^^^^^^^^^
Sqoop handles large objects (+BLOB+ and +CLOB+ columns) in particular
ways. If this data is truly large, then these columns should not be
fully materialized in memory for manipulation, as most columns are.
Instead, their data is handled in a streaming fashion. Large objects
can be stored inline with the rest of the data, in which case they are
fully materialized in memory on every access, or they can be stored in
a secondary storage file linked to the primary data storage. By
default, large objects less than 16 MB in size are stored inline with
the rest of the data. At a larger size, they are stored in files in
the +_lobs+ subdirectory of the import target directory. These files
are stored in a separate format optimized for large record storage,
which can accomodate records of up to 2^63 bytes each. The size at
which lobs spill into separate files is controlled by the
+\--inline-lob-limit+ argument, which takes a parameter specifying the
largest lob size to keep inline, in bytes. If you set the inline LOB
limit to 0, all large objects will be placed in external
storage.
include::output-args.txt[]
When importing to delimited files, the choice of delimiter is
important. Delimiters which appear inside string-based fields may
cause ambiguous parsing of the imported data by subsequent analysis
passes. For example, the string +"Hello, pleased to meet you"+ should
not be imported with the end-of-field delimiter set to a comma.
Delimiters may be specified as:
- a character (+\--fields-terminated-by X+)
- an escape character (+\--fields-terminated-by \t+). Supported escape
characters are:
* +\b+ (backspace)
* +\n+ (newline)
* +\r+ (carriage return)
* +\t+ (tab)
* +\"+ (double-quote)
* +\\'+ (single-quote)
* +\\+ (backslash)
* +\0+ (NUL) - This will insert NUL characters between fields or lines,
or will disable enclosing/escaping if used for one of the +\--enclosed-by+,
+\--optionally-enclosed-by+, or +\--escaped-by+ arguments.
- The octal representation of a UTF-8 character's code point. This
should be of the form +\0ooo+, where _ooo_ is the octal value.
For example, +\--fields-terminated-by \001+ would yield the +^A+ character.
- The hexadecimal representation of a UTF-8 character's code point. This
should be of the form +\0xhhh+, where _hhh_ is the hex value.
For example, +\--fields-terminated-by \0x10+ would yield the carriage
return character.
The default delimiters are a comma (+,+) for fields, a newline (+\n+) for records, no quote
character, and no escape character. Note that this can lead to
ambiguous/unparsible records if you import database records containing
commas or newlines in the field data. For unambiguous parsing, both must
be enabled. For example, via +\--mysql-delimiters+.
If unambiguous delimiters cannot be presented, then use _enclosing_ and
_escaping_ characters. The combination of (optional)
enclosing and escaping characters will allow unambiguous parsing of
lines. For example, suppose one column of a dataset contained the
following values:
----
Some string, with a comma.
Another "string with quotes"
----
The following arguments would provide delimiters which can be
unambiguously parsed:
----
$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...
----
(Note that to prevent the shell from mangling the enclosing character,
we have enclosed that argument itself in single-quotes.)
The result of the above arguments applied to the above dataset would
be:
----
"Some string, with a comma.","1","2","3"...
"Another \"string with quotes\"","4","5","6"...
----
Here the imported strings are shown in the context of additional
columns (+"1","2","3"+, etc.) to demonstrate the full effect of enclosing
and escaping. The enclosing character is only strictly necessary when
delimiter characters appear in the imported text. The enclosing
character can therefore be specified as optional:
----
$ sqoop import --optionally-enclosed-by '\"' (the rest as above)...
----
Which would result in the following import:
----
"Some string, with a comma.",1,2,3...
"Another \"string with quotes\"",4,5,6...
----
NOTE: Even though Hive supports escaping characters, it does not
handle escaping of new-line character. Also, it does not support
the notion of enclosing characters that may include field delimiters
in the enclosed string. It is therefore recommended that you choose
unambiguous field and record-terminating delimiters without the help
of escaping and enclosing characters when working with Hive; this is
due to limitations of Hive's input parsing abilities.
The +\--mysql-delimiters+ argument is a shorthand argument which uses
the default delimiters for the +mysqldump+ program.
If you use the +mysqldump+ delimiters in conjunction with a
direct-mode import (with +\--direct+), very fast imports can be
achieved.
While the choice of delimiters is most important for a text-mode
import, it is still relevant if you import to SequenceFiles with
+\--as-sequencefile+. The generated class' +toString()+ method
will use the delimiters you specify, so subsequent formatting of
the output data will rely on the delimiters you choose.
include::input-args.txt[]
When Sqoop imports data to HDFS, it generates a Java class which can
reinterpret the text files that it creates when doing a
delimited-format import. The delimiters are chosen with arguments such
as +\--fields-terminated-by+; this controls both how the data is
written to disk, and how the generated +parse()+ method reinterprets
this data. The delimiters used by the +parse()+ method can be chosen
independently of the output arguments, by using
+\--input-fields-terminated-by+, and so on. This is useful, for example, to
generate classes which can parse records created with one set of
delimiters, and emit the records to a different set of files using a
separate set of delimiters.
include::hive-args.txt[]
include::hive.txt[]
include::hbase-args.txt[]
include::hbase.txt[]
include::accumulo-args.txt[]
include::accumulo.txt[]
include::codegen-args.txt[]
As mentioned earlier, a byproduct of importing a table to HDFS is a
class which can manipulate the imported data. If the data is stored in
SequenceFiles, this class will be used for the data's serialization
container. Therefore, you should use this class in your subsequent
MapReduce processing of the data.
The class is typically named after the table; a table named +foo+ will
generate a class named +foo+. You may want to override this class
name. For example, if your table is named +EMPLOYEES+, you may want to
specify +\--class-name Employee+ instead. Similarly, you can specify
just the package name with +\--package-name+. The following import
generates a class named +com.foocorp.SomeTable+:
----
$ sqoop import --connect <connect-str> --table SomeTable --package-name com.foocorp
----
The +.java+ source file for your class will be written to the current
working directory when you run +sqoop+. You can control the output
directory with +\--outdir+. For example, +\--outdir src/generated/+.
The import process compiles the source into +.class+ and +.jar+ files;
these are ordinarily stored under +/tmp+. You can select an alternate
target directory with +\--bindir+. For example, +\--bindir /scratch+.
If you already have a compiled class that can be used to perform the
import and want to suppress the code-generation aspect of the import
process, you can use an existing jar and class by
providing the +\--jar-file+ and +\--class-name+ options. For example:
----
$ sqoop import --table SomeTable --jar-file mydatatypes.jar \
--class-name SomeTableType
----
This command will load the +SomeTableType+ class out of +mydatatypes.jar+.
Additional Import Configuration Properties
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are some additional properties which can be configured by modifying
+conf/sqoop-site.xml+. Properties can be specified the same as in Hadoop
configuration files, for example:
----
<property>
<name>property.name</name>
<value>property.value</value>
</property>
----
They can also be specified on the command line in the generic arguments, for
example:
----
sqoop import -D property.name=property.value ...
----
.Additional import configuration properties:
[grid="all"]
`-------------------------------------`----------------------------------------
Argument Description
-------------------------------------------------------------------------------
+sqoop.bigdecimal.format.string+ Controls how BigDecimal columns will \
formatted when stored as a String. A \
value of +true+ (default) will use \
toPlainString to store them without an \
exponent component (0.0000001); while \
a value of +false+ will use toString \
which may include an exponent (1E-7)
+sqoop.hbase.add.row.key+ When set to +false+ (default), Sqoop \
will not add the column used as a row \
key into the row data in HBase. When \
set to +true+, the column used as a \
row key will be added to the row data \
in HBase.
-------------------------------------------------------------------------------
Example Invocations
~~~~~~~~~~~~~~~~~~~
The following examples illustrate how to use the import tool in a variety
of situations.
A basic import of a table named +EMPLOYEES+ in the +corp+ database:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
----
A basic import requiring a login:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--username SomeUser -P
Enter password: (hidden)
----
Selecting specific columns from the +EMPLOYEES+ table:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--columns "employee_id,first_name,last_name,job_title"
----
Controlling the import parallelism (using 8 parallel tasks):
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
-m 8
----
Storing data in SequenceFiles, and setting the generated class name to
+com.foocorp.Employee+:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--class-name com.foocorp.Employee --as-sequencefile
----
Specifying the delimiters to use in a text-mode import:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--fields-terminated-by '\t' --lines-terminated-by '\n' \
--optionally-enclosed-by '\"'
----
Importing the data to Hive:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--hive-import
----
Importing only new employees:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--where "start_date > '2010-01-01'"
----
Changing the splitting column from the default:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--split-by dept_id
----
Verifying that an import was successful:
----
$ hadoop fs -ls EMPLOYEES
Found 5 items
drwxr-xr-x - someuser somegrp 0 2010-04-27 16:40 /user/someuser/EMPLOYEES/_logs
-rw-r--r-- 1 someuser somegrp 2913511 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00000
-rw-r--r-- 1 someuser somegrp 1683938 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00001
-rw-r--r-- 1 someuser somegrp 7245839 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00002
-rw-r--r-- 1 someuser somegrp 7842523 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00003
$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
0,joe,smith,engineering
1,jane,doe,marketing
...
----
Performing an incremental import of new data, after having already
imported the first 100,000 rows of a table:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
--where "id > 100000" --target-dir /incremental_dataset --append
----
An import of a table named +EMPLOYEES+ in the +corp+ database that uses
validation to validate the import using the table row count and number of
rows copied into HDFS:
<<validation,More Details>>
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
--table EMPLOYEES --validate
----
Enabling logical types in avro import and also turning on padding with 0s:
----
$ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true
--connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS"
--target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1
----
Enabling logical types in avro import and also turning on padding with 0s, while specifying default precision and scale as well:
----
$ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true
-Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
--connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS"
--target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1
----