src/docs/user/import.txt - sqoop - Git at Google

 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 ////


 +sqoop-import+
 --------------

 Purpose
 ~~~~~~~

 include::import-purpose.txt[]

 Syntax
 ~~~~~~

 ----
 $ sqoop import (generic-args) (import-args)
 $ sqoop-import (generic-args) (import-args)
 ----

 While the Hadoop generic arguments must precede any import arguments,
 you can type the import arguments in any order with respect to one
 another.

 NOTE: In this document, arguments are grouped into collections
 organized by function. Some collections are present in several tools
 (for example, the "common" arguments). An extended description of their
 functionality is given only on the first presentation in this
 document.

 include::common-args.txt[]

 include::connecting.txt[]

 include::validation-args.txt[]

 .Import control arguments:
 [grid="all"]
 `---------------------------------`--------------------------------------
 Argument                          Description
 -------------------------------------------------------------------------
 +\--append+                       Append data to an existing dataset\
                                   in HDFS
 +\--as-avrodatafile+              Imports data to Avro Data Files
 +\--as-sequencefile+              Imports data to SequenceFiles
 +\--as-textfile+                  Imports data as plain text (default)
 +\--as-parquetfile+               Imports data to Parquet Files
 +\--boundary-query <statement>+   Boundary query to use for creating splits
 +\--columns <col,col,col...>+     Columns to import from table
 +\--delete-target-dir+            Delete the import target directory\
                                   if it exists
 +\--direct+                       Use direct connector if exists for the database
 +\--fetch-size <n>+               Number of entries to read from database\
                                   at once.
 +\--inline-lob-limit <n>+         Set the maximum size for an inline LOB
 +-m,\--num-mappers <n>+           Use 'n' map tasks to import in parallel
 +-e,\--query <statement>+         Import the results of '+statement+'.
 +\--split-by <column-name>+       Column of the table used to split work\
                                   units.  Cannot be used with\
                                   +--autoreset-to-one-mapper+ option.
 +\--split-limit <n>+              Upper Limit for each split size.\
                                   This only applies to Integer and Date columns.\
                                   For date or timestamp fields it is calculated in seconds.
 +\--autoreset-to-one-mapper+      Import should use one mapper if a table\
                                   has no primary key and no split-by column\
                                   is provided.  Cannot be used with\
                                   +--split-by <col>+ option.
 +\--table <table-name>+           Table to read
 +\--target-dir <dir>+             HDFS destination dir
 +\--temporary-rootdir <dir>+      HDFS directory for temporary files created during import (overrides default "_sqoop")
 +\--warehouse-dir <dir>+          HDFS parent for table destination
 +\--where <where clause>+         WHERE clause to use during import
 +-z,\--compress+                  Enable compression
 +\--compression-codec <c>+        Use Hadoop codec (default gzip)
 +--null-string <null-string>+     The string to be written for a null\
                                   value for string columns
 +--null-non-string <null-string>+ The string to be written for a null\
                                   value for non-string columns
 -------------------------------------------------------------------------

 The +\--null-string+ and +\--null-non-string+ arguments are optional.\
 If not specified, then the string "null" will be used.

 Selecting the Data to Import
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Sqoop typically imports data in a table-centric fashion. Use the
 +\--table+ argument to select the table to import. For example, +\--table
 employees+. This argument can also identify a +VIEW+ or other table-like
 entity in a database.

 By default, all columns within a table are selected for import.
 Imported data is written to HDFS in its "natural order;" that is, a
 table containing columns A, B, and C result in an import of data such
 as:

 ----
 A1,B1,C1
 A2,B2,C2
 ...
 ----

 You can select a subset of columns and control their ordering by using
 the +\--columns+ argument. This should include a comma-delimited list
 of columns to import. For example: +\--columns "name,employee_id,jobtitle"+.

 You can control which rows are imported by adding a SQL +WHERE+ clause
 to the import statement. By default, Sqoop generates statements of the
 form +SELECT <column list> FROM <table name>+. You can append a
 +WHERE+ clause to this with the +\--where+ argument. For example: +\--where
 "id > 400"+. Only rows where the +id+ column has a value greater than
 400 will be imported.

 By default sqoop will use query +select min(<split-by>), max(<split-by>) from
 <table name>+ to find out boundaries for creating splits. In some cases this query
 is not the most optimal so you can specify any arbitrary query returning two
 numeric columns using +\--boundary-query+ argument.

 Free-form Query Imports
 ^^^^^^^^^^^^^^^^^^^^^^^

 Sqoop can also import the result set of an arbitrary SQL query. Instead of
 using the +\--table+, +\--columns+ and +\--where+ arguments, you can specify
 a SQL statement with the +\--query+ argument.

 When importing a free-form query, you must specify a destination directory
 with +\--target-dir+.

 If you want to import the results of a query in parallel, then each map task
 will need to execute a copy of the query, with results partitioned by bounding
 conditions inferred by Sqoop. Your query must include the token +$CONDITIONS+
 which each Sqoop process will replace with a unique condition expression.
 You must also select a splitting column with +\--split-by+.

 For example:

 ----
 $ sqoop import \
   --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
   --split-by a.id --target-dir /user/foo/joinresults
 ----

 Alternately, the query can be executed once and imported serially, by
 specifying a single map task with +-m 1+:

 ----
 $ sqoop import \
   --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
   -m 1 --target-dir /user/foo/joinresults
 ----

 NOTE: If you are issuing the query wrapped with double quotes ("),
 you will have to use +\$CONDITIONS+ instead of just +$CONDITIONS+
 to disallow your shell from treating it as a shell variable.
 For example, a double quoted query may look like:
 +"SELECT * FROM x WHERE a=\'foo' AND \$CONDITIONS"+

 NOTE: The facility of using free-form query in the current version of Sqoop
 is limited to simple queries where there are no ambiguous projections and
 no +OR+ conditions in the +WHERE+ clause. Use of complex queries such as
 queries that have sub-queries or joins leading to ambiguous projections can
 lead to unexpected results.

 Controlling Parallelism
 ^^^^^^^^^^^^^^^^^^^^^^^

 Sqoop imports data in parallel from most database sources. You can
 specify the number
 of map tasks (parallel processes) to use to perform the import by
 using the +-m+ or +\--num-mappers+ argument. Each of these arguments
 takes an integer value which corresponds to the degree of parallelism
 to employ. By default, four tasks are used. Some databases may see
 improved performance by increasing this value to 8 or 16. Do not
 increase the degree of parallelism greater than that available within
 your MapReduce cluster; tasks will run serially and will likely
 increase the amount of time required to perform the import. Likewise,
 do not increase the degree of parallism higher than that which your
 database can reasonably support. Connecting 100 concurrent clients to
 your database may increase the load on the database server to a point
 where performance suffers as a result.

 When performing parallel imports, Sqoop needs a criterion by which it
 can split the workload. Sqoop uses a _splitting column_ to split the
 workload. By default, Sqoop will identify the primary key column (if
 present) in a table and use it as the splitting column. The low and
 high values for the splitting column are retrieved from the database,
 and the map tasks operate on evenly-sized components of the total
 range. For example, if you had a table with a primary key column of
 +id+ whose minimum value was 0 and maximum value was 1000, and Sqoop
 was directed to use 4 tasks, Sqoop would run four processes which each
 execute SQL statements of the form +SELECT * FROM sometable WHERE id
 >= lo AND id < hi+, with +(lo, hi)+ set to (0, 250), (250, 500),
 (500, 750), and (750, 1001) in the different tasks.

 If the actual values for the primary key are not uniformly distributed
 across its range, then this can result in unbalanced tasks. You should
 explicitly choose a different column with the +\--split-by+ argument.
 For example, +\--split-by employee_id+. Sqoop cannot currently split on
 multi-column indices. If your table has no index column, or has a
 multi-column key, then you must also manually choose a splitting
 column.

 User can override the +\--num-mapers+ by using +\--split-limit+ option.
 Using the +\--split-limit+ parameter places a limit on the size of the split
 section created. If the size of the split created is larger than the size
 specified in this parameter, then the splits would be resized to fit within
 this limit, and the number of splits will change according to that.This
 affects actual number of mappers. If size of a split calculated based on
 provided +\--num-mappers+ parameter exceeds +\--split-limit+ parameter then actual
 number of mappers will be increased.If the value specified in +\--split-limit+
 parameter is 0 or negative, the parameter will be ignored altogether and
 the split size will be calculated according to the number of mappers.

 If a table does not have a primary key defined and the +--split-by <col>+
 is not provided, then import will fail unless the number
 of mappers is explicitly set to one with the +--num-mappers 1+ option
 or the +--autoreset-to-one-mapper+ option is used.  The option
 +--autoreset-to-one-mapper+ is typically used with the import-all-tables
 tool to automatically handle tables without a primary key in a schema.

 include::distributed-cache.txt[]

 Controlling the Import Process
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 By default, the import process will use JDBC which provides a
 reasonable cross-vendor import channel. Some databases can perform
 imports in a more high-performance fashion by using database-specific
 data movement tools. For example, MySQL provides the +mysqldump+ tool
 which can export data from MySQL to other systems very quickly. By
 supplying the +\--direct+ argument, you are specifying that Sqoop
 should attempt the direct import channel. This channel may be
 higher performance than using JDBC.

 Details about use of direct mode with each specific RDBMS, installation requirements, available
 options and limitations can be found in <<connectors>>.

 By default, Sqoop will import a table named +foo+ to a directory named
 +foo+ inside your home directory in HDFS. For example, if your
 username is +someuser+, then the import tool will write to
 +/user/someuser/foo/(files)+. You can adjust the parent directory of
 the import with the +\--warehouse-dir+ argument. For example:

 ----
 $ sqoop import --connect <connect-str> --table foo --warehouse-dir /shared \
     ...
 ----

 This command would write to a set of files in the +/shared/foo/+ directory.

 You can also explicitly choose the target directory, like so:

 ----
 $ sqoop import --connect <connect-str> --table foo --target-dir /dest \
     ...
 ----

 This will import the files into the +/dest+ directory. +\--target-dir+ is
 incompatible with +\--warehouse-dir+.

 When using direct mode, you can specify additional arguments which
 should be passed to the underlying tool. If the argument
 +\--+ is given on the command-line, then subsequent arguments are sent
 directly to the underlying tool. For example, the following adjusts
 the character set used by +mysqldump+:

 ----
 $ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
     --direct -- --default-character-set=latin1
 ----

 By default, imports go to a new target location. If the destination directory
 already exists in HDFS, Sqoop will refuse to import and overwrite that
 directory's contents. If you use the +\--append+ argument, Sqoop will import
 data to a temporary directory and then rename the files into the normal
 target directory in a manner that does not conflict with existing filenames
 in that directory.


 Controlling transaction isolation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 By default, Sqoop uses the read committed transaction isolation in the mappers
 to import data.  This may not be the ideal in all ETL workflows and it may
 desired to reduce the isolation guarantees.   The +\--relaxed-isolation+ option
 can be used to instruct Sqoop to use read uncommitted isolation level.

 The +read-uncommitted+ isolation level is not supported on all databases
 (for example, Oracle), so specifying the option +\--relaxed-isolation+
 may not be supported on all databases.

 Controlling type mapping
 ^^^^^^^^^^^^^^^^^^^^^^^^

 Sqoop is preconfigured to map most SQL types to appropriate Java or Hive
 representatives. However the default mapping might not be suitable for
 everyone and might be overridden by +--map-column-java+ (for changing
 mapping to Java) or +--map-column-hive+ (for changing Hive mapping).

 .Parameters for overriding mapping
 [grid="all"]
 `---------------------------------`--------------------------------------
 Argument                          Description
 -------------------------------------------------------------------------
 +\--map-column-java <mapping>+    Override mapping from SQL to Java type\
                                   for configured columns.
 +\--map-column-hive <mapping>+    Override mapping from SQL to Hive type\
                                   for configured columns.
 -------------------------------------------------------------------------

 Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example:

 ----
 $ sqoop import ... --map-column-java id=String,value=Integer
 ----

 Notice that specifying commas in --map-column-hive option, you should use URL encoded
 keys and values, for example, use DECIMAL(1%2C%201) instead of DECIMAL(1, 1).

 Sqoop will rise exception in case that some configured mapping will not be used.

 Schema name handling
 ^^^^^^^^^^^^^^^^^^^^

 When sqoop imports data from an enterprise store, table and column names
 may have characters that are not valid Java identifier characters or
 Avro/Parquet identifiers.  To address this, sqoop translates these characters
 to _ as part of the schema creation.  Any column name starting with an _
 (underscore) character will be translated to have two underscore characters.
 For example _AVRO will be converted to __AVRO.

 In the case of HCatalog imports, column names are converted to lower case when
 mapped to HCatalog columns.  This may change in future.

 Incremental Imports
 ^^^^^^^^^^^^^^^^^^^

 Sqoop provides an incremental import mode which can be used to retrieve
 only rows newer than some previously-imported set of rows.

 The following arguments control incremental imports:


 .Incremental import arguments:
 [grid="all"]
 `-----------------------------`--------------------------------------
 Argument                      Description
 ---------------------------------------------------------------------
 +\--check-column (col)+       Specifies the column to be examined \
                               when determining which rows to import.\
                               (the column should not be of type \
                               CHAR/NCHAR/VARCHAR/VARNCHAR/\
                               LONGVARCHAR/LONGNVARCHAR)
 +\--incremental (mode)+       Specifies how Sqoop determines which \
                               rows are new. Legal values for +mode+\
                               include +append+ and +lastmodified+.
 +\--last-value (value)+       Specifies the maximum value of the \
                               check column from the previous import.
 ---------------------------------------------------------------------


 Sqoop supports two types of incremental imports: +append+ and +lastmodified+.
 You can use the +\--incremental+ argument to specify the type of incremental
 import to perform.

 You should specify +append+ mode when importing a table where new rows are
 continually being added with increasing row id values. You specify the column
 containing the row's id with +\--check-column+. Sqoop imports rows where the
 check column has a value greater than the one specified with +\--last-value+.

 An alternate table update strategy supported by Sqoop is called +lastmodified+
 mode. You should use this when rows of the source table may be updated, and
 each such update will set the value of a last-modified column to the current
 timestamp.  Rows where the check column holds a timestamp more recent than the
 timestamp specified with +\--last-value+ are imported.

 At the end of an incremental import, the value which should be specified as
 +\--last-value+ for a subsequent import is printed to the screen. When running
 a subsequent import, you should specify +\--last-value+ in this way to ensure
 you import only the new or updated data. This is handled automatically by
 creating an incremental import as a saved job, which is the preferred
 mechanism for performing a recurring incremental import. See the section on
 saved jobs later in this document for more information.


 File Formats
 ^^^^^^^^^^^^

 You can import data in one of two file formats: delimited text or
 SequenceFiles.

 Delimited text is the default import format. You can also specify it
 explicitly by using the +\--as-textfile+ argument. This argument will write
 string-based representations of each record to the output files, with
 delimiter characters between individual columns and rows. These
 delimiters may be commas, tabs, or other characters. (The delimiters
 can be selected; see "Output line formatting arguments.") The
 following is the results of an example text-based import:

 ----
 1,here is a message,2010-05-01
 2,happy new year!,2010-01-01
 3,another message,2009-11-12
 ----

 Delimited text is appropriate for most non-binary data types. It also
 readily supports further manipulation by other tools, such as Hive.

 SequenceFiles are a binary format that store individual records in
 custom record-specific data types. These data types are manifested as
 Java classes. Sqoop will automatically generate these data types for
 you. This format supports exact storage of all data in binary
 representations, and is appropriate for storing binary data
 (for example, +VARBINARY+ columns), or data that will be principly
 manipulated by custom MapReduce programs (reading from SequenceFiles
 is higher-performance than reading from text files, as records do not
 need to be parsed).

 Avro data files are a compact, efficient binary format that provides
 interoperability with applications written in other programming
 languages.  Avro also supports versioning, so that when, e.g., columns
 are added or removed from a table, previously imported data files can
 be processed along with new ones.

 By default, data is not compressed. You can compress your data by
 using the deflate (gzip) algorithm with the +-z+ or +\--compress+
 argument, or specify any Hadoop compression codec using the
 +\--compression-codec+ argument. This applies to SequenceFile, text,
 and Avro files.

 Enabling Logical Types in Avro and Parquet import for numbers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 To enable the use of logical types in Sqoop's avro schema generation,
 i.e. used during both avro and parquet imports, one has to use the
 sqoop.avro.logical_types.decimal.enable flag. This is necessary if one
 wants to store values as decimals in the avro file format.

 Padding number types in avro import
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Certain databases, such as Oracle and Postgres store number and decimal
 values without padding. For example 1.5 in a column declared
 as NUMBER (20,5) is stored as is in Oracle, while the equivalent
 DECIMAL (20, 5) is stored as 1.50000 in an SQL server instance.
 This leads to a scale mismatch during avro import.

 To avoid this error, one can use the sqoop.avro.decimal_padding.enable flag
 to turn on padding with 0s. This flag has to be used together with the
 sqoop.avro.logical_types.decimal.enable flag set to true.

 Default precision and scale in avro import
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 All of the databases allow users to specify numeric columns without
 a precision or scale. While MS SQL and MySQL translate these into
 a valid precision and scale values, Oracle and Postgres don't.

 Therefore, when a table contains NUMBER in a table in Oracle or
 NUMERIC/DECIMAL in Postgres, one can specify a default precision and scale
 to be used in the avro schema by using the +sqoop.avro.logical_types.decimal.default.precision+
 and +sqoop.avro.logical_types.decimal.default.scale+ flags.
 Avro padding also has to be enabled, if the values are shorter than
 the specified default scale.

 Large Objects
 ^^^^^^^^^^^^^

 Sqoop handles large objects (+BLOB+ and +CLOB+ columns) in particular
 ways. If this data is truly large, then these columns should not be
 fully materialized in memory for manipulation, as most columns are.
 Instead, their data is handled in a streaming fashion. Large objects
 can be stored inline with the rest of the data, in which case they are
 fully materialized in memory on every access, or they can be stored in
 a secondary storage file linked to the primary data storage. By
 default, large objects less than 16 MB in size are stored inline with
 the rest of the data. At a larger size, they are stored in files in
 the +_lobs+ subdirectory of the import target directory. These files
 are stored in a separate format optimized for large record storage,
 which can accomodate records of up to 2^63 bytes each. The size at
 which lobs spill into separate files is controlled by the
 +\--inline-lob-limit+ argument, which takes a parameter specifying the
 largest lob size to keep inline, in bytes. If you set the inline LOB
 limit to 0, all large objects will be placed in external
 storage.

 include::output-args.txt[]

 When importing to delimited files, the choice of delimiter is
 important. Delimiters which appear inside string-based fields may
 cause ambiguous parsing of the imported data by subsequent analysis
 passes. For example, the string +"Hello, pleased to meet you"+ should
 not be imported with the end-of-field delimiter set to a comma.

 Delimiters may be specified as:

 - a character (+\--fields-terminated-by X+)
 - an escape character (+\--fields-terminated-by \t+). Supported escape
   characters are:
 * +\b+ (backspace)
 * +\n+ (newline)
 * +\r+ (carriage return)
 * +\t+ (tab)
 * +\"+ (double-quote)
 * +\\'+ (single-quote)
 * +\\+ (backslash)
 * +\0+ (NUL) - This will insert NUL characters between fields or lines,
   or will disable enclosing/escaping if used for one of the +\--enclosed-by+,
   +\--optionally-enclosed-by+, or +\--escaped-by+ arguments.
 - The octal representation of a UTF-8 character's code point. This
   should be of the form +\0ooo+, where _ooo_ is the octal value.
   For example, +\--fields-terminated-by \001+ would yield the +^A+ character.
 - The hexadecimal representation of a UTF-8 character's code point. This
   should be of the form +\0xhhh+, where _hhh_ is the hex value.
   For example, +\--fields-terminated-by \0x10+ would yield the carriage
   return character.

 The default delimiters are a comma (+,+) for fields, a newline (+\n+) for records, no quote
 character, and no escape character. Note that this can lead to
 ambiguous/unparsible records if you import database records containing
 commas or newlines in the field data. For unambiguous parsing, both must
 be enabled. For example, via +\--mysql-delimiters+.

 If unambiguous delimiters cannot be presented, then use _enclosing_ and
 _escaping_ characters. The combination of (optional)
 enclosing and escaping characters will allow unambiguous parsing of
 lines. For example, suppose one column of a dataset contained the
 following values:

 ----
 Some string, with a comma.
 Another "string with quotes"
 ----

 The following arguments would provide delimiters which can be
 unambiguously parsed:

 ----
 $ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...
 ----

 (Note that to prevent the shell from mangling the enclosing character,
 we have enclosed that argument itself in single-quotes.)

 The result of the above arguments applied to the above dataset would
 be:

 ----
 "Some string, with a comma.","1","2","3"...
 "Another \"string with quotes\"","4","5","6"...
 ----

 Here the imported strings are shown in the context of additional
 columns (+"1","2","3"+, etc.) to demonstrate the full effect of enclosing
 and escaping. The enclosing character is only strictly necessary when
 delimiter characters appear in the imported text. The enclosing
 character can therefore be specified as optional:

 ----
 $ sqoop import --optionally-enclosed-by '\"' (the rest as above)...
 ----

 Which would result in the following import:

 ----
 "Some string, with a comma.",1,2,3...
 "Another \"string with quotes\"",4,5,6...
 ----

 NOTE: Even though Hive supports escaping characters, it does not
 handle escaping of new-line character. Also, it does not support
 the notion of enclosing characters that may include field delimiters
 in the enclosed string.  It is therefore recommended that you choose
 unambiguous field and record-terminating delimiters without the help
 of escaping and enclosing characters when working with Hive; this is
 due to limitations of Hive's input parsing abilities.

 The +\--mysql-delimiters+ argument is a shorthand argument which uses
 the default delimiters for the +mysqldump+ program.
 If you use the +mysqldump+ delimiters in conjunction with a
 direct-mode import (with +\--direct+), very fast imports can be
 achieved.

 While the choice of delimiters is most important for a text-mode
 import, it is still relevant if you import to SequenceFiles with
 +\--as-sequencefile+. The generated class' +toString()+ method
 will use the delimiters you specify, so subsequent formatting of
 the output data will rely on the delimiters you choose.

 include::input-args.txt[]

 When Sqoop imports data to HDFS, it generates a Java class which can
 reinterpret the text files that it creates when doing a
 delimited-format import. The delimiters are chosen with arguments such
 as +\--fields-terminated-by+; this controls both how the data is
 written to disk, and how the generated +parse()+ method reinterprets
 this data. The delimiters used by the +parse()+ method can be chosen
 independently of the output arguments, by using
 +\--input-fields-terminated-by+, and so on. This is useful, for example, to
 generate classes which can parse records created with one set of
 delimiters, and emit the records to a different set of files using a
 separate set of delimiters.

 include::hive-args.txt[]

 include::hive.txt[]

 include::hbase-args.txt[]
 include::hbase.txt[]

 include::accumulo-args.txt[]
 include::accumulo.txt[]

 include::codegen-args.txt[]

 As mentioned earlier, a byproduct of importing a table to HDFS is a
 class which can manipulate the imported data. If the data is stored in
 SequenceFiles, this class will be used for the data's serialization
 container. Therefore, you should use this class in your subsequent
 MapReduce processing of the data.

 The class is typically named after the table; a table named +foo+ will
 generate a class named +foo+. You may want to override this class
 name. For example, if your table is named +EMPLOYEES+, you may want to
 specify +\--class-name Employee+ instead. Similarly, you can specify
 just the package name with +\--package-name+. The following import
 generates a class named +com.foocorp.SomeTable+:

 ----
 $ sqoop import --connect <connect-str> --table SomeTable --package-name com.foocorp
 ----

 The +.java+ source file for your class will be written to the current
 working directory when you run +sqoop+. You can control the output
 directory with +\--outdir+. For example, +\--outdir src/generated/+.

 The import process compiles the source into +.class+ and +.jar+ files;
 these are ordinarily stored under +/tmp+. You can select an alternate
 target directory with +\--bindir+. For example, +\--bindir /scratch+.

 If you already have a compiled class that can be used to perform the
 import and want to suppress the code-generation aspect of the import
 process, you can use an existing jar and class by
 providing the +\--jar-file+ and +\--class-name+ options. For example:

 ----
 $ sqoop import --table SomeTable --jar-file mydatatypes.jar \
     --class-name SomeTableType
 ----

 This command will load the +SomeTableType+ class out of +mydatatypes.jar+.

 Additional Import Configuration Properties
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 There are some additional properties which can be configured by modifying
 +conf/sqoop-site.xml+. Properties can be specified the same as in Hadoop
 configuration files, for example:

 ----
   <property>
     <name>property.name</name>
     <value>property.value</value>
   </property>
 ----

 They can also be specified on the command line in the generic arguments, for
 example:

 ----
 sqoop import -D property.name=property.value ...
 ----

 .Additional import configuration properties:
 [grid="all"]
 `-------------------------------------`----------------------------------------
 Argument                               Description
 -------------------------------------------------------------------------------
 +sqoop.bigdecimal.format.string+       Controls how BigDecimal columns will   \
                                        formatted when stored as a String. A   \
                                        value of +true+ (default) will use     \
                                        toPlainString to store them without an \
                                        exponent component (0.0000001); while  \
                                        a value of +false+ will use toString   \
                                        which may include an exponent (1E-7)
 +sqoop.hbase.add.row.key+              When set to +false+ (default), Sqoop   \
                                        will not add the column used as a row  \
                                        key into the row data in HBase. When   \
                                        set to +true+, the column used as a    \
                                        row key will be added to the row data  \
                                        in HBase.
 -------------------------------------------------------------------------------


 Example Invocations
 ~~~~~~~~~~~~~~~~~~~

 The following examples illustrate how to use the import tool in a variety
 of situations.

 A basic import of a table named +EMPLOYEES+ in the +corp+ database:

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
 ----

 A basic import requiring a login:

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
     --username SomeUser -P
 Enter password: (hidden)
 ----

 Selecting specific columns from the +EMPLOYEES+ table:

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
     --columns "employee_id,first_name,last_name,job_title"
 ----

 Controlling the import parallelism (using 8 parallel tasks):

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
     -m 8
 ----

 Storing data in SequenceFiles, and setting the generated class name to
 +com.foocorp.Employee+:

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
     --class-name com.foocorp.Employee --as-sequencefile
 ----

 Specifying the delimiters to use in a text-mode import:

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
     --fields-terminated-by '\t' --lines-terminated-by '\n' \
     --optionally-enclosed-by '\"'
 ----

 Importing the data to Hive:

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
     --hive-import
 ----

 Importing only new employees:

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
     --where "start_date > '2010-01-01'"
 ----

 Changing the splitting column from the default:

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
     --split-by dept_id
 ----

 Verifying that an import was successful:

 ----
 $ hadoop fs -ls EMPLOYEES
 Found 5 items
 drwxr-xr-x   - someuser somegrp          0 2010-04-27 16:40 /user/someuser/EMPLOYEES/_logs
 -rw-r--r--   1 someuser somegrp    2913511 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00000
 -rw-r--r--   1 someuser somegrp    1683938 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00001
 -rw-r--r--   1 someuser somegrp    7245839 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00002
 -rw-r--r--   1 someuser somegrp    7842523 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00003

 $ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
 0,joe,smith,engineering
 1,jane,doe,marketing
 ...
 ----

 Performing an incremental import of new data, after having already
 imported the first 100,000 rows of a table:

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
     --where "id > 100000" --target-dir /incremental_dataset --append
 ----

 An import of a table named +EMPLOYEES+ in the +corp+ database that uses
 validation to validate the import using the table row count and number of
 rows copied into HDFS:
 <<validation,More Details>>

 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp \
     --table EMPLOYEES --validate
 ----

 Enabling logical types in avro import and also turning on padding with 0s:

 ----
 $ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true
     --connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS"
     --target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1

 ----

 Enabling logical types in avro import and also turning on padding with 0s, while specifying default precision and scale as well:

 ----
 $ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true
     -Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
     --connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS"
     --target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1

 ----