src/docs/user/export.txt - sqoop - Git at Google

 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 ////


 +sqoop-export+
 --------------


 Purpose
 ~~~~~~~

 include::export-purpose.txt[]

 Syntax
 ~~~~~~

 ----
 $ sqoop export (generic-args) (export-args)
 $ sqoop-export (generic-args) (export-args)
 ----

 Although the Hadoop generic arguments must preceed any export arguments,
 the export arguments can be entered in any order with respect to one
 another.


 include::common-args.txt[]

 include::validation-args.txt[]

 .Export control arguments:
 [grid="all"]
 `----------------------------------------`------------------------------
 Argument                                 Description
 ------------------------------------------------------------------------
 +\--columns <col,col,col...>+            Columns to export to table
 +\--direct+                              Use direct export fast path
 +\--export-dir <dir>+                    HDFS source path for the export
 +-m,\--num-mappers <n>+                  Use 'n' map tasks to export in\
                                          parallel
 +\--table <table-name>+                  Table to populate
 +\--call <stored-proc-name>+             Stored Procedure to call
 +\--update-key <col-name>+               Anchor column to use for updates.\
                                          Use a comma separated list of columns\
                                          if there are more than one column.
 +\--update-mode <mode>+                  Specify how updates are performed\
                                          when new rows are found with\
                                          non-matching keys in database.
                                          Legal values for +mode+ include\
                                          +updateonly+ (default) and\
                                          +allowinsert+.
 +\--input-null-string <null-string>+     The string to be interpreted as\
                                          null for string columns
 +\--input-null-non-string <null-string>+ The string to be interpreted as\
                                          null for non-string columns
 +\--staging-table <staging-table-name>+  The table in which data will be\
                                          staged before being inserted into\
                                          the destination table.
 +\--clear-staging-table+                 Indicates that any data present in\
                                          the staging table can be deleted.
 +\--batch+                               Use batch mode for underlying\
                                          statement execution.
 ------------------------------------------------------------------------

 The +\--export-dir+ argument and one of +\--table+ or +\--call+ are
 required. These specify the table to populate in the database (or the
 stored procedure to call), and the directory in HDFS that contains
 the source data.

 By default, all columns within a table are selected for export. You
 can select a subset of columns and control their ordering by using the
 +\--columns+ argument. This should include a comma-delimited list
 of columns to export. For example: +\--columns "col1,col2,col3"+. Note
 that columns that are not included in the +--columns+ parameter need
 to have either defined default value or allow +NULL+ values. Otherwise
 your database will reject the imported data which in turn will make
 Sqoop job fail.

 You can control the number of mappers independently from the number of
 files present in the directory. Export performance depends on the
 degree of parallelism. By default, Sqoop will use four tasks in
 parallel for the export process. This may not be optimal; you will
 need to experiment with your own particular setup. Additional tasks
 may offer better concurrency, but if the database is already
 bottlenecked on updating indices, invoking triggers, and so on, then
 additional load may decrease performance. The +\--num-mappers+ or +-m+
 arguments control the number of map tasks, which is the degree of
 parallelism used.

 Some databases provides a direct mode for exports as well. Use the +\--direct+ argument
 to specify this codepath. This may be higher-performance than the standard JDBC codepath.
 Details about use of direct mode with each specific RDBMS, installation requirements, available
 options and limitations can be found in <<connectors>>.

 The +\--input-null-string+ and +\--input-null-non-string+ arguments are
 optional. If +\--input-null-string+ is not specified, then the string
 "null" will be interpreted as null for string-type columns.
 If +\--input-null-non-string+ is not specified, then both the string
 "null" and the empty string will be interpreted as null for non-string
 columns. Note that, the empty string will be always interpreted as null
 for non-string columns, in addition to other string if specified by
 +\--input-null-non-string+.

 Since Sqoop breaks down export process into multiple transactions, it
 is possible that a failed export job may result in partial data being
 committed to the database. This can further lead to subsequent jobs
 failing due to insert collisions in some cases, or lead to duplicated data
 in others. You can overcome this problem by specifying a staging table via
 the +\--staging-table+ option which acts as an auxiliary table that is used
 to stage exported data. The staged data is finally moved to the destination
 table in a single transaction.

 In order to use the staging facility, you must create the staging table
 prior to running the export job. This table must be structurally
 identical to the target table. This table should either be empty before
 the export job runs, or the +\--clear-staging-table+ option must be specified.
 If the staging table contains data and the +\--clear-staging-table+ option is
 specified, Sqoop will delete all of the data before starting the export job.

 NOTE: Support for staging data prior to pushing it into the destination
 table is not always available for +--direct+ exports. It is also not available when
 export is invoked using the +--update-key+ option for updating existing data,
 and when stored procedures are used to insert the data. It is best to check the <<connectors>> section to validate.


 Inserts vs. Updates
 ~~~~~~~~~~~~~~~~~~~

 By default, +sqoop-export+ appends new rows to a table; each input
 record is transformed into an +INSERT+ statement that adds a row to the
 target database table. If your table has constraints (e.g., a primary
 key column whose values must be unique) and already contains data, you
 must take care to avoid inserting records that violate these
 constraints. The export process will fail if an +INSERT+ statement
 fails. This mode is primarily intended for exporting records to a new,
 empty table intended to receive these results.

 If you specify the +\--update-key+ argument, Sqoop will instead modify
 an existing dataset in the database. Each input record is treated as
 an +UPDATE+ statement that modifies an existing row. The row a
 statement modifies is determined by the column name(s) specified with
 +\--update-key+. For example, consider the following table
 definition:

 ----
 CREATE TABLE foo(
     id INT NOT NULL PRIMARY KEY,
     msg VARCHAR(32),
     bar INT);
 ----

 Consider also a dataset in HDFS containing records like these:

 ----
 0,this is a test,42
 1,some more data,100
 ...
 ----

 Running +sqoop-export \--table foo \--update-key id \--export-dir
 /path/to/data \--connect ...+ will run an export job that executes SQL
 statements based on the data like so:

 ----
 UPDATE foo SET msg='this is a test', bar=42 WHERE id=0;
 UPDATE foo SET msg='some more data', bar=100 WHERE id=1;
 ...
 ----

 If an +UPDATE+ statement modifies no rows, this is not considered an
 error; the export will silently continue. (In effect, this means that
 an update-based export will not insert new rows into the database.)
 Likewise, if the column specified with +\--update-key+ does not
 uniquely identify rows and multiple rows are updated by a single
 statement, this condition is also undetected.

 The argument +\--update-key+ can also be given a comma separated list of
 column names. In which case, Sqoop will match all keys from this list before
 updating any existing record.

 Depending on the target database, you may also specify the +\--update-mode+
 argument with +allowinsert+ mode if you want to update rows if they exist
 in the database already or insert rows if they do not exist yet.

 include::input-args.txt[]

 include::output-args.txt[]

 Sqoop automatically generates code to parse and interpret records of the
 files containing the data to be exported back to the database. If
 these files were created with non-default delimiters (comma-separated
 fields with newline-separated records), you should specify
 the same delimiters again so that Sqoop can parse your files.

 If you specify incorrect delimiters, Sqoop will fail to find enough
 columns per line. This will cause export map tasks to fail by throwing
 +ParseExceptions+.

 include::codegen-args.txt[]

 If the records to be exported were generated as the result of a
 previous import, then the original generated class can be used to read
 the data back. Specifying +\--jar-file+ and +\--class-name+ obviate
 the need to specify delimiters in this case.

 The use of existing generated code is incompatible with
 +\--update-key+; an update-mode export requires new code generation to
 perform the update. You cannot use +\--jar-file+, and must fully specify
 any non-default delimiters.

 Exports and Transactions
 ~~~~~~~~~~~~~~~~~~~~~~~~

 Exports are performed by multiple writers in parallel. Each writer
 uses a separate connection to the database; these have separate
 transactions from one another. Sqoop uses the multi-row +INSERT+
 syntax to insert up to 100 records per statement. Every 100
 statements, the current transaction within a writer task is committed,
 causing a commit every 10,000 rows. This ensures that transaction
 buffers do not grow without bound, and cause out-of-memory conditions.
 Therefore, an export is not an atomic process. Partial results from
 the export will become visible before the export is complete.

 Failed Exports
 ~~~~~~~~~~~~~~

 Exports may fail for a number of reasons:

 - Loss of connectivity from the Hadoop cluster to the database (either
   due to hardware fault, or server software crashes)
 - Attempting to +INSERT+ a row which violates a consistency constraint
   (for example, inserting a duplicate primary key value)
 - Attempting to parse an incomplete or malformed record from the HDFS
   source data
 - Attempting to parse records using incorrect delimiters
 - Capacity issues (such as insufficient RAM or disk space)

 If an export map task fails due to these or other reasons, it will
 cause the export job to fail. The results of a failed export are
 undefined. Each export map task operates in a separate transaction.
 Furthermore, individual map tasks +commit+ their current transaction
 periodically. If a task fails, the current transaction will be rolled
 back. Any previously-committed transactions will remain durable in the
 database, leading to a partially-complete export.

 Example Invocations
 ~~~~~~~~~~~~~~~~~~~

 A basic export to populate a table named +bar+:

 ----
 $ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \
     --export-dir /results/bar_data
 ----

 This example takes the files in +/results/bar_data+ and injects their
 contents in to the +bar+ table in the +foo+ database on +db.example.com+.
 The target table must already exist in the database. Sqoop performs
 a set of +INSERT INTO+ operations, without regard for existing content. If
 Sqoop attempts to insert rows which violate constraints in the database
 (for example, a particular primary key value already exists), then the export
 fails.

 Alternatively, you can specify the columns to be exported by providing
 +--columns "col1,col2,col3"+. Please note that columns that are not included
 in the +--columns+ parameter need to have either defined default value or
 allow +NULL+ values. Otherwise your database will reject the imported data
 which in turn will make Sqoop job fail.

 Another basic export to populate a table named +bar+ with validation enabled:
 <<validation,More Details>>

 ----
 $ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \
     --export-dir /results/bar_data --validate
 ----

 An export that calls a stored procedure named +barproc+ for every record in
 +/results/bar_data+ would look like:

 ----
 $ sqoop export --connect jdbc:mysql://db.example.com/foo --call barproc \
     --export-dir /results/bar_data
 ----
	////
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////


	+sqoop-export+
	--------------


	Purpose
	~~~~~~~

	include::export-purpose.txt[]

	Syntax
	~~~~~~

	----
	$ sqoop export (generic-args) (export-args)
	$ sqoop-export (generic-args) (export-args)
	----

	Although the Hadoop generic arguments must preceed any export arguments,
	the export arguments can be entered in any order with respect to one
	another.


	include::common-args.txt[]

	include::validation-args.txt[]

	.Export control arguments:
	[grid="all"]
	`----------------------------------------`------------------------------
	Argument Description
	------------------------------------------------------------------------
	+\--columns <col,col,col...>+ Columns to export to table
	+\--direct+ Use direct export fast path
	+\--export-dir <dir>+ HDFS source path for the export
	+-m,\--num-mappers <n>+ Use 'n' map tasks to export in\
	parallel
	+\--table <table-name>+ Table to populate
	+\--call <stored-proc-name>+ Stored Procedure to call
	+\--update-key <col-name>+ Anchor column to use for updates.\
	Use a comma separated list of columns\
	if there are more than one column.
	+\--update-mode <mode>+ Specify how updates are performed\
	when new rows are found with\
	non-matching keys in database.
	Legal values for +mode+ include\
	+updateonly+ (default) and\
	+allowinsert+.
	+\--input-null-string <null-string>+ The string to be interpreted as\
	null for string columns
	+\--input-null-non-string <null-string>+ The string to be interpreted as\
	null for non-string columns
	+\--staging-table <staging-table-name>+ The table in which data will be\
	staged before being inserted into\
	the destination table.
	+\--clear-staging-table+ Indicates that any data present in\
	the staging table can be deleted.
	+\--batch+ Use batch mode for underlying\
	statement execution.
	------------------------------------------------------------------------

	The +\--export-dir+ argument and one of +\--table+ or +\--call+ are
	required. These specify the table to populate in the database (or the
	stored procedure to call), and the directory in HDFS that contains
	the source data.

	By default, all columns within a table are selected for export. You
	can select a subset of columns and control their ordering by using the
	+\--columns+ argument. This should include a comma-delimited list
	of columns to export. For example: +\--columns "col1,col2,col3"+. Note
	that columns that are not included in the +--columns+ parameter need
	to have either defined default value or allow +NULL+ values. Otherwise
	your database will reject the imported data which in turn will make
	Sqoop job fail.

	You can control the number of mappers independently from the number of
	files present in the directory. Export performance depends on the
	degree of parallelism. By default, Sqoop will use four tasks in
	parallel for the export process. This may not be optimal; you will
	need to experiment with your own particular setup. Additional tasks
	may offer better concurrency, but if the database is already
	bottlenecked on updating indices, invoking triggers, and so on, then
	additional load may decrease performance. The +\--num-mappers+ or +-m+
	arguments control the number of map tasks, which is the degree of
	parallelism used.

	Some databases provides a direct mode for exports as well. Use the +\--direct+ argument
	to specify this codepath. This may be higher-performance than the standard JDBC codepath.
	Details about use of direct mode with each specific RDBMS, installation requirements, available
	options and limitations can be found in <<connectors>>.

	The +\--input-null-string+ and +\--input-null-non-string+ arguments are
	optional. If +\--input-null-string+ is not specified, then the string
	"null" will be interpreted as null for string-type columns.
	If +\--input-null-non-string+ is not specified, then both the string
	"null" and the empty string will be interpreted as null for non-string
	columns. Note that, the empty string will be always interpreted as null
	for non-string columns, in addition to other string if specified by
	+\--input-null-non-string+.

	Since Sqoop breaks down export process into multiple transactions, it
	is possible that a failed export job may result in partial data being
	committed to the database. This can further lead to subsequent jobs
	failing due to insert collisions in some cases, or lead to duplicated data
	in others. You can overcome this problem by specifying a staging table via
	the +\--staging-table+ option which acts as an auxiliary table that is used
	to stage exported data. The staged data is finally moved to the destination
	table in a single transaction.

	In order to use the staging facility, you must create the staging table
	prior to running the export job. This table must be structurally
	identical to the target table. This table should either be empty before
	the export job runs, or the +\--clear-staging-table+ option must be specified.
	If the staging table contains data and the +\--clear-staging-table+ option is
	specified, Sqoop will delete all of the data before starting the export job.

	NOTE: Support for staging data prior to pushing it into the destination
	table is not always available for +--direct+ exports. It is also not available when
	export is invoked using the +--update-key+ option for updating existing data,
	and when stored procedures are used to insert the data. It is best to check the <<connectors>> section to validate.


	Inserts vs. Updates
	~~~~~~~~~~~~~~~~~~~

	By default, +sqoop-export+ appends new rows to a table; each input
	record is transformed into an +INSERT+ statement that adds a row to the
	target database table. If your table has constraints (e.g., a primary
	key column whose values must be unique) and already contains data, you
	must take care to avoid inserting records that violate these
	constraints. The export process will fail if an +INSERT+ statement
	fails. This mode is primarily intended for exporting records to a new,
	empty table intended to receive these results.

	If you specify the +\--update-key+ argument, Sqoop will instead modify
	an existing dataset in the database. Each input record is treated as
	an +UPDATE+ statement that modifies an existing row. The row a
	statement modifies is determined by the column name(s) specified with
	+\--update-key+. For example, consider the following table
	definition:

	----
	CREATE TABLE foo(
	id INT NOT NULL PRIMARY KEY,
	msg VARCHAR(32),
	bar INT);
	----

	Consider also a dataset in HDFS containing records like these:

	----
	0,this is a test,42
	1,some more data,100
	...
	----

	Running +sqoop-export \--table foo \--update-key id \--export-dir
	/path/to/data \--connect ...+ will run an export job that executes SQL
	statements based on the data like so:

	----
	UPDATE foo SET msg='this is a test', bar=42 WHERE id=0;
	UPDATE foo SET msg='some more data', bar=100 WHERE id=1;
	...
	----

	If an +UPDATE+ statement modifies no rows, this is not considered an
	error; the export will silently continue. (In effect, this means that
	an update-based export will not insert new rows into the database.)
	Likewise, if the column specified with +\--update-key+ does not
	uniquely identify rows and multiple rows are updated by a single
	statement, this condition is also undetected.

	The argument +\--update-key+ can also be given a comma separated list of
	column names. In which case, Sqoop will match all keys from this list before
	updating any existing record.

	Depending on the target database, you may also specify the +\--update-mode+
	argument with +allowinsert+ mode if you want to update rows if they exist
	in the database already or insert rows if they do not exist yet.

	include::input-args.txt[]

	include::output-args.txt[]

	Sqoop automatically generates code to parse and interpret records of the
	files containing the data to be exported back to the database. If
	these files were created with non-default delimiters (comma-separated
	fields with newline-separated records), you should specify
	the same delimiters again so that Sqoop can parse your files.

	If you specify incorrect delimiters, Sqoop will fail to find enough
	columns per line. This will cause export map tasks to fail by throwing
	+ParseExceptions+.

	include::codegen-args.txt[]

	If the records to be exported were generated as the result of a
	previous import, then the original generated class can be used to read
	the data back. Specifying +\--jar-file+ and +\--class-name+ obviate
	the need to specify delimiters in this case.

	The use of existing generated code is incompatible with
	+\--update-key+; an update-mode export requires new code generation to
	perform the update. You cannot use +\--jar-file+, and must fully specify
	any non-default delimiters.

	Exports and Transactions
	~~~~~~~~~~~~~~~~~~~~~~~~

	Exports are performed by multiple writers in parallel. Each writer
	uses a separate connection to the database; these have separate
	transactions from one another. Sqoop uses the multi-row +INSERT+
	syntax to insert up to 100 records per statement. Every 100
	statements, the current transaction within a writer task is committed,
	causing a commit every 10,000 rows. This ensures that transaction
	buffers do not grow without bound, and cause out-of-memory conditions.
	Therefore, an export is not an atomic process. Partial results from
	the export will become visible before the export is complete.

	Failed Exports
	~~~~~~~~~~~~~~

	Exports may fail for a number of reasons:

	- Loss of connectivity from the Hadoop cluster to the database (either
	due to hardware fault, or server software crashes)
	- Attempting to +INSERT+ a row which violates a consistency constraint
	(for example, inserting a duplicate primary key value)
	- Attempting to parse an incomplete or malformed record from the HDFS
	source data
	- Attempting to parse records using incorrect delimiters
	- Capacity issues (such as insufficient RAM or disk space)

	If an export map task fails due to these or other reasons, it will
	cause the export job to fail. The results of a failed export are
	undefined. Each export map task operates in a separate transaction.
	Furthermore, individual map tasks +commit+ their current transaction
	periodically. If a task fails, the current transaction will be rolled
	back. Any previously-committed transactions will remain durable in the
	database, leading to a partially-complete export.

	Example Invocations
	~~~~~~~~~~~~~~~~~~~

	A basic export to populate a table named +bar+:

	----
	$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
	--export-dir /results/bar_data
	----

	This example takes the files in +/results/bar_data+ and injects their
	contents in to the +bar+ table in the +foo+ database on +db.example.com+.
	The target table must already exist in the database. Sqoop performs
	a set of +INSERT INTO+ operations, without regard for existing content. If
	Sqoop attempts to insert rows which violate constraints in the database
	(for example, a particular primary key value already exists), then the export
	fails.

	Alternatively, you can specify the columns to be exported by providing
	+--columns "col1,col2,col3"+. Please note that columns that are not included
	in the +--columns+ parameter need to have either defined default value or
	allow +NULL+ values. Otherwise your database will reject the imported data
	which in turn will make Sqoop job fail.

	Another basic export to populate a table named +bar+ with validation enabled:
	<<validation,More Details>>

	----
	$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
	--export-dir /results/bar_data --validate
	----

	An export that calls a stored procedure named +barproc+ for every record in
	+/results/bar_data+ would look like:

	----
	$ sqoop export --connect jdbc:mysql://db.example.com/foo --call barproc \
	--export-dir /results/bar_data
	----