src/docs/user/s3.txt - sqoop - Git at Google


 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 ////

 Sqoop-Amazon S3 Integration
 ---------------------------

 Background
 ~~~~~~~~~~

 Amazon Simple Storage Service, or Amazon S3, is a cloud computing web service offered by Amazon Web Services that
 facilitates highly-scalable, secured and low-latency data storage from the cloud.
 For learning more about S3 please see the official documentation at https://aws.amazon.com/documentation/s3/.

 Sqoop can be used to transfer data between a relational database management system (RDBMS)
 and Amazon S3 exploiting the capabilities of the Hadoop-Amazon Web Services integration.
 For learning more about the Hadoop-AWS module please see the Hadoop documentation at
 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html.

 Authentication
 ~~~~~~~~~~~~~~

 Users authenticate to an S3 bucket using AWS credentials. The standard way to authenticate is with an access key
 and secret key using the properties in the configuration file. The AWS credentials can be passed to the sqoop job via
 setting the following properties: +fs.s3a.access.key+, +fs.s3a.secret.key+ and +fs.s3a.session.token+,
 the latter one only in case of temporary AWS credentials.

 For learning more about the S3 authentication methods please see the Hadoop documentation at
 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods.

 Import Into Amazon S3
 ~~~~~~~~~~~~~~~~~~~~~

 Sqoop import is supported into the S3A (s3a://) filesystem only.

 To import data from RDBMS into an S3 bucket the +--target-dir+ option has to be set to the target location in
 the S3 bucket. Example usage:

 ----
 $ sqoop import \
   -Dfs.s3a.access.key=$AWS_ACCESS_KEY \
   -Dfs.s3a.secret.key=$AWS_SECRET_KEY \
   --connect $CONN \
   --username $USER \
   --password $PWD \
   --table $TABLENAME \
   --target-dir s3a://example-bucket/target-directory
 ----

 Data from RDBMS can be imported into S3 as Sequence, Avro and Parquet file formats too.

 Incremental Import
 ^^^^^^^^^^^^^^^^^^

 To import data from RDBMS into an S3 bucket in incremental mode the +temporary-rootdir+ option always has to
 be set and has to point to a location in the S3 bucket.

 Incremental Append Mode
 +++++++++++++++++++++++

 In case of +append+ mode the location of the temporary root directory has to be in the same bucket as the target directory,
 for example +s3a://example-bucket/temporary-rootdir+ or +s3a://example-bucket/target-directory/temporary-rootdir+.
 Example usage:

 ----
 $ sqoop import \
   -Dfs.s3a.access.key=$AWS_ACCESS_KEY \
   -Dfs.s3a.secret.key=$AWS_SECRET_KEY \
   --connect $CONN \
   --username $USER \
   --password $PWD \
   --table $TABLE_NAME \
   --target-dir s3a://example-bucket/target-directory \
   --incremental append \
   --check-column $CHECK_COLUMN \
   --last-value $LAST_VALUE \
   --temporary-rootdir s3a://example-bucket/temporary-rootdir
 ----

 Data from RDBMS can be imported into S3 in incremental +append+ mode as Sequence, Avro and Parquet file formats too.

 Incremental Lastmodified Mode
 +++++++++++++++++++++++++++++

 In case of +lastmodified+ mode the location of the temporary root directory has to be in the same bucket and in the same
 directory as the target directory, for example +s3a://example-bucket/temporary-rootdir+ in case of
 +s3a://example-bucket/target-directory+. Example usage:

 ----
 $ sqoop import \
   -Dfs.s3a.access.key=$AWS_ACCES_KEY \
   -Dfs.s3a.secret.key=$AWS_SECRET_KEY \
   --connect $CONN \
   --username $USER \
   --password $PWD \
   --table $TABLE_NAME \
   --target-dir s3a://example-bucket/target-directory \
   --incremental lastmodified \
   --check-column $CHECK_COLUMN \
   --merge-key $MERGE_KEY \
   --last-value $LAST_VALUE \
   --temporary-rootdir s3a://example-bucket/temporary-rootdir
 ----

 Data from RDBMS can be imported into S3 in incremental +lastmodified+ mode as Parquet file format too.

 Import Into External Hive Table Backed By S3
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 To import data from RDBMS into an external Hive table backed by S3 the AWS credentials have to be set in the Hive
 configuration file (+hive-site.xml+) too. For learning more about Hive on Amazon Web Services please see the Hive
 documentation at https://cwiki.apache.org/confluence/display/Hive/HiveAws.

 The current implementation of Sqoop requires that both +target-dir+ and +external-table-dir+ options are set
 where +external-table-dir+ has to point to the Hive table location in the S3 bucket.

 Import into an external Hive table backed by S3 for example:

 ----
 $ sqoop import \
   -Dfs.s3a.access.key=$AWS_ACCES_KEY \
   -Dfs.s3a.secret.key=$AWS_SECRET_KEY \
   --connect $CONN \
   --username $USER \
   --password $PWD \
   --table $TABLE_NAME \
   --hive-import \
   --target-dir s3a://example-bucket/target-directory \
   --external-table-dir s3a://example-bucket/external-directory
 ----

 Create an external Hive table backed by S3 for example:

 ----
 $ sqoop import \
   -Dfs.s3a.access.key=$AWS_ACCES_KEY \
   -Dfs.s3a.secret.key=$AWS_SECRET_KEY \
   --connect $CONN \
   --username $USER \
   --password $PWD \
   --table $TABLE_NAME \
   --hive-import \
   --create-hive-table \
   --hive-table $HIVE_TABLE_NAME \
   --target-dir s3a://example-bucket/target-directory \
   --external-table-dir s3a://example-bucket/external-directory
 ----

 Data from RDBMS can be imported into an external Hive table backed by S3 as Parquet file format too.

 Hadoop S3Guard usage with Sqoop
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Amazon S3 offers eventual consistency for PUTS and DELETES in all regions which means the visibility of the files
 are not guaranteed in a specific time after creation. Due to this behavior it can happen that right after a
 sqoop import the data will not be visible immediately. For learning more about the core concepts of Amazon S3
 please see the official documentation at https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#CoreConcepts.

 S3Guard is an experimental feature for the S3A client in Hadoop which can use a database as a store of metadata about
 objects in an S3 bucket. For learning more about S3Guard please see the Hadoop documentation at
 https://hadoop.apache.org/docs/r3.0.3/hadoop-aws/tools/hadoop-aws/s3guard.html.

 S3Guard can be enabled during sqoop imports via setting properties described in the linked documentation.

 Example usage with setting S3Guard:

 ----
 $ sqoop import \
   -Dfs.s3a.access.key=$AWS_ACCESS_KEY \
   -Dfs.s3a.secret.key=$AWS_SECRET_KEY \
   -Dfs.s3a.metadatastore.impl=org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore \
   -Dfs.s3a.s3guard.ddb.region=$BUCKET_REGION \
   -Dfs.s3a.s3guard.ddb.table.create=true \
   --connect $CONN \
   --username $USER \
   --password $PWD \
   --table $TABLENAME \
   --target-dir s3a://example-bucket/target-directory
 ----

	////
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////

	Sqoop-Amazon S3 Integration
	---------------------------

	Background
	~~~~~~~~~~

	Amazon Simple Storage Service, or Amazon S3, is a cloud computing web service offered by Amazon Web Services that
	facilitates highly-scalable, secured and low-latency data storage from the cloud.
	For learning more about S3 please see the official documentation at https://aws.amazon.com/documentation/s3/.

	Sqoop can be used to transfer data between a relational database management system (RDBMS)
	and Amazon S3 exploiting the capabilities of the Hadoop-Amazon Web Services integration.
	For learning more about the Hadoop-AWS module please see the Hadoop documentation at
	https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html.

	Authentication
	~~~~~~~~~~~~~~

	Users authenticate to an S3 bucket using AWS credentials. The standard way to authenticate is with an access key
	and secret key using the properties in the configuration file. The AWS credentials can be passed to the sqoop job via
	setting the following properties: +fs.s3a.access.key+, +fs.s3a.secret.key+ and +fs.s3a.session.token+,
	the latter one only in case of temporary AWS credentials.

	For learning more about the S3 authentication methods please see the Hadoop documentation at
	https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods.

	Import Into Amazon S3
	~~~~~~~~~~~~~~~~~~~~~

	Sqoop import is supported into the S3A (s3a://) filesystem only.

	To import data from RDBMS into an S3 bucket the +--target-dir+ option has to be set to the target location in
	the S3 bucket. Example usage:

	----
	$ sqoop import \
	-Dfs.s3a.access.key=$AWS_ACCESS_KEY \
	-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
	--connect $CONN \
	--username $USER \
	--password $PWD \
	--table $TABLENAME \
	--target-dir s3a://example-bucket/target-directory
	----

	Data from RDBMS can be imported into S3 as Sequence, Avro and Parquet file formats too.

	Incremental Import
	^^^^^^^^^^^^^^^^^^

	To import data from RDBMS into an S3 bucket in incremental mode the +temporary-rootdir+ option always has to
	be set and has to point to a location in the S3 bucket.

	Incremental Append Mode
	+++++++++++++++++++++++

	In case of +append+ mode the location of the temporary root directory has to be in the same bucket as the target directory,
	for example +s3a://example-bucket/temporary-rootdir+ or +s3a://example-bucket/target-directory/temporary-rootdir+.
	Example usage:

	----
	$ sqoop import \
	-Dfs.s3a.access.key=$AWS_ACCESS_KEY \
	-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
	--connect $CONN \
	--username $USER \
	--password $PWD \
	--table $TABLE_NAME \
	--target-dir s3a://example-bucket/target-directory \
	--incremental append \
	--check-column $CHECK_COLUMN \
	--last-value $LAST_VALUE \
	--temporary-rootdir s3a://example-bucket/temporary-rootdir
	----

	Data from RDBMS can be imported into S3 in incremental +append+ mode as Sequence, Avro and Parquet file formats too.

	Incremental Lastmodified Mode
	+++++++++++++++++++++++++++++

	In case of +lastmodified+ mode the location of the temporary root directory has to be in the same bucket and in the same
	directory as the target directory, for example +s3a://example-bucket/temporary-rootdir+ in case of
	+s3a://example-bucket/target-directory+. Example usage:

	----
	$ sqoop import \
	-Dfs.s3a.access.key=$AWS_ACCES_KEY \
	-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
	--connect $CONN \
	--username $USER \
	--password $PWD \
	--table $TABLE_NAME \
	--target-dir s3a://example-bucket/target-directory \
	--incremental lastmodified \
	--check-column $CHECK_COLUMN \
	--merge-key $MERGE_KEY \
	--last-value $LAST_VALUE \
	--temporary-rootdir s3a://example-bucket/temporary-rootdir
	----

	Data from RDBMS can be imported into S3 in incremental +lastmodified+ mode as Parquet file format too.

	Import Into External Hive Table Backed By S3
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	To import data from RDBMS into an external Hive table backed by S3 the AWS credentials have to be set in the Hive
	configuration file (+hive-site.xml+) too. For learning more about Hive on Amazon Web Services please see the Hive
	documentation at https://cwiki.apache.org/confluence/display/Hive/HiveAws.

	The current implementation of Sqoop requires that both +target-dir+ and +external-table-dir+ options are set
	where +external-table-dir+ has to point to the Hive table location in the S3 bucket.

	Import into an external Hive table backed by S3 for example:

	----
	$ sqoop import \
	-Dfs.s3a.access.key=$AWS_ACCES_KEY \
	-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
	--connect $CONN \
	--username $USER \
	--password $PWD \
	--table $TABLE_NAME \
	--hive-import \
	--target-dir s3a://example-bucket/target-directory \
	--external-table-dir s3a://example-bucket/external-directory
	----

	Create an external Hive table backed by S3 for example:

	----
	$ sqoop import \
	-Dfs.s3a.access.key=$AWS_ACCES_KEY \
	-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
	--connect $CONN \
	--username $USER \
	--password $PWD \
	--table $TABLE_NAME \
	--hive-import \
	--create-hive-table \
	--hive-table $HIVE_TABLE_NAME \
	--target-dir s3a://example-bucket/target-directory \
	--external-table-dir s3a://example-bucket/external-directory
	----

	Data from RDBMS can be imported into an external Hive table backed by S3 as Parquet file format too.

	Hadoop S3Guard usage with Sqoop
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Amazon S3 offers eventual consistency for PUTS and DELETES in all regions which means the visibility of the files
	are not guaranteed in a specific time after creation. Due to this behavior it can happen that right after a
	sqoop import the data will not be visible immediately. For learning more about the core concepts of Amazon S3
	please see the official documentation at https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#CoreConcepts.

	S3Guard is an experimental feature for the S3A client in Hadoop which can use a database as a store of metadata about
	objects in an S3 bucket. For learning more about S3Guard please see the Hadoop documentation at
	https://hadoop.apache.org/docs/r3.0.3/hadoop-aws/tools/hadoop-aws/s3guard.html.

	S3Guard can be enabled during sqoop imports via setting properties described in the linked documentation.

	Example usage with setting S3Guard:

	----
	$ sqoop import \
	-Dfs.s3a.access.key=$AWS_ACCESS_KEY \
	-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
	-Dfs.s3a.metadatastore.impl=org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore \
	-Dfs.s3a.s3guard.ddb.region=$BUCKET_REGION \
	-Dfs.s3a.s3guard.ddb.table.create=true \
	--connect $CONN \
	--username $USER \
	--password $PWD \
	--table $TABLENAME \
	--target-dir s3a://example-bucket/target-directory
	----