blob: 52ab6ac07203494922db7a7aaa991c2ea1fc52c8 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
Sqoop-Amazon S3 Integration
---------------------------
Background
~~~~~~~~~~
Amazon Simple Storage Service, or Amazon S3, is a cloud computing web service offered by Amazon Web Services that
facilitates highly-scalable, secured and low-latency data storage from the cloud.
For learning more about S3 please see the official documentation at https://aws.amazon.com/documentation/s3/.
Sqoop can be used to transfer data between a relational database management system (RDBMS)
and Amazon S3 exploiting the capabilities of the Hadoop-Amazon Web Services integration.
For learning more about the Hadoop-AWS module please see the Hadoop documentation at
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html.
Authentication
~~~~~~~~~~~~~~
Users authenticate to an S3 bucket using AWS credentials. The standard way to authenticate is with an access key
and secret key using the properties in the configuration file. The AWS credentials can be passed to the sqoop job via
setting the following properties: +fs.s3a.access.key+, +fs.s3a.secret.key+ and +fs.s3a.session.token+,
the latter one only in case of temporary AWS credentials.
For learning more about the S3 authentication methods please see the Hadoop documentation at
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods.
Import Into Amazon S3
~~~~~~~~~~~~~~~~~~~~~
Sqoop import is supported into the S3A (s3a://) filesystem only.
To import data from RDBMS into an S3 bucket the +--target-dir+ option has to be set to the target location in
the S3 bucket. Example usage:
----
$ sqoop import \
-Dfs.s3a.access.key=$AWS_ACCESS_KEY \
-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
--connect $CONN \
--username $USER \
--password $PWD \
--table $TABLENAME \
--target-dir s3a://example-bucket/target-directory
----
Data from RDBMS can be imported into S3 as Sequence, Avro and Parquet file formats too.
Incremental Import
^^^^^^^^^^^^^^^^^^
To import data from RDBMS into an S3 bucket in incremental mode the +temporary-rootdir+ option always has to
be set and has to point to a location in the S3 bucket.
Incremental Append Mode
+++++++++++++++++++++++
In case of +append+ mode the location of the temporary root directory has to be in the same bucket as the target directory,
for example +s3a://example-bucket/temporary-rootdir+ or +s3a://example-bucket/target-directory/temporary-rootdir+.
Example usage:
----
$ sqoop import \
-Dfs.s3a.access.key=$AWS_ACCESS_KEY \
-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
--connect $CONN \
--username $USER \
--password $PWD \
--table $TABLE_NAME \
--target-dir s3a://example-bucket/target-directory \
--incremental append \
--check-column $CHECK_COLUMN \
--last-value $LAST_VALUE \
--temporary-rootdir s3a://example-bucket/temporary-rootdir
----
Data from RDBMS can be imported into S3 in incremental +append+ mode as Sequence, Avro and Parquet file formats too.
Incremental Lastmodified Mode
+++++++++++++++++++++++++++++
In case of +lastmodified+ mode the location of the temporary root directory has to be in the same bucket and in the same
directory as the target directory, for example +s3a://example-bucket/temporary-rootdir+ in case of
+s3a://example-bucket/target-directory+. Example usage:
----
$ sqoop import \
-Dfs.s3a.access.key=$AWS_ACCES_KEY \
-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
--connect $CONN \
--username $USER \
--password $PWD \
--table $TABLE_NAME \
--target-dir s3a://example-bucket/target-directory \
--incremental lastmodified \
--check-column $CHECK_COLUMN \
--merge-key $MERGE_KEY \
--last-value $LAST_VALUE \
--temporary-rootdir s3a://example-bucket/temporary-rootdir
----
Data from RDBMS can be imported into S3 in incremental +lastmodified+ mode as Parquet file format too.
Import Into External Hive Table Backed By S3
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To import data from RDBMS into an external Hive table backed by S3 the AWS credentials have to be set in the Hive
configuration file (+hive-site.xml+) too. For learning more about Hive on Amazon Web Services please see the Hive
documentation at https://cwiki.apache.org/confluence/display/Hive/HiveAws.
The current implementation of Sqoop requires that both +target-dir+ and +external-table-dir+ options are set
where +external-table-dir+ has to point to the Hive table location in the S3 bucket.
Import into an external Hive table backed by S3 for example:
----
$ sqoop import \
-Dfs.s3a.access.key=$AWS_ACCES_KEY \
-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
--connect $CONN \
--username $USER \
--password $PWD \
--table $TABLE_NAME \
--hive-import \
--target-dir s3a://example-bucket/target-directory \
--external-table-dir s3a://example-bucket/external-directory
----
Create an external Hive table backed by S3 for example:
----
$ sqoop import \
-Dfs.s3a.access.key=$AWS_ACCES_KEY \
-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
--connect $CONN \
--username $USER \
--password $PWD \
--table $TABLE_NAME \
--hive-import \
--create-hive-table \
--hive-table $HIVE_TABLE_NAME \
--target-dir s3a://example-bucket/target-directory \
--external-table-dir s3a://example-bucket/external-directory
----
Data from RDBMS can be imported into an external Hive table backed by S3 as Parquet file format too.
Hadoop S3Guard usage with Sqoop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Amazon S3 offers eventual consistency for PUTS and DELETES in all regions which means the visibility of the files
are not guaranteed in a specific time after creation. Due to this behavior it can happen that right after a
sqoop import the data will not be visible immediately. For learning more about the core concepts of Amazon S3
please see the official documentation at https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#CoreConcepts.
S3Guard is an experimental feature for the S3A client in Hadoop which can use a database as a store of metadata about
objects in an S3 bucket. For learning more about S3Guard please see the Hadoop documentation at
https://hadoop.apache.org/docs/r3.0.3/hadoop-aws/tools/hadoop-aws/s3guard.html.
S3Guard can be enabled during sqoop imports via setting properties described in the linked documentation.
Example usage with setting S3Guard:
----
$ sqoop import \
-Dfs.s3a.access.key=$AWS_ACCESS_KEY \
-Dfs.s3a.secret.key=$AWS_SECRET_KEY \
-Dfs.s3a.metadatastore.impl=org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore \
-Dfs.s3a.s3guard.ddb.region=$BUCKET_REGION \
-Dfs.s3a.s3guard.ddb.table.create=true \
--connect $CONN \
--username $USER \
--password $PWD \
--table $TABLENAME \
--target-dir s3a://example-bucket/target-directory
----