blob: 561145513cb67d9a63e7997bc81d3ad4feb463e4 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
Saved Jobs
----------
Imports and exports can be repeatedly performed by issuing the same command
multiple times. Especially when using the incremental import capability,
this is an expected scenario.
Sqoop allows you to define _saved jobs_ which make this process easier. A
saved job records the configuration information required to execute a
Sqoop command at a later time. The section on the +sqoop-job+ tool
describes how to create and work with saved jobs.
By default, job descriptions are saved to a private repository stored
in +$HOME/.sqoop/+. You can configure Sqoop to instead use a shared
_metastore_, which makes saved jobs available to multiple users across a
shared cluster. Starting the metastore is covered by the section on the
+sqoop-metastore+ tool.
+sqoop-job+
-----------
Purpose
~~~~~~~
include::job-purpose.txt[]
Syntax
~~~~~~
----
$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
----
Although the Hadoop generic arguments must preceed any job arguments,
the job arguments can be entered in any order with respect to one
another.
.Job management options:
[grid="all"]
`---------------------------`------------------------------------------
Argument Description
-----------------------------------------------------------------------
+\--create <job-id>+ Define a new saved job with the specified \
job-id (name). A second Sqoop \
command-line, separated by a +\--+ should \
be specified; this defines the saved job.
+\--delete <job-id>+ Delete a saved job.
+\--exec <job-id>+ Given a job defined with +\--create+, run \
the saved job.
+\--show <job-id>+ Show the parameters for a saved job.
+\--list+ List all saved jobs
-----------------------------------------------------------------------
Creating saved jobs is done with the +\--create+ action. This operation
requires a +\--+ followed by a tool name and its arguments. The tool and
its arguments will form the basis of the saved job. Consider:
----
$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
--table mytable
----
This creates a job named +myjob+ which can be executed later. The job is not
run. This job is now available in the list of saved jobs:
----
$ sqoop job --list
Available jobs:
myjob
----
We can inspect the configuration of a job with the +show+ action. As you can see in the below example even if the password is stored in the metastore the +show+ action will redact its value in the output:
----
$ sqoop job --show myjob
Job: myjob
Tool: import
Options:
----------------------------
direct.import = false
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = mytable
db.password = ********
...
----
And if we are satisfied with it, we can run the job with +exec+:
----
$ sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
...
----
The +exec+ action allows you to override arguments of the saved job
by supplying them after a +\--+. For example, if the database were
changed to require a username, we could specify the username and
password with:
----
$ sqoop job --exec myjob -- --username someuser -P
Enter password:
...
----
.Metastore connection options:
[grid="all"]
`----------------------------`-----------------------------------------
Argument Description
-----------------------------------------------------------------------
+\--meta-connect <jdbc-uri>+ Specifies the JDBC connect string used \
to connect to the metastore
+\--meta-username <username>+ Specifies the username for the metastore database
+\--meta-password <password>+ Specifies the password for the metastore database
-----------------------------------------------------------------------
By default, a private metastore is instantiated in +$HOME/.sqoop+. If
you have configured a hosted metastore with the +sqoop-metastore+
tool, you can connect to it by specifying the +\--meta-connect+
argument. This is a JDBC connect string just like the ones used to
connect to databases for import.
In +conf/sqoop-site.xml+, you can configure
+sqoop.metastore.client.autoconnect.url+ with this address, so you do not have
to supply +\--meta-connect+ to use a remote metastore. This parameter can
also be modified to move the private metastore to a location on your
filesystem other than your home directory.
If you configure +sqoop.metastore.client.enable.autoconnect+ with the
value +false+, then you must explicitly supply +\--meta-connect+.
If the +--meta-connect+ option is present, then Sqoop will try to connect to the
+metastore+ database specified in this parameter value. It will use the username
and password specified in the +--meta-username+ and +--meta-password+ parameters.
If they are not present Sqoop will use empty username/password. If the database
in the connection string is not supported then Sqoop will throw an exception.
If the +--meta-connect+ parameter is not preset and the +sqoop.metastore.client.enable.autoconnect+
configuration parameter is false (default value is true) then Sqoop will throw an error since
there are no applicable +metastore+ implementations.
Job data can be stored in MySql, PostgreSql, DB2, SqlServer, and Oracle with
the +\--meta-connect+ argument. The +\--meta-username+ and +\--meta-password+ arguments are necessary
if the database containing the saved jobs requires a username and password.
In case of using any of these implementations, you have to ensure that the
database is online and accessible when Sqoop tries to access them.
Examples
~~~~~~~~
Listing available jobs in the metastore:
----
sqoop job --list --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB
--meta-username ms_user --meta-password ms_password
----
Creating a new job in the metastore:
----
sqoop job --create myjob1 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB
--meta-username ms_user --meta-password ms_password -- import
--connect jdbc:mysql://mysqlhost:3306/sqoop --username sqoop --password sqoop --table "TestTable" -m 1
----
Executing an existing job:
----
sqoop job --exec myjob1 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB
--meta-username ms_user --meta-password ms_password
----
Showing the definition of an existing job:
----
sqoop job --show myjob2 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB
--meta-username ms_user --meta-password ms_password
----
Deleting an existing job:
----
sqoop job --delete myjob1 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB
--meta-username ms_user --meta-password ms_password
----
Using a Hsqldb:
----
$ sqoop job --exec myjob --meta-connect jdbc:hsqldb:hsql://localhost:3000/ --meta-username *username* --meta-password *password*
----
.Common options:
[grid="all"]
`---------------------------`------------------------------------------
Argument Description
-----------------------------------------------------------------------
+\--help+ Print usage instructions
+\--verbose+ Print more information while working
-----------------------------------------------------------------------
Saved jobs and passwords
~~~~~~~~~~~~~~~~~~~~~~~~
The Sqoop metastore is not a secure resource. Multiple users can access
its contents. For this reason, Sqoop does not store passwords in the
metastore. If you create a job that requires a password, you will be
prompted for that password each time you execute the job.
You can enable passwords in the metastore by setting
+sqoop.metastore.client.record.password+ to +true+ in the configuration.
Note that you have to set +sqoop.metastore.client.record.password+ to +true+
if you are executing saved jobs via Oozie because Sqoop cannot prompt the user
to enter passwords while being executed as Oozie tasks.
Saved jobs and incremental imports
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Incremental imports are performed by comparing the values in a _check column_
against a reference value for the most recent import. For example, if the
+\--incremental append+ argument was specified, along with +\--check-column
id+ and +\--last-value 100+, all rows with +id > 100+ will be imported.
If an incremental import is run from the command line, the value which
should be specified as +\--last-value+ in a subsequent incremental import
will be printed to the screen for your reference. If an incremental import is
run from a saved job, this value will be retained in the saved job. Subsequent
runs of +sqoop job \--exec someIncrementalJob+ will continue to import only
newer rows than those previously imported.
+sqoop-metastore+
-----------------
Purpose
~~~~~~~
include::metastore-purpose.txt[]
Syntax
~~~~~~
----
$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)
----
Although the Hadoop generic arguments must preceed any metastore arguments,
the metastore arguments can be entered in any order with respect to one
another.
.Metastore management options:
[grid="all"]
`---------------------------`------------------------------------------
Argument Description
-----------------------------------------------------------------------
+\--shutdown+ Shuts down a running metastore instance \
on the same machine.
-----------------------------------------------------------------------
Running +sqoop-metastore+ launches a shared HSQLDB database instance on
the current machine. Clients can connect to this metastore and create jobs
which can be shared between users for execution.
The location of the metastore's files on disk is controlled by the
+sqoop.metastore.server.location+ property in +conf/sqoop-site.xml+.
This should point to a directory on the local filesystem.
The metastore is available over TCP/IP. The port is controlled by the
+sqoop.metastore.server.port+ configuration parameter, and defaults to 16000.
Clients should connect to the metastore by specifying
+sqoop.metastore.client.autoconnect.url+ or +\--meta-connect+ with a
JDBC-URI string. For example,
+jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop+.
Alternatively, one can start an RDBMS to host the metastore and pass the
connection parameters to Sqoop. This metastore may be hosted on a machine
within the Hadoop cluster, or elsewhere in the network. Sqoop supports the
following database implementations: MySql, Oracle, Postgresql, MSSql and DB2.
+sqoop-merge+
-------------
Purpose
~~~~~~~
include::merge-purpose.txt[]
Syntax
~~~~~~
----
$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)
----
Although the Hadoop generic arguments must preceed any merge arguments,
the job arguments can be entered in any order with respect to one
another.
.Merge options:
[grid="all"]
`---------------------------`------------------------------------------
Argument Description
-----------------------------------------------------------------------
+\--class-name <class>+ Specify the name of the record-specific \
class to use during the merge job.
+\--jar-file <file>+ Specify the name of the jar to load the \
record class from.
+\--merge-key <col>+ Specify the name of a column to use as \
the merge key.
+\--new-data <path>+ Specify the path of the newer dataset.
+\--onto <path>+ Specify the path of the older dataset.
+\--target-dir <path>+ Specify the target path for the output \
of the merge job.
-----------------------------------------------------------------------
The +merge+ tool runs a MapReduce job that takes two directories as
input: a newer dataset, and an older one. These are specified with
+\--new-data+ and +\--onto+ respectively. The output of the MapReduce
job will be placed in the directory in HDFS specified by +\--target-dir+.
When merging the datasets, it is assumed that there is a unique primary
key value in each record. The column for the primary key is specified
with +\--merge-key+. Multiple rows in the same dataset should not
have the same primary key, or else data loss may occur.
To parse the dataset and extract the key column, the auto-generated
class from a previous import must be used. You should specify the
class name and jar file with +\--class-name+ and +\--jar-file+. If
this is not availab,e you can recreate the class using the +codegen+
tool.
The merge tool is typically run after an incremental import with the
date-last-modified mode (+sqoop import --incremental lastmodified ...+).
Supposing two incremental imports were performed, where some older data
is in an HDFS directory named +older+ and newer data is in an HDFS
directory named +newer+, these could be merged like so:
----
$ sqoop merge --new-data newer --onto older --target-dir merged \
--jar-file datatypes.jar --class-name Foo --merge-key id
----
This would run a MapReduce job where the value in the +id+ column
of each row is used to join rows; rows in the +newer+ dataset will
be used in preference to rows in the +older+ dataset.
This can be used with both SequenceFile-, Avro- and text-based
incremental imports. The file types of the newer and older datasets
must be the same.