| |
| //// |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| |
| Saved Jobs |
| ---------- |
| |
| Imports and exports can be repeatedly performed by issuing the same command |
| multiple times. Especially when using the incremental import capability, |
| this is an expected scenario. |
| |
| Sqoop allows you to define _saved jobs_ which make this process easier. A |
| saved job records the configuration information required to execute a |
| Sqoop command at a later time. The section on the +sqoop-job+ tool |
| describes how to create and work with saved jobs. |
| |
| By default, job descriptions are saved to a private repository stored |
| in +$HOME/.sqoop/+. You can configure Sqoop to instead use a shared |
| _metastore_, which makes saved jobs available to multiple users across a |
| shared cluster. Starting the metastore is covered by the section on the |
| +sqoop-metastore+ tool. |
| |
| |
| +sqoop-job+ |
| ----------- |
| |
| Purpose |
| ~~~~~~~ |
| |
| include::job-purpose.txt[] |
| |
| Syntax |
| ~~~~~~ |
| |
| ---- |
| $ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)] |
| $ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)] |
| ---- |
| |
| Although the Hadoop generic arguments must preceed any job arguments, |
| the job arguments can be entered in any order with respect to one |
| another. |
| |
| .Job management options: |
| [grid="all"] |
| `---------------------------`------------------------------------------ |
| Argument Description |
| ----------------------------------------------------------------------- |
| +\--create <job-id>+ Define a new saved job with the specified \ |
| job-id (name). A second Sqoop \ |
| command-line, separated by a +\--+ should \ |
| be specified; this defines the saved job. |
| +\--delete <job-id>+ Delete a saved job. |
| +\--exec <job-id>+ Given a job defined with +\--create+, run \ |
| the saved job. |
| +\--show <job-id>+ Show the parameters for a saved job. |
| +\--list+ List all saved jobs |
| ----------------------------------------------------------------------- |
| |
| Creating saved jobs is done with the +\--create+ action. This operation |
| requires a +\--+ followed by a tool name and its arguments. The tool and |
| its arguments will form the basis of the saved job. Consider: |
| |
| ---- |
| $ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \ |
| --table mytable |
| ---- |
| |
| This creates a job named +myjob+ which can be executed later. The job is not |
| run. This job is now available in the list of saved jobs: |
| |
| ---- |
| $ sqoop job --list |
| Available jobs: |
| myjob |
| ---- |
| |
| We can inspect the configuration of a job with the +show+ action. As you can see in the below example even if the password is stored in the metastore the +show+ action will redact its value in the output: |
| |
| ---- |
| $ sqoop job --show myjob |
| Job: myjob |
| Tool: import |
| Options: |
| ---------------------------- |
| direct.import = false |
| codegen.input.delimiters.record = 0 |
| hdfs.append.dir = false |
| db.table = mytable |
| db.password = ******** |
| ... |
| ---- |
| |
| And if we are satisfied with it, we can run the job with +exec+: |
| |
| ---- |
| $ sqoop job --exec myjob |
| 10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation |
| ... |
| ---- |
| |
| The +exec+ action allows you to override arguments of the saved job |
| by supplying them after a +\--+. For example, if the database were |
| changed to require a username, we could specify the username and |
| password with: |
| |
| ---- |
| $ sqoop job --exec myjob -- --username someuser -P |
| Enter password: |
| ... |
| ---- |
| |
| .Metastore connection options: |
| [grid="all"] |
| `----------------------------`----------------------------------------- |
| Argument Description |
| ----------------------------------------------------------------------- |
| +\--meta-connect <jdbc-uri>+ Specifies the JDBC connect string used \ |
| to connect to the metastore |
| +\--meta-username <username>+ Specifies the username for the metastore database |
| +\--meta-password <password>+ Specifies the password for the metastore database |
| ----------------------------------------------------------------------- |
| |
| By default, a private metastore is instantiated in +$HOME/.sqoop+. If |
| you have configured a hosted metastore with the +sqoop-metastore+ |
| tool, you can connect to it by specifying the +\--meta-connect+ |
| argument. This is a JDBC connect string just like the ones used to |
| connect to databases for import. |
| |
| In +conf/sqoop-site.xml+, you can configure |
| +sqoop.metastore.client.autoconnect.url+ with this address, so you do not have |
| to supply +\--meta-connect+ to use a remote metastore. This parameter can |
| also be modified to move the private metastore to a location on your |
| filesystem other than your home directory. |
| |
| If you configure +sqoop.metastore.client.enable.autoconnect+ with the |
| value +false+, then you must explicitly supply +\--meta-connect+. |
| |
| If the +--meta-connect+ option is present, then Sqoop will try to connect to the |
| +metastore+ database specified in this parameter value. It will use the username |
| and password specified in the +--meta-username+ and +--meta-password+ parameters. |
| If they are not present Sqoop will use empty username/password. If the database |
| in the connection string is not supported then Sqoop will throw an exception. |
| |
| If the +--meta-connect+ parameter is not preset and the +sqoop.metastore.client.enable.autoconnect+ |
| configuration parameter is false (default value is true) then Sqoop will throw an error since |
| there are no applicable +metastore+ implementations. |
| |
| Job data can be stored in MySql, PostgreSql, DB2, SqlServer, and Oracle with |
| the +\--meta-connect+ argument. The +\--meta-username+ and +\--meta-password+ arguments are necessary |
| if the database containing the saved jobs requires a username and password. |
| In case of using any of these implementations, you have to ensure that the |
| database is online and accessible when Sqoop tries to access them. |
| |
| Examples |
| ~~~~~~~~ |
| |
| Listing available jobs in the metastore: |
| ---- |
| sqoop job --list --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB |
| --meta-username ms_user --meta-password ms_password |
| ---- |
| |
| Creating a new job in the metastore: |
| ---- |
| sqoop job --create myjob1 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB |
| --meta-username ms_user --meta-password ms_password -- import |
| --connect jdbc:mysql://mysqlhost:3306/sqoop --username sqoop --password sqoop --table "TestTable" -m 1 |
| ---- |
| |
| Executing an existing job: |
| ---- |
| sqoop job --exec myjob1 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB |
| --meta-username ms_user --meta-password ms_password |
| ---- |
| |
| Showing the definition of an existing job: |
| ---- |
| sqoop job --show myjob2 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB |
| --meta-username ms_user --meta-password ms_password |
| ---- |
| |
| Deleting an existing job: |
| ---- |
| sqoop job --delete myjob1 --meta-connect jdbc:oracle:thin:@//myhost:1521/ORCLCDB |
| --meta-username ms_user --meta-password ms_password |
| ---- |
| |
| Using a Hsqldb: |
| ---- |
| $ sqoop job --exec myjob --meta-connect jdbc:hsqldb:hsql://localhost:3000/ --meta-username *username* --meta-password *password* |
| |
| ---- |
| |
| .Common options: |
| [grid="all"] |
| `---------------------------`------------------------------------------ |
| Argument Description |
| ----------------------------------------------------------------------- |
| +\--help+ Print usage instructions |
| +\--verbose+ Print more information while working |
| ----------------------------------------------------------------------- |
| |
| Saved jobs and passwords |
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The Sqoop metastore is not a secure resource. Multiple users can access |
| its contents. For this reason, Sqoop does not store passwords in the |
| metastore. If you create a job that requires a password, you will be |
| prompted for that password each time you execute the job. |
| |
| You can enable passwords in the metastore by setting |
| +sqoop.metastore.client.record.password+ to +true+ in the configuration. |
| |
| Note that you have to set +sqoop.metastore.client.record.password+ to +true+ |
| if you are executing saved jobs via Oozie because Sqoop cannot prompt the user |
| to enter passwords while being executed as Oozie tasks. |
| |
| Saved jobs and incremental imports |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Incremental imports are performed by comparing the values in a _check column_ |
| against a reference value for the most recent import. For example, if the |
| +\--incremental append+ argument was specified, along with +\--check-column |
| id+ and +\--last-value 100+, all rows with +id > 100+ will be imported. |
| If an incremental import is run from the command line, the value which |
| should be specified as +\--last-value+ in a subsequent incremental import |
| will be printed to the screen for your reference. If an incremental import is |
| run from a saved job, this value will be retained in the saved job. Subsequent |
| runs of +sqoop job \--exec someIncrementalJob+ will continue to import only |
| newer rows than those previously imported. |
| |
| |
| +sqoop-metastore+ |
| ----------------- |
| |
| Purpose |
| ~~~~~~~ |
| |
| include::metastore-purpose.txt[] |
| |
| Syntax |
| ~~~~~~ |
| |
| ---- |
| $ sqoop metastore (generic-args) (metastore-args) |
| $ sqoop-metastore (generic-args) (metastore-args) |
| ---- |
| |
| Although the Hadoop generic arguments must preceed any metastore arguments, |
| the metastore arguments can be entered in any order with respect to one |
| another. |
| |
| .Metastore management options: |
| [grid="all"] |
| `---------------------------`------------------------------------------ |
| Argument Description |
| ----------------------------------------------------------------------- |
| +\--shutdown+ Shuts down a running metastore instance \ |
| on the same machine. |
| ----------------------------------------------------------------------- |
| |
| Running +sqoop-metastore+ launches a shared HSQLDB database instance on |
| the current machine. Clients can connect to this metastore and create jobs |
| which can be shared between users for execution. |
| |
| The location of the metastore's files on disk is controlled by the |
| +sqoop.metastore.server.location+ property in +conf/sqoop-site.xml+. |
| This should point to a directory on the local filesystem. |
| |
| The metastore is available over TCP/IP. The port is controlled by the |
| +sqoop.metastore.server.port+ configuration parameter, and defaults to 16000. |
| |
| Clients should connect to the metastore by specifying |
| +sqoop.metastore.client.autoconnect.url+ or +\--meta-connect+ with a |
| JDBC-URI string. For example, |
| +jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop+. |
| |
| Alternatively, one can start an RDBMS to host the metastore and pass the |
| connection parameters to Sqoop. This metastore may be hosted on a machine |
| within the Hadoop cluster, or elsewhere in the network. Sqoop supports the |
| following database implementations: MySql, Oracle, Postgresql, MSSql and DB2. |
| |
| +sqoop-merge+ |
| ------------- |
| |
| Purpose |
| ~~~~~~~ |
| |
| include::merge-purpose.txt[] |
| |
| Syntax |
| ~~~~~~ |
| |
| ---- |
| $ sqoop merge (generic-args) (merge-args) |
| $ sqoop-merge (generic-args) (merge-args) |
| ---- |
| |
| Although the Hadoop generic arguments must preceed any merge arguments, |
| the job arguments can be entered in any order with respect to one |
| another. |
| |
| .Merge options: |
| [grid="all"] |
| `---------------------------`------------------------------------------ |
| Argument Description |
| ----------------------------------------------------------------------- |
| +\--class-name <class>+ Specify the name of the record-specific \ |
| class to use during the merge job. |
| +\--jar-file <file>+ Specify the name of the jar to load the \ |
| record class from. |
| +\--merge-key <col>+ Specify the name of a column to use as \ |
| the merge key. |
| +\--new-data <path>+ Specify the path of the newer dataset. |
| +\--onto <path>+ Specify the path of the older dataset. |
| +\--target-dir <path>+ Specify the target path for the output \ |
| of the merge job. |
| ----------------------------------------------------------------------- |
| |
| The +merge+ tool runs a MapReduce job that takes two directories as |
| input: a newer dataset, and an older one. These are specified with |
| +\--new-data+ and +\--onto+ respectively. The output of the MapReduce |
| job will be placed in the directory in HDFS specified by +\--target-dir+. |
| |
| When merging the datasets, it is assumed that there is a unique primary |
| key value in each record. The column for the primary key is specified |
| with +\--merge-key+. Multiple rows in the same dataset should not |
| have the same primary key, or else data loss may occur. |
| |
| To parse the dataset and extract the key column, the auto-generated |
| class from a previous import must be used. You should specify the |
| class name and jar file with +\--class-name+ and +\--jar-file+. If |
| this is not availab,e you can recreate the class using the +codegen+ |
| tool. |
| |
| The merge tool is typically run after an incremental import with the |
| date-last-modified mode (+sqoop import --incremental lastmodified ...+). |
| |
| Supposing two incremental imports were performed, where some older data |
| is in an HDFS directory named +older+ and newer data is in an HDFS |
| directory named +newer+, these could be merged like so: |
| |
| ---- |
| $ sqoop merge --new-data newer --onto older --target-dir merged \ |
| --jar-file datatypes.jar --class-name Foo --merge-key id |
| ---- |
| |
| This would run a MapReduce job where the value in the +id+ column |
| of each row is used to join rows; rows in the +newer+ dataset will |
| be used in preference to rows in the +older+ dataset. |
| |
| This can be used with both SequenceFile-, Avro- and text-based |
| incremental imports. The file types of the newer and older datasets |
| must be the same. |
| |
| |