| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> |
| <?asciidoc-toc?> |
| <?asciidoc-numbered?> |
| |
| <article lang="en"> |
| <articleinfo> |
| <title>Sqoop User Guide (v1.4.7-SNAPSHOT)</title> |
| </articleinfo> |
| <screen> Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License.</screen> |
| <section id="_introduction"> |
| <title>Introduction</title> |
| <simpara>Sqoop is a tool designed to transfer data between Hadoop and |
| relational databases or mainframes. You can use Sqoop to import data from a |
| relational database management system (RDBMS) such as MySQL or Oracle or a |
| mainframe into the Hadoop Distributed File System (HDFS), |
| transform the data in Hadoop MapReduce, and then export the data back |
| into an RDBMS.</simpara> |
| <simpara>Sqoop automates most of this process, relying on the database to |
| describe the schema for the data to be imported. Sqoop uses MapReduce |
| to import and export the data, which provides parallel operation as |
| well as fault tolerance.</simpara> |
| <simpara>This document describes how to get started using Sqoop to move data |
| between databases and Hadoop or mainframe to Hadoop and provides reference |
| information for the operation of the Sqoop command-line tool suite. This |
| document is intended for:</simpara> |
| <itemizedlist> |
| <listitem> |
| <simpara> |
| System and application programmers |
| </simpara> |
| </listitem> |
| <listitem> |
| <simpara> |
| System administrators |
| </simpara> |
| </listitem> |
| <listitem> |
| <simpara> |
| Database administrators |
| </simpara> |
| </listitem> |
| <listitem> |
| <simpara> |
| Data analysts |
| </simpara> |
| </listitem> |
| <listitem> |
| <simpara> |
| Data engineers |
| </simpara> |
| </listitem> |
| </itemizedlist> |
| </section> |
| <section id="_supported_releases"> |
| <title>Supported Releases</title> |
| <simpara>This documentation applies to Sqoop v1.4.7-SNAPSHOT.</simpara> |
| </section> |
| <section id="_sqoop_releases"> |
| <title>Sqoop Releases</title> |
| <simpara>Sqoop is an open source software product of the Apache Software Foundation.</simpara> |
| <simpara>Software development for Sqoop occurs at <ulink url="http://sqoop.apache.org">http://sqoop.apache.org</ulink> |
| At that site you can obtain:</simpara> |
| <itemizedlist> |
| <listitem> |
| <simpara> |
| New releases of Sqoop as well as its most recent source code |
| </simpara> |
| </listitem> |
| <listitem> |
| <simpara> |
| An issue tracker |
| </simpara> |
| </listitem> |
| <listitem> |
| <simpara> |
| A wiki that contains Sqoop documentation |
| </simpara> |
| </listitem> |
| </itemizedlist> |
| </section> |
| <section id="_prerequisites"> |
| <title>Prerequisites</title> |
| <simpara>The following prerequisite knowledge is required for this product:</simpara> |
| <itemizedlist> |
| <listitem> |
| <simpara> |
| Basic computer technology and terminology |
| </simpara> |
| </listitem> |
| <listitem> |
| <simpara> |
| Familiarity with command-line interfaces such as <literal>bash</literal> |
| </simpara> |
| </listitem> |
| <listitem> |
| <simpara> |
| Relational database management systems |
| </simpara> |
| </listitem> |
| <listitem> |
| <simpara> |
| Basic familiarity with the purpose and operation of Hadoop |
| </simpara> |
| </listitem> |
| </itemizedlist> |
| <simpara>Before you can use Sqoop, a release of Hadoop must be installed and |
| configured. Sqoop is currently supporting 4 major Hadoop releases - 0.20, |
| 0.23, 1.0 and 2.0.</simpara> |
| <simpara>This document assumes you are using a Linux or Linux-like environment. |
| If you are using Windows, you may be able to use cygwin to accomplish |
| most of the following tasks. If you are using Mac OS X, you should see |
| few (if any) compatibility errors. Sqoop is predominantly operated and |
| tested on Linux.</simpara> |
| </section> |
| <section id="_basic_usage"> |
| <title>Basic Usage</title> |
| <simpara>With Sqoop, you can <emphasis>import</emphasis> data from a relational database system or a |
| mainframe into HDFS. The input to the import process is either database table |
| or mainframe datasets. For databases, Sqoop will read the table row-by-row |
| into HDFS. For mainframe datasets, Sqoop will read records from each mainframe |
| dataset into HDFS. The output of this import process is a set of files |
| containing a copy of the imported table or datasets. |
| The import process is performed in parallel. For this reason, the |
| output will be in multiple files. These files may be delimited text |
| files (for example, with commas or tabs separating each field), or |
| binary Avro or SequenceFiles containing serialized record data.</simpara> |
| <simpara>A by-product of the import process is a generated Java class which |
| can encapsulate one row of the imported table. This class is used |
| during the import process by Sqoop itself. The Java source code for |
| this class is also provided to you, for use in subsequent MapReduce |
| processing of the data. This class can serialize and deserialize data |
| to and from the SequenceFile format. It can also parse the |
| delimited-text form of a record. These abilities allow you to quickly |
| develop MapReduce applications that use the HDFS-stored records in |
| your processing pipeline. You are also free to parse the delimiteds |
| record data yourself, using any other tools you prefer.</simpara> |
| <simpara>After manipulating the imported records (for example, with MapReduce |
| or Hive) you may have a result data set which you can then <emphasis>export</emphasis> |
| back to the relational database. Sqoop’s export process will read |
| a set of delimited text files from HDFS in parallel, parse them into |
| records, and insert them as new rows in a target database table, for |
| consumption by external applications or users.</simpara> |
| <simpara>Sqoop includes some other commands which allow you to inspect the |
| database you are working with. For example, you can list the available |
| database schemas (with the <literal>sqoop-list-databases</literal> tool) and tables |
| within a schema (with the <literal>sqoop-list-tables</literal> tool). Sqoop also |
| includes a primitive SQL execution shell (the <literal>sqoop-eval</literal> tool).</simpara> |
| <simpara>Most aspects of the import, code generation, and export processes can |
| be customized. For databases, you can control the specific row range or |
| columns imported. You can specify particular delimiters and escape characters |
| for the file-based representation of the data, as well as the file format |
| used. You can also control the class or package names used in |
| generated code. Subsequent sections of this document explain how to |
| specify these and other arguments to Sqoop.</simpara> |
| </section> |
| <section id="_sqoop_tools"> |
| <title>Sqoop Tools</title> |
| <simpara>Sqoop is a collection of related tools. To use Sqoop, you specify the |
| tool you want to use and the arguments that control the tool.</simpara> |
| <simpara>If Sqoop is compiled from its own source, you can run Sqoop without a formal |
| installation process by running the <literal>bin/sqoop</literal> program. Users |
| of a packaged deployment of Sqoop (such as an RPM shipped with Apache Bigtop) |
| will see this program installed as <literal>/usr/bin/sqoop</literal>. The remainder of this |
| documentation will refer to this program as <literal>sqoop</literal>. For example:</simpara> |
| <screen>$ sqoop tool-name [tool-arguments]</screen> |
| <note><simpara>The following examples that begin with a <literal>$</literal> character indicate |
| that the commands must be entered at a terminal prompt (such as |
| <literal>bash</literal>). The <literal>$</literal> character represents the prompt itself; you should |
| not start these commands by typing a <literal>$</literal>. You can also enter commands |
| inline in the text of a paragraph; for example, <literal>sqoop help</literal>. These |
| examples do not show a <literal>$</literal> prefix, but you should enter them the same |
| way. Don’t confuse the <literal>$</literal> shell prompt in the examples with the <literal>$</literal> |
| that precedes an environment variable name. For example, the string |
| literal <literal>$HADOOP_HOME</literal> includes a "<literal>$</literal>".</simpara></note> |
| <simpara>Sqoop ships with a help tool. To display a list of all available |
| tools, type the following command:</simpara> |
| <screen>$ sqoop help |
| usage: sqoop COMMAND [ARGS] |
| |
| Available commands: |
| codegen Generate code to interact with database records |
| create-hive-table Import a table definition into Hive |
| eval Evaluate a SQL statement and display the results |
| export Export an HDFS directory to a database table |
| help List available commands |
| import Import a table from a database to HDFS |
| import-all-tables Import tables from a database to HDFS |
| import-mainframe Import mainframe datasets to HDFS |
| list-databases List available databases on a server |
| list-tables List available tables in a database |
| version Display version information |
| |
| See 'sqoop help COMMAND' for information on a specific command.</screen> |
| <simpara>You can display help for a specific tool by entering: <literal>sqoop help |
| (tool-name)</literal>; for example, <literal>sqoop help import</literal>.</simpara> |
| <simpara>You can also add the <literal>--help</literal> argument to any command: <literal>sqoop import |
| --help</literal>.</simpara> |
| <section id="_using_command_aliases"> |
| <title>Using Command Aliases</title> |
| <simpara>In addition to typing the <literal>sqoop (toolname)</literal> syntax, you can use alias |
| scripts that specify the <literal>sqoop-(toolname)</literal> syntax. For example, the |
| scripts <literal>sqoop-import</literal>, <literal>sqoop-export</literal>, etc. each select a specific |
| tool.</simpara> |
| </section> |
| <section id="_controlling_the_hadoop_installation"> |
| <title>Controlling the Hadoop Installation</title> |
| <simpara>You invoke Sqoop through the program launch capability provided by |
| Hadoop. The <literal>sqoop</literal> command-line program is a wrapper which runs the |
| <literal>bin/hadoop</literal> script shipped with Hadoop. If you have multiple |
| installations of Hadoop present on your machine, you can select the |
| Hadoop installation by setting the <literal>$HADOOP_COMMON_HOME</literal> and |
| <literal>$HADOOP_MAPRED_HOME</literal> environment variables.</simpara> |
| <simpara>For example:</simpara> |
| <screen>$ HADOOP_COMMON_HOME=/path/to/some/hadoop \ |
| HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce \ |
| sqoop import --arguments...</screen> |
| <simpara>or:</simpara> |
| <screen>$ export HADOOP_COMMON_HOME=/some/path/to/hadoop |
| $ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-mapreduce |
| $ sqoop import --arguments...</screen> |
| <simpara>If either of these variables are not set, Sqoop will fall back to |
| <literal>$HADOOP_HOME</literal>. If it is not set either, Sqoop will use the default |
| installation locations for Apache Bigtop, <literal>/usr/lib/hadoop</literal> and |
| <literal>/usr/lib/hadoop-mapreduce</literal>, respectively.</simpara> |
| <simpara>The active Hadoop configuration is loaded from <literal>$HADOOP_HOME/conf/</literal>, |
| unless the <literal>$HADOOP_CONF_DIR</literal> environment variable is set.</simpara> |
| </section> |
| <section id="_using_generic_and_specific_arguments"> |
| <title>Using Generic and Specific Arguments</title> |
| <simpara>To control the operation of each Sqoop tool, you use generic and |
| specific arguments.</simpara> |
| <simpara>For example:</simpara> |
| <screen>$ sqoop help import |
| usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS] |
| |
| Common arguments: |
| --connect <jdbc-uri> Specify JDBC connect string |
| --connect-manager <class-name> Specify connection manager class to use |
| --driver <class-name> Manually specify JDBC driver class to use |
| --hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME |
| --help Print usage instructions |
| --password-file Set path for file containing authentication password |
| -P Read password from console |
| --password <password> Set authentication password |
| --username <username> Set authentication username |
| --verbose Print more information while working |
| --hadoop-home <dir> Deprecated. Override $HADOOP_HOME |
| |
| [...] |
| |
| Generic Hadoop command-line arguments: |
| (must preceed any tool-specific arguments) |
| Generic options supported are |
| -conf <configuration file> specify an application configuration file |
| -D <property=value> use value for given property |
| -fs <local|namenode:port> specify a namenode |
| -jt <local|jobtracker:port> specify a job tracker |
| -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster |
| -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. |
| -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. |
| |
| The general command line syntax is |
| bin/hadoop command [genericOptions] [commandOptions]</screen> |
| <simpara>You must supply the generic arguments <literal>-conf</literal>, <literal>-D</literal>, and so on after the |
| tool name but <emphasis role="strong">before</emphasis> any tool-specific arguments (such as |
| <literal>--connect</literal>). Note that generic Hadoop arguments are preceeded by a |
| single dash character (<literal>-</literal>), whereas tool-specific arguments start |
| with two dashes (<literal>--</literal>), unless they are single character arguments such as <literal>-P</literal>.</simpara> |
| <simpara>The <literal>-conf</literal>, <literal>-D</literal>, <literal>-fs</literal> and <literal>-jt</literal> arguments control the configuration |
| and Hadoop server settings. For example, the <literal>-D mapred.job.name=<job_name></literal> can |
| be used to set the name of the MR job that Sqoop launches, if not specified, |
| the name defaults to the jar name for the job - which is derived from the used |
| table name.</simpara> |
| <simpara>The <literal>-files</literal>, <literal>-libjars</literal>, and <literal>-archives</literal> arguments are not typically used with |
| Sqoop, but they are included as part of Hadoop’s internal argument-parsing |
| system.</simpara> |
| </section> |
| <section id="_using_options_files_to_pass_arguments"> |
| <title>Using Options Files to Pass Arguments</title> |
| <simpara>When using Sqoop, the command line options that do not change from |
| invocation to invocation can be put in an options file for convenience. |
| An options file is a text file where each line identifies an option in |
| the order that it appears otherwise on the command line. Option files |
| allow specifying a single option on multiple lines by using the |
| back-slash character at the end of intermediate lines. Also supported |
| are comments within option files that begin with the hash character. |
| Comments must be specified on a new line and may not be mixed with |
| option text. All comments and empty lines are ignored when option |
| files are expanded. Unless options appear as quoted strings, any |
| leading or trailing spaces are ignored. Quoted strings if used must |
| not extend beyond the line on which they are specified.</simpara> |
| <simpara>Option files can be specified anywhere in the command line as long as |
| the options within them follow the otherwise prescribed rules of |
| options ordering. For instance, regardless of where the options are |
| loaded from, they must follow the ordering such that generic options |
| appear first, tool specific options next, finally followed by options |
| that are intended to be passed to child programs.</simpara> |
| <simpara>To specify an options file, simply create an options file in a |
| convenient location and pass it to the command line via |
| <literal>--options-file</literal> argument.</simpara> |
| <simpara>Whenever an options file is specified, it is expanded on the |
| command line before the tool is invoked. You can specify more than |
| one option files within the same invocation if needed.</simpara> |
| <simpara>For example, the following Sqoop invocation for import can |
| be specified alternatively as shown below:</simpara> |
| <screen>$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST |
| |
| $ sqoop --options-file /users/homer/work/import.txt --table TEST</screen> |
| <simpara>where the options file <literal>/users/homer/work/import.txt</literal> contains the following:</simpara> |
| <screen>import |
| --connect |
| jdbc:mysql://localhost/db |
| --username |
| foo</screen> |
| <simpara>The options file can have empty lines and comments for readability purposes. |
| So the above example would work exactly the same if the options file |
| <literal>/users/homer/work/import.txt</literal> contained the following:</simpara> |
| <screen># |
| # Options file for Sqoop import |
| # |
| |
| # Specifies the tool being invoked |
| import |
| |
| # Connect parameter and value |
| --connect |
| jdbc:mysql://localhost/db |
| |
| # Username parameter and value |
| --username |
| foo |
| |
| # |
| # Remaining options should be specified in the command line. |
| #</screen> |
| </section> |
| <section id="_using_tools"> |
| <title>Using Tools</title> |
| <simpara>The following sections will describe each tool’s operation. The |
| tools are listed in the most likely order you will find them useful.</simpara> |
| </section> |
| </section> |
| <section id="_literal_sqoop_import_literal"> |
| <title><literal>sqoop-import</literal></title> |
| <section id="_purpose"> |
| <title>Purpose</title> |
| <simpara>The <literal>import</literal> tool imports an individual table from an RDBMS to HDFS. |
| Each row from a table is represented as a separate record in HDFS. |
| Records can be stored as text files (one record per line), or in |
| binary representation as Avro or SequenceFiles.</simpara> |
| </section> |
| <section id="_syntax"> |
| <title>Syntax</title> |
| <screen>$ sqoop import (generic-args) (import-args) |
| $ sqoop-import (generic-args) (import-args)</screen> |
| <simpara>While the Hadoop generic arguments must precede any import arguments, |
| you can type the import arguments in any order with respect to one |
| another.</simpara> |
| <note><simpara>In this document, arguments are grouped into collections |
| organized by function. Some collections are present in several tools |
| (for example, the "common" arguments). An extended description of their |
| functionality is given only on the first presentation in this |
| document.</simpara></note> |
| <table pgwide="0" |
| frame="topbot" |
| rowsep="1" colsep="1" |
| > |
| <title>Common arguments</title> |
| <tgroup cols="2"> |
| <colspec colwidth="248*" align="left"/> |
| <colspec colwidth="230*" align="left"/> |
| <thead> |
| <row> |
| <entry> |
| Argument |
| </entry> |
| <entry> |
| Description |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry> |
| <literal>--connect <jdbc-uri></literal> |
| </entry> |
| <entry> |
| Specify JDBC connect string |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--connection-manager <class-name></literal> |
| </entry> |
| <entry> |
| Specify connection manager class to use |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--driver <class-name></literal> |
| </entry> |
| <entry> |
| Manually specify JDBC driver class to use |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--hadoop-mapred-home <dir></literal> |
| </entry> |
| <entry> |
| Override $HADOOP_MAPRED_HOME |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--help</literal> |
| </entry> |
| <entry> |
| Print usage instructions |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--password-file</literal> |
| </entry> |
| <entry> |
| Set path for a file containing the authentication password |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>-P</literal> |
| </entry> |
| <entry> |
| Read password from console |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--password <password></literal> |
| </entry> |
| <entry> |
| Set authentication password |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--username <username></literal> |
| </entry> |
| <entry> |
| Set authentication username |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--verbose</literal> |
| </entry> |
| <entry> |
| Print more information while working |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--connection-param-file <filename></literal> |
| </entry> |
| <entry> |
| Optional properties file that provides connection parameters |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--relaxed-isolation</literal> |
| </entry> |
| <entry> |
| Set connection transaction isolation to read uncommitted for the mappers. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| <section id="_connecting_to_a_database_server"> |
| <title>Connecting to a Database Server</title> |
| <simpara>Sqoop is designed to import tables from a database into HDFS. To do |
| so, you must specify a <emphasis>connect string</emphasis> that describes how to connect to the |
| database. The <emphasis>connect string</emphasis> is similar to a URL, and is communicated to |
| Sqoop with the <literal>--connect</literal> argument. This describes the server and |
| database to connect to; it may also specify the port. For example:</simpara> |
| <screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees</screen> |
| <simpara>This string will connect to a MySQL database named <literal>employees</literal> on the |
| host <literal>database.example.com</literal>. It’s important that you <emphasis role="strong">do not</emphasis> use the URL |
| <literal>localhost</literal> if you intend to use Sqoop with a distributed Hadoop |
| cluster. The connect string you supply will be used on TaskTracker nodes |
| throughout your MapReduce cluster; if you specify the |
| literal name <literal>localhost</literal>, each node will connect to a different |
| database (or more likely, no database at all). Instead, you should use |
| the full hostname or IP address of the database host that can be seen |
| by all your remote nodes.</simpara> |
| <simpara>You might need to authenticate against the database before you can |
| access it. You can use the <literal>--username</literal> to supply a username to the database. |
| Sqoop provides couple of different ways to supply a password, |
| secure and non-secure, to the database which is detailed below.</simpara> |
| <formalpara><title>Secure way of supplying password to the database</title><para>You should save the password in a file on the users home directory with 400 |
| permissions and specify the path to that file using the <emphasis role="strong"><literal>--password-file</literal></emphasis> |
| argument, and is the preferred method of entering credentials. Sqoop will |
| then read the password from the file and pass it to the MapReduce cluster |
| using secure means with out exposing the password in the job configuration. |
| The file containing the password can either be on the Local FS or HDFS. |
| For example:</para></formalpara> |
| <screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees \ |
| --username venkatesh --password-file ${user.home}/.password</screen> |
| <warning><simpara>Sqoop will read entire content of the password file and use it as |
| a password. This will include any trailing white space characters such as |
| new line characters that are added by default by most of the text editors. |
| You need to make sure that your password file contains only characters |
| that belongs to your password. On the command line you can use command |
| <literal>echo</literal> with switch <literal>-n</literal> to store password without any trailing white space |
| characters. For example to store password <literal>secret</literal> you would call |
| <literal>echo -n "secret" > password.file</literal>.</simpara></warning> |
| <simpara>Another way of supplying passwords is using the <literal>-P</literal> argument which will |
| read a password from a console prompt.</simpara> |
| <formalpara><title>Protecting password from preying eyes</title><para>Hadoop 2.6.0 provides an API to separate password storage from applications. |
| This API is called the credential provided API and there is a new |
| <literal>credential</literal> command line tool to manage passwords and their aliases. |
| The passwords are stored with their aliases in a keystore that is password |
| protected. The keystore password can be the provided to a password prompt |
| on the command line, via an environment variable or defaulted to a software |
| defined constant. Please check the Hadoop documentation on the usage |
| of this facility.</para></formalpara> |
| <simpara>Once the password is stored using the Credential Provider facility and |
| the Hadoop configuration has been suitably updated, all applications can |
| optionally use the alias in place of the actual password and at runtime |
| resolve the alias for the password to use.</simpara> |
| <simpara>Since the keystore or similar technology used for storing the credential |
| provider is shared across components, passwords for various applications, |
| various database and other passwords can be securely stored in them and only |
| the alias needs to be exposed in configuration files, protecting the password |
| from being visible.</simpara> |
| <simpara>Sqoop has been enhanced to allow usage of this funcionality if it is |
| available in the underlying Hadoop version being used. One new option |
| has been introduced to provide the alias on the command line instead of the |
| actual password (--password-alias). The argument value this option is |
| the alias on the storage associated with the actual password. |
| Example usage is as follows:</simpara> |
| <screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees \ |
| --username dbuser --password-alias mydb.password.alias</screen> |
| <simpara>Similarly, if the command line option is not preferred, the alias can be saved |
| in the file provided with --password-file option. Along with this, the |
| Sqoop configuration parameter org.apache.sqoop.credentials.loader.class |
| should be set to the classname that provides the alias resolution: |
| <literal>org.apache.sqoop.util.password.CredentialProviderPasswordLoader</literal></simpara> |
| <simpara>Example usage is as follows (assuming .password.alias has the alias for |
| the real password) :</simpara> |
| <screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees \ |
| --username dbuser --password-file ${user.home}/.password-alias</screen> |
| <warning><simpara>The <literal>--password</literal> parameter is insecure, as other users may |
| be able to read your password from the command-line arguments via |
| the output of programs such as <literal>ps</literal>. The <emphasis role="strong"><literal>-P</literal></emphasis> argument is the preferred |
| method over using the <literal>--password</literal> argument. Credentials may still be |
| transferred between nodes of the MapReduce cluster using insecure means. |
| For example:</simpara></warning> |
| <screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees \ |
| --username aaron --password 12345</screen> |
| <simpara>Sqoop automatically supports several databases, including MySQL. Connect |
| strings beginning with <literal>jdbc:mysql://</literal> are handled automatically in Sqoop. (A |
| full list of databases with built-in support is provided in the "Supported |
| Databases" section. For some, you may need to install the JDBC driver |
| yourself.)</simpara> |
| <simpara>You can use Sqoop with any other |
| JDBC-compliant database. First, download the appropriate JDBC |
| driver for the type of database you want to import, and install the .jar |
| file in the <literal>$SQOOP_HOME/lib</literal> directory on your client machine. (This will |
| be <literal>/usr/lib/sqoop/lib</literal> if you installed from an RPM or Debian package.) |
| Each driver <literal>.jar</literal> file also has a specific driver class which defines |
| the entry-point to the driver. For example, MySQL’s Connector/J library has |
| a driver class of <literal>com.mysql.jdbc.Driver</literal>. Refer to your database |
| vendor-specific documentation to determine the main driver class. |
| This class must be provided as an argument to Sqoop with <literal>--driver</literal>.</simpara> |
| <simpara>For example, to connect to a SQLServer database, first download the driver from |
| microsoft.com and install it in your Sqoop lib path.</simpara> |
| <simpara>Then run Sqoop. For example:</simpara> |
| <screen>$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \ |
| --connect <connect-string> ...</screen> |
| <simpara>When connecting to a database using JDBC, you can optionally specify extra |
| JDBC parameters via a property file using the option |
| <literal>--connection-param-file</literal>. The contents of this file are parsed as standard |
| Java properties and passed into the driver while creating a connection.</simpara> |
| <note><simpara>The parameters specified via the optional property file are only |
| applicable to JDBC connections. Any fastpath connectors that use connections |
| other than JDBC will ignore these parameters.</simpara></note> |
| <table pgwide="0" |
| frame="topbot" |
| rowsep="1" colsep="1" |
| > |
| <title>Validation arguments <link linkend="validation">More Details</link></title> |
| <tgroup cols="2"> |
| <colspec colwidth="267*" align="left"/> |
| <colspec colwidth="230*" align="left"/> |
| <thead> |
| <row> |
| <entry> |
| Argument |
| </entry> |
| <entry> |
| Description |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry> |
| <literal>--validate</literal> |
| </entry> |
| <entry> |
| Enable validation of data copied, supports single table copy only. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--validator <class-name></literal> |
| </entry> |
| <entry> |
| Specify validator class to use. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--validation-threshold <class-name></literal> |
| </entry> |
| <entry> |
| Specify validation threshold class to use. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--validation-failurehandler <class-name></literal> |
| </entry> |
| <entry> |
| Specify validation failure handler class to use. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| <table pgwide="0" |
| frame="topbot" |
| rowsep="1" colsep="1" |
| > |
| <title>Import control arguments:</title> |
| <tgroup cols="2"> |
| <colspec colwidth="206*" align="left"/> |
| <colspec colwidth="236*" align="left"/> |
| <thead> |
| <row> |
| <entry> |
| Argument |
| </entry> |
| <entry> |
| Description |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry> |
| <literal>--append</literal> |
| </entry> |
| <entry> |
| Append data to an existing dataset in HDFS |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--as-avrodatafile</literal> |
| </entry> |
| <entry> |
| Imports data to Avro Data Files |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--as-sequencefile</literal> |
| </entry> |
| <entry> |
| Imports data to SequenceFiles |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--as-textfile</literal> |
| </entry> |
| <entry> |
| Imports data as plain text (default) |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--as-parquetfile</literal> |
| </entry> |
| <entry> |
| Imports data to Parquet Files |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--boundary-query <statement></literal> |
| </entry> |
| <entry> |
| Boundary query to use for creating splits |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--columns <col,col,col…></literal> |
| </entry> |
| <entry> |
| Columns to import from table |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--delete-target-dir</literal> |
| </entry> |
| <entry> |
| Delete the import target directory if it exists |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--direct</literal> |
| </entry> |
| <entry> |
| Use direct connector if exists for the database |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--fetch-size <n></literal> |
| </entry> |
| <entry> |
| Number of entries to read from database at once. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--inline-lob-limit <n></literal> |
| </entry> |
| <entry> |
| Set the maximum size for an inline LOB |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>-m,--num-mappers <n></literal> |
| </entry> |
| <entry> |
| Use <emphasis>n</emphasis> map tasks to import in parallel |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>-e,--query <statement></literal> |
| </entry> |
| <entry> |
| Import the results of <emphasis><literal>statement</literal></emphasis>. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--split-by <column-name></literal> |
| </entry> |
| <entry> |
| Column of the table used to split work units |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--table <table-name></literal> |
| </entry> |
| <entry> |
| Table to read |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--target-dir <dir></literal> |
| </entry> |
| <entry> |
| HDFS destination dir |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--warehouse-dir <dir></literal> |
| </entry> |
| <entry> |
| HDFS parent for table destination |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--where <where clause></literal> |
| </entry> |
| <entry> |
| WHERE clause to use during import |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>-z,--compress</literal> |
| </entry> |
| <entry> |
| Enable compression |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--compression-codec <c></literal> |
| </entry> |
| <entry> |
| Use Hadoop codec (default gzip) |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--null-string <null-string></literal> |
| </entry> |
| <entry> |
| The string to be written for a null value for string columns |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--null-non-string <null-string></literal> |
| </entry> |
| <entry> |
| The string to be written for a null value for non-string columns |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| <simpara>The <literal>--null-string</literal> and <literal>--null-non-string</literal> arguments are optional.\ |
| If not specified, then the string "null" will be used.</simpara> |
| </section> |
| <section id="_selecting_the_data_to_import"> |
| <title>Selecting the Data to Import</title> |
| <simpara>Sqoop typically imports data in a table-centric fashion. Use the |
| <literal>--table</literal> argument to select the table to import. For example, <literal>--table |
| employees</literal>. This argument can also identify a <literal>VIEW</literal> or other table-like |
| entity in a database.</simpara> |
| <simpara>By default, all columns within a table are selected for import. |
| Imported data is written to HDFS in its "natural order;" that is, a |
| table containing columns A, B, and C result in an import of data such |
| as:</simpara> |
| <screen>A1,B1,C1 |
| A2,B2,C2 |
| ...</screen> |
| <simpara>You can select a subset of columns and control their ordering by using |
| the <literal>--columns</literal> argument. This should include a comma-delimited list |
| of columns to import. For example: <literal>--columns "name,employee_id,jobtitle"</literal>.</simpara> |
| <simpara>You can control which rows are imported by adding a SQL <literal>WHERE</literal> clause |
| to the import statement. By default, Sqoop generates statements of the |
| form <literal>SELECT <column list> FROM <table name></literal>. You can append a |
| <literal>WHERE</literal> clause to this with the <literal>--where</literal> argument. For example: <literal>--where |
| "id > 400"</literal>. Only rows where the <literal>id</literal> column has a value greater than |
| 400 will be imported.</simpara> |
| <simpara>By default sqoop will use query <literal>select min(<split-by>), max(<split-by>) from |
| <table name></literal> to find out boundaries for creating splits. In some cases this query |
| is not the most optimal so you can specify any arbitrary query returning two |
| numeric columns using <literal>--boundary-query</literal> argument.</simpara> |
| </section> |
| <section id="_free_form_query_imports"> |
| <title>Free-form Query Imports</title> |
| <simpara>Sqoop can also import the result set of an arbitrary SQL query. Instead of |
| using the <literal>--table</literal>, <literal>--columns</literal> and <literal>--where</literal> arguments, you can specify |
| a SQL statement with the <literal>--query</literal> argument.</simpara> |
| <simpara>When importing a free-form query, you must specify a destination directory |
| with <literal>--target-dir</literal>.</simpara> |
| <simpara>If you want to import the results of a query in parallel, then each map task |
| will need to execute a copy of the query, with results partitioned by bounding |
| conditions inferred by Sqoop. Your query must include the token <literal>$CONDITIONS</literal> |
| which each Sqoop process will replace with a unique condition expression. |
| You must also select a splitting column with <literal>--split-by</literal>.</simpara> |
| <simpara>For example:</simpara> |
| <screen>$ sqoop import \ |
| --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ |
| --split-by a.id --target-dir /user/foo/joinresults</screen> |
| <simpara>Alternately, the query can be executed once and imported serially, by |
| specifying a single map task with <literal>-m 1</literal>:</simpara> |
| <screen>$ sqoop import \ |
| --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ |
| -m 1 --target-dir /user/foo/joinresults</screen> |
| <note><simpara>If you are issuing the query wrapped with double quotes ("), |
| you will have to use <literal>\$CONDITIONS</literal> instead of just <literal>$CONDITIONS</literal> |
| to disallow your shell from treating it as a shell variable. |
| For example, a double quoted query may look like: |
| <literal>"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"</literal></simpara></note> |
| <note><simpara>The facility of using free-form query in the current version of Sqoop |
| is limited to simple queries where there are no ambiguous projections and |
| no <literal>OR</literal> conditions in the <literal>WHERE</literal> clause. Use of complex queries such as |
| queries that have sub-queries or joins leading to ambiguous projections can |
| lead to unexpected results.</simpara></note> |
| </section> |
| <section id="_controlling_parallelism"> |
| <title>Controlling Parallelism</title> |
| <simpara>Sqoop imports data in parallel from most database sources. You can |
| specify the number |
| of map tasks (parallel processes) to use to perform the import by |
| using the <literal>-m</literal> or <literal>--num-mappers</literal> argument. Each of these arguments |
| takes an integer value which corresponds to the degree of parallelism |
| to employ. By default, four tasks are used. Some databases may see |
| improved performance by increasing this value to 8 or 16. Do not |
| increase the degree of parallelism greater than that available within |
| your MapReduce cluster; tasks will run serially and will likely |
| increase the amount of time required to perform the import. Likewise, |
| do not increase the degree of parallism higher than that which your |
| database can reasonably support. Connecting 100 concurrent clients to |
| your database may increase the load on the database server to a point |
| where performance suffers as a result.</simpara> |
| <simpara>When performing parallel imports, Sqoop needs a criterion by which it |
| can split the workload. Sqoop uses a <emphasis>splitting column</emphasis> to split the |
| workload. By default, Sqoop will identify the primary key column (if |
| present) in a table and use it as the splitting column. The low and |
| high values for the splitting column are retrieved from the database, |
| and the map tasks operate on evenly-sized components of the total |
| range. For example, if you had a table with a primary key column of |
| <literal>id</literal> whose minimum value was 0 and maximum value was 1000, and Sqoop |
| was directed to use 4 tasks, Sqoop would run four processes which each |
| execute SQL statements of the form <literal>SELECT * FROM sometable WHERE id |
| >= lo AND id < hi</literal>, with <literal>(lo, hi)</literal> set to (0, 250), (250, 500), |
| (500, 750), and (750, 1001) in the different tasks.</simpara> |
| <simpara>If the actual values for the primary key are not uniformly distributed |
| across its range, then this can result in unbalanced tasks. You should |
| explicitly choose a different column with the <literal>--split-by</literal> argument. |
| For example, <literal>--split-by employee_id</literal>. Sqoop cannot currently split on |
| multi-column indices. If your table has no index column, or has a |
| multi-column key, then you must also manually choose a splitting |
| column.</simpara> |
| </section> |
| <section id="_controlling_distributed_cache"> |
| <title>Controlling Distributed Cache</title> |
| <simpara>Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every |
| time when start a Sqoop job. When launched by Oozie this is unnecessary |
| since Oozie use its own Sqoop share lib which keeps Sqoop dependencies |
| in the distributed cache. Oozie will do the localization on each |
| worker node for the Sqoop dependencies only once during the first Sqoop |
| job and reuse the jars on worker node for subsquencial jobs. Using |
| option <literal>--skip-dist-cache</literal> in Sqoop command when launched by Oozie will |
| skip the step which Sqoop copies its dependencies to job cache and save |
| massive I/O.</simpara> |
| </section> |
| <section id="_controlling_the_import_process"> |
| <title>Controlling the Import Process</title> |
| <simpara>By default, the import process will use JDBC which provides a |
| reasonable cross-vendor import channel. Some databases can perform |
| imports in a more high-performance fashion by using database-specific |
| data movement tools. For example, MySQL provides the <literal>mysqldump</literal> tool |
| which can export data from MySQL to other systems very quickly. By |
| supplying the <literal>--direct</literal> argument, you are specifying that Sqoop |
| should attempt the direct import channel. This channel may be |
| higher performance than using JDBC.</simpara> |
| <simpara>Details about use of direct mode with each specific RDBMS, installation requirements, available |
| options and limitations can be found in <xref linkend="connectors"/>.</simpara> |
| <simpara>By default, Sqoop will import a table named <literal>foo</literal> to a directory named |
| <literal>foo</literal> inside your home directory in HDFS. For example, if your |
| username is <literal>someuser</literal>, then the import tool will write to |
| <literal>/user/someuser/foo/(files)</literal>. You can adjust the parent directory of |
| the import with the <literal>--warehouse-dir</literal> argument. For example:</simpara> |
| <screen>$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \ |
| ...</screen> |
| <simpara>This command would write to a set of files in the <literal>/shared/foo/</literal> directory.</simpara> |
| <simpara>You can also explicitly choose the target directory, like so:</simpara> |
| <screen>$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \ |
| ...</screen> |
| <simpara>This will import the files into the <literal>/dest</literal> directory. <literal>--target-dir</literal> is |
| incompatible with <literal>--warehouse-dir</literal>.</simpara> |
| <simpara>When using direct mode, you can specify additional arguments which |
| should be passed to the underlying tool. If the argument |
| <literal>--</literal> is given on the command-line, then subsequent arguments are sent |
| directly to the underlying tool. For example, the following adjusts |
| the character set used by <literal>mysqldump</literal>:</simpara> |
| <screen>$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \ |
| --direct -- --default-character-set=latin1</screen> |
| <simpara>By default, imports go to a new target location. If the destination directory |
| already exists in HDFS, Sqoop will refuse to import and overwrite that |
| directory’s contents. If you use the <literal>--append</literal> argument, Sqoop will import |
| data to a temporary directory and then rename the files into the normal |
| target directory in a manner that does not conflict with existing filenames |
| in that directory.</simpara> |
| </section> |
| <section id="_controlling_transaction_isolation"> |
| <title>Controlling transaction isolation</title> |
| <simpara>By default, Sqoop uses the read committed transaction isolation in the mappers |
| to import data. This may not be the ideal in all ETL workflows and it may |
| desired to reduce the isolation guarantees. The <literal>--relaxed-isolation</literal> option |
| can be used to instruct Sqoop to use read uncommitted isolation level.</simpara> |
| <simpara>The <literal>read-uncommitted</literal> isolation level is not supported on all databases |
| (for example, Oracle), so specifying the option <literal>--relaxed-isolation</literal> |
| may not be supported on all databases.</simpara> |
| </section> |
| <section id="_controlling_type_mapping"> |
| <title>Controlling type mapping</title> |
| <simpara>Sqoop is preconfigured to map most SQL types to appropriate Java or Hive |
| representatives. However the default mapping might not be suitable for |
| everyone and might be overridden by <literal>--map-column-java</literal> (for changing |
| mapping to Java) or <literal>--map-column-hive</literal> (for changing Hive mapping).</simpara> |
| <table pgwide="0" |
| frame="topbot" |
| rowsep="1" colsep="1" |
| > |
| <title>Parameters for overriding mapping</title> |
| <tgroup cols="2"> |
| <colspec colwidth="206*" align="left"/> |
| <colspec colwidth="236*" align="left"/> |
| <thead> |
| <row> |
| <entry> |
| Argument |
| </entry> |
| <entry> |
| Description |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry> |
| <literal>--map-column-java <mapping></literal> |
| </entry> |
| <entry> |
| Override mapping from SQL to Java type for configured columns. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--map-column-hive <mapping></literal> |
| </entry> |
| <entry> |
| Override mapping from SQL to Hive type for configured columns. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| <simpara>Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example:</simpara> |
| <screen>$ sqoop import ... --map-column-java id=String,value=Integer</screen> |
| <simpara>Sqoop will rise exception in case that some configured mapping will not be used.</simpara> |
| </section> |
| <section id="_incremental_imports"> |
| <title>Incremental Imports</title> |
| <simpara>Sqoop provides an incremental import mode which can be used to retrieve |
| only rows newer than some previously-imported set of rows.</simpara> |
| <simpara>The following arguments control incremental imports:</simpara> |
| <table pgwide="0" |
| frame="topbot" |
| rowsep="1" colsep="1" |
| > |
| <title>Incremental import arguments:</title> |
| <tgroup cols="2"> |
| <colspec colwidth="182*" align="left"/> |
| <colspec colwidth="236*" align="left"/> |
| <thead> |
| <row> |
| <entry> |
| Argument |
| </entry> |
| <entry> |
| Description |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry> |
| <literal>--check-column (col)</literal> |
| </entry> |
| <entry> |
| Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR) |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--incremental (mode)</literal> |
| </entry> |
| <entry> |
| Specifies how Sqoop determines which rows are new. Legal values for <literal>mode</literal> include <literal>append</literal> and <literal>lastmodified</literal>. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <literal>--last-value (value)</literal> |
| </entry> |
| <entry> |
| Specifies the maximum value of the check column from the previous import. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| <simpara>Sqoop supports two types of incremental imports: <literal>append</literal> and <literal>lastmodified</literal>. |
| You can use the <literal>--incremental</literal> argument to specify the type of incremental |
| import to perform.</simpara> |
| <simpara>You should specify <literal>append</literal> mode when importing a table where new rows are |
| continually being added with increasing row id values. You specify the column |
| containing the row’s id with <literal>--check-column</literal>. Sqoop imports rows where the |
| check column has a value greater than the one specified with <literal>--last-value</literal>.</simpara> |
| <simpara>An alternate table update strategy supported by Sqoop is called <literal>lastmodified</literal> |
| mode. You should use this when rows of the source table may be updated, and |
| each such update will set the value of a last-modified column to the current |
| timestamp. Rows where the check column holds a timestamp more recent than the |
| timestamp specified with <literal>--last-value</literal> are imported.</simpara> |
| <simpara>At the end of an incremental import, the value which should be specified as |
| <literal>--last-value</literal> for a subsequent import is printed to the screen. When running |
| a subsequent import, you should specify <literal>--last-value</literal> in this way to ensure |
| you import only the new or updated data. This is handled automatically by |
| creating an incremental import as a saved job, which is the preferred |
| mechanism for performing a recurring incremental import. See the section on |
| saved jobs later in this document for more information.</simpara> |
| </section> |
| <section id="_file_formats"> |
| <title>File Formats</title> |
| <simpara>You can import data in one of two file formats: delimited text or |
| SequenceFiles.</simpara> |
| <simpara>Delimited text is the default import format. You can also specify it |
| explicitly by using the <literal>--as-textfile</literal> argument. This argument will write |
| string-based representations of each record to the output files, with |
| delimiter characters between individual columns and rows. These |
| delimiters may be commas, tabs, or other characters. (The delimiters |
| can be selected; see "Output line formatting arguments.") The |
| following is the results of an example text-based import:</simpara> |
| <screen>1,here is a message,2010-05-01 |
| 2,happy new year!,2010-01-01 |
| 3,another message,2009-11-12</screen> |
| <simpara>Delimited text is appropriate for most non-binary data types. It also |
| readily supports further manipulation by other tools, such as Hive.</simpara> |