SQOOP-2264: Exclude and remove SqoopUserGuide.xml from git repository
(Qian Xu via Jarek Jarcec Cecho)
diff --git a/src/docs/.gitignore b/src/docs/.gitignore
index 1ac4643..519ebf4 100644
--- a/src/docs/.gitignore
+++ b/src/docs/.gitignore
@@ -17,3 +17,5 @@
/Sqoop-manpage.xml
/sqoop.1
/Sqoop-web.html
+/dev/SqoopDevGuide.xml
+/user/SqoopUserGuide.xml
diff --git a/src/docs/Makefile b/src/docs/Makefile
index 5bdf2f8..224c023 100644
--- a/src/docs/Makefile
+++ b/src/docs/Makefile
@@ -105,6 +105,7 @@
clean:
-rm $(BUILD_DIR)/sqoop-*.1.gz
+ -rm $(BUILD_DIR)/SqoopDevGuide.html
-rm $(BUILD_DIR)/SqoopUserGuide.html
-rm user/SqoopUserGuide.html
-rm dev/SqoopDevGuide.html
diff --git a/src/docs/user/SqoopUserGuide.xml b/src/docs/user/SqoopUserGuide.xml
deleted file mode 100644
index dc16e9c..0000000
--- a/src/docs/user/SqoopUserGuide.xml
+++ /dev/null
@@ -1,1151 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
-<?asciidoc-toc?>
-<?asciidoc-numbered?>
-
-<article lang="en">
-<articleinfo>
- <title>Sqoop User Guide (v1.4.7-SNAPSHOT)</title>
-</articleinfo>
-<screen> Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.</screen>
-<section id="_introduction">
-<title>Introduction</title>
-<simpara>Sqoop is a tool designed to transfer data between Hadoop and
-relational databases or mainframes. You can use Sqoop to import data from a
-relational database management system (RDBMS) such as MySQL or Oracle or a
-mainframe into the Hadoop Distributed File System (HDFS),
-transform the data in Hadoop MapReduce, and then export the data back
-into an RDBMS.</simpara>
-<simpara>Sqoop automates most of this process, relying on the database to
-describe the schema for the data to be imported. Sqoop uses MapReduce
-to import and export the data, which provides parallel operation as
-well as fault tolerance.</simpara>
-<simpara>This document describes how to get started using Sqoop to move data
-between databases and Hadoop or mainframe to Hadoop and provides reference
-information for the operation of the Sqoop command-line tool suite. This
-document is intended for:</simpara>
-<itemizedlist>
-<listitem>
-<simpara>
-System and application programmers
-</simpara>
-</listitem>
-<listitem>
-<simpara>
-System administrators
-</simpara>
-</listitem>
-<listitem>
-<simpara>
-Database administrators
-</simpara>
-</listitem>
-<listitem>
-<simpara>
-Data analysts
-</simpara>
-</listitem>
-<listitem>
-<simpara>
-Data engineers
-</simpara>
-</listitem>
-</itemizedlist>
-</section>
-<section id="_supported_releases">
-<title>Supported Releases</title>
-<simpara>This documentation applies to Sqoop v1.4.7-SNAPSHOT.</simpara>
-</section>
-<section id="_sqoop_releases">
-<title>Sqoop Releases</title>
-<simpara>Sqoop is an open source software product of the Apache Software Foundation.</simpara>
-<simpara>Software development for Sqoop occurs at <ulink url="http://sqoop.apache.org">http://sqoop.apache.org</ulink>
-At that site you can obtain:</simpara>
-<itemizedlist>
-<listitem>
-<simpara>
-New releases of Sqoop as well as its most recent source code
-</simpara>
-</listitem>
-<listitem>
-<simpara>
-An issue tracker
-</simpara>
-</listitem>
-<listitem>
-<simpara>
-A wiki that contains Sqoop documentation
-</simpara>
-</listitem>
-</itemizedlist>
-</section>
-<section id="_prerequisites">
-<title>Prerequisites</title>
-<simpara>The following prerequisite knowledge is required for this product:</simpara>
-<itemizedlist>
-<listitem>
-<simpara>
-Basic computer technology and terminology
-</simpara>
-</listitem>
-<listitem>
-<simpara>
-Familiarity with command-line interfaces such as <literal>bash</literal>
-</simpara>
-</listitem>
-<listitem>
-<simpara>
-Relational database management systems
-</simpara>
-</listitem>
-<listitem>
-<simpara>
-Basic familiarity with the purpose and operation of Hadoop
-</simpara>
-</listitem>
-</itemizedlist>
-<simpara>Before you can use Sqoop, a release of Hadoop must be installed and
-configured. Sqoop is currently supporting 4 major Hadoop releases - 0.20,
-0.23, 1.0 and 2.0.</simpara>
-<simpara>This document assumes you are using a Linux or Linux-like environment.
-If you are using Windows, you may be able to use cygwin to accomplish
-most of the following tasks. If you are using Mac OS X, you should see
-few (if any) compatibility errors. Sqoop is predominantly operated and
-tested on Linux.</simpara>
-</section>
-<section id="_basic_usage">
-<title>Basic Usage</title>
-<simpara>With Sqoop, you can <emphasis>import</emphasis> data from a relational database system or a
-mainframe into HDFS. The input to the import process is either database table
-or mainframe datasets. For databases, Sqoop will read the table row-by-row
-into HDFS. For mainframe datasets, Sqoop will read records from each mainframe
-dataset into HDFS. The output of this import process is a set of files
-containing a copy of the imported table or datasets.
-The import process is performed in parallel. For this reason, the
-output will be in multiple files. These files may be delimited text
-files (for example, with commas or tabs separating each field), or
-binary Avro or SequenceFiles containing serialized record data.</simpara>
-<simpara>A by-product of the import process is a generated Java class which
-can encapsulate one row of the imported table. This class is used
-during the import process by Sqoop itself. The Java source code for
-this class is also provided to you, for use in subsequent MapReduce
-processing of the data. This class can serialize and deserialize data
-to and from the SequenceFile format. It can also parse the
-delimited-text form of a record. These abilities allow you to quickly
-develop MapReduce applications that use the HDFS-stored records in
-your processing pipeline. You are also free to parse the delimiteds
-record data yourself, using any other tools you prefer.</simpara>
-<simpara>After manipulating the imported records (for example, with MapReduce
-or Hive) you may have a result data set which you can then <emphasis>export</emphasis>
-back to the relational database. Sqoop’s export process will read
-a set of delimited text files from HDFS in parallel, parse them into
-records, and insert them as new rows in a target database table, for
-consumption by external applications or users.</simpara>
-<simpara>Sqoop includes some other commands which allow you to inspect the
-database you are working with. For example, you can list the available
-database schemas (with the <literal>sqoop-list-databases</literal> tool) and tables
-within a schema (with the <literal>sqoop-list-tables</literal> tool). Sqoop also
-includes a primitive SQL execution shell (the <literal>sqoop-eval</literal> tool).</simpara>
-<simpara>Most aspects of the import, code generation, and export processes can
-be customized. For databases, you can control the specific row range or
-columns imported. You can specify particular delimiters and escape characters
-for the file-based representation of the data, as well as the file format
-used. You can also control the class or package names used in
-generated code. Subsequent sections of this document explain how to
-specify these and other arguments to Sqoop.</simpara>
-</section>
-<section id="_sqoop_tools">
-<title>Sqoop Tools</title>
-<simpara>Sqoop is a collection of related tools. To use Sqoop, you specify the
-tool you want to use and the arguments that control the tool.</simpara>
-<simpara>If Sqoop is compiled from its own source, you can run Sqoop without a formal
-installation process by running the <literal>bin/sqoop</literal> program. Users
-of a packaged deployment of Sqoop (such as an RPM shipped with Apache Bigtop)
-will see this program installed as <literal>/usr/bin/sqoop</literal>. The remainder of this
-documentation will refer to this program as <literal>sqoop</literal>. For example:</simpara>
-<screen>$ sqoop tool-name [tool-arguments]</screen>
-<note><simpara>The following examples that begin with a <literal>$</literal> character indicate
-that the commands must be entered at a terminal prompt (such as
-<literal>bash</literal>). The <literal>$</literal> character represents the prompt itself; you should
-not start these commands by typing a <literal>$</literal>. You can also enter commands
-inline in the text of a paragraph; for example, <literal>sqoop help</literal>. These
-examples do not show a <literal>$</literal> prefix, but you should enter them the same
-way. Don’t confuse the <literal>$</literal> shell prompt in the examples with the <literal>$</literal>
-that precedes an environment variable name. For example, the string
-literal <literal>$HADOOP_HOME</literal> includes a "<literal>$</literal>".</simpara></note>
-<simpara>Sqoop ships with a help tool. To display a list of all available
-tools, type the following command:</simpara>
-<screen>$ sqoop help
-usage: sqoop COMMAND [ARGS]
-
-Available commands:
- codegen Generate code to interact with database records
- create-hive-table Import a table definition into Hive
- eval Evaluate a SQL statement and display the results
- export Export an HDFS directory to a database table
- help List available commands
- import Import a table from a database to HDFS
- import-all-tables Import tables from a database to HDFS
- import-mainframe Import mainframe datasets to HDFS
- list-databases List available databases on a server
- list-tables List available tables in a database
- version Display version information
-
-See 'sqoop help COMMAND' for information on a specific command.</screen>
-<simpara>You can display help for a specific tool by entering: <literal>sqoop help
-(tool-name)</literal>; for example, <literal>sqoop help import</literal>.</simpara>
-<simpara>You can also add the <literal>--help</literal> argument to any command: <literal>sqoop import
---help</literal>.</simpara>
-<section id="_using_command_aliases">
-<title>Using Command Aliases</title>
-<simpara>In addition to typing the <literal>sqoop (toolname)</literal> syntax, you can use alias
-scripts that specify the <literal>sqoop-(toolname)</literal> syntax. For example, the
-scripts <literal>sqoop-import</literal>, <literal>sqoop-export</literal>, etc. each select a specific
-tool.</simpara>
-</section>
-<section id="_controlling_the_hadoop_installation">
-<title>Controlling the Hadoop Installation</title>
-<simpara>You invoke Sqoop through the program launch capability provided by
-Hadoop. The <literal>sqoop</literal> command-line program is a wrapper which runs the
-<literal>bin/hadoop</literal> script shipped with Hadoop. If you have multiple
-installations of Hadoop present on your machine, you can select the
-Hadoop installation by setting the <literal>$HADOOP_COMMON_HOME</literal> and
-<literal>$HADOOP_MAPRED_HOME</literal> environment variables.</simpara>
-<simpara>For example:</simpara>
-<screen>$ HADOOP_COMMON_HOME=/path/to/some/hadoop \
- HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce \
- sqoop import --arguments...</screen>
-<simpara>or:</simpara>
-<screen>$ export HADOOP_COMMON_HOME=/some/path/to/hadoop
-$ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-mapreduce
-$ sqoop import --arguments...</screen>
-<simpara>If either of these variables are not set, Sqoop will fall back to
-<literal>$HADOOP_HOME</literal>. If it is not set either, Sqoop will use the default
-installation locations for Apache Bigtop, <literal>/usr/lib/hadoop</literal> and
-<literal>/usr/lib/hadoop-mapreduce</literal>, respectively.</simpara>
-<simpara>The active Hadoop configuration is loaded from <literal>$HADOOP_HOME/conf/</literal>,
-unless the <literal>$HADOOP_CONF_DIR</literal> environment variable is set.</simpara>
-</section>
-<section id="_using_generic_and_specific_arguments">
-<title>Using Generic and Specific Arguments</title>
-<simpara>To control the operation of each Sqoop tool, you use generic and
-specific arguments.</simpara>
-<simpara>For example:</simpara>
-<screen>$ sqoop help import
-usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
-
-Common arguments:
- --connect <jdbc-uri> Specify JDBC connect string
- --connect-manager <class-name> Specify connection manager class to use
- --driver <class-name> Manually specify JDBC driver class to use
- --hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
- --help Print usage instructions
- --password-file Set path for file containing authentication password
- -P Read password from console
- --password <password> Set authentication password
- --username <username> Set authentication username
- --verbose Print more information while working
- --hadoop-home <dir> Deprecated. Override $HADOOP_HOME
-
-[...]
-
-Generic Hadoop command-line arguments:
-(must preceed any tool-specific arguments)
-Generic options supported are
--conf <configuration file> specify an application configuration file
--D <property=value> use value for given property
--fs <local|namenode:port> specify a namenode
--jt <local|jobtracker:port> specify a job tracker
--files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
--libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
--archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
-
-The general command line syntax is
-bin/hadoop command [genericOptions] [commandOptions]</screen>
-<simpara>You must supply the generic arguments <literal>-conf</literal>, <literal>-D</literal>, and so on after the
-tool name but <emphasis role="strong">before</emphasis> any tool-specific arguments (such as
-<literal>--connect</literal>). Note that generic Hadoop arguments are preceeded by a
-single dash character (<literal>-</literal>), whereas tool-specific arguments start
-with two dashes (<literal>--</literal>), unless they are single character arguments such as <literal>-P</literal>.</simpara>
-<simpara>The <literal>-conf</literal>, <literal>-D</literal>, <literal>-fs</literal> and <literal>-jt</literal> arguments control the configuration
-and Hadoop server settings. For example, the <literal>-D mapred.job.name=<job_name></literal> can
-be used to set the name of the MR job that Sqoop launches, if not specified,
-the name defaults to the jar name for the job - which is derived from the used
-table name.</simpara>
-<simpara>The <literal>-files</literal>, <literal>-libjars</literal>, and <literal>-archives</literal> arguments are not typically used with
-Sqoop, but they are included as part of Hadoop’s internal argument-parsing
-system.</simpara>
-</section>
-<section id="_using_options_files_to_pass_arguments">
-<title>Using Options Files to Pass Arguments</title>
-<simpara>When using Sqoop, the command line options that do not change from
-invocation to invocation can be put in an options file for convenience.
-An options file is a text file where each line identifies an option in
-the order that it appears otherwise on the command line. Option files
-allow specifying a single option on multiple lines by using the
-back-slash character at the end of intermediate lines. Also supported
-are comments within option files that begin with the hash character.
-Comments must be specified on a new line and may not be mixed with
-option text. All comments and empty lines are ignored when option
-files are expanded. Unless options appear as quoted strings, any
-leading or trailing spaces are ignored. Quoted strings if used must
-not extend beyond the line on which they are specified.</simpara>
-<simpara>Option files can be specified anywhere in the command line as long as
-the options within them follow the otherwise prescribed rules of
-options ordering. For instance, regardless of where the options are
-loaded from, they must follow the ordering such that generic options
-appear first, tool specific options next, finally followed by options
-that are intended to be passed to child programs.</simpara>
-<simpara>To specify an options file, simply create an options file in a
-convenient location and pass it to the command line via
-<literal>--options-file</literal> argument.</simpara>
-<simpara>Whenever an options file is specified, it is expanded on the
-command line before the tool is invoked. You can specify more than
-one option files within the same invocation if needed.</simpara>
-<simpara>For example, the following Sqoop invocation for import can
-be specified alternatively as shown below:</simpara>
-<screen>$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
-
-$ sqoop --options-file /users/homer/work/import.txt --table TEST</screen>
-<simpara>where the options file <literal>/users/homer/work/import.txt</literal> contains the following:</simpara>
-<screen>import
---connect
-jdbc:mysql://localhost/db
---username
-foo</screen>
-<simpara>The options file can have empty lines and comments for readability purposes.
-So the above example would work exactly the same if the options file
-<literal>/users/homer/work/import.txt</literal> contained the following:</simpara>
-<screen>#
-# Options file for Sqoop import
-#
-
-# Specifies the tool being invoked
-import
-
-# Connect parameter and value
---connect
-jdbc:mysql://localhost/db
-
-# Username parameter and value
---username
-foo
-
-#
-# Remaining options should be specified in the command line.
-#</screen>
-</section>
-<section id="_using_tools">
-<title>Using Tools</title>
-<simpara>The following sections will describe each tool’s operation. The
-tools are listed in the most likely order you will find them useful.</simpara>
-</section>
-</section>
-<section id="_literal_sqoop_import_literal">
-<title><literal>sqoop-import</literal></title>
-<section id="_purpose">
-<title>Purpose</title>
-<simpara>The <literal>import</literal> tool imports an individual table from an RDBMS to HDFS.
-Each row from a table is represented as a separate record in HDFS.
-Records can be stored as text files (one record per line), or in
-binary representation as Avro or SequenceFiles.</simpara>
-</section>
-<section id="_syntax">
-<title>Syntax</title>
-<screen>$ sqoop import (generic-args) (import-args)
-$ sqoop-import (generic-args) (import-args)</screen>
-<simpara>While the Hadoop generic arguments must precede any import arguments,
-you can type the import arguments in any order with respect to one
-another.</simpara>
-<note><simpara>In this document, arguments are grouped into collections
-organized by function. Some collections are present in several tools
-(for example, the "common" arguments). An extended description of their
-functionality is given only on the first presentation in this
-document.</simpara></note>
-<table pgwide="0"
-frame="topbot"
-rowsep="1" colsep="1"
->
-<title>Common arguments</title>
-<tgroup cols="2">
-<colspec colwidth="248*" align="left"/>
-<colspec colwidth="230*" align="left"/>
-<thead>
- <row>
- <entry>
- Argument
- </entry>
- <entry>
- Description
- </entry>
- </row>
-</thead>
-<tbody>
- <row>
- <entry>
- <literal>--connect <jdbc-uri></literal>
- </entry>
- <entry>
- Specify JDBC connect string
- </entry>
- </row>
- <row>
- <entry>
- <literal>--connection-manager <class-name></literal>
- </entry>
- <entry>
- Specify connection manager class to use
- </entry>
- </row>
- <row>
- <entry>
- <literal>--driver <class-name></literal>
- </entry>
- <entry>
- Manually specify JDBC driver class to use
- </entry>
- </row>
- <row>
- <entry>
- <literal>--hadoop-mapred-home <dir></literal>
- </entry>
- <entry>
- Override $HADOOP_MAPRED_HOME
- </entry>
- </row>
- <row>
- <entry>
- <literal>--help</literal>
- </entry>
- <entry>
- Print usage instructions
- </entry>
- </row>
- <row>
- <entry>
- <literal>--password-file</literal>
- </entry>
- <entry>
- Set path for a file containing the authentication password
- </entry>
- </row>
- <row>
- <entry>
- <literal>-P</literal>
- </entry>
- <entry>
- Read password from console
- </entry>
- </row>
- <row>
- <entry>
- <literal>--password <password></literal>
- </entry>
- <entry>
- Set authentication password
- </entry>
- </row>
- <row>
- <entry>
- <literal>--username <username></literal>
- </entry>
- <entry>
- Set authentication username
- </entry>
- </row>
- <row>
- <entry>
- <literal>--verbose</literal>
- </entry>
- <entry>
- Print more information while working
- </entry>
- </row>
- <row>
- <entry>
- <literal>--connection-param-file <filename></literal>
- </entry>
- <entry>
- Optional properties file that provides connection parameters
- </entry>
- </row>
- <row>
- <entry>
- <literal>--relaxed-isolation</literal>
- </entry>
- <entry>
- Set connection transaction isolation to read uncommitted for the mappers.
- </entry>
- </row>
-</tbody>
-</tgroup>
-</table>
-<section id="_connecting_to_a_database_server">
-<title>Connecting to a Database Server</title>
-<simpara>Sqoop is designed to import tables from a database into HDFS. To do
-so, you must specify a <emphasis>connect string</emphasis> that describes how to connect to the
-database. The <emphasis>connect string</emphasis> is similar to a URL, and is communicated to
-Sqoop with the <literal>--connect</literal> argument. This describes the server and
-database to connect to; it may also specify the port. For example:</simpara>
-<screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees</screen>
-<simpara>This string will connect to a MySQL database named <literal>employees</literal> on the
-host <literal>database.example.com</literal>. It’s important that you <emphasis role="strong">do not</emphasis> use the URL
-<literal>localhost</literal> if you intend to use Sqoop with a distributed Hadoop
-cluster. The connect string you supply will be used on TaskTracker nodes
-throughout your MapReduce cluster; if you specify the
-literal name <literal>localhost</literal>, each node will connect to a different
-database (or more likely, no database at all). Instead, you should use
-the full hostname or IP address of the database host that can be seen
-by all your remote nodes.</simpara>
-<simpara>You might need to authenticate against the database before you can
-access it. You can use the <literal>--username</literal> to supply a username to the database.
-Sqoop provides couple of different ways to supply a password,
-secure and non-secure, to the database which is detailed below.</simpara>
-<formalpara><title>Secure way of supplying password to the database</title><para>You should save the password in a file on the users home directory with 400
-permissions and specify the path to that file using the <emphasis role="strong"><literal>--password-file</literal></emphasis>
-argument, and is the preferred method of entering credentials. Sqoop will
-then read the password from the file and pass it to the MapReduce cluster
-using secure means with out exposing the password in the job configuration.
-The file containing the password can either be on the Local FS or HDFS.
-For example:</para></formalpara>
-<screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees \
- --username venkatesh --password-file ${user.home}/.password</screen>
-<warning><simpara>Sqoop will read entire content of the password file and use it as
-a password. This will include any trailing white space characters such as
-new line characters that are added by default by most of the text editors.
-You need to make sure that your password file contains only characters
-that belongs to your password. On the command line you can use command
-<literal>echo</literal> with switch <literal>-n</literal> to store password without any trailing white space
-characters. For example to store password <literal>secret</literal> you would call
-<literal>echo -n "secret" > password.file</literal>.</simpara></warning>
-<simpara>Another way of supplying passwords is using the <literal>-P</literal> argument which will
-read a password from a console prompt.</simpara>
-<formalpara><title>Protecting password from preying eyes</title><para>Hadoop 2.6.0 provides an API to separate password storage from applications.
-This API is called the credential provided API and there is a new
-<literal>credential</literal> command line tool to manage passwords and their aliases.
-The passwords are stored with their aliases in a keystore that is password
-protected. The keystore password can be the provided to a password prompt
-on the command line, via an environment variable or defaulted to a software
-defined constant. Please check the Hadoop documentation on the usage
-of this facility.</para></formalpara>
-<simpara>Once the password is stored using the Credential Provider facility and
-the Hadoop configuration has been suitably updated, all applications can
-optionally use the alias in place of the actual password and at runtime
-resolve the alias for the password to use.</simpara>
-<simpara>Since the keystore or similar technology used for storing the credential
-provider is shared across components, passwords for various applications,
-various database and other passwords can be securely stored in them and only
-the alias needs to be exposed in configuration files, protecting the password
-from being visible.</simpara>
-<simpara>Sqoop has been enhanced to allow usage of this funcionality if it is
-available in the underlying Hadoop version being used. One new option
-has been introduced to provide the alias on the command line instead of the
-actual password (--password-alias). The argument value this option is
-the alias on the storage associated with the actual password.
-Example usage is as follows:</simpara>
-<screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees \
- --username dbuser --password-alias mydb.password.alias</screen>
-<simpara>Similarly, if the command line option is not preferred, the alias can be saved
-in the file provided with --password-file option. Along with this, the
-Sqoop configuration parameter org.apache.sqoop.credentials.loader.class
-should be set to the classname that provides the alias resolution:
-<literal>org.apache.sqoop.util.password.CredentialProviderPasswordLoader</literal></simpara>
-<simpara>Example usage is as follows (assuming .password.alias has the alias for
-the real password) :</simpara>
-<screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees \
- --username dbuser --password-file ${user.home}/.password-alias</screen>
-<warning><simpara>The <literal>--password</literal> parameter is insecure, as other users may
-be able to read your password from the command-line arguments via
-the output of programs such as <literal>ps</literal>. The <emphasis role="strong"><literal>-P</literal></emphasis> argument is the preferred
-method over using the <literal>--password</literal> argument. Credentials may still be
-transferred between nodes of the MapReduce cluster using insecure means.
-For example:</simpara></warning>
-<screen>$ sqoop import --connect jdbc:mysql://database.example.com/employees \
- --username aaron --password 12345</screen>
-<simpara>Sqoop automatically supports several databases, including MySQL. Connect
-strings beginning with <literal>jdbc:mysql://</literal> are handled automatically in Sqoop. (A
-full list of databases with built-in support is provided in the "Supported
-Databases" section. For some, you may need to install the JDBC driver
-yourself.)</simpara>
-<simpara>You can use Sqoop with any other
-JDBC-compliant database. First, download the appropriate JDBC
-driver for the type of database you want to import, and install the .jar
-file in the <literal>$SQOOP_HOME/lib</literal> directory on your client machine. (This will
-be <literal>/usr/lib/sqoop/lib</literal> if you installed from an RPM or Debian package.)
-Each driver <literal>.jar</literal> file also has a specific driver class which defines
-the entry-point to the driver. For example, MySQL’s Connector/J library has
-a driver class of <literal>com.mysql.jdbc.Driver</literal>. Refer to your database
-vendor-specific documentation to determine the main driver class.
-This class must be provided as an argument to Sqoop with <literal>--driver</literal>.</simpara>
-<simpara>For example, to connect to a SQLServer database, first download the driver from
-microsoft.com and install it in your Sqoop lib path.</simpara>
-<simpara>Then run Sqoop. For example:</simpara>
-<screen>$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
- --connect <connect-string> ...</screen>
-<simpara>When connecting to a database using JDBC, you can optionally specify extra
-JDBC parameters via a property file using the option
-<literal>--connection-param-file</literal>. The contents of this file are parsed as standard
-Java properties and passed into the driver while creating a connection.</simpara>
-<note><simpara>The parameters specified via the optional property file are only
-applicable to JDBC connections. Any fastpath connectors that use connections
-other than JDBC will ignore these parameters.</simpara></note>
-<table pgwide="0"
-frame="topbot"
-rowsep="1" colsep="1"
->
-<title>Validation arguments <link linkend="validation">More Details</link></title>
-<tgroup cols="2">
-<colspec colwidth="267*" align="left"/>
-<colspec colwidth="230*" align="left"/>
-<thead>
- <row>
- <entry>
- Argument
- </entry>
- <entry>
- Description
- </entry>
- </row>
-</thead>
-<tbody>
- <row>
- <entry>
- <literal>--validate</literal>
- </entry>
- <entry>
- Enable validation of data copied, supports single table copy only.
- </entry>
- </row>
- <row>
- <entry>
- <literal>--validator <class-name></literal>
- </entry>
- <entry>
- Specify validator class to use.
- </entry>
- </row>
- <row>
- <entry>
- <literal>--validation-threshold <class-name></literal>
- </entry>
- <entry>
- Specify validation threshold class to use.
- </entry>
- </row>
- <row>
- <entry>
- <literal>--validation-failurehandler <class-name></literal>
- </entry>
- <entry>
- Specify validation failure handler class to use.
- </entry>
- </row>
-</tbody>
-</tgroup>
-</table>
-<table pgwide="0"
-frame="topbot"
-rowsep="1" colsep="1"
->
-<title>Import control arguments:</title>
-<tgroup cols="2">
-<colspec colwidth="206*" align="left"/>
-<colspec colwidth="236*" align="left"/>
-<thead>
- <row>
- <entry>
- Argument
- </entry>
- <entry>
- Description
- </entry>
- </row>
-</thead>
-<tbody>
- <row>
- <entry>
- <literal>--append</literal>
- </entry>
- <entry>
- Append data to an existing dataset in HDFS
- </entry>
- </row>
- <row>
- <entry>
- <literal>--as-avrodatafile</literal>
- </entry>
- <entry>
- Imports data to Avro Data Files
- </entry>
- </row>
- <row>
- <entry>
- <literal>--as-sequencefile</literal>
- </entry>
- <entry>
- Imports data to SequenceFiles
- </entry>
- </row>
- <row>
- <entry>
- <literal>--as-textfile</literal>
- </entry>
- <entry>
- Imports data as plain text (default)
- </entry>
- </row>
- <row>
- <entry>
- <literal>--as-parquetfile</literal>
- </entry>
- <entry>
- Imports data to Parquet Files
- </entry>
- </row>
- <row>
- <entry>
- <literal>--boundary-query <statement></literal>
- </entry>
- <entry>
- Boundary query to use for creating splits
- </entry>
- </row>
- <row>
- <entry>
- <literal>--columns <col,col,col…></literal>
- </entry>
- <entry>
- Columns to import from table
- </entry>
- </row>
- <row>
- <entry>
- <literal>--delete-target-dir</literal>
- </entry>
- <entry>
- Delete the import target directory if it exists
- </entry>
- </row>
- <row>
- <entry>
- <literal>--direct</literal>
- </entry>
- <entry>
- Use direct connector if exists for the database
- </entry>
- </row>
- <row>
- <entry>
- <literal>--fetch-size <n></literal>
- </entry>
- <entry>
- Number of entries to read from database at once.
- </entry>
- </row>
- <row>
- <entry>
- <literal>--inline-lob-limit <n></literal>
- </entry>
- <entry>
- Set the maximum size for an inline LOB
- </entry>
- </row>
- <row>
- <entry>
- <literal>-m,--num-mappers <n></literal>
- </entry>
- <entry>
- Use <emphasis>n</emphasis> map tasks to import in parallel
- </entry>
- </row>
- <row>
- <entry>
- <literal>-e,--query <statement></literal>
- </entry>
- <entry>
- Import the results of <emphasis><literal>statement</literal></emphasis>.
- </entry>
- </row>
- <row>
- <entry>
- <literal>--split-by <column-name></literal>
- </entry>
- <entry>
- Column of the table used to split work units
- </entry>
- </row>
- <row>
- <entry>
- <literal>--table <table-name></literal>
- </entry>
- <entry>
- Table to read
- </entry>
- </row>
- <row>
- <entry>
- <literal>--target-dir <dir></literal>
- </entry>
- <entry>
- HDFS destination dir
- </entry>
- </row>
- <row>
- <entry>
- <literal>--warehouse-dir <dir></literal>
- </entry>
- <entry>
- HDFS parent for table destination
- </entry>
- </row>
- <row>
- <entry>
- <literal>--where <where clause></literal>
- </entry>
- <entry>
- WHERE clause to use during import
- </entry>
- </row>
- <row>
- <entry>
- <literal>-z,--compress</literal>
- </entry>
- <entry>
- Enable compression
- </entry>
- </row>
- <row>
- <entry>
- <literal>--compression-codec <c></literal>
- </entry>
- <entry>
- Use Hadoop codec (default gzip)
- </entry>
- </row>
- <row>
- <entry>
- <literal>--null-string <null-string></literal>
- </entry>
- <entry>
- The string to be written for a null value for string columns
- </entry>
- </row>
- <row>
- <entry>
- <literal>--null-non-string <null-string></literal>
- </entry>
- <entry>
- The string to be written for a null value for non-string columns
- </entry>
- </row>
-</tbody>
-</tgroup>
-</table>
-<simpara>The <literal>--null-string</literal> and <literal>--null-non-string</literal> arguments are optional.\
-If not specified, then the string "null" will be used.</simpara>
-</section>
-<section id="_selecting_the_data_to_import">
-<title>Selecting the Data to Import</title>
-<simpara>Sqoop typically imports data in a table-centric fashion. Use the
-<literal>--table</literal> argument to select the table to import. For example, <literal>--table
-employees</literal>. This argument can also identify a <literal>VIEW</literal> or other table-like
-entity in a database.</simpara>
-<simpara>By default, all columns within a table are selected for import.
-Imported data is written to HDFS in its "natural order;" that is, a
-table containing columns A, B, and C result in an import of data such
-as:</simpara>
-<screen>A1,B1,C1
-A2,B2,C2
-...</screen>
-<simpara>You can select a subset of columns and control their ordering by using
-the <literal>--columns</literal> argument. This should include a comma-delimited list
-of columns to import. For example: <literal>--columns "name,employee_id,jobtitle"</literal>.</simpara>
-<simpara>You can control which rows are imported by adding a SQL <literal>WHERE</literal> clause
-to the import statement. By default, Sqoop generates statements of the
-form <literal>SELECT <column list> FROM <table name></literal>. You can append a
-<literal>WHERE</literal> clause to this with the <literal>--where</literal> argument. For example: <literal>--where
-"id > 400"</literal>. Only rows where the <literal>id</literal> column has a value greater than
-400 will be imported.</simpara>
-<simpara>By default sqoop will use query <literal>select min(<split-by>), max(<split-by>) from
-<table name></literal> to find out boundaries for creating splits. In some cases this query
-is not the most optimal so you can specify any arbitrary query returning two
-numeric columns using <literal>--boundary-query</literal> argument.</simpara>
-</section>
-<section id="_free_form_query_imports">
-<title>Free-form Query Imports</title>
-<simpara>Sqoop can also import the result set of an arbitrary SQL query. Instead of
-using the <literal>--table</literal>, <literal>--columns</literal> and <literal>--where</literal> arguments, you can specify
-a SQL statement with the <literal>--query</literal> argument.</simpara>
-<simpara>When importing a free-form query, you must specify a destination directory
-with <literal>--target-dir</literal>.</simpara>
-<simpara>If you want to import the results of a query in parallel, then each map task
-will need to execute a copy of the query, with results partitioned by bounding
-conditions inferred by Sqoop. Your query must include the token <literal>$CONDITIONS</literal>
-which each Sqoop process will replace with a unique condition expression.
-You must also select a splitting column with <literal>--split-by</literal>.</simpara>
-<simpara>For example:</simpara>
-<screen>$ sqoop import \
- --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
- --split-by a.id --target-dir /user/foo/joinresults</screen>
-<simpara>Alternately, the query can be executed once and imported serially, by
-specifying a single map task with <literal>-m 1</literal>:</simpara>
-<screen>$ sqoop import \
- --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
- -m 1 --target-dir /user/foo/joinresults</screen>
-<note><simpara>If you are issuing the query wrapped with double quotes ("),
-you will have to use <literal>\$CONDITIONS</literal> instead of just <literal>$CONDITIONS</literal>
-to disallow your shell from treating it as a shell variable.
-For example, a double quoted query may look like:
-<literal>"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"</literal></simpara></note>
-<note><simpara>The facility of using free-form query in the current version of Sqoop
-is limited to simple queries where there are no ambiguous projections and
-no <literal>OR</literal> conditions in the <literal>WHERE</literal> clause. Use of complex queries such as
-queries that have sub-queries or joins leading to ambiguous projections can
-lead to unexpected results.</simpara></note>
-</section>
-<section id="_controlling_parallelism">
-<title>Controlling Parallelism</title>
-<simpara>Sqoop imports data in parallel from most database sources. You can
-specify the number
-of map tasks (parallel processes) to use to perform the import by
-using the <literal>-m</literal> or <literal>--num-mappers</literal> argument. Each of these arguments
-takes an integer value which corresponds to the degree of parallelism
-to employ. By default, four tasks are used. Some databases may see
-improved performance by increasing this value to 8 or 16. Do not
-increase the degree of parallelism greater than that available within
-your MapReduce cluster; tasks will run serially and will likely
-increase the amount of time required to perform the import. Likewise,
-do not increase the degree of parallism higher than that which your
-database can reasonably support. Connecting 100 concurrent clients to
-your database may increase the load on the database server to a point
-where performance suffers as a result.</simpara>
-<simpara>When performing parallel imports, Sqoop needs a criterion by which it
-can split the workload. Sqoop uses a <emphasis>splitting column</emphasis> to split the
-workload. By default, Sqoop will identify the primary key column (if
-present) in a table and use it as the splitting column. The low and
-high values for the splitting column are retrieved from the database,
-and the map tasks operate on evenly-sized components of the total
-range. For example, if you had a table with a primary key column of
-<literal>id</literal> whose minimum value was 0 and maximum value was 1000, and Sqoop
-was directed to use 4 tasks, Sqoop would run four processes which each
-execute SQL statements of the form <literal>SELECT * FROM sometable WHERE id
->= lo AND id < hi</literal>, with <literal>(lo, hi)</literal> set to (0, 250), (250, 500),
-(500, 750), and (750, 1001) in the different tasks.</simpara>
-<simpara>If the actual values for the primary key are not uniformly distributed
-across its range, then this can result in unbalanced tasks. You should
-explicitly choose a different column with the <literal>--split-by</literal> argument.
-For example, <literal>--split-by employee_id</literal>. Sqoop cannot currently split on
-multi-column indices. If your table has no index column, or has a
-multi-column key, then you must also manually choose a splitting
-column.</simpara>
-</section>
-<section id="_controlling_distributed_cache">
-<title>Controlling Distributed Cache</title>
-<simpara>Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every
-time when start a Sqoop job. When launched by Oozie this is unnecessary
-since Oozie use its own Sqoop share lib which keeps Sqoop dependencies
-in the distributed cache. Oozie will do the localization on each
-worker node for the Sqoop dependencies only once during the first Sqoop
-job and reuse the jars on worker node for subsquencial jobs. Using
-option <literal>--skip-dist-cache</literal> in Sqoop command when launched by Oozie will
-skip the step which Sqoop copies its dependencies to job cache and save
-massive I/O.</simpara>
-</section>
-<section id="_controlling_the_import_process">
-<title>Controlling the Import Process</title>
-<simpara>By default, the import process will use JDBC which provides a
-reasonable cross-vendor import channel. Some databases can perform
-imports in a more high-performance fashion by using database-specific
-data movement tools. For example, MySQL provides the <literal>mysqldump</literal> tool
-which can export data from MySQL to other systems very quickly. By
-supplying the <literal>--direct</literal> argument, you are specifying that Sqoop
-should attempt the direct import channel. This channel may be
-higher performance than using JDBC.</simpara>
-<simpara>Details about use of direct mode with each specific RDBMS, installation requirements, available
-options and limitations can be found in <xref linkend="connectors"/>.</simpara>
-<simpara>By default, Sqoop will import a table named <literal>foo</literal> to a directory named
-<literal>foo</literal> inside your home directory in HDFS. For example, if your
-username is <literal>someuser</literal>, then the import tool will write to
-<literal>/user/someuser/foo/(files)</literal>. You can adjust the parent directory of
-the import with the <literal>--warehouse-dir</literal> argument. For example:</simpara>
-<screen>$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \
- ...</screen>
-<simpara>This command would write to a set of files in the <literal>/shared/foo/</literal> directory.</simpara>
-<simpara>You can also explicitly choose the target directory, like so:</simpara>
-<screen>$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \
- ...</screen>
-<simpara>This will import the files into the <literal>/dest</literal> directory. <literal>--target-dir</literal> is
-incompatible with <literal>--warehouse-dir</literal>.</simpara>
-<simpara>When using direct mode, you can specify additional arguments which
-should be passed to the underlying tool. If the argument
-<literal>--</literal> is given on the command-line, then subsequent arguments are sent
-directly to the underlying tool. For example, the following adjusts
-the character set used by <literal>mysqldump</literal>:</simpara>
-<screen>$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
- --direct -- --default-character-set=latin1</screen>
-<simpara>By default, imports go to a new target location. If the destination directory
-already exists in HDFS, Sqoop will refuse to import and overwrite that
-directory’s contents. If you use the <literal>--append</literal> argument, Sqoop will import
-data to a temporary directory and then rename the files into the normal
-target directory in a manner that does not conflict with existing filenames
-in that directory.</simpara>
-</section>
-<section id="_controlling_transaction_isolation">
-<title>Controlling transaction isolation</title>
-<simpara>By default, Sqoop uses the read committed transaction isolation in the mappers
-to import data. This may not be the ideal in all ETL workflows and it may
-desired to reduce the isolation guarantees. The <literal>--relaxed-isolation</literal> option
-can be used to instruct Sqoop to use read uncommitted isolation level.</simpara>
-<simpara>The <literal>read-uncommitted</literal> isolation level is not supported on all databases
-(for example, Oracle), so specifying the option <literal>--relaxed-isolation</literal>
-may not be supported on all databases.</simpara>
-</section>
-<section id="_controlling_type_mapping">
-<title>Controlling type mapping</title>
-<simpara>Sqoop is preconfigured to map most SQL types to appropriate Java or Hive
-representatives. However the default mapping might not be suitable for
-everyone and might be overridden by <literal>--map-column-java</literal> (for changing
-mapping to Java) or <literal>--map-column-hive</literal> (for changing Hive mapping).</simpara>
-<table pgwide="0"
-frame="topbot"
-rowsep="1" colsep="1"
->
-<title>Parameters for overriding mapping</title>
-<tgroup cols="2">
-<colspec colwidth="206*" align="left"/>
-<colspec colwidth="236*" align="left"/>
-<thead>
- <row>
- <entry>
- Argument
- </entry>
- <entry>
- Description
- </entry>
- </row>
-</thead>
-<tbody>
- <row>
- <entry>
- <literal>--map-column-java <mapping></literal>
- </entry>
- <entry>
- Override mapping from SQL to Java type for configured columns.
- </entry>
- </row>
- <row>
- <entry>
- <literal>--map-column-hive <mapping></literal>
- </entry>
- <entry>
- Override mapping from SQL to Hive type for configured columns.
- </entry>
- </row>
-</tbody>
-</tgroup>
-</table>
-<simpara>Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example:</simpara>
-<screen>$ sqoop import ... --map-column-java id=String,value=Integer</screen>
-<simpara>Sqoop will rise exception in case that some configured mapping will not be used.</simpara>
-</section>
-<section id="_incremental_imports">
-<title>Incremental Imports</title>
-<simpara>Sqoop provides an incremental import mode which can be used to retrieve
-only rows newer than some previously-imported set of rows.</simpara>
-<simpara>The following arguments control incremental imports:</simpara>
-<table pgwide="0"
-frame="topbot"
-rowsep="1" colsep="1"
->
-<title>Incremental import arguments:</title>
-<tgroup cols="2">
-<colspec colwidth="182*" align="left"/>
-<colspec colwidth="236*" align="left"/>
-<thead>
- <row>
- <entry>
- Argument
- </entry>
- <entry>
- Description
- </entry>
- </row>
-</thead>
-<tbody>
- <row>
- <entry>
- <literal>--check-column (col)</literal>
- </entry>
- <entry>
- Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)
- </entry>
- </row>
- <row>
- <entry>
- <literal>--incremental (mode)</literal>
- </entry>
- <entry>
- Specifies how Sqoop determines which rows are new. Legal values for <literal>mode</literal> include <literal>append</literal> and <literal>lastmodified</literal>.
- </entry>
- </row>
- <row>
- <entry>
- <literal>--last-value (value)</literal>
- </entry>
- <entry>
- Specifies the maximum value of the check column from the previous import.
- </entry>
- </row>
-</tbody>
-</tgroup>
-</table>
-<simpara>Sqoop supports two types of incremental imports: <literal>append</literal> and <literal>lastmodified</literal>.
-You can use the <literal>--incremental</literal> argument to specify the type of incremental
-import to perform.</simpara>
-<simpara>You should specify <literal>append</literal> mode when importing a table where new rows are
-continually being added with increasing row id values. You specify the column
-containing the row’s id with <literal>--check-column</literal>. Sqoop imports rows where the
-check column has a value greater than the one specified with <literal>--last-value</literal>.</simpara>
-<simpara>An alternate table update strategy supported by Sqoop is called <literal>lastmodified</literal>
-mode. You should use this when rows of the source table may be updated, and
-each such update will set the value of a last-modified column to the current
-timestamp. Rows where the check column holds a timestamp more recent than the
-timestamp specified with <literal>--last-value</literal> are imported.</simpara>
-<simpara>At the end of an incremental import, the value which should be specified as
-<literal>--last-value</literal> for a subsequent import is printed to the screen. When running
-a subsequent import, you should specify <literal>--last-value</literal> in this way to ensure
-you import only the new or updated data. This is handled automatically by
-creating an incremental import as a saved job, which is the preferred
-mechanism for performing a recurring incremental import. See the section on
-saved jobs later in this document for more information.</simpara>
-</section>
-<section id="_file_formats">
-<title>File Formats</title>
-<simpara>You can import data in one of two file formats: delimited text or
-SequenceFiles.</simpara>
-<simpara>Delimited text is the default import format. You can also specify it
-explicitly by using the <literal>--as-textfile</literal> argument. This argument will write
-string-based representations of each record to the output files, with
-delimiter characters between individual columns and rows. These
-delimiters may be commas, tabs, or other characters. (The delimiters
-can be selected; see "Output line formatting arguments.") The
-following is the results of an example text-based import:</simpara>
-<screen>1,here is a message,2010-05-01
-2,happy new year!,2010-01-01
-3,another message,2009-11-12</screen>
-<simpara>Delimited text is appropriate for most non-binary data types. It also
-readily supports further manipulation by other tools, such as Hive.</simpara>