blob: 6f31c602a9d08d296d2262fecf8e655e7f590f7c [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
Connecting to a Database Server
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sqoop is designed to import tables from a database into HDFS. To do
so, you must specify a _connect string_ that describes how to connect to the
database. The _connect string_ is similar to a URL, and is communicated to
Sqoop with the +\--connect+ argument. This describes the server and
database to connect to; it may also specify the port. For example:
----
$ sqoop import --connect jdbc:mysql://database.example.com/employees
----
This string will connect to a MySQL database named +employees+ on the
host +database.example.com+. It's important that you *do not* use the URL
+localhost+ if you intend to use Sqoop with a distributed Hadoop
cluster. The connect string you supply will be used on TaskTracker nodes
throughout your MapReduce cluster; if you specify the
literal name +localhost+, each node will connect to a different
database (or more likely, no database at all). Instead, you should use
the full hostname or IP address of the database host that can be seen
by all your remote nodes.
You might need to authenticate against the database before you can
access it. You can use the +\--username+ to supply a username to the database.
Sqoop provides couple of different ways to supply a password,
secure and non-secure, to the database which is detailed below.
.Secure way of supplying password to the database
You should save the password in a file on the users home directory with 400
permissions and specify the path to that file using the *+--password-file+*
argument, and is the preferred method of entering credentials. Sqoop will
then read the password from the file and pass it to the MapReduce cluster
using secure means with out exposing the password in the job configuration.
The file containing the password can either be on the Local FS or HDFS.
For example:
----
$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username venkatesh --password-file ${user.home}/.password
----
WARNING: Sqoop will read entire content of the password file and use it as
a password. This will include any trailing white space characters such as
new line characters that are added by default by most of the text editors.
You need to make sure that your password file contains only characters
that belongs to your password. On the command line you can use command
+echo+ with switch +-n+ to store password without any trailing white space
characters. For example to store password +secret+ you would call
+echo -n "secret" > password.file+.
Another way of supplying passwords is using the +-P+ argument which will
read a password from a console prompt.
.Protecting password from preying eyes
Hadoop 2.6.0 provides an API to separate password storage from applications.
This API is called the credential provided API and there is a new
+credential+ command line tool to manage passwords and their aliases.
The passwords are stored with their aliases in a keystore that is password
protected. The keystore password can be the provided to a password prompt
on the command line, via an environment variable or defaulted to a software
defined constant. Please check the Hadoop documentation on the usage
of this facility.
Once the password is stored using the Credential Provider facility and
the Hadoop configuration has been suitably updated, all applications can
optionally use the alias in place of the actual password and at runtime
resolve the alias for the password to use.
Since the keystore or similar technology used for storing the credential
provider is shared across components, passwords for various applications,
various database and other passwords can be securely stored in them and only
the alias needs to be exposed in configuration files, protecting the password
from being visible.
Sqoop has been enhanced to allow usage of this funcionality if it is
available in the underlying Hadoop version being used. One new option
has been introduced to provide the alias on the command line instead of the
actual password (--password-alias). The argument value this option is
the alias on the storage associated with the actual password.
Example usage is as follows:
----
$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username dbuser --password-alias mydb.password.alias
----
Similarly, if the command line option is not preferred, the alias can be saved
in the file provided with --password-file option. Along with this, the
Sqoop configuration parameter org.apache.sqoop.credentials.loader.class
should be set to the classname that provides the alias resolution:
+org.apache.sqoop.util.password.CredentialProviderPasswordLoader+
Example usage is as follows (assuming .password.alias has the alias for
the real password) :
----
$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username dbuser --password-file ${user.home}/.password-alias
----
.Non-secure way of passing password
WARNING: The +\--password+ parameter is insecure, as other users may
be able to read your password from the command-line arguments via
the output of programs such as `ps`. The *+-P+* argument is the preferred
method over using the +\--password+ argument. Credentials may still be
transferred between nodes of the MapReduce cluster using insecure means.
For example:
----
$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username aaron --password 12345
----
Sqoop automatically supports several databases, including MySQL. Connect
strings beginning with +jdbc:mysql://+ are handled automatically in Sqoop. (A
full list of databases with built-in support is provided in the "Supported
Databases" section. For some, you may need to install the JDBC driver
yourself.)
You can use Sqoop with any other
JDBC-compliant database. First, download the appropriate JDBC
driver for the type of database you want to import, and install the .jar
file in the +$SQOOP_HOME/lib+ directory on your client machine. (This will
be +/usr/lib/sqoop/lib+ if you installed from an RPM or Debian package.)
Each driver +.jar+ file also has a specific driver class which defines
the entry-point to the driver. For example, MySQL's Connector/J library has
a driver class of +com.mysql.jdbc.Driver+. Refer to your database
vendor-specific documentation to determine the main driver class.
This class must be provided as an argument to Sqoop with +\--driver+.
For example, to connect to a SQLServer database, first download the driver from
microsoft.com and install it in your Sqoop lib path.
Then run Sqoop. For example:
----
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
--connect <connect-string> ...
----
When connecting to a database using JDBC, you can optionally specify extra
JDBC parameters via a property file using the option
+\--connection-param-file+. The contents of this file are parsed as standard
Java properties and passed into the driver while creating a connection.
NOTE: The parameters specified via the optional property file are only
applicable to JDBC connections. Any fastpath connectors that use connections
other than JDBC will ignore these parameters.