src/docs/user/connecting.txt - sqoop - Git at Google


 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 ////


 Connecting to a Database Server
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Sqoop is designed to import tables from a database into HDFS. To do
 so, you must specify a _connect string_ that describes how to connect to the
 database. The _connect string_ is similar to a URL, and is communicated to
 Sqoop with the +\--connect+ argument. This describes the server and
 database to connect to; it may also specify the port. For example:

 ----
 $ sqoop import --connect jdbc:mysql://database.example.com/employees
 ----

 This string will connect to a MySQL database named +employees+ on the
 host +database.example.com+. It's important that you *do not* use the URL
 +localhost+ if you intend to use Sqoop with a distributed Hadoop
 cluster. The connect string you supply will be used on TaskTracker nodes
 throughout your MapReduce cluster; if you specify the
 literal name +localhost+, each node will connect to a different
 database (or more likely, no database at all). Instead, you should use
 the full hostname or IP address of the database host that can be seen
 by all your remote nodes.

 You might need to authenticate against the database before you can
 access it. You can use the +\--username+ to supply a username to the database.
 Sqoop provides couple of different ways to supply a password,
 secure and non-secure, to the database which is detailed below.

 .Secure way of supplying password to the database
 You should save the password in a file on the users home directory with 400
 permissions and specify the path to that file using the *+--password-file+*
 argument, and is the preferred method of entering credentials. Sqoop will
 then read the password from the file and pass it to the MapReduce cluster
 using secure means with out exposing the password in the job configuration.
 The file containing the password can either be on the Local FS or HDFS.
 For example:

 ----
 $ sqoop import --connect jdbc:mysql://database.example.com/employees \
     --username venkatesh --password-file ${user.home}/.password
 ----

 WARNING: Sqoop will read entire content of the password file and use it as
 a password. This will include any trailing white space characters such as
 new line characters that are added by default by most of the text editors.
 You need to make sure that your password file contains only characters
 that belongs to your password. On the command line you can use command
 +echo+ with switch +-n+ to store password without any trailing white space
 characters. For example to store password +secret+ you would call
 +echo -n "secret" > password.file+.

 Another way of supplying passwords is using the +-P+ argument which will
 read a password from a console prompt.

 .Protecting password from preying eyes
 Hadoop 2.6.0 provides an API to separate password storage from applications.
 This API is called the credential provided API and there is a new
 +credential+  command line tool to manage passwords and their aliases.
 The passwords are stored with their aliases in a keystore that is password
 protected.   The keystore password can be the provided to a password prompt
 on the command line, via an environment variable or defaulted to a software
 defined constant.   Please check the Hadoop documentation on the usage
 of this facility.

 Once the password is stored using the Credential Provider facility and
 the Hadoop configuration has been suitably updated, all applications can
 optionally use the alias in place of the actual password and at runtime
 resolve the alias for the password to use.

 Since the keystore or similar technology used for storing the credential
 provider is shared across components, passwords for various applications,
 various database and other passwords can be securely stored in them and only
 the alias needs to be exposed in configuration files, protecting the password
 from being visible.

 Sqoop has been enhanced to allow usage of this funcionality if it is
 available in the underlying Hadoop version being used.   One new option
 has been introduced to provide the alias on the command line instead of the
 actual password (--password-alias).  The argument value this option is
 the alias on the storage associated with the actual password.
 Example usage is as follows:

 ----
 $ sqoop import --connect jdbc:mysql://database.example.com/employees \
     --username dbuser --password-alias mydb.password.alias
 ----

 Similarly, if the command line option is not preferred, the alias can be saved
 in the file provided with --password-file option.  Along with this, the
 Sqoop configuration parameter org.apache.sqoop.credentials.loader.class
 should be set to the classname that provides the alias resolution:
 +org.apache.sqoop.util.password.CredentialProviderPasswordLoader+

 Example usage is as follows (assuming .password.alias has the alias for
 the real password) :

 ----
 $ sqoop import --connect jdbc:mysql://database.example.com/employees \
     --username dbuser --password-file ${user.home}/.password-alias
 ----

 .Non-secure way of passing password

 WARNING: The +\--password+ parameter is insecure, as other users may
 be able to read your password from the command-line arguments via
 the output of programs such as `ps`. The *+-P+* argument is the preferred
 method over using the +\--password+ argument. Credentials may still be
 transferred between nodes of the MapReduce cluster using insecure means.
 For example:

 ----
 $ sqoop import --connect jdbc:mysql://database.example.com/employees \
     --username aaron --password 12345
 ----

 Sqoop automatically supports several databases, including MySQL.  Connect
 strings beginning with +jdbc:mysql://+ are handled automatically in Sqoop.  (A
 full list of databases with built-in support is provided in the "Supported
 Databases" section. For some, you may need to install the JDBC driver
 yourself.)

 You can use Sqoop with any other
 JDBC-compliant database. First, download the appropriate JDBC
 driver for the type of database you want to import, and install the .jar
 file in the +$SQOOP_HOME/lib+ directory on your client machine. (This will
 be +/usr/lib/sqoop/lib+ if you installed from an RPM or Debian package.)
 Each driver +.jar+ file also has a specific driver class which defines
 the entry-point to the driver. For example, MySQL's Connector/J library has
 a driver class of +com.mysql.jdbc.Driver+. Refer to your database
 vendor-specific documentation to determine the main driver class.
 This class must be provided as an argument to Sqoop with +\--driver+.

 For example, to connect to a SQLServer database, first download the driver from
 microsoft.com and install it in your Sqoop lib path.

 Then run Sqoop. For example:

 ----
 $ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
     --connect <connect-string> ...
 ----

 When connecting to a database using JDBC, you can optionally specify extra
 JDBC parameters via a property file using the option
 +\--connection-param-file+. The contents of this file are parsed as standard
 Java properties and passed into the driver while creating a connection.

 NOTE: The parameters specified via the optional property file are only
 applicable to JDBC connections. Any fastpath connectors that use connections
 other than JDBC will ignore these parameters.

	////
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////


	Connecting to a Database Server
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Sqoop is designed to import tables from a database into HDFS. To do
	so, you must specify a _connect string_ that describes how to connect to the
	database. The _connect string_ is similar to a URL, and is communicated to
	Sqoop with the +\--connect+ argument. This describes the server and
	database to connect to; it may also specify the port. For example:

	----
	$ sqoop import --connect jdbc:mysql://database.example.com/employees
	----

	This string will connect to a MySQL database named +employees+ on the
	host +database.example.com+. It's important that you do not use the URL
	+localhost+ if you intend to use Sqoop with a distributed Hadoop
	cluster. The connect string you supply will be used on TaskTracker nodes
	throughout your MapReduce cluster; if you specify the
	literal name +localhost+, each node will connect to a different
	database (or more likely, no database at all). Instead, you should use
	the full hostname or IP address of the database host that can be seen
	by all your remote nodes.

	You might need to authenticate against the database before you can
	access it. You can use the +\--username+ to supply a username to the database.
	Sqoop provides couple of different ways to supply a password,
	secure and non-secure, to the database which is detailed below.

	.Secure way of supplying password to the database
	You should save the password in a file on the users home directory with 400
	permissions and specify the path to that file using the +--password-file+
	argument, and is the preferred method of entering credentials. Sqoop will
	then read the password from the file and pass it to the MapReduce cluster
	using secure means with out exposing the password in the job configuration.
	The file containing the password can either be on the Local FS or HDFS.
	For example:

	----
	$ sqoop import --connect jdbc:mysql://database.example.com/employees \
	--username venkatesh --password-file ${user.home}/.password
	----

	WARNING: Sqoop will read entire content of the password file and use it as
	a password. This will include any trailing white space characters such as
	new line characters that are added by default by most of the text editors.
	You need to make sure that your password file contains only characters
	that belongs to your password. On the command line you can use command
	+echo+ with switch +-n+ to store password without any trailing white space
	characters. For example to store password +secret+ you would call
	+echo -n "secret" > password.file+.

	Another way of supplying passwords is using the +-P+ argument which will
	read a password from a console prompt.

	.Protecting password from preying eyes
	Hadoop 2.6.0 provides an API to separate password storage from applications.
	This API is called the credential provided API and there is a new
	+credential+ command line tool to manage passwords and their aliases.
	The passwords are stored with their aliases in a keystore that is password
	protected. The keystore password can be the provided to a password prompt
	on the command line, via an environment variable or defaulted to a software
	defined constant. Please check the Hadoop documentation on the usage
	of this facility.

	Once the password is stored using the Credential Provider facility and
	the Hadoop configuration has been suitably updated, all applications can
	optionally use the alias in place of the actual password and at runtime
	resolve the alias for the password to use.

	Since the keystore or similar technology used for storing the credential
	provider is shared across components, passwords for various applications,
	various database and other passwords can be securely stored in them and only
	the alias needs to be exposed in configuration files, protecting the password
	from being visible.

	Sqoop has been enhanced to allow usage of this funcionality if it is
	available in the underlying Hadoop version being used. One new option
	has been introduced to provide the alias on the command line instead of the
	actual password (--password-alias). The argument value this option is
	the alias on the storage associated with the actual password.
	Example usage is as follows:

	----
	$ sqoop import --connect jdbc:mysql://database.example.com/employees \
	--username dbuser --password-alias mydb.password.alias
	----

	Similarly, if the command line option is not preferred, the alias can be saved
	in the file provided with --password-file option. Along with this, the
	Sqoop configuration parameter org.apache.sqoop.credentials.loader.class
	should be set to the classname that provides the alias resolution:
	+org.apache.sqoop.util.password.CredentialProviderPasswordLoader+

	Example usage is as follows (assuming .password.alias has the alias for
	the real password) :

	----
	$ sqoop import --connect jdbc:mysql://database.example.com/employees \
	--username dbuser --password-file ${user.home}/.password-alias
	----

	.Non-secure way of passing password

	WARNING: The +\--password+ parameter is insecure, as other users may
	be able to read your password from the command-line arguments via
	the output of programs such as `ps`. The +-P+ argument is the preferred
	method over using the +\--password+ argument. Credentials may still be
	transferred between nodes of the MapReduce cluster using insecure means.
	For example:

	----
	$ sqoop import --connect jdbc:mysql://database.example.com/employees \
	--username aaron --password 12345
	----

	Sqoop automatically supports several databases, including MySQL. Connect
	strings beginning with +jdbc:mysql://+ are handled automatically in Sqoop. (A
	full list of databases with built-in support is provided in the "Supported
	Databases" section. For some, you may need to install the JDBC driver
	yourself.)

	You can use Sqoop with any other
	JDBC-compliant database. First, download the appropriate JDBC
	driver for the type of database you want to import, and install the .jar
	file in the +$SQOOP_HOME/lib+ directory on your client machine. (This will
	be +/usr/lib/sqoop/lib+ if you installed from an RPM or Debian package.)
	Each driver +.jar+ file also has a specific driver class which defines
	the entry-point to the driver. For example, MySQL's Connector/J library has
	a driver class of +com.mysql.jdbc.Driver+. Refer to your database
	vendor-specific documentation to determine the main driver class.
	This class must be provided as an argument to Sqoop with +\--driver+.

	For example, to connect to a SQLServer database, first download the driver from
	microsoft.com and install it in your Sqoop lib path.

	Then run Sqoop. For example:

	----
	$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
	--connect <connect-string> ...
	----

	When connecting to a database using JDBC, you can optionally specify extra
	JDBC parameters via a property file using the option
	+\--connection-param-file+. The contents of this file are parsed as standard
	Java properties and passed into the driver while creating a connection.

	NOTE: The parameters specified via the optional property file are only
	applicable to JDBC connections. Any fastpath connectors that use connections
	other than JDBC will ignore these parameters.