src/contrib/sqoop/doc/api-reference.txt - hadoop-mapreduce - Git at Google


 ////
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 ////

 Developer API Reference
 -----------------------

 This section is intended to specify the APIs available to application writers
 integrating with Sqoop, and those modifying Sqoop. The next three subsections
 are written from the following three perspectives: those using classes generated
 by Sqoop, and its public library; those writing Sqoop extensions (i.e.,
 additional ConnManager implementations that interact with more databases); and
 those modifying Sqoop's internals. Each section describes the system in
 successively greater depth.


 The External API
 ~~~~~~~~~~~~~~~~

 Sqoop auto-generates classes that represent the tables imported into HDFS. The
 class contains member fields for each column of the imported table; an instance
 of the class holds one row of the table. The generated classes implement the
 serialization APIs used in Hadoop, namely the _Writable_ and _DBWritable_
 interfaces.  They also contain other convenience methods: a +parse()+ method
 that interprets delimited text fields, and a +toString()+ method that preserves
 the user's chosen delimiters. The full set of methods guaranteed to exist in an
 auto-generated class are specified in the interface
 +org.apache.hadoop.sqoop.lib.SqoopRecord+.

 Instances of _SqoopRecord_ may depend on Sqoop's public API. This is all classes
 in the +org.apache.hadoop.sqoop.lib+ package. These are briefly described below.
 Clients of Sqoop should not need to directly interact with any of these classes,
 although classes generated by Sqoop will depend on them. Therefore, these APIs
 are considered public and care will be taken when forward-evolving them.

 * The +RecordParser+ class will parse a line of text into a list of fields,
   using controllable delimiters and quote characters.
 * The static +FieldFormatter+ class provides a method which handles quoting and
   escaping of characters in a field which will be used in
   +SqoopRecord.toString()+ implementations.
 * Marshaling data between _ResultSet_ and _PreparedStatement_ objects and
   _SqoopRecords_ is done via +JdbcWritableBridge+.
 * +BigDecimalSerializer+ contains a pair of methods that facilitate
   serialization of +BigDecimal+ objects over the _Writable_ interface.

 The Extension API
 ~~~~~~~~~~~~~~~~~

 This section covers the API and primary classes used by extensions for Sqoop
 which allow Sqoop to interface with more database vendors.

 While Sqoop uses JDBC and +DBInputFormat+ (and +DataDrivenDBInputFormat+) to
 read from databases, differences in the SQL supported by different vendors as
 well as JDBC metadata necessitates vendor-specific codepaths for most databases.
 Sqoop's solution to this problem is by introducing the ConnManager API
 (+org.apache.hadoop.sqoop.manager.ConnMananger+).

 +ConnManager+ is an abstract class defining all methods that interact with the
 database itself. Most implementations of +ConnManager+ will extend the
 +org.apache.hadoop.sqoop.manager.SqlManager+ abstract class, which uses standard
 SQL to perform most actions. Subclasses are required to implement the
 +getConnection()+ method which returns the actual JDBC connection to the
 database. Subclasses are free to override all other methods as well. The
 +SqlManager+ class itself exposes a protected API that allows developers to
 selectively override behavior. For example, the +getColNamesQuery()+ method
 allows the SQL query used by +getColNames()+ to be modified without needing to
 rewrite the majority of +getColNames()+.

 +ConnManager+ implementations receive a lot of their configuration data from a
 Sqoop-specific class, +SqoopOptions+. While +SqoopOptions+ does not currently
 contain many setter methods, clients should not assume +SqoopOptions+ are
 immutable. More setter methods may be added in the future.  +SqoopOptions+ does
 not directly store specific per-manager options. Instead, it contains a
 reference to the +Configuration+ returned by +Tool.getConf()+ after parsing
 command-line arguments with the +GenericOptionsParser+. This allows extension
 arguments via "+-D any.specific.param=any.value+" without requiring any layering
 of options parsing or modification of +SqoopOptions+.

 All existing +ConnManager+ implementations are stateless. Thus, the system which
 instantiates +ConnManagers+ may implement multiple instances of the same
 +ConnMananger+ class over Sqoop's lifetime. If a caching layer is required, we
 can add one later, but it is not currently available.

 +ConnManagers+ are currently created by instances of the abstract class +ManagerFactory+ (See
 MAPREDUCE-750). One +ManagerFactory+ implementation currently serves all of
 Sqoop: +org.apache.hadoop.sqoop.manager.DefaultManagerFactory+.  Extensions
 should not modify +DefaultManagerFactory+. Instead, an extension-specific
 +ManagerFactory+ implementation should be provided with the new ConnManager.
 +ManagerFactory+ has a single method of note, named +accept()+. This method will
 determine whether it can instantiate a +ConnManager+ for the user's
 +SqoopOptions+. If so, it returns the +ConnManager+ instance. Otherwise, it
 returns +null+.

 The +ManagerFactory+ implementations used are governed by the
 +sqoop.connection.factories+ setting in sqoop-site.xml. Users of extension
 libraries can install the 3rd-party library containing a new +ManagerFactory+
 and +ConnManager+(s), and configure sqoop-site.xml to use the new
 +ManagerFactory+.  The +DefaultManagerFactory+ principly discriminates between
 databases by parsing the connect string stored in +SqoopOptions+.

 Extension authors may make use of classes in the +org.apache.hadoop.sqoop.io+,
 +mapred+, +mapreduce+, and +util+ packages to facilitate their implementations.
 These packages and classes are described in more detail in the following
 section.


 Sqoop Internals
 ~~~~~~~~~~~~~~~

 This section describes the internal architecture of Sqoop.

 The Sqoop program is driven by the +org.apache.hadoop.sqoop.Sqoop+ main class.
 A limited number of additional classes are in the same package; +SqoopOptions+
 (described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+
 instances).

 General program flow
 ^^^^^^^^^^^^^^^^^^^^

 The general program flow is as follows:

 +org.apache.hadoop.sqoop.Sqoop+ is the main class and implements _Tool_. A new
 instance is launched with +ToolRunner+. It parses its arguments using the
 +SqoopOptions+ class.  Within the +SqoopOptions+, an +ImportAction+ will be
 chosen by the user. This may be import all tables, import one specific table,
 execute a SQL statement, or others.

 A +ConnManager+ is then instantiated based on the data in the +SqoopOptions+.
 The +ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+; the
 mechanics of this were described in an earlier section.

 Then in the +run()+ method, using a case statement, it determines which actions
 the user needs performed based on the +ImportAction+ enum. Usually this involves
 determining a list of tables to import, generating user code for them, and
 running a MapReduce job per table to read the data.  The import itself does not
 specifically need to be run via a MapReduce job; the +ConnManager.importTable()+
 method is left to determine how best to run the import. Each of these actions is
 controlled by the +ConnMananger+, except for the generating of code, which is
 done by the +CompilationManager+ and +ClassWriter+. (Both in the
 +org.apache.hadoop.sqoop.orm+ package.) Importing into Hive is also taken care
 of via the +org.apache.hadoop.sqoop.hive.HiveImport+ class after the
 +importTable()+ has completed. This is done without concern for the
 +ConnManager+ implementation used.

 A ConnManager's +importTable()+ method receives a single argument of type
 +ImportJobContext+ which contains parameters to the method. This class may be
 extended with additional parameters in the future, which optionally further
 direct the import operation. Similarly, the +exportTable()+ method receives an
 argument of type +ExportJobContext+. These classes contain the name of the table
 to import/export, a reference to the +SqoopOptions+ object, and other related
 data.

 Subpackages
 ^^^^^^^^^^^

 The following subpackages under +org.apache.hadoop.sqoop+ exist:

 * +hive+ - Facilitates importing data to Hive.
 * +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and
   _Writer_).
 * +lib+ - The external public API (described earlier).
 * +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their
   implementations.
 * +mapred+ - Classes interfacing with the old (pre-0.20) MapReduce API.
 * +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API....
 * +orm+ - Code auto-generation.
 * +util+ - Miscellaneous utility classes.

 The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations
 used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single
 BufferedWriter to be opened to a client which will, under the hood, write to
 multiple files in series as they reach a target threshold size. This allows
 unsplittable compression libraries (e.g., gzip) to be used in conjunction with
 Sqoop import while still allowing subsequent MapReduce jobs to use multiple
 input splits per dataset.

 Code in the +mapred+ package should be considered deprecated. The +mapreduce+
 package contains +DataDrivenImportJob+, which uses the +DataDrivenDBInputFormat+
 introduced in 0.21. The mapred package contains +ImportJob+, which uses the
 older +DBInputFormat+. Most +ConnManager+ implementations use
 +DataDrivenImportJob+; +DataDrivenDBInputFormat+ does not currently work with
 Oracle in all circumstances, so it remains on the old code-path.

 The +orm+ package contains code used for class generation. It depends on the
 JDK's tools.jar which provides the com.sun.tools.javac package.

 The +util+ package contains various utilities used throughout Sqoop:

 * +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the
   current thread. This is principly used to load auto-generated code into the
   current thread when running MapReduce in local (standalone) mode.
 * +DirectImportUtils+ contains convenience methods used by direct HDFS
   importers.
 * +Executor+ launches external processes and connects these to stream handlers
   generated by an AsyncSink (see more detail below).
 * +ExportException+ is thrown by +ConnManagers+ when exports fail.
 * +ImportException+ is thrown by +ConnManagers+ when imports fail.
 * +JdbcUrl+ handles parsing of connect strings, which are URL-like but not
   specification-conforming. (In particular, JDBC connect strings may have
   +multi:part:scheme://+ components.)
 * +PerfCounters+ are used to estimate transfer rates for display to the user.
 * +ResultSetPrinter+ will pretty-print a _ResultSet_.

 In several places, Sqoop reads the stdout from external processes. The most
 straightforward cases are direct-mode imports as performed by the
 +LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by
 +Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr
 (+Process.getErrorStream()+) must be handled. Failure to read enough data from
 both of these streams will cause the external process to block before writing
 more. Consequently, these must both be handled, and preferably asynchronously.

 In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and
 reads it to completion. These are realized by +AsyncSink+ implementations. The
 +org.apache.hadoop.sqoop.util.AsyncSink+ abstract class defines the operations
 this factory must perform. +processStream()+ will spawn another thread to
 immediately begin handling the data read from the +InputStream+ argument; it
 must read this stream to completion. The +join()+ method allows external threads
 to wait until this processing is complete.

 Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will
 repeat everything on the +InputStream+ as log4j INFO statements. The
 +NullAsyncSink+ consumes all its input and does nothing.

 The various +ConnManagers+ that make use of external processes have their own
 +AsyncSink+ implementations as inner classes, which read from the database tools
 and forward the data along to HDFS, possibly performing formatting conversions
 in the meantime.

	////
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////

	Developer API Reference
	-----------------------

	This section is intended to specify the APIs available to application writers
	integrating with Sqoop, and those modifying Sqoop. The next three subsections
	are written from the following three perspectives: those using classes generated
	by Sqoop, and its public library; those writing Sqoop extensions (i.e.,
	additional ConnManager implementations that interact with more databases); and
	those modifying Sqoop's internals. Each section describes the system in
	successively greater depth.


	The External API
	~~~~~~~~~~~~~~~~

	Sqoop auto-generates classes that represent the tables imported into HDFS. The
	class contains member fields for each column of the imported table; an instance
	of the class holds one row of the table. The generated classes implement the
	serialization APIs used in Hadoop, namely the _Writable_ and _DBWritable_
	interfaces. They also contain other convenience methods: a +parse()+ method
	that interprets delimited text fields, and a +toString()+ method that preserves
	the user's chosen delimiters. The full set of methods guaranteed to exist in an
	auto-generated class are specified in the interface
	+org.apache.hadoop.sqoop.lib.SqoopRecord+.

	Instances of _SqoopRecord_ may depend on Sqoop's public API. This is all classes
	in the +org.apache.hadoop.sqoop.lib+ package. These are briefly described below.
	Clients of Sqoop should not need to directly interact with any of these classes,
	although classes generated by Sqoop will depend on them. Therefore, these APIs
	are considered public and care will be taken when forward-evolving them.

	* The +RecordParser+ class will parse a line of text into a list of fields,
	using controllable delimiters and quote characters.
	* The static +FieldFormatter+ class provides a method which handles quoting and
	escaping of characters in a field which will be used in
	+SqoopRecord.toString()+ implementations.
	* Marshaling data between _ResultSet_ and _PreparedStatement_ objects and
	_SqoopRecords_ is done via +JdbcWritableBridge+.
	* +BigDecimalSerializer+ contains a pair of methods that facilitate
	serialization of +BigDecimal+ objects over the _Writable_ interface.

	The Extension API
	~~~~~~~~~~~~~~~~~

	This section covers the API and primary classes used by extensions for Sqoop
	which allow Sqoop to interface with more database vendors.

	While Sqoop uses JDBC and +DBInputFormat+ (and +DataDrivenDBInputFormat+) to
	read from databases, differences in the SQL supported by different vendors as
	well as JDBC metadata necessitates vendor-specific codepaths for most databases.
	Sqoop's solution to this problem is by introducing the ConnManager API
	(+org.apache.hadoop.sqoop.manager.ConnMananger+).

	+ConnManager+ is an abstract class defining all methods that interact with the
	database itself. Most implementations of +ConnManager+ will extend the
	+org.apache.hadoop.sqoop.manager.SqlManager+ abstract class, which uses standard
	SQL to perform most actions. Subclasses are required to implement the
	+getConnection()+ method which returns the actual JDBC connection to the
	database. Subclasses are free to override all other methods as well. The
	+SqlManager+ class itself exposes a protected API that allows developers to
	selectively override behavior. For example, the +getColNamesQuery()+ method
	allows the SQL query used by +getColNames()+ to be modified without needing to
	rewrite the majority of +getColNames()+.

	+ConnManager+ implementations receive a lot of their configuration data from a
	Sqoop-specific class, +SqoopOptions+. While +SqoopOptions+ does not currently
	contain many setter methods, clients should not assume +SqoopOptions+ are
	immutable. More setter methods may be added in the future. +SqoopOptions+ does
	not directly store specific per-manager options. Instead, it contains a
	reference to the +Configuration+ returned by +Tool.getConf()+ after parsing
	command-line arguments with the +GenericOptionsParser+. This allows extension
	arguments via "+-D any.specific.param=any.value+" without requiring any layering
	of options parsing or modification of +SqoopOptions+.

	All existing +ConnManager+ implementations are stateless. Thus, the system which
	instantiates +ConnManagers+ may implement multiple instances of the same
	+ConnMananger+ class over Sqoop's lifetime. If a caching layer is required, we
	can add one later, but it is not currently available.

	+ConnManagers+ are currently created by instances of the abstract class +ManagerFactory+ (See
	MAPREDUCE-750). One +ManagerFactory+ implementation currently serves all of
	Sqoop: +org.apache.hadoop.sqoop.manager.DefaultManagerFactory+. Extensions
	should not modify +DefaultManagerFactory+. Instead, an extension-specific
	+ManagerFactory+ implementation should be provided with the new ConnManager.
	+ManagerFactory+ has a single method of note, named +accept()+. This method will
	determine whether it can instantiate a +ConnManager+ for the user's
	+SqoopOptions+. If so, it returns the +ConnManager+ instance. Otherwise, it
	returns +null+.

	The +ManagerFactory+ implementations used are governed by the
	+sqoop.connection.factories+ setting in sqoop-site.xml. Users of extension
	libraries can install the 3rd-party library containing a new +ManagerFactory+
	and +ConnManager+(s), and configure sqoop-site.xml to use the new
	+ManagerFactory+. The +DefaultManagerFactory+ principly discriminates between
	databases by parsing the connect string stored in +SqoopOptions+.

	Extension authors may make use of classes in the +org.apache.hadoop.sqoop.io+,
	+mapred+, +mapreduce+, and +util+ packages to facilitate their implementations.
	These packages and classes are described in more detail in the following
	section.


	Sqoop Internals
	~~~~~~~~~~~~~~~

	This section describes the internal architecture of Sqoop.

	The Sqoop program is driven by the +org.apache.hadoop.sqoop.Sqoop+ main class.
	A limited number of additional classes are in the same package; +SqoopOptions+
	(described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+
	instances).

	General program flow
	^^^^^^^^^^^^^^^^^^^^

	The general program flow is as follows:

	+org.apache.hadoop.sqoop.Sqoop+ is the main class and implements _Tool_. A new
	instance is launched with +ToolRunner+. It parses its arguments using the
	+SqoopOptions+ class. Within the +SqoopOptions+, an +ImportAction+ will be
	chosen by the user. This may be import all tables, import one specific table,
	execute a SQL statement, or others.

	A +ConnManager+ is then instantiated based on the data in the +SqoopOptions+.
	The +ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+; the
	mechanics of this were described in an earlier section.

	Then in the +run()+ method, using a case statement, it determines which actions
	the user needs performed based on the +ImportAction+ enum. Usually this involves
	determining a list of tables to import, generating user code for them, and
	running a MapReduce job per table to read the data. The import itself does not
	specifically need to be run via a MapReduce job; the +ConnManager.importTable()+
	method is left to determine how best to run the import. Each of these actions is
	controlled by the +ConnMananger+, except for the generating of code, which is
	done by the +CompilationManager+ and +ClassWriter+. (Both in the
	+org.apache.hadoop.sqoop.orm+ package.) Importing into Hive is also taken care
	of via the +org.apache.hadoop.sqoop.hive.HiveImport+ class after the
	+importTable()+ has completed. This is done without concern for the
	+ConnManager+ implementation used.

	A ConnManager's +importTable()+ method receives a single argument of type
	+ImportJobContext+ which contains parameters to the method. This class may be
	extended with additional parameters in the future, which optionally further
	direct the import operation. Similarly, the +exportTable()+ method receives an
	argument of type +ExportJobContext+. These classes contain the name of the table
	to import/export, a reference to the +SqoopOptions+ object, and other related
	data.

	Subpackages
	^^^^^^^^^^^

	The following subpackages under +org.apache.hadoop.sqoop+ exist:

	* +hive+ - Facilitates importing data to Hive.
	* +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and
	_Writer_).
	* +lib+ - The external public API (described earlier).
	* +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their
	implementations.
	* +mapred+ - Classes interfacing with the old (pre-0.20) MapReduce API.
	* +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API....
	* +orm+ - Code auto-generation.
	* +util+ - Miscellaneous utility classes.

	The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations
	used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single
	BufferedWriter to be opened to a client which will, under the hood, write to
	multiple files in series as they reach a target threshold size. This allows
	unsplittable compression libraries (e.g., gzip) to be used in conjunction with
	Sqoop import while still allowing subsequent MapReduce jobs to use multiple
	input splits per dataset.

	Code in the +mapred+ package should be considered deprecated. The +mapreduce+
	package contains +DataDrivenImportJob+, which uses the +DataDrivenDBInputFormat+
	introduced in 0.21. The mapred package contains +ImportJob+, which uses the
	older +DBInputFormat+. Most +ConnManager+ implementations use
	+DataDrivenImportJob+; +DataDrivenDBInputFormat+ does not currently work with
	Oracle in all circumstances, so it remains on the old code-path.

	The +orm+ package contains code used for class generation. It depends on the
	JDK's tools.jar which provides the com.sun.tools.javac package.

	The +util+ package contains various utilities used throughout Sqoop:

	* +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the
	current thread. This is principly used to load auto-generated code into the
	current thread when running MapReduce in local (standalone) mode.
	* +DirectImportUtils+ contains convenience methods used by direct HDFS
	importers.
	* +Executor+ launches external processes and connects these to stream handlers
	generated by an AsyncSink (see more detail below).
	* +ExportException+ is thrown by +ConnManagers+ when exports fail.
	* +ImportException+ is thrown by +ConnManagers+ when imports fail.
	* +JdbcUrl+ handles parsing of connect strings, which are URL-like but not
	specification-conforming. (In particular, JDBC connect strings may have
	+multi:part:scheme://+ components.)
	* +PerfCounters+ are used to estimate transfer rates for display to the user.
	* +ResultSetPrinter+ will pretty-print a _ResultSet_.

	In several places, Sqoop reads the stdout from external processes. The most
	straightforward cases are direct-mode imports as performed by the
	+LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by
	+Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr
	(+Process.getErrorStream()+) must be handled. Failure to read enough data from
	both of these streams will cause the external process to block before writing
	more. Consequently, these must both be handled, and preferably asynchronously.

	In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and
	reads it to completion. These are realized by +AsyncSink+ implementations. The
	+org.apache.hadoop.sqoop.util.AsyncSink+ abstract class defines the operations
	this factory must perform. +processStream()+ will spawn another thread to
	immediately begin handling the data read from the +InputStream+ argument; it
	must read this stream to completion. The +join()+ method allows external threads
	to wait until this processing is complete.

	Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will
	repeat everything on the +InputStream+ as log4j INFO statements. The
	+NullAsyncSink+ consumes all its input and does nothing.

	The various +ConnManagers+ that make use of external processes have their own
	+AsyncSink+ implementations as inner classes, which read from the database tools
	and forward the data along to HDFS, possibly performing formatting conversions
	in the meantime.