src/docs/dev/api-reference.txt - sqoop - Git at Google


 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 ////

 Developer API Reference
 -----------------------

 This section specifies the APIs available to application writers who
 want to integrate with Sqoop, and those who want to modify Sqoop.

 The next three subsections are written for the following use cases:

 - Using classes generated by Sqoop and its public library
 - Writing Sqoop extensions (that is, additional ConnManager implementations
   that interact with more databases)
 - Modifying Sqoop's internals

 Each section describes the system in successively greater depth.


 The External API
 ~~~~~~~~~~~~~~~~

 Sqoop automatically generates classes that represent the tables
 imported into the Hadoop Distributed File System (HDFS). The class
 contains member fields for each column of the imported table; an
 instance of the class holds one row of the table. The generated
 classes implement the serialization APIs used in Hadoop, namely the
 _Writable_ and _DBWritable_ interfaces. They also contain these other
 convenience methods:

 - A parse() method that interprets delimited text fields
 - A toString() method that preserves the user's chosen delimiters

 The full set of methods guaranteed to exist in an auto-generated class
 is specified in the abstract class
 +org.apache.sqoop.lib.SqoopRecord+.

 Instances of +SqoopRecord+ may depend on Sqoop's public API. This is all classes
 in the +org.apache.sqoop.lib+ package. These are briefly described below.
 Clients of Sqoop should not need to directly interact with any of these classes,
 although classes generated by Sqoop will depend on them. Therefore, these APIs
 are considered public and care will be taken when forward-evolving them.

 * The +RecordParser+ class will parse a line of text into a list of fields,
   using controllable delimiters and quote characters.
 * The static +FieldFormatter+ class provides a method which handles quoting and
   escaping of characters in a field which will be used in
   +SqoopRecord.toString()+ implementations.
 * Marshaling data between _ResultSet_ and _PreparedStatement_ objects and
   _SqoopRecords_ is done via +JdbcWritableBridge+.
 * +BigDecimalSerializer+ contains a pair of methods that facilitate
   serialization of +BigDecimal+ objects over the _Writable_ interface.

 The full specification of the public API is available on the Sqoop
 Development Wiki as
 http://wiki.github.com/cloudera/sqoop/sip-4[SIP-4].


 The Extension API
 ~~~~~~~~~~~~~~~~~

 This section covers the API and primary classes used by extensions for Sqoop
 which allow Sqoop to interface with more database vendors.

 While Sqoop uses JDBC and +DataDrivenDBInputFormat+ to
 read from databases, differences in the SQL supported by different vendors as
 well as JDBC metadata necessitates vendor-specific codepaths for most databases.
 Sqoop's solution to this problem is by introducing the +ConnManager+ API
 (+org.apache.sqoop.manager.ConnMananger+).

 +ConnManager+ is an abstract class defining all methods that interact with the
 database itself. Most implementations of +ConnManager+ will extend the
 +org.apache.sqoop.manager.SqlManager+ abstract class, which uses standard
 SQL to perform most actions. Subclasses are required to implement the
 +getConnection()+ method which returns the actual JDBC connection to the
 database. Subclasses are free to override all other methods as well. The
 +SqlManager+ class itself exposes a protected API that allows developers to
 selectively override behavior. For example, the +getColNamesQuery()+ method
 allows the SQL query used by +getColNames()+ to be modified without needing to
 rewrite the majority of +getColNames()+.

 +ConnManager+ implementations receive a lot of their configuration
 data from a Sqoop-specific class, +SqoopOptions+. +SqoopOptions+ are
 mutable.  +SqoopOptions+ does not directly store specific per-manager
 options. Instead, it contains a reference to the +Configuration+
 returned by +Tool.getConf()+ after parsing command-line arguments with
 the +GenericOptionsParser+. This allows extension arguments via "+-D
 any.specific.param=any.value+" without requiring any layering of
 options parsing or modification of +SqoopOptions+. This
 +Configuration+ forms the basis of the +Configuration+ passed to any
 MapReduce +Job+ invoked in the workflow, so that users can set on the
 command-line any necessary custom Hadoop state.

 All existing +ConnManager+ implementations are stateless. Thus, the
 system which instantiates +ConnManagers+ may implement multiple
 instances of the same +ConnMananger+ class over Sqoop's lifetime. It
 is currently assumed that instantiating a +ConnManager+ is a
 lightweight operation, and is done reasonably infrequently. Therefore,
 +ConnManagers+ are not cached between operations, etc.

 +ConnManagers+ are currently created by instances of the abstract
 class +ManagerFactory+ (See
 http://issues.apache.org/jira/browse/MAPREDUCE-750[]). One
 +ManagerFactory+ implementation currently serves all of Sqoop:
 +org.apache.sqoop.manager.DefaultManagerFactory+.  Extensions
 should not modify +DefaultManagerFactory+. Instead, an
 extension-specific +ManagerFactory+ implementation should be provided
 with the new +ConnManager+.  +ManagerFactory+ has a single method of
 note, named +accept()+. This method will determine whether it can
 instantiate a +ConnManager+ for the user's +SqoopOptions+. If so, it
 returns the +ConnManager+ instance. Otherwise, it returns +null+.

 The +ManagerFactory+ implementations used are governed by the
 +sqoop.connection.factories+ setting in +sqoop-site.xml+. Users of extension
 libraries can install the 3rd-party library containing a new +ManagerFactory+
 and +ConnManager+(s), and configure +sqoop-site.xml+ to use the new
 +ManagerFactory+.  The +DefaultManagerFactory+ principly discriminates between
 databases by parsing the connect string stored in +SqoopOptions+.

 Extension authors may make use of classes in the +org.apache.sqoop.io+,
 +mapreduce+, and +util+ packages to facilitate their implementations.
 These packages and classes are described in more detail in the following
 section.

 HBase Serialization Extensions
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Sqoop supports imports from databases to HBase. When copying data into
 HBase, it must be transformed into a format HBase can accept. Specifically:

 * Data must be placed into one (or more) tables in HBase.
 * Columns of input data must be placed into a column family.
 * Values must be serialized to byte arrays to put into cells.

 All of this is done via +Put+ statements in the HBase client API.
 Sqoop's interaction with HBase is performed in the +org.apache.sqoop.hbase+
 package. Records are deserialzed from the database and emitted from the mapper.
 The OutputFormat is responsible for inserting the results into HBase. This is
 done through an interface called +PutTransformer+. The +PutTransformer+
 has a method called +getPutCommand()+ that
 takes as input a +Map<String, Object>+ representing the fields of the dataset.
 It returns a +List<Put>+ describing how to insert the cells into HBase.
 The default +PutTransformer+ implementation is the +ToStringPutTransformer+
 that uses the string-based representation of each field to serialize the
 fields to HBase.

 You can override this implementation by implementing your own +PutTransformer+
 and adding it to the classpath for the map tasks (e.g., with the +-libjars+
 option). To tell Sqoop to use your implementation, set the
 +sqoop.hbase.insert.put.transformer.class+ property to identify your class
 with +-D+.

 Within your PutTransformer implementation, the specified row key
 column and column family are
 available via the +getRowKeyColumn()+ and +getColumnFamily()+ methods.
 You are free to make additional Put operations outside these constraints;
 for example, to inject additional rows representing a secondary index.
 However, Sqoop will execute all +Put+ operations against the table
 specified with +\--hbase-table+.

 Sqoop Internals
 ~~~~~~~~~~~~~~~

 This section describes the internal architecture of Sqoop.

 The Sqoop program is driven by the +org.apache.sqoop.Sqoop+ main class.
 A limited number of additional classes are in the same package; +SqoopOptions+
 (described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+
 instances).

 General program flow
 ^^^^^^^^^^^^^^^^^^^^

 The general program flow is as follows:

 +org.apache.sqoop.Sqoop+ is the main class and implements _Tool_. A new
 instance is launched with +ToolRunner+. The first argument to Sqoop is
 a string identifying the name of a +SqoopTool+ to run. The +SqoopTool+
 itself drives the execution of the user's requested operation (e.g.,
 import, export, codegen, etc).

 The +SqoopTool+ API is specified fully in
 http://wiki.github.com/cloudera/sqoop/sip-1[SIP-1].

 The chosen +SqoopTool+ will parse the remainder of the arguments,
 setting the appropriate fields in the +SqoopOptions+ class. It will
 then run its body.

 Then in the SqoopTool's +run()+ method, the import or export or other
 action proper is executed.  Typically, a +ConnManager+ is then
 instantiated based on the data in the +SqoopOptions+.  The
 +ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+;
 the mechanics of this were described in an earlier section. Imports
 and exports and other large data motion tasks typically run a
 MapReduce job to operate on a table in a parallel, reliable fashion.
 An import does not specifically need to be run via a MapReduce job;
 the +ConnManager.importTable()+ method is left to determine how best
 to run the import. Each main action is actually controlled by the
 +ConnMananger+, except for the generating of code, which is done by
 the +CompilationManager+ and +ClassWriter+. (Both in the
 +org.apache.sqoop.orm+ package.) Importing into Hive is also
 taken care of via the +org.apache.sqoop.hive.HiveImport+ class
 after the +importTable()+ has completed. This is done without concern
 for the +ConnManager+ implementation used.

 A ConnManager's +importTable()+ method receives a single argument of
 type +ImportJobContext+ which contains parameters to the method. This
 class may be extended with additional parameters in the future, which
 optionally further direct the import operation. Similarly, the
 +exportTable()+ method receives an argument of type
 +ExportJobContext+. These classes contain the name of the table to
 import/export, a reference to the +SqoopOptions+ object, and other
 related data.

 Subpackages
 ^^^^^^^^^^^

 The following subpackages under +org.apache.sqoop+ exist:

 * +hive+ - Facilitates importing data to Hive.
 * +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and
   _Writer_).
 * +lib+ - The external public API (described earlier).
 * +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their
   implementations.
 * +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API.
 * +orm+ - Code auto-generation.
 * +tool+ - Implementations of +SqoopTool+.
 * +util+ - Miscellaneous utility classes.

 The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations
 used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single
 BufferedWriter to be opened to a client which will, under the hood, write to
 multiple files in series as they reach a target threshold size. This allows
 unsplittable compression libraries (e.g., gzip) to be used in conjunction with
 Sqoop import while still allowing subsequent MapReduce jobs to use multiple
 input splits per dataset. The large object file storage (see
 http://wiki.github.com/cloudera/sqoop/sip-3[SIP-3]) system's code
 lies in the +io+ package as well.

 The +mapreduce+ package contains code that interfaces directly with
 Hadoop MapReduce. This package's contents are described in more detail
 in the next section.

 The +orm+ package contains code used for class generation. It depends on the
 JDK's tools.jar which provides the com.sun.tools.javac package.

 The +util+ package contains various utilities used throughout Sqoop:

 * +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the
   current thread. This is principly used to load auto-generated code into the
   current thread when running MapReduce in local (standalone) mode.
 * +DirectImportUtils+ contains convenience methods used by direct HDFS
   importers.
 * +Executor+ launches external processes and connects these to stream handlers
   generated by an AsyncSink (see more detail below).
 * +ExportException+ is thrown by +ConnManagers+ when exports fail.
 * +ImportException+ is thrown by +ConnManagers+ when imports fail.
 * +JdbcUrl+ handles parsing of connect strings, which are URL-like but not
   specification-conforming. (In particular, JDBC connect strings may have
   +multi:part:scheme://+ components.)
 * +PerfCounters+ are used to estimate transfer rates for display to the user.
 * +ResultSetPrinter+ will pretty-print a _ResultSet_.

 In several places, Sqoop reads the stdout from external processes. The most
 straightforward cases are direct-mode imports as performed by the
 +LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by
 +Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr
 (+Process.getErrorStream()+) must be handled. Failure to read enough data from
 both of these streams will cause the external process to block before writing
 more. Consequently, these must both be handled, and preferably asynchronously.

 In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and
 reads it to completion. These are realized by +AsyncSink+ implementations. The
 +org.apache.sqoop.util.AsyncSink+ abstract class defines the operations
 this factory must perform. +processStream()+ will spawn another thread to
 immediately begin handling the data read from the +InputStream+ argument; it
 must read this stream to completion. The +join()+ method allows external threads
 to wait until this processing is complete.

 Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will
 repeat everything on the +InputStream+ as log4j INFO statements. The
 +NullAsyncSink+ consumes all its input and does nothing.

 The various +ConnManagers+ that make use of external processes have their own
 +AsyncSink+ implementations as inner classes, which read from the database tools
 and forward the data along to HDFS, possibly performing formatting conversions
 in the meantime.


 Interfacing with MapReduce
 ^^^^^^^^^^^^^^^^^^^^^^^^^^

 Sqoop schedules MapReduce jobs to effect imports and exports.
 Configuration and execution of MapReduce jobs follows a few common
 steps (configuring the +InputFormat+; configuring the +OutputFormat+;
 setting the +Mapper+ implementation; etc...). These steps are
 formalized in the +org.apache.sqoop.mapreduce.JobBase+ class.
 The +JobBase+ allows a user to specify the +InputFormat+,
 +OutputFormat+, and +Mapper+ to use.

 +JobBase+ itself is subclassed by +ImportJobBase+ and +ExportJobBase+
 which offer better support for the particular configuration steps
 common to import or export-related jobs, respectively.
 +ImportJobBase.runImport()+ will call the configuration steps and run
 a job to import a table to HDFS.

 Subclasses of these base classes exist as well. For example,
 +DataDrivenImportJob+ uses the +DataDrivenDBInputFormat+ to run an
 import. This is the most common type of import used by the various
 +ConnManager+ implementations available. MySQL uses a different class
 (+MySQLDumpImportJob+) to run a direct-mode import. Its custom
 +Mapper+ and +InputFormat+ implementations reside in this package as
 well.

	////
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////

	Developer API Reference
	-----------------------

	This section specifies the APIs available to application writers who
	want to integrate with Sqoop, and those who want to modify Sqoop.

	The next three subsections are written for the following use cases:

	- Using classes generated by Sqoop and its public library
	- Writing Sqoop extensions (that is, additional ConnManager implementations
	that interact with more databases)
	- Modifying Sqoop's internals

	Each section describes the system in successively greater depth.


	The External API
	~~~~~~~~~~~~~~~~

	Sqoop automatically generates classes that represent the tables
	imported into the Hadoop Distributed File System (HDFS). The class
	contains member fields for each column of the imported table; an
	instance of the class holds one row of the table. The generated
	classes implement the serialization APIs used in Hadoop, namely the
	_Writable_ and _DBWritable_ interfaces. They also contain these other
	convenience methods:

	- A parse() method that interprets delimited text fields
	- A toString() method that preserves the user's chosen delimiters

	The full set of methods guaranteed to exist in an auto-generated class
	is specified in the abstract class
	+org.apache.sqoop.lib.SqoopRecord+.

	Instances of +SqoopRecord+ may depend on Sqoop's public API. This is all classes
	in the +org.apache.sqoop.lib+ package. These are briefly described below.
	Clients of Sqoop should not need to directly interact with any of these classes,
	although classes generated by Sqoop will depend on them. Therefore, these APIs
	are considered public and care will be taken when forward-evolving them.

	* The +RecordParser+ class will parse a line of text into a list of fields,
	using controllable delimiters and quote characters.
	* The static +FieldFormatter+ class provides a method which handles quoting and
	escaping of characters in a field which will be used in
	+SqoopRecord.toString()+ implementations.
	* Marshaling data between _ResultSet_ and _PreparedStatement_ objects and
	_SqoopRecords_ is done via +JdbcWritableBridge+.
	* +BigDecimalSerializer+ contains a pair of methods that facilitate
	serialization of +BigDecimal+ objects over the _Writable_ interface.

	The full specification of the public API is available on the Sqoop
	Development Wiki as
	http://wiki.github.com/cloudera/sqoop/sip-4[SIP-4].


	The Extension API
	~~~~~~~~~~~~~~~~~

	This section covers the API and primary classes used by extensions for Sqoop
	which allow Sqoop to interface with more database vendors.

	While Sqoop uses JDBC and +DataDrivenDBInputFormat+ to
	read from databases, differences in the SQL supported by different vendors as
	well as JDBC metadata necessitates vendor-specific codepaths for most databases.
	Sqoop's solution to this problem is by introducing the +ConnManager+ API
	(+org.apache.sqoop.manager.ConnMananger+).

	+ConnManager+ is an abstract class defining all methods that interact with the
	database itself. Most implementations of +ConnManager+ will extend the
	+org.apache.sqoop.manager.SqlManager+ abstract class, which uses standard
	SQL to perform most actions. Subclasses are required to implement the
	+getConnection()+ method which returns the actual JDBC connection to the
	database. Subclasses are free to override all other methods as well. The
	+SqlManager+ class itself exposes a protected API that allows developers to
	selectively override behavior. For example, the +getColNamesQuery()+ method
	allows the SQL query used by +getColNames()+ to be modified without needing to
	rewrite the majority of +getColNames()+.

	+ConnManager+ implementations receive a lot of their configuration
	data from a Sqoop-specific class, +SqoopOptions+. +SqoopOptions+ are
	mutable. +SqoopOptions+ does not directly store specific per-manager
	options. Instead, it contains a reference to the +Configuration+
	returned by +Tool.getConf()+ after parsing command-line arguments with
	the +GenericOptionsParser+. This allows extension arguments via "+-D
	any.specific.param=any.value+" without requiring any layering of
	options parsing or modification of +SqoopOptions+. This
	+Configuration+ forms the basis of the +Configuration+ passed to any
	MapReduce +Job+ invoked in the workflow, so that users can set on the
	command-line any necessary custom Hadoop state.

	All existing +ConnManager+ implementations are stateless. Thus, the
	system which instantiates +ConnManagers+ may implement multiple
	instances of the same +ConnMananger+ class over Sqoop's lifetime. It
	is currently assumed that instantiating a +ConnManager+ is a
	lightweight operation, and is done reasonably infrequently. Therefore,
	+ConnManagers+ are not cached between operations, etc.

	+ConnManagers+ are currently created by instances of the abstract
	class +ManagerFactory+ (See
	http://issues.apache.org/jira/browse/MAPREDUCE-750[]). One
	+ManagerFactory+ implementation currently serves all of Sqoop:
	+org.apache.sqoop.manager.DefaultManagerFactory+. Extensions
	should not modify +DefaultManagerFactory+. Instead, an
	extension-specific +ManagerFactory+ implementation should be provided
	with the new +ConnManager+. +ManagerFactory+ has a single method of
	note, named +accept()+. This method will determine whether it can
	instantiate a +ConnManager+ for the user's +SqoopOptions+. If so, it
	returns the +ConnManager+ instance. Otherwise, it returns +null+.

	The +ManagerFactory+ implementations used are governed by the
	+sqoop.connection.factories+ setting in +sqoop-site.xml+. Users of extension
	libraries can install the 3rd-party library containing a new +ManagerFactory+
	and +ConnManager+(s), and configure +sqoop-site.xml+ to use the new
	+ManagerFactory+. The +DefaultManagerFactory+ principly discriminates between
	databases by parsing the connect string stored in +SqoopOptions+.

	Extension authors may make use of classes in the +org.apache.sqoop.io+,
	+mapreduce+, and +util+ packages to facilitate their implementations.
	These packages and classes are described in more detail in the following
	section.

	HBase Serialization Extensions
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Sqoop supports imports from databases to HBase. When copying data into
	HBase, it must be transformed into a format HBase can accept. Specifically:

	* Data must be placed into one (or more) tables in HBase.
	* Columns of input data must be placed into a column family.
	* Values must be serialized to byte arrays to put into cells.

	All of this is done via +Put+ statements in the HBase client API.
	Sqoop's interaction with HBase is performed in the +org.apache.sqoop.hbase+
	package. Records are deserialzed from the database and emitted from the mapper.
	The OutputFormat is responsible for inserting the results into HBase. This is
	done through an interface called +PutTransformer+. The +PutTransformer+
	has a method called +getPutCommand()+ that
	takes as input a +Map<String, Object>+ representing the fields of the dataset.
	It returns a +List<Put>+ describing how to insert the cells into HBase.
	The default +PutTransformer+ implementation is the +ToStringPutTransformer+
	that uses the string-based representation of each field to serialize the
	fields to HBase.

	You can override this implementation by implementing your own +PutTransformer+
	and adding it to the classpath for the map tasks (e.g., with the +-libjars+
	option). To tell Sqoop to use your implementation, set the
	+sqoop.hbase.insert.put.transformer.class+ property to identify your class
	with +-D+.

	Within your PutTransformer implementation, the specified row key
	column and column family are
	available via the +getRowKeyColumn()+ and +getColumnFamily()+ methods.
	You are free to make additional Put operations outside these constraints;
	for example, to inject additional rows representing a secondary index.
	However, Sqoop will execute all +Put+ operations against the table
	specified with +\--hbase-table+.

	Sqoop Internals
	~~~~~~~~~~~~~~~

	This section describes the internal architecture of Sqoop.

	The Sqoop program is driven by the +org.apache.sqoop.Sqoop+ main class.
	A limited number of additional classes are in the same package; +SqoopOptions+
	(described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+
	instances).

	General program flow
	^^^^^^^^^^^^^^^^^^^^

	The general program flow is as follows:

	+org.apache.sqoop.Sqoop+ is the main class and implements _Tool_. A new
	instance is launched with +ToolRunner+. The first argument to Sqoop is
	a string identifying the name of a +SqoopTool+ to run. The +SqoopTool+
	itself drives the execution of the user's requested operation (e.g.,
	import, export, codegen, etc).

	The +SqoopTool+ API is specified fully in
	http://wiki.github.com/cloudera/sqoop/sip-1[SIP-1].

	The chosen +SqoopTool+ will parse the remainder of the arguments,
	setting the appropriate fields in the +SqoopOptions+ class. It will
	then run its body.

	Then in the SqoopTool's +run()+ method, the import or export or other
	action proper is executed. Typically, a +ConnManager+ is then
	instantiated based on the data in the +SqoopOptions+. The
	+ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+;
	the mechanics of this were described in an earlier section. Imports
	and exports and other large data motion tasks typically run a
	MapReduce job to operate on a table in a parallel, reliable fashion.
	An import does not specifically need to be run via a MapReduce job;
	the +ConnManager.importTable()+ method is left to determine how best
	to run the import. Each main action is actually controlled by the
	+ConnMananger+, except for the generating of code, which is done by
	the +CompilationManager+ and +ClassWriter+. (Both in the
	+org.apache.sqoop.orm+ package.) Importing into Hive is also
	taken care of via the +org.apache.sqoop.hive.HiveImport+ class
	after the +importTable()+ has completed. This is done without concern
	for the +ConnManager+ implementation used.

	A ConnManager's +importTable()+ method receives a single argument of
	type +ImportJobContext+ which contains parameters to the method. This
	class may be extended with additional parameters in the future, which
	optionally further direct the import operation. Similarly, the
	+exportTable()+ method receives an argument of type
	+ExportJobContext+. These classes contain the name of the table to
	import/export, a reference to the +SqoopOptions+ object, and other
	related data.

	Subpackages
	^^^^^^^^^^^

	The following subpackages under +org.apache.sqoop+ exist:

	* +hive+ - Facilitates importing data to Hive.
	* +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and
	_Writer_).
	* +lib+ - The external public API (described earlier).
	* +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their
	implementations.
	* +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API.
	* +orm+ - Code auto-generation.
	* +tool+ - Implementations of +SqoopTool+.
	* +util+ - Miscellaneous utility classes.

	The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations
	used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single
	BufferedWriter to be opened to a client which will, under the hood, write to
	multiple files in series as they reach a target threshold size. This allows
	unsplittable compression libraries (e.g., gzip) to be used in conjunction with
	Sqoop import while still allowing subsequent MapReduce jobs to use multiple
	input splits per dataset. The large object file storage (see
	http://wiki.github.com/cloudera/sqoop/sip-3[SIP-3]) system's code
	lies in the +io+ package as well.

	The +mapreduce+ package contains code that interfaces directly with
	Hadoop MapReduce. This package's contents are described in more detail
	in the next section.

	The +orm+ package contains code used for class generation. It depends on the
	JDK's tools.jar which provides the com.sun.tools.javac package.

	The +util+ package contains various utilities used throughout Sqoop:

	* +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the
	current thread. This is principly used to load auto-generated code into the
	current thread when running MapReduce in local (standalone) mode.
	* +DirectImportUtils+ contains convenience methods used by direct HDFS
	importers.
	* +Executor+ launches external processes and connects these to stream handlers
	generated by an AsyncSink (see more detail below).
	* +ExportException+ is thrown by +ConnManagers+ when exports fail.
	* +ImportException+ is thrown by +ConnManagers+ when imports fail.
	* +JdbcUrl+ handles parsing of connect strings, which are URL-like but not
	specification-conforming. (In particular, JDBC connect strings may have
	+multi:part:scheme://+ components.)
	* +PerfCounters+ are used to estimate transfer rates for display to the user.
	* +ResultSetPrinter+ will pretty-print a _ResultSet_.

	In several places, Sqoop reads the stdout from external processes. The most
	straightforward cases are direct-mode imports as performed by the
	+LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by
	+Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr
	(+Process.getErrorStream()+) must be handled. Failure to read enough data from
	both of these streams will cause the external process to block before writing
	more. Consequently, these must both be handled, and preferably asynchronously.

	In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and
	reads it to completion. These are realized by +AsyncSink+ implementations. The
	+org.apache.sqoop.util.AsyncSink+ abstract class defines the operations
	this factory must perform. +processStream()+ will spawn another thread to
	immediately begin handling the data read from the +InputStream+ argument; it
	must read this stream to completion. The +join()+ method allows external threads
	to wait until this processing is complete.

	Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will
	repeat everything on the +InputStream+ as log4j INFO statements. The
	+NullAsyncSink+ consumes all its input and does nothing.

	The various +ConnManagers+ that make use of external processes have their own
	+AsyncSink+ implementations as inner classes, which read from the database tools
	and forward the data along to HDFS, possibly performing formatting conversions
	in the meantime.


	Interfacing with MapReduce
	^^^^^^^^^^^^^^^^^^^^^^^^^^

	Sqoop schedules MapReduce jobs to effect imports and exports.
	Configuration and execution of MapReduce jobs follows a few common
	steps (configuring the +InputFormat+; configuring the +OutputFormat+;
	setting the +Mapper+ implementation; etc...). These steps are
	formalized in the +org.apache.sqoop.mapreduce.JobBase+ class.
	The +JobBase+ allows a user to specify the +InputFormat+,
	+OutputFormat+, and +Mapper+ to use.

	+JobBase+ itself is subclassed by +ImportJobBase+ and +ExportJobBase+
	which offer better support for the particular configuration steps
	common to import or export-related jobs, respectively.
	+ImportJobBase.runImport()+ will call the configuration steps and run
	a job to import a table to HDFS.

	Subclasses of these base classes exist as well. For example,
	+DataDrivenImportJob+ uses the +DataDrivenDBInputFormat+ to run an
	import. This is the most common type of import used by the various
	+ConnManager+ implementations available. MySQL uses a different class
	(+MySQLDumpImportJob+) to run a direct-mode import. Its custom
	+Mapper+ and +InputFormat+ implementations reside in this package as
	well.