blob: 46a57eef8c582fabfb023ead118b1af28e860c5e [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
Developer API Reference
-----------------------
This section specifies the APIs available to application writers who
want to integrate with Sqoop, and those who want to modify Sqoop.
The next three subsections are written for the following use cases:
- Using classes generated by Sqoop and its public library
- Writing Sqoop extensions (that is, additional ConnManager implementations
that interact with more databases)
- Modifying Sqoop's internals
Each section describes the system in successively greater depth.
The External API
~~~~~~~~~~~~~~~~
Sqoop automatically generates classes that represent the tables
imported into the Hadoop Distributed File System (HDFS). The class
contains member fields for each column of the imported table; an
instance of the class holds one row of the table. The generated
classes implement the serialization APIs used in Hadoop, namely the
_Writable_ and _DBWritable_ interfaces. They also contain these other
convenience methods:
- A parse() method that interprets delimited text fields
- A toString() method that preserves the user's chosen delimiters
The full set of methods guaranteed to exist in an auto-generated class
is specified in the abstract class
+org.apache.sqoop.lib.SqoopRecord+.
Instances of +SqoopRecord+ may depend on Sqoop's public API. This is all classes
in the +org.apache.sqoop.lib+ package. These are briefly described below.
Clients of Sqoop should not need to directly interact with any of these classes,
although classes generated by Sqoop will depend on them. Therefore, these APIs
are considered public and care will be taken when forward-evolving them.
* The +RecordParser+ class will parse a line of text into a list of fields,
using controllable delimiters and quote characters.
* The static +FieldFormatter+ class provides a method which handles quoting and
escaping of characters in a field which will be used in
+SqoopRecord.toString()+ implementations.
* Marshaling data between _ResultSet_ and _PreparedStatement_ objects and
_SqoopRecords_ is done via +JdbcWritableBridge+.
* +BigDecimalSerializer+ contains a pair of methods that facilitate
serialization of +BigDecimal+ objects over the _Writable_ interface.
The full specification of the public API is available on the Sqoop
Development Wiki as
http://wiki.github.com/cloudera/sqoop/sip-4[SIP-4].
The Extension API
~~~~~~~~~~~~~~~~~
This section covers the API and primary classes used by extensions for Sqoop
which allow Sqoop to interface with more database vendors.
While Sqoop uses JDBC and +DataDrivenDBInputFormat+ to
read from databases, differences in the SQL supported by different vendors as
well as JDBC metadata necessitates vendor-specific codepaths for most databases.
Sqoop's solution to this problem is by introducing the +ConnManager+ API
(+org.apache.sqoop.manager.ConnMananger+).
+ConnManager+ is an abstract class defining all methods that interact with the
database itself. Most implementations of +ConnManager+ will extend the
+org.apache.sqoop.manager.SqlManager+ abstract class, which uses standard
SQL to perform most actions. Subclasses are required to implement the
+getConnection()+ method which returns the actual JDBC connection to the
database. Subclasses are free to override all other methods as well. The
+SqlManager+ class itself exposes a protected API that allows developers to
selectively override behavior. For example, the +getColNamesQuery()+ method
allows the SQL query used by +getColNames()+ to be modified without needing to
rewrite the majority of +getColNames()+.
+ConnManager+ implementations receive a lot of their configuration
data from a Sqoop-specific class, +SqoopOptions+. +SqoopOptions+ are
mutable. +SqoopOptions+ does not directly store specific per-manager
options. Instead, it contains a reference to the +Configuration+
returned by +Tool.getConf()+ after parsing command-line arguments with
the +GenericOptionsParser+. This allows extension arguments via "+-D
any.specific.param=any.value+" without requiring any layering of
options parsing or modification of +SqoopOptions+. This
+Configuration+ forms the basis of the +Configuration+ passed to any
MapReduce +Job+ invoked in the workflow, so that users can set on the
command-line any necessary custom Hadoop state.
All existing +ConnManager+ implementations are stateless. Thus, the
system which instantiates +ConnManagers+ may implement multiple
instances of the same +ConnMananger+ class over Sqoop's lifetime. It
is currently assumed that instantiating a +ConnManager+ is a
lightweight operation, and is done reasonably infrequently. Therefore,
+ConnManagers+ are not cached between operations, etc.
+ConnManagers+ are currently created by instances of the abstract
class +ManagerFactory+ (See
http://issues.apache.org/jira/browse/MAPREDUCE-750[]). One
+ManagerFactory+ implementation currently serves all of Sqoop:
+org.apache.sqoop.manager.DefaultManagerFactory+. Extensions
should not modify +DefaultManagerFactory+. Instead, an
extension-specific +ManagerFactory+ implementation should be provided
with the new +ConnManager+. +ManagerFactory+ has a single method of
note, named +accept()+. This method will determine whether it can
instantiate a +ConnManager+ for the user's +SqoopOptions+. If so, it
returns the +ConnManager+ instance. Otherwise, it returns +null+.
The +ManagerFactory+ implementations used are governed by the
+sqoop.connection.factories+ setting in +sqoop-site.xml+. Users of extension
libraries can install the 3rd-party library containing a new +ManagerFactory+
and +ConnManager+(s), and configure +sqoop-site.xml+ to use the new
+ManagerFactory+. The +DefaultManagerFactory+ principly discriminates between
databases by parsing the connect string stored in +SqoopOptions+.
Extension authors may make use of classes in the +org.apache.sqoop.io+,
+mapreduce+, and +util+ packages to facilitate their implementations.
These packages and classes are described in more detail in the following
section.
HBase Serialization Extensions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sqoop supports imports from databases to HBase. When copying data into
HBase, it must be transformed into a format HBase can accept. Specifically:
* Data must be placed into one (or more) tables in HBase.
* Columns of input data must be placed into a column family.
* Values must be serialized to byte arrays to put into cells.
All of this is done via +Put+ statements in the HBase client API.
Sqoop's interaction with HBase is performed in the +org.apache.sqoop.hbase+
package. Records are deserialzed from the database and emitted from the mapper.
The OutputFormat is responsible for inserting the results into HBase. This is
done through an interface called +PutTransformer+. The +PutTransformer+
has a method called +getPutCommand()+ that
takes as input a +Map<String, Object>+ representing the fields of the dataset.
It returns a +List<Put>+ describing how to insert the cells into HBase.
The default +PutTransformer+ implementation is the +ToStringPutTransformer+
that uses the string-based representation of each field to serialize the
fields to HBase.
You can override this implementation by implementing your own +PutTransformer+
and adding it to the classpath for the map tasks (e.g., with the +-libjars+
option). To tell Sqoop to use your implementation, set the
+sqoop.hbase.insert.put.transformer.class+ property to identify your class
with +-D+.
Within your PutTransformer implementation, the specified row key
column and column family are
available via the +getRowKeyColumn()+ and +getColumnFamily()+ methods.
You are free to make additional Put operations outside these constraints;
for example, to inject additional rows representing a secondary index.
However, Sqoop will execute all +Put+ operations against the table
specified with +\--hbase-table+.
Sqoop Internals
~~~~~~~~~~~~~~~
This section describes the internal architecture of Sqoop.
The Sqoop program is driven by the +org.apache.sqoop.Sqoop+ main class.
A limited number of additional classes are in the same package; +SqoopOptions+
(described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+
instances).
General program flow
^^^^^^^^^^^^^^^^^^^^
The general program flow is as follows:
+org.apache.sqoop.Sqoop+ is the main class and implements _Tool_. A new
instance is launched with +ToolRunner+. The first argument to Sqoop is
a string identifying the name of a +SqoopTool+ to run. The +SqoopTool+
itself drives the execution of the user's requested operation (e.g.,
import, export, codegen, etc).
The +SqoopTool+ API is specified fully in
http://wiki.github.com/cloudera/sqoop/sip-1[SIP-1].
The chosen +SqoopTool+ will parse the remainder of the arguments,
setting the appropriate fields in the +SqoopOptions+ class. It will
then run its body.
Then in the SqoopTool's +run()+ method, the import or export or other
action proper is executed. Typically, a +ConnManager+ is then
instantiated based on the data in the +SqoopOptions+. The
+ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+;
the mechanics of this were described in an earlier section. Imports
and exports and other large data motion tasks typically run a
MapReduce job to operate on a table in a parallel, reliable fashion.
An import does not specifically need to be run via a MapReduce job;
the +ConnManager.importTable()+ method is left to determine how best
to run the import. Each main action is actually controlled by the
+ConnMananger+, except for the generating of code, which is done by
the +CompilationManager+ and +ClassWriter+. (Both in the
+org.apache.sqoop.orm+ package.) Importing into Hive is also
taken care of via the +org.apache.sqoop.hive.HiveImport+ class
after the +importTable()+ has completed. This is done without concern
for the +ConnManager+ implementation used.
A ConnManager's +importTable()+ method receives a single argument of
type +ImportJobContext+ which contains parameters to the method. This
class may be extended with additional parameters in the future, which
optionally further direct the import operation. Similarly, the
+exportTable()+ method receives an argument of type
+ExportJobContext+. These classes contain the name of the table to
import/export, a reference to the +SqoopOptions+ object, and other
related data.
Subpackages
^^^^^^^^^^^
The following subpackages under +org.apache.sqoop+ exist:
* +hive+ - Facilitates importing data to Hive.
* +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and
_Writer_).
* +lib+ - The external public API (described earlier).
* +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their
implementations.
* +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API.
* +orm+ - Code auto-generation.
* +tool+ - Implementations of +SqoopTool+.
* +util+ - Miscellaneous utility classes.
The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations
used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single
BufferedWriter to be opened to a client which will, under the hood, write to
multiple files in series as they reach a target threshold size. This allows
unsplittable compression libraries (e.g., gzip) to be used in conjunction with
Sqoop import while still allowing subsequent MapReduce jobs to use multiple
input splits per dataset. The large object file storage (see
http://wiki.github.com/cloudera/sqoop/sip-3[SIP-3]) system's code
lies in the +io+ package as well.
The +mapreduce+ package contains code that interfaces directly with
Hadoop MapReduce. This package's contents are described in more detail
in the next section.
The +orm+ package contains code used for class generation. It depends on the
JDK's tools.jar which provides the com.sun.tools.javac package.
The +util+ package contains various utilities used throughout Sqoop:
* +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the
current thread. This is principly used to load auto-generated code into the
current thread when running MapReduce in local (standalone) mode.
* +DirectImportUtils+ contains convenience methods used by direct HDFS
importers.
* +Executor+ launches external processes and connects these to stream handlers
generated by an AsyncSink (see more detail below).
* +ExportException+ is thrown by +ConnManagers+ when exports fail.
* +ImportException+ is thrown by +ConnManagers+ when imports fail.
* +JdbcUrl+ handles parsing of connect strings, which are URL-like but not
specification-conforming. (In particular, JDBC connect strings may have
+multi:part:scheme://+ components.)
* +PerfCounters+ are used to estimate transfer rates for display to the user.
* +ResultSetPrinter+ will pretty-print a _ResultSet_.
In several places, Sqoop reads the stdout from external processes. The most
straightforward cases are direct-mode imports as performed by the
+LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by
+Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr
(+Process.getErrorStream()+) must be handled. Failure to read enough data from
both of these streams will cause the external process to block before writing
more. Consequently, these must both be handled, and preferably asynchronously.
In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and
reads it to completion. These are realized by +AsyncSink+ implementations. The
+org.apache.sqoop.util.AsyncSink+ abstract class defines the operations
this factory must perform. +processStream()+ will spawn another thread to
immediately begin handling the data read from the +InputStream+ argument; it
must read this stream to completion. The +join()+ method allows external threads
to wait until this processing is complete.
Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will
repeat everything on the +InputStream+ as log4j INFO statements. The
+NullAsyncSink+ consumes all its input and does nothing.
The various +ConnManagers+ that make use of external processes have their own
+AsyncSink+ implementations as inner classes, which read from the database tools
and forward the data along to HDFS, possibly performing formatting conversions
in the meantime.
Interfacing with MapReduce
^^^^^^^^^^^^^^^^^^^^^^^^^^
Sqoop schedules MapReduce jobs to effect imports and exports.
Configuration and execution of MapReduce jobs follows a few common
steps (configuring the +InputFormat+; configuring the +OutputFormat+;
setting the +Mapper+ implementation; etc...). These steps are
formalized in the +org.apache.sqoop.mapreduce.JobBase+ class.
The +JobBase+ allows a user to specify the +InputFormat+,
+OutputFormat+, and +Mapper+ to use.
+JobBase+ itself is subclassed by +ImportJobBase+ and +ExportJobBase+
which offer better support for the particular configuration steps
common to import or export-related jobs, respectively.
+ImportJobBase.runImport()+ will call the configuration steps and run
a job to import a table to HDFS.
Subclasses of these base classes exist as well. For example,
+DataDrivenImportJob+ uses the +DataDrivenDBInputFormat+ to run an
import. This is the most common type of import used by the various
+ConnManager+ implementations available. MySQL uses a different class
(+MySQLDumpImportJob+) to run a direct-mode import. Its custom
+Mapper+ and +InputFormat+ implementations reside in this package as
well.