| |
| //// |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| |
| Developer API Reference |
| ----------------------- |
| |
| This section specifies the APIs available to application writers who |
| want to integrate with Sqoop, and those who want to modify Sqoop. |
| |
| The next three subsections are written for the following use cases: |
| |
| - Using classes generated by Sqoop and its public library |
| - Writing Sqoop extensions (that is, additional ConnManager implementations |
| that interact with more databases) |
| - Modifying Sqoop's internals |
| |
| Each section describes the system in successively greater depth. |
| |
| |
| The External API |
| ~~~~~~~~~~~~~~~~ |
| |
| Sqoop automatically generates classes that represent the tables |
| imported into the Hadoop Distributed File System (HDFS). The class |
| contains member fields for each column of the imported table; an |
| instance of the class holds one row of the table. The generated |
| classes implement the serialization APIs used in Hadoop, namely the |
| _Writable_ and _DBWritable_ interfaces. They also contain these other |
| convenience methods: |
| |
| - A parse() method that interprets delimited text fields |
| - A toString() method that preserves the user's chosen delimiters |
| |
| The full set of methods guaranteed to exist in an auto-generated class |
| is specified in the abstract class |
| +org.apache.sqoop.lib.SqoopRecord+. |
| |
| Instances of +SqoopRecord+ may depend on Sqoop's public API. This is all classes |
| in the +org.apache.sqoop.lib+ package. These are briefly described below. |
| Clients of Sqoop should not need to directly interact with any of these classes, |
| although classes generated by Sqoop will depend on them. Therefore, these APIs |
| are considered public and care will be taken when forward-evolving them. |
| |
| * The +RecordParser+ class will parse a line of text into a list of fields, |
| using controllable delimiters and quote characters. |
| * The static +FieldFormatter+ class provides a method which handles quoting and |
| escaping of characters in a field which will be used in |
| +SqoopRecord.toString()+ implementations. |
| * Marshaling data between _ResultSet_ and _PreparedStatement_ objects and |
| _SqoopRecords_ is done via +JdbcWritableBridge+. |
| * +BigDecimalSerializer+ contains a pair of methods that facilitate |
| serialization of +BigDecimal+ objects over the _Writable_ interface. |
| |
| The full specification of the public API is available on the Sqoop |
| Development Wiki as |
| http://wiki.github.com/cloudera/sqoop/sip-4[SIP-4]. |
| |
| |
| The Extension API |
| ~~~~~~~~~~~~~~~~~ |
| |
| This section covers the API and primary classes used by extensions for Sqoop |
| which allow Sqoop to interface with more database vendors. |
| |
| While Sqoop uses JDBC and +DataDrivenDBInputFormat+ to |
| read from databases, differences in the SQL supported by different vendors as |
| well as JDBC metadata necessitates vendor-specific codepaths for most databases. |
| Sqoop's solution to this problem is by introducing the +ConnManager+ API |
| (+org.apache.sqoop.manager.ConnMananger+). |
| |
| +ConnManager+ is an abstract class defining all methods that interact with the |
| database itself. Most implementations of +ConnManager+ will extend the |
| +org.apache.sqoop.manager.SqlManager+ abstract class, which uses standard |
| SQL to perform most actions. Subclasses are required to implement the |
| +getConnection()+ method which returns the actual JDBC connection to the |
| database. Subclasses are free to override all other methods as well. The |
| +SqlManager+ class itself exposes a protected API that allows developers to |
| selectively override behavior. For example, the +getColNamesQuery()+ method |
| allows the SQL query used by +getColNames()+ to be modified without needing to |
| rewrite the majority of +getColNames()+. |
| |
| +ConnManager+ implementations receive a lot of their configuration |
| data from a Sqoop-specific class, +SqoopOptions+. +SqoopOptions+ are |
| mutable. +SqoopOptions+ does not directly store specific per-manager |
| options. Instead, it contains a reference to the +Configuration+ |
| returned by +Tool.getConf()+ after parsing command-line arguments with |
| the +GenericOptionsParser+. This allows extension arguments via "+-D |
| any.specific.param=any.value+" without requiring any layering of |
| options parsing or modification of +SqoopOptions+. This |
| +Configuration+ forms the basis of the +Configuration+ passed to any |
| MapReduce +Job+ invoked in the workflow, so that users can set on the |
| command-line any necessary custom Hadoop state. |
| |
| All existing +ConnManager+ implementations are stateless. Thus, the |
| system which instantiates +ConnManagers+ may implement multiple |
| instances of the same +ConnMananger+ class over Sqoop's lifetime. It |
| is currently assumed that instantiating a +ConnManager+ is a |
| lightweight operation, and is done reasonably infrequently. Therefore, |
| +ConnManagers+ are not cached between operations, etc. |
| |
| +ConnManagers+ are currently created by instances of the abstract |
| class +ManagerFactory+ (See |
| http://issues.apache.org/jira/browse/MAPREDUCE-750[]). One |
| +ManagerFactory+ implementation currently serves all of Sqoop: |
| +org.apache.sqoop.manager.DefaultManagerFactory+. Extensions |
| should not modify +DefaultManagerFactory+. Instead, an |
| extension-specific +ManagerFactory+ implementation should be provided |
| with the new +ConnManager+. +ManagerFactory+ has a single method of |
| note, named +accept()+. This method will determine whether it can |
| instantiate a +ConnManager+ for the user's +SqoopOptions+. If so, it |
| returns the +ConnManager+ instance. Otherwise, it returns +null+. |
| |
| The +ManagerFactory+ implementations used are governed by the |
| +sqoop.connection.factories+ setting in +sqoop-site.xml+. Users of extension |
| libraries can install the 3rd-party library containing a new +ManagerFactory+ |
| and +ConnManager+(s), and configure +sqoop-site.xml+ to use the new |
| +ManagerFactory+. The +DefaultManagerFactory+ principly discriminates between |
| databases by parsing the connect string stored in +SqoopOptions+. |
| |
| Extension authors may make use of classes in the +org.apache.sqoop.io+, |
| +mapreduce+, and +util+ packages to facilitate their implementations. |
| These packages and classes are described in more detail in the following |
| section. |
| |
| HBase Serialization Extensions |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Sqoop supports imports from databases to HBase. When copying data into |
| HBase, it must be transformed into a format HBase can accept. Specifically: |
| |
| * Data must be placed into one (or more) tables in HBase. |
| * Columns of input data must be placed into a column family. |
| * Values must be serialized to byte arrays to put into cells. |
| |
| All of this is done via +Put+ statements in the HBase client API. |
| Sqoop's interaction with HBase is performed in the +org.apache.sqoop.hbase+ |
| package. Records are deserialzed from the database and emitted from the mapper. |
| The OutputFormat is responsible for inserting the results into HBase. This is |
| done through an interface called +PutTransformer+. The +PutTransformer+ |
| has a method called +getPutCommand()+ that |
| takes as input a +Map<String, Object>+ representing the fields of the dataset. |
| It returns a +List<Put>+ describing how to insert the cells into HBase. |
| The default +PutTransformer+ implementation is the +ToStringPutTransformer+ |
| that uses the string-based representation of each field to serialize the |
| fields to HBase. |
| |
| You can override this implementation by implementing your own +PutTransformer+ |
| and adding it to the classpath for the map tasks (e.g., with the +-libjars+ |
| option). To tell Sqoop to use your implementation, set the |
| +sqoop.hbase.insert.put.transformer.class+ property to identify your class |
| with +-D+. |
| |
| Within your PutTransformer implementation, the specified row key |
| column and column family are |
| available via the +getRowKeyColumn()+ and +getColumnFamily()+ methods. |
| You are free to make additional Put operations outside these constraints; |
| for example, to inject additional rows representing a secondary index. |
| However, Sqoop will execute all +Put+ operations against the table |
| specified with +\--hbase-table+. |
| |
| Sqoop Internals |
| ~~~~~~~~~~~~~~~ |
| |
| This section describes the internal architecture of Sqoop. |
| |
| The Sqoop program is driven by the +org.apache.sqoop.Sqoop+ main class. |
| A limited number of additional classes are in the same package; +SqoopOptions+ |
| (described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+ |
| instances). |
| |
| General program flow |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| The general program flow is as follows: |
| |
| +org.apache.sqoop.Sqoop+ is the main class and implements _Tool_. A new |
| instance is launched with +ToolRunner+. The first argument to Sqoop is |
| a string identifying the name of a +SqoopTool+ to run. The +SqoopTool+ |
| itself drives the execution of the user's requested operation (e.g., |
| import, export, codegen, etc). |
| |
| The +SqoopTool+ API is specified fully in |
| http://wiki.github.com/cloudera/sqoop/sip-1[SIP-1]. |
| |
| The chosen +SqoopTool+ will parse the remainder of the arguments, |
| setting the appropriate fields in the +SqoopOptions+ class. It will |
| then run its body. |
| |
| Then in the SqoopTool's +run()+ method, the import or export or other |
| action proper is executed. Typically, a +ConnManager+ is then |
| instantiated based on the data in the +SqoopOptions+. The |
| +ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+; |
| the mechanics of this were described in an earlier section. Imports |
| and exports and other large data motion tasks typically run a |
| MapReduce job to operate on a table in a parallel, reliable fashion. |
| An import does not specifically need to be run via a MapReduce job; |
| the +ConnManager.importTable()+ method is left to determine how best |
| to run the import. Each main action is actually controlled by the |
| +ConnMananger+, except for the generating of code, which is done by |
| the +CompilationManager+ and +ClassWriter+. (Both in the |
| +org.apache.sqoop.orm+ package.) Importing into Hive is also |
| taken care of via the +org.apache.sqoop.hive.HiveImport+ class |
| after the +importTable()+ has completed. This is done without concern |
| for the +ConnManager+ implementation used. |
| |
| A ConnManager's +importTable()+ method receives a single argument of |
| type +ImportJobContext+ which contains parameters to the method. This |
| class may be extended with additional parameters in the future, which |
| optionally further direct the import operation. Similarly, the |
| +exportTable()+ method receives an argument of type |
| +ExportJobContext+. These classes contain the name of the table to |
| import/export, a reference to the +SqoopOptions+ object, and other |
| related data. |
| |
| Subpackages |
| ^^^^^^^^^^^ |
| |
| The following subpackages under +org.apache.sqoop+ exist: |
| |
| * +hive+ - Facilitates importing data to Hive. |
| * +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and |
| _Writer_). |
| * +lib+ - The external public API (described earlier). |
| * +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their |
| implementations. |
| * +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API. |
| * +orm+ - Code auto-generation. |
| * +tool+ - Implementations of +SqoopTool+. |
| * +util+ - Miscellaneous utility classes. |
| |
| The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations |
| used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single |
| BufferedWriter to be opened to a client which will, under the hood, write to |
| multiple files in series as they reach a target threshold size. This allows |
| unsplittable compression libraries (e.g., gzip) to be used in conjunction with |
| Sqoop import while still allowing subsequent MapReduce jobs to use multiple |
| input splits per dataset. The large object file storage (see |
| http://wiki.github.com/cloudera/sqoop/sip-3[SIP-3]) system's code |
| lies in the +io+ package as well. |
| |
| The +mapreduce+ package contains code that interfaces directly with |
| Hadoop MapReduce. This package's contents are described in more detail |
| in the next section. |
| |
| The +orm+ package contains code used for class generation. It depends on the |
| JDK's tools.jar which provides the com.sun.tools.javac package. |
| |
| The +util+ package contains various utilities used throughout Sqoop: |
| |
| * +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the |
| current thread. This is principly used to load auto-generated code into the |
| current thread when running MapReduce in local (standalone) mode. |
| * +DirectImportUtils+ contains convenience methods used by direct HDFS |
| importers. |
| * +Executor+ launches external processes and connects these to stream handlers |
| generated by an AsyncSink (see more detail below). |
| * +ExportException+ is thrown by +ConnManagers+ when exports fail. |
| * +ImportException+ is thrown by +ConnManagers+ when imports fail. |
| * +JdbcUrl+ handles parsing of connect strings, which are URL-like but not |
| specification-conforming. (In particular, JDBC connect strings may have |
| +multi:part:scheme://+ components.) |
| * +PerfCounters+ are used to estimate transfer rates for display to the user. |
| * +ResultSetPrinter+ will pretty-print a _ResultSet_. |
| |
| In several places, Sqoop reads the stdout from external processes. The most |
| straightforward cases are direct-mode imports as performed by the |
| +LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by |
| +Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr |
| (+Process.getErrorStream()+) must be handled. Failure to read enough data from |
| both of these streams will cause the external process to block before writing |
| more. Consequently, these must both be handled, and preferably asynchronously. |
| |
| In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and |
| reads it to completion. These are realized by +AsyncSink+ implementations. The |
| +org.apache.sqoop.util.AsyncSink+ abstract class defines the operations |
| this factory must perform. +processStream()+ will spawn another thread to |
| immediately begin handling the data read from the +InputStream+ argument; it |
| must read this stream to completion. The +join()+ method allows external threads |
| to wait until this processing is complete. |
| |
| Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will |
| repeat everything on the +InputStream+ as log4j INFO statements. The |
| +NullAsyncSink+ consumes all its input and does nothing. |
| |
| The various +ConnManagers+ that make use of external processes have their own |
| +AsyncSink+ implementations as inner classes, which read from the database tools |
| and forward the data along to HDFS, possibly performing formatting conversions |
| in the meantime. |
| |
| |
| Interfacing with MapReduce |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Sqoop schedules MapReduce jobs to effect imports and exports. |
| Configuration and execution of MapReduce jobs follows a few common |
| steps (configuring the +InputFormat+; configuring the +OutputFormat+; |
| setting the +Mapper+ implementation; etc...). These steps are |
| formalized in the +org.apache.sqoop.mapreduce.JobBase+ class. |
| The +JobBase+ allows a user to specify the +InputFormat+, |
| +OutputFormat+, and +Mapper+ to use. |
| |
| +JobBase+ itself is subclassed by +ImportJobBase+ and +ExportJobBase+ |
| which offer better support for the particular configuration steps |
| common to import or export-related jobs, respectively. |
| +ImportJobBase.runImport()+ will call the configuration steps and run |
| a job to import a table to HDFS. |
| |
| Subclasses of these base classes exist as well. For example, |
| +DataDrivenImportJob+ uses the +DataDrivenDBInputFormat+ to run an |
| import. This is the most common type of import used by the various |
| +ConnManager+ implementations available. MySQL uses a different class |
| (+MySQLDumpImportJob+) to run a direct-mode import. Its custom |
| +Mapper+ and +InputFormat+ implementations reside in this package as |
| well. |
| |
| |