| |
| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| |
| Developer API Reference |
| ----------------------- |
| |
| This section is intended to specify the APIs available to application writers |
| integrating with Sqoop, and those modifying Sqoop. The next three subsections |
| are written from the following three perspectives: those using classes generated |
| by Sqoop, and its public library; those writing Sqoop extensions (i.e., |
| additional ConnManager implementations that interact with more databases); and |
| those modifying Sqoop's internals. Each section describes the system in |
| successively greater depth. |
| |
| |
| The External API |
| ~~~~~~~~~~~~~~~~ |
| |
| Sqoop auto-generates classes that represent the tables imported into HDFS. The |
| class contains member fields for each column of the imported table; an instance |
| of the class holds one row of the table. The generated classes implement the |
| serialization APIs used in Hadoop, namely the _Writable_ and _DBWritable_ |
| interfaces. They also contain other convenience methods: a +parse()+ method |
| that interprets delimited text fields, and a +toString()+ method that preserves |
| the user's chosen delimiters. The full set of methods guaranteed to exist in an |
| auto-generated class are specified in the interface |
| +org.apache.hadoop.sqoop.lib.SqoopRecord+. |
| |
| Instances of _SqoopRecord_ may depend on Sqoop's public API. This is all classes |
| in the +org.apache.hadoop.sqoop.lib+ package. These are briefly described below. |
| Clients of Sqoop should not need to directly interact with any of these classes, |
| although classes generated by Sqoop will depend on them. Therefore, these APIs |
| are considered public and care will be taken when forward-evolving them. |
| |
| * The +RecordParser+ class will parse a line of text into a list of fields, |
| using controllable delimiters and quote characters. |
| * The static +FieldFormatter+ class provides a method which handles quoting and |
| escaping of characters in a field which will be used in |
| +SqoopRecord.toString()+ implementations. |
| * Marshaling data between _ResultSet_ and _PreparedStatement_ objects and |
| _SqoopRecords_ is done via +JdbcWritableBridge+. |
| * +BigDecimalSerializer+ contains a pair of methods that facilitate |
| serialization of +BigDecimal+ objects over the _Writable_ interface. |
| |
| The Extension API |
| ~~~~~~~~~~~~~~~~~ |
| |
| This section covers the API and primary classes used by extensions for Sqoop |
| which allow Sqoop to interface with more database vendors. |
| |
| While Sqoop uses JDBC and +DBInputFormat+ (and +DataDrivenDBInputFormat+) to |
| read from databases, differences in the SQL supported by different vendors as |
| well as JDBC metadata necessitates vendor-specific codepaths for most databases. |
| Sqoop's solution to this problem is by introducing the ConnManager API |
| (+org.apache.hadoop.sqoop.manager.ConnMananger+). |
| |
| +ConnManager+ is an abstract class defining all methods that interact with the |
| database itself. Most implementations of +ConnManager+ will extend the |
| +org.apache.hadoop.sqoop.manager.SqlManager+ abstract class, which uses standard |
| SQL to perform most actions. Subclasses are required to implement the |
| +getConnection()+ method which returns the actual JDBC connection to the |
| database. Subclasses are free to override all other methods as well. The |
| +SqlManager+ class itself exposes a protected API that allows developers to |
| selectively override behavior. For example, the +getColNamesQuery()+ method |
| allows the SQL query used by +getColNames()+ to be modified without needing to |
| rewrite the majority of +getColNames()+. |
| |
| +ConnManager+ implementations receive a lot of their configuration data from a |
| Sqoop-specific class, +SqoopOptions+. While +SqoopOptions+ does not currently |
| contain many setter methods, clients should not assume +SqoopOptions+ are |
| immutable. More setter methods may be added in the future. +SqoopOptions+ does |
| not directly store specific per-manager options. Instead, it contains a |
| reference to the +Configuration+ returned by +Tool.getConf()+ after parsing |
| command-line arguments with the +GenericOptionsParser+. This allows extension |
| arguments via "+-D any.specific.param=any.value+" without requiring any layering |
| of options parsing or modification of +SqoopOptions+. |
| |
| All existing +ConnManager+ implementations are stateless. Thus, the system which |
| instantiates +ConnManagers+ may implement multiple instances of the same |
| +ConnMananger+ class over Sqoop's lifetime. If a caching layer is required, we |
| can add one later, but it is not currently available. |
| |
| +ConnManagers+ are currently created by instances of the abstract class +ManagerFactory+ (See |
| MAPREDUCE-750). One +ManagerFactory+ implementation currently serves all of |
| Sqoop: +org.apache.hadoop.sqoop.manager.DefaultManagerFactory+. Extensions |
| should not modify +DefaultManagerFactory+. Instead, an extension-specific |
| +ManagerFactory+ implementation should be provided with the new ConnManager. |
| +ManagerFactory+ has a single method of note, named +accept()+. This method will |
| determine whether it can instantiate a +ConnManager+ for the user's |
| +SqoopOptions+. If so, it returns the +ConnManager+ instance. Otherwise, it |
| returns +null+. |
| |
| The +ManagerFactory+ implementations used are governed by the |
| +sqoop.connection.factories+ setting in sqoop-site.xml. Users of extension |
| libraries can install the 3rd-party library containing a new +ManagerFactory+ |
| and +ConnManager+(s), and configure sqoop-site.xml to use the new |
| +ManagerFactory+. The +DefaultManagerFactory+ principly discriminates between |
| databases by parsing the connect string stored in +SqoopOptions+. |
| |
| Extension authors may make use of classes in the +org.apache.hadoop.sqoop.io+, |
| +mapred+, +mapreduce+, and +util+ packages to facilitate their implementations. |
| These packages and classes are described in more detail in the following |
| section. |
| |
| |
| Sqoop Internals |
| ~~~~~~~~~~~~~~~ |
| |
| This section describes the internal architecture of Sqoop. |
| |
| The Sqoop program is driven by the +org.apache.hadoop.sqoop.Sqoop+ main class. |
| A limited number of additional classes are in the same package; +SqoopOptions+ |
| (described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+ |
| instances). |
| |
| General program flow |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| The general program flow is as follows: |
| |
| +org.apache.hadoop.sqoop.Sqoop+ is the main class and implements _Tool_. A new |
| instance is launched with +ToolRunner+. It parses its arguments using the |
| +SqoopOptions+ class. Within the +SqoopOptions+, an +ImportAction+ will be |
| chosen by the user. This may be import all tables, import one specific table, |
| execute a SQL statement, or others. |
| |
| A +ConnManager+ is then instantiated based on the data in the +SqoopOptions+. |
| The +ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+; the |
| mechanics of this were described in an earlier section. |
| |
| Then in the +run()+ method, using a case statement, it determines which actions |
| the user needs performed based on the +ImportAction+ enum. Usually this involves |
| determining a list of tables to import, generating user code for them, and |
| running a MapReduce job per table to read the data. The import itself does not |
| specifically need to be run via a MapReduce job; the +ConnManager.importTable()+ |
| method is left to determine how best to run the import. Each of these actions is |
| controlled by the +ConnMananger+, except for the generating of code, which is |
| done by the +CompilationManager+ and +ClassWriter+. (Both in the |
| +org.apache.hadoop.sqoop.orm+ package.) Importing into Hive is also taken care |
| of via the +org.apache.hadoop.sqoop.hive.HiveImport+ class after the |
| +importTable()+ has completed. This is done without concern for the |
| +ConnManager+ implementation used. |
| |
| A ConnManager's +importTable()+ method receives a single argument of type |
| +ImportJobContext+ which contains parameters to the method. This class may be |
| extended with additional parameters in the future, which optionally further |
| direct the import operation. Similarly, the +exportTable()+ method receives an |
| argument of type +ExportJobContext+. These classes contain the name of the table |
| to import/export, a reference to the +SqoopOptions+ object, and other related |
| data. |
| |
| Subpackages |
| ^^^^^^^^^^^ |
| |
| The following subpackages under +org.apache.hadoop.sqoop+ exist: |
| |
| * +hive+ - Facilitates importing data to Hive. |
| * +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and |
| _Writer_). |
| * +lib+ - The external public API (described earlier). |
| * +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their |
| implementations. |
| * +mapred+ - Classes interfacing with the old (pre-0.20) MapReduce API. |
| * +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API.... |
| * +orm+ - Code auto-generation. |
| * +util+ - Miscellaneous utility classes. |
| |
| The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations |
| used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single |
| BufferedWriter to be opened to a client which will, under the hood, write to |
| multiple files in series as they reach a target threshold size. This allows |
| unsplittable compression libraries (e.g., gzip) to be used in conjunction with |
| Sqoop import while still allowing subsequent MapReduce jobs to use multiple |
| input splits per dataset. |
| |
| Code in the +mapred+ package should be considered deprecated. The +mapreduce+ |
| package contains +DataDrivenImportJob+, which uses the +DataDrivenDBInputFormat+ |
| introduced in 0.21. The mapred package contains +ImportJob+, which uses the |
| older +DBInputFormat+. Most +ConnManager+ implementations use |
| +DataDrivenImportJob+; +DataDrivenDBInputFormat+ does not currently work with |
| Oracle in all circumstances, so it remains on the old code-path. |
| |
| The +orm+ package contains code used for class generation. It depends on the |
| JDK's tools.jar which provides the com.sun.tools.javac package. |
| |
| The +util+ package contains various utilities used throughout Sqoop: |
| |
| * +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the |
| current thread. This is principly used to load auto-generated code into the |
| current thread when running MapReduce in local (standalone) mode. |
| * +DirectImportUtils+ contains convenience methods used by direct HDFS |
| importers. |
| * +Executor+ launches external processes and connects these to stream handlers |
| generated by an AsyncSink (see more detail below). |
| * +ExportException+ is thrown by +ConnManagers+ when exports fail. |
| * +ImportException+ is thrown by +ConnManagers+ when imports fail. |
| * +JdbcUrl+ handles parsing of connect strings, which are URL-like but not |
| specification-conforming. (In particular, JDBC connect strings may have |
| +multi:part:scheme://+ components.) |
| * +PerfCounters+ are used to estimate transfer rates for display to the user. |
| * +ResultSetPrinter+ will pretty-print a _ResultSet_. |
| |
| In several places, Sqoop reads the stdout from external processes. The most |
| straightforward cases are direct-mode imports as performed by the |
| +LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by |
| +Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr |
| (+Process.getErrorStream()+) must be handled. Failure to read enough data from |
| both of these streams will cause the external process to block before writing |
| more. Consequently, these must both be handled, and preferably asynchronously. |
| |
| In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and |
| reads it to completion. These are realized by +AsyncSink+ implementations. The |
| +org.apache.hadoop.sqoop.util.AsyncSink+ abstract class defines the operations |
| this factory must perform. +processStream()+ will spawn another thread to |
| immediately begin handling the data read from the +InputStream+ argument; it |
| must read this stream to completion. The +join()+ method allows external threads |
| to wait until this processing is complete. |
| |
| Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will |
| repeat everything on the +InputStream+ as log4j INFO statements. The |
| +NullAsyncSink+ consumes all its input and does nothing. |
| |
| The various +ConnManagers+ that make use of external processes have their own |
| +AsyncSink+ implementations as inner classes, which read from the database tools |
| and forward the data along to HDFS, possibly performing formatting conversions |
| in the meantime. |
| |
| |