blob: fb4dba5dfd3fa4543c5a96540efbf81437c16b72 [file] [log] [blame]
== <!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--> ==
|SIP | 4 |
|Title | Public API for Sqoop v1.0.0 |
|Author | Aaron Kimball (aaron at cloudera dot com) |
|Created | May 14, 2010 |
|Status | Accepted |
|Discussion | "http://github.com/cloudera/sqoop/issues/issue/13":http://github.com/cloudera/sqoop/issues/issue/13 |
|Implementation| "http://review.hbase.org/r/73/":http://review.hbase.org/r/73/ |
h2. Abstract
This SIP defines the public API to be exposed in the first release of Sqoop. The @org.apache.hadoop.sqoop.lib@ package contains the public API relied-upon by external clients of Sqoop. Generated code produced by Sqoop depends on these modules. Clients of imported data may also rely on additional modules specified here.
h2. Problem statement
To deal with the unique table schemas of each database a Sqoop user imports, Sqoop's current design requires that it generate a per-table class. This class is used to interact with the data after it is imported to Hadoop; data can be stored in SequenceFiles, requiring this class to deserialize records. Subsequent re-exports of the data rely on this class to push records back to the RDBMS. And the generated class includes support for parsing text-based representations of the data.
This class, however, relies on reusable code modules provided with Sqoop. These code modules are all placed in the @org.apache.hadoop.sqoop.lib@ package. Clients of generated code must be able to rely on previously-generated code to work with later versions of Sqoop. While code regeneration is possible, Sqoop users should see the @lib@ package as the most stable API provided by Sqoop.
Sqoop also provides a file format for large object data; while large objects can be manipulated in the context of their encapsulating records (e.g., through @BlobRef@ or @ClobRef@ references to the data), the large object file store may be inspected directly.
This SIP defines the official "surface area" of the public packages which will be maintained. In order to ensure that future versions remain backwards compatible, some existing class definitions must be modified. It is hoped that these sorts of "breaking changes" will occur only before incrementing the major version number (1.0, 2.0, etc.), and are thus infrequent disruptions to Sqoop users. Sqoop clients who target only the APIs specified may be confident that their programs will work properly with all subsequent Sqoop releases in the 1.0 series (in accordance with the compatibility and deprecation policy specified in [[SIP-2]]).
h2. Specification
h3. lib package
As of 5/14/2010, the lib package contains the following classes:
* @BigDecimalSerializer@
* @BlobRef@
* @ClobRef@
* @FieldFormatter@
* @JdbcWritableBridge@
* @LargeObjectLoader@
* @LobRef@
* @LobSerializer@
* @RecordParser@
* @TaskId@
and the following interface:
* @SqoopRecord@
Classes generated by Sqoop fulfill the interface of @SqoopRecord@. The first change necessary in this package is to transform @SqoopRecord@ from an interface into an abstract class. This way, subsequent releases in the 1.0 series can introduce additional methods required by SqoopRecords along with a default implementation for previously-generated clients.
The @TaskId@ class is improperly placed in this package. This class is Sqoop-internal and should be moved to the @util@ package.
We should add a class called @DelimiterSet@ which encapsulates the parameters regarding formatting of delimiters around fields: the field terminator, the record terminator, the escape character, the enclosing character, and whether the latter of these is optional. This would allow sets of delimiters to be manipulated easily. The @SqoopRecord@ class could then be extended with a @toString(DelimiterSet)@ method that allowed users to format output with alternate delimiters than the ones specified during codegen time.
@LobRef@ is an abstract base class that encapsulates common code in @BlobRef@ and @ClobRef@. The constructors for @LobRef@ are marked as @protected@. Clients of Sqoop should not subclass @LobRef@ directly.
Classes in the lib package may depend on classes elsewhere in Sqoop's implementation. Clients should not do so directly.
h3. io package
Clients of Sqoop who have imported large objects into HDFS may have large object files holding their data; this file format is defined in [[SIP-3]]. The large objects may be manipulated by iterating over their encapsulating records and calling @{B,C}lobRef.getDataStream()@, which will retrieve the data for a large object from its underlying store. However, the large objects may also be directly retrieved from their underlying LobFile storage.
The @org.apache.hadoop.sqoop.io.LobFile@ class is considered part of the public API. Clients of Sqoop may depend on the @LobFile.Writer@ and @LobFile.Reader@ APIs. Clients should never instantiate subclasses of @Writer@ and @Reader@ directly; instead they should use the static methods @LobFile.create()@ and @LobFile.open@ respectively. The underlying concrete Writer and Reader implementation classes are considered private.
To allow users to verify the compression formats available in LobFiles, the @CodecMap.getCodecNames()@ method is also public.
h3. Entry-points to Sqoop
A well-defined programmatic entry-point to Sqoop is *not* defined by this specification. The only method of @org.apache.hadoop.sqoop.Sqoop@ considered stable is its @main()@ method; all others are currently internal. This restriction will be relaxed in a future specification, allowing programmatic client interaction with Sqoop.
h3. Base package
The base package in Sqoop is currently @org.apache.hadoop.sqoop@.
h2. Compatibility Issues
The modification of @SqoopRecord@ from interface to class will cause existing generated code to break. Such a change is expected prior to the 1.0.0 release. This is the last interface in Sqoop; once it is transitioned to an abstract class, subsequent changes to the SqoopRecord API should be backwards-compatible.
h2. Test Plan
The changes required to implement this specification are minimal; the existing unit test suite should cover all necessary testing.
h2. Discussion
Please provide feedback and comments at "http://github.com/cloudera/sqoop/issues/issue/13":http://github.com/cloudera/sqoop/issues/issue/13