blob: 27a78e22dc230fe61ccdff8fa2f45380fdc2d364 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
[[validation]]
+validation+
--------------
Purpose
~~~~~~~
Validate the data copied, either import or export by comparing the row
counts from the source and the target post copy.
Introduction
~~~~~~~~~~~~
There are 3 basic interfaces:
ValidationThreshold - Determines if the error margin between the source and
target are acceptable: Absolute, Percentage Tolerant, etc.
Default implementation is AbsoluteValidationThreshold which ensures the row
counts from source and targets are the same.
ValidationFailureHandler - Responsible for handling failures: log an
error/warning, abort, etc.
Default implementation is LogOnFailureHandler that logs a warning message to
the configured logger.
Validator - Drives the validation logic by delegating the decision to
ValidationThreshold and delegating failure handling to ValidationFailureHandler.
The default implementation is RowCountValidator which validates the row
counts from source and the target.
Syntax
~~~~~~
----
$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)
----
Validation arguments are part of import and export arguments.
Configuration
~~~~~~~~~~~~~
The validation framework is extensible and pluggable. It comes with default
implementations but the interfaces can be extended to allow custom
implementations by passing them as part of the command line arguments as
described below.
.Validator
Property: validator
Description: Driver for validation,
must implement org.apache.sqoop.validation.Validator
Supported values: The value has to be a fully qualified class name.
Default value: org.apache.sqoop.validation.RowCountValidator
.Validation Threshold
Property: validation-threshold
Description: Drives the decision based on the validation meeting the
threshold or not. Must implement
org.apache.sqoop.validation.ValidationThreshold
Supported values: The value has to be a fully qualified class name.
Default value: org.apache.sqoop.validation.AbsoluteValidationThreshold
.Validation Failure Handler
Property: validation-failurehandler
Description: Responsible for handling failures, must implement
org.apache.sqoop.validation.ValidationFailureHandler
Supported values: The value has to be a fully qualified class name.
Default value: org.apache.sqoop.validation.AbortOnFailureHandler
Limitations
~~~~~~~~~~~
Validation currently only validates data copied from a single table into HDFS.
The following are the limitations in the current implementation:
* all-tables option
* free-form query option
* Data imported into Hive, HBase or Accumulo
* table import with --where argument
* incremental imports
Example Invocations
~~~~~~~~~~~~~~~~~~~
A basic import of a table named +EMPLOYEES+ in the +corp+ database that uses
validation to validate the row counts:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
--table EMPLOYEES --validate
----
A basic export to populate a table named +bar+ with validation enabled:
----
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data --validate
----
Another example that overrides the validation args:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--validate --validator org.apache.sqoop.validation.RowCountValidator \
--validation-threshold \
org.apache.sqoop.validation.AbsoluteValidationThreshold \
--validation-failurehandler \
org.apache.sqoop.validation.AbortOnFailureHandler
----