FALCON-1812 Data Import and Export documentation (By Venkatesan Ramachandran)
diff --git a/CHANGES.txt b/CHANGES.txt
index 3ab7726..465bc9d 100755
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -13,6 +13,8 @@
INCOMPATIBLE CHANGES
NEW FEATURES
+ FALCON-1812 Data Import and Export documentation (Venkatesan Ramachandran via Pallavi Rao)
+
FALCON-1742 Implement instance summary api for native scheduler (Pallavi Rao)
FALCON-1677 Support re-tries for timed-out instances (Narayan Periwal via Pallavi Rao)
diff --git a/docs/src/site/twiki/FalconDocumentation.twiki b/docs/src/site/twiki/FalconDocumentation.twiki
index 21b22ff..122435a 100644
--- a/docs/src/site/twiki/FalconDocumentation.twiki
+++ b/docs/src/site/twiki/FalconDocumentation.twiki
@@ -18,6 +18,7 @@
* <a href="#Email_Notification">Email Notification</a>
* <a href="#Backwards_Compatibility">Backwards Compatibility Instructions</a>
* <a href="#Proxyuser_support">Proxyuser support</a>
+ * <a href="#ImportExport">Data Import and Export</a>
---++ Architecture
@@ -768,5 +769,9 @@
-doAs option via CLI or doAs query parameter can be appended if using API to enable impersonation.
+---++ ImportExport
+
+Data Import and Export is detailed in [[ImportExport][Data Import and Export]].
+
diff --git a/docs/src/site/twiki/ImportExport.twiki b/docs/src/site/twiki/ImportExport.twiki
new file mode 100644
index 0000000..5df9365
--- /dev/null
+++ b/docs/src/site/twiki/ImportExport.twiki
@@ -0,0 +1,236 @@
+---+Falcon Data Import and Export
+
+
+---++Overview
+
+Falcon provides constructs to periodically bring raw data from external data sources (like databases, drop boxes etc)
+onto Hadoop and push derived data computed on Hadoop onto external data sources.
+
+As of this release, Falcon only supports Relational Databases (e.g. Oracle, MySQL etc) via JDBC as external data source.
+The future releases will add support for other external data sources.
+
+
+---++Prerequisites
+
+Following are the prerequisites to import external data from and export to databases.
+
+ * *Sqoop 1.4.6+*
+ * *Oozie 4.2.0+*
+ * *Appropriate database connector*
+
+
+*Note:* Falcon uses Sqoop for import/export operation. Sqoop will require appropriate database driver to connect to
+the relational database. Please refer to the Sqoop documentation for any Sqoop related question. Please make sure
+the database driver jar is copied into oozie share lib for Sqoop.
+
+<verbatim>
+For example, in order to import and export with MySQL, please make sure the latest MySQL connector
+*mysql-connector-java-5.1.31.jar+* is copied into oozie's Sqoop share lib
+
+/user/oozie/share/lib/{lib-dir}/sqoop/mysql-connector-java-5.1.31.jar+
+
+where {lib-dir} value varies in oozie deployments.
+
+</verbatim>
+
+---++ Usage
+---+++ Entity Definition and Setup
+ * *Datasource Entity*
+ Datasource entity abstracts connection and credential details to external data sources. The Datasource entity
+ supports read and write interfaces with specific credentials. The default credential will be used if the read
+ or write interface does not have its own credentials. In general, the Datasource entity will be defined by
+ system administrator. Please refer to datasource XSD for more details.
+
+ The following example defines a Datasource entity for a MySQL database. The import operation will use
+ the read interface with url "jdbc:mysql://dbhost/test", user name "import_usr" and password text "sqoop".
+ Where as, the export operation will use the write interface with url "jdbc:mysql://dbhost/test" with user
+ name "export_usr" and password specified in a HDFS file at the location "/user/ambari-qa/password-store/password_write_user".
+
+ The default credential specifies the password using password text and will be used if either read or write interface
+ does not provide credentials.
+
+ The available read and write interfaces enable database administrators to segregate read and write workloads.
+
+ <verbatim>
+
+ File: mysql-database.xml
+
+ <?xml version="1.0" encoding="UTF-8"?>
+ <datasource colo="west-coast" description="MySQL database on west coast" type="mysql" name="mysql-db" xmlns="uri:falcon:datasource:0.1">
+ <tags>owner=foobar@ambari.apache.org, consumer=phoe@ambari.apache.org</tags>
+ <interfaces>
+ <!-- ***** read interface ***** -->
+ <interface type="readonly" endpoint="jdbc:mysql://dbhost/test">
+ <credential type="password-text">
+ <userName>import_usr</userName>
+ <passwordText>sqoop</passwordFile>
+ </credential>
+ </interface>
+
+ <!-- ***** write interface ***** -->
+ <interface type="write" endpoint="jdbc:mysql://dbhost/test">
+ <credential type="password-file">
+ <userName>export_usr</userName>
+ <passwordFile>/user/ambari-qa/password-store/password_write_user</passwordFile>
+ </credential>
+ </interface>
+
+ <!-- *** default credential *** -->
+ <credential type="password-text">
+ <userName>sqoop2_user</userName>
+ <passwordText>sqoop</passwordText>
+ </credential>
+
+ </interfaces>
+
+ <driver>
+ <clazz>com.mysql.jdbc.Driver</clazz>
+ <jar>/user/oozie/share/lib/lib_20150721010816/sqoop/mysql-connector-java-5.1.31</jar>
+ </driver>
+ </datasource>
+ </verbatim>
+
+ * *Feed Entity*
+ Feed entity now enables users to define IMPORT and EXPORT policies in addition to RETENTION and REPLICATION.
+ The IMPORT and EXPORT policies will refer to a already defined Datasource entity for connection and credential
+ details and take a table name from the policy to operate on. Please refer to feed entity XSD for details.
+
+ The following example defines a Feed entity with IMPORT and EXPORT policies. Both the IMPORT and EXPORT operations
+ refer to a datasource entity "mysql-db". The IMPORT operation will use the read interface and credentials while
+ the EXPORT operation will use the write interface and credentials. A feed instance is created every 1 hour
+ since the frequency of the Feed is hour(1) and the Feed instances are deleted after 90 days because of the
+ retention policy.
+
+
+ <verbatim>
+
+ File: customer_email_feed.xml
+
+ <?xml version="1.0" encoding="UTF-8"?>
+ <!--
+ A feed representing Hourly customer email data retained for 90 days
+ -->
+ <feed description="Raw customer email feed" name="customer_feed" xmlns="uri:falcon:feed:0.1">
+ <tags>externalSystem=USWestEmailServers,classification=secure</tags>
+ <groups>DataImportPipeline</groups>
+ <frequency>hours(1)</frequency>
+ <late-arrival cut-off="hours(4)"/>
+ <clusters>
+ <cluster name="primaryCluster" type="source">
+ <validity start="2015-12-15T00:00Z" end="2016-03-31T00:00Z"/>
+ <retention limit="days(90)" action="delete"/>
+ <import>
+ <source name="mysql-db" tableName="simple">
+ <extract type="full">
+ <mergepolicy>snapshot</mergepolicy>
+ </extract>
+ <fields>
+ <includes>
+ <field>id</field>
+ <field>name</field>
+ </includes>
+ </fields>
+ </source>
+ <arguments>
+ <argument name="--split-by" value="id"/>
+ <argument name="--num-mappers" value="2"/>
+ </arguments>
+ </import>
+ <export>
+ <target name="mysql-db" tableName="simple_export">
+ <load type="insert"/>
+ <fields>
+ <includes>
+ <field>id</field>
+ <field>name</field>
+ </includes>
+ </fields>
+ </target>
+ <arguments>
+ <argument name="--update-key" value="id"/>
+ </arguments>
+ </export>
+ </cluster>
+ </clusters>
+
+ <locations>
+ <location type="data" path="/user/ambari-qa/falcon/demo/primary/importfeed/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}"/>
+ <location type="stats" path="/none"/>
+ <location type="meta" path="/none"/>
+ </locations>
+
+ <ACL owner="ambari-qa" group="users" permission="0755"/>
+ <schema location="/none" provider="none"/>
+
+ </feed>
+ </verbatim>
+
+ * *Import policy*
+ The import policy uses the datasource entity specified in the "source" to connect to the database. The tableName
+ specified should exist in the source datasource.
+
+ Extraction type specifies whether to pull data from external datasource "full" everytime or "incrementally".
+ The mergepolicy specifies how to organize (snapshot or append, i.e time series partiitons) the data on hadoop.
+ The valid combinations are:
+ * [full,snapshot] - data is extracted in full and dumped into the feed instance location.
+ * [incremental, append] - data is extracted incrementally using the key specified in the *deltacolumn*
+ and added as a partition to the feed instance location.
+ * [incremental, snapshot] - data is extracted incrementally and merged with already existing data on hadoop to
+ produce one latest feed instance.*This feature is not supported currently*. The use case for this feature is
+ to efficiently import very large dimention tables that have updates and inserts onto hadoop and make it available
+ as a snapshot with latest updates to consumers.
+
+ The following example defines an incremental extraction with append organization:
+
+ <verbatim>
+ <import>
+ <source name="mysql-db" tableName="simple">
+ <extract type="incremental">
+ <deltacolumn>modified_time</deltacolumn>
+ <mergepolicy>append</mergepolicy>
+ </extract>
+ <fields>
+ <includes>
+ <field>id</field>
+ <field>name</field>
+ </includes>
+ </fields>
+ </source>
+ <arguments>
+ <argument name="--split-by" value="id"/>
+ <argument name="--num-mappers" value="2"/>
+ </arguments>
+ </import>
+ </verbatim>
+
+
+ The fields option enables to control what fields get imported. By default, all fields get import. The "includes" option
+ brings only those fields specified. The "excludes" option brings all the fields other than specified.
+
+ The arguments section enables to pass in any extra arguments needed for fine control on the underlying implementation --
+ in this case, Sqoop.
+
+ * *Export policy*
+
+ The export, like import, uses the datasource for connecting to the database. Load type specifies whether to insert
+ or only update data onto the external table. Fields option behaves the same way as in import policy.
+ The tableName specified should exist in the external datasource.
+
+---+++ Operation
+ Once the Datasource and Feed entity with import and export policies are defined, Users can submit and schedule
+ the Import and Export operations via CLI and REST API as below:
+
+ <verbatim>
+
+ ## submit the mysql-db datasource defined in the file mysql_datasource.xml
+ falcon entity -submit -type datasource -file mysql_datasource.xml
+
+ ## submit the customer_feed specified in the customer_email_feed.xml
+ falcon entity -submit -type feed -file customer_email_feed.xml
+
+ ## schedule the customer_feed
+ falcon entity -schedule -type feed -name customer_feed
+
+ </verbatim>
+
+ Falcon will create corresponding oozie bundles with coordinator and workflow for import and export operation.