FALCON-1812 Data Import and Export documentation (By Venkatesan Ramachandran)

commit: 00f6b0be7336ed324b229843a302afb3fc796592 [log] [tgz]
author: Pallavi Rao <pallavi.rao@inmobi.com> Wed Feb 03 09:49:46 2016 +0530
committer: Pallavi Rao <pallavi.rao@inmobi.com> Wed Feb 03 09:49:46 2016 +0530
tree: 45041169542fae1f5a08a13b925bcea1d21b7dd1
parent: 12e56b644da1a21f1f64e27c562c22adb3beee95 [diff]
diff --git a/CHANGES.txt b/CHANGES.txt
index 3ab7726..465bc9d 100755
--- a/CHANGES.txt
+++ b/CHANGES.txt

@@ -13,6 +13,8 @@
   INCOMPATIBLE CHANGES
 
   NEW FEATURES
+    FALCON-1812 Data Import and Export documentation (Venkatesan Ramachandran via Pallavi Rao)
+
     FALCON-1742 Implement instance summary api for native scheduler (Pallavi Rao)
 
     FALCON-1677 Support re-tries for timed-out instances (Narayan Periwal via Pallavi Rao)

diff --git a/docs/src/site/twiki/FalconDocumentation.twiki b/docs/src/site/twiki/FalconDocumentation.twiki
index 21b22ff..122435a 100644
--- a/docs/src/site/twiki/FalconDocumentation.twiki
+++ b/docs/src/site/twiki/FalconDocumentation.twiki

@@ -18,6 +18,7 @@
    * <a href="#Email_Notification">Email Notification</a>
    * <a href="#Backwards_Compatibility">Backwards Compatibility Instructions</a>
    * <a href="#Proxyuser_support">Proxyuser support</a>
+   * <a href="#ImportExport">Data Import and Export</a>
 
 ---++ Architecture
 
@@ -768,5 +769,9 @@
 
 -doAs option via  CLI or doAs query parameter can be appended if using API to enable impersonation.
 
+---++ ImportExport
+
+Data Import and Export is detailed in [[ImportExport][Data Import and Export]].
+
 
 

diff --git a/docs/src/site/twiki/ImportExport.twiki b/docs/src/site/twiki/ImportExport.twiki
new file mode 100644
index 0000000..5df9365
--- /dev/null
+++ b/docs/src/site/twiki/ImportExport.twiki

@@ -0,0 +1,236 @@
+---+Falcon Data Import and Export
+
+
+---++Overview
+
+Falcon provides constructs to periodically bring raw data from external data sources (like databases, drop boxes etc)
+onto Hadoop and push derived data computed on Hadoop onto external data sources.
+
+As of this release, Falcon only supports Relational Databases (e.g. Oracle, MySQL etc) via JDBC as external data source.
+The future releases will add support for other external data sources.
+
+
+---++Prerequisites
+
+Following are the prerequisites to import external data from and export to databases.
+
+   * *Sqoop 1.4.6+*
+   * *Oozie 4.2.0+*
+   * *Appropriate database connector*
+
+
+*Note:* Falcon uses Sqoop for import/export operation. Sqoop will require appropriate database driver to connect to
+the relational database. Please refer to the Sqoop documentation for any Sqoop related question. Please make sure
+the database driver jar is copied into oozie share lib for Sqoop.
+
+<verbatim>
+For example, in order to import and export with MySQL, please make sure the latest MySQL connector
+*mysql-connector-java-5.1.31.jar+* is copied into oozie's Sqoop share lib
+
+/user/oozie/share/lib/{lib-dir}/sqoop/mysql-connector-java-5.1.31.jar+
+
+where {lib-dir} value varies in oozie deployments.
+
+</verbatim>
+
+---++ Usage
+---+++ Entity Definition and Setup
+   * *Datasource Entity*
+      Datasource entity abstracts connection and credential details to external data sources. The Datasource entity
+      supports read and write interfaces with specific credentials. The default credential will be used if the read
+      or write interface does not have its own credentials. In general, the Datasource entity will be defined by
+      system administrator. Please refer to datasource XSD for more details.
+
+      The following example defines a Datasource entity for a MySQL database. The import operation will use
+      the read interface with url "jdbc:mysql://dbhost/test", user name "import_usr" and password text "sqoop".
+      Where as, the export operation will use the write interface with url "jdbc:mysql://dbhost/test" with user
+      name "export_usr" and password specified in a HDFS file at the location "/user/ambari-qa/password-store/password_write_user".
+
+      The default credential specifies the password using password text and will be used if either read or write interface
+      does not provide credentials.
+
+      The available read and write interfaces enable database administrators to segregate read and write workloads.
+
+      <verbatim>
+
+      File: mysql-database.xml
+
+      <?xml version="1.0" encoding="UTF-8"?>
+      <datasource colo="west-coast" description="MySQL database on west coast" type="mysql" name="mysql-db" xmlns="uri:falcon:datasource:0.1">
+          <tags>owner=foobar@ambari.apache.org, consumer=phoe@ambari.apache.org</tags>
+          <interfaces>
+              <!-- ***** read interface ***** -->
+              <interface type="readonly" endpoint="jdbc:mysql://dbhost/test">
+                  <credential type="password-text">
+                      <userName>import_usr</userName>
+                      <passwordText>sqoop</passwordFile>
+                  </credential>
+              </interface>
+
+              <!-- ***** write interface ***** -->
+              <interface type="write"  endpoint="jdbc:mysql://dbhost/test">
+                  <credential type="password-file">
+                      <userName>export_usr</userName>
+                      <passwordFile>/user/ambari-qa/password-store/password_write_user</passwordFile>
+                  </credential>
+              </interface>
+
+              <!-- *** default credential *** -->
+              <credential type="password-text">
+                <userName>sqoop2_user</userName>
+                <passwordText>sqoop</passwordText>
+              </credential>
+
+          </interfaces>
+
+          <driver>
+              <clazz>com.mysql.jdbc.Driver</clazz>
+              <jar>/user/oozie/share/lib/lib_20150721010816/sqoop/mysql-connector-java-5.1.31</jar>
+          </driver>
+      </datasource>
+      </verbatim>
+
+   * *Feed  Entity*
+      Feed entity now enables users to define IMPORT and EXPORT policies in addition to RETENTION and REPLICATION.
+      The IMPORT and EXPORT policies will refer to a already defined Datasource entity for connection and credential
+      details and take a table name from the policy to operate on. Please refer to feed entity XSD for details.
+
+      The following example defines a Feed entity with IMPORT and EXPORT policies. Both the IMPORT and EXPORT operations
+      refer to a datasource entity "mysql-db". The IMPORT operation will use the read interface and credentials while
+      the EXPORT operation will use the write interface and credentials. A feed instance is created every 1 hour
+      since the frequency of the Feed is hour(1) and the Feed instances are deleted after 90 days because of the
+      retention policy.
+
+
+      <verbatim>
+
+      File: customer_email_feed.xml
+
+      <?xml version="1.0" encoding="UTF-8"?>
+      <!--
+       A feed representing Hourly customer email data retained for 90 days
+       -->
+      <feed description="Raw customer email feed" name="customer_feed" xmlns="uri:falcon:feed:0.1">
+          <tags>externalSystem=USWestEmailServers,classification=secure</tags>
+          <groups>DataImportPipeline</groups>
+          <frequency>hours(1)</frequency>
+          <late-arrival cut-off="hours(4)"/>
+          <clusters>
+              <cluster name="primaryCluster" type="source">
+                  <validity start="2015-12-15T00:00Z" end="2016-03-31T00:00Z"/>
+                  <retention limit="days(90)" action="delete"/>
+                  <import>
+                      <source name="mysql-db" tableName="simple">
+                          <extract type="full">
+                              <mergepolicy>snapshot</mergepolicy>
+                          </extract>
+                          <fields>
+                              <includes>
+                                  <field>id</field>
+                                  <field>name</field>
+                              </includes>
+                          </fields>
+                      </source>
+                      <arguments>
+                          <argument name="--split-by" value="id"/>
+                          <argument name="--num-mappers" value="2"/>
+                      </arguments>
+                  </import>
+                  <export>
+                        <target name="mysql-db" tableName="simple_export">
+                            <load type="insert"/>
+                            <fields>
+                              <includes>
+                                <field>id</field>
+                                <field>name</field>
+                              </includes>
+                            </fields>
+                        </target>
+                        <arguments>
+                             <argument name="--update-key" value="id"/>
+                        </arguments>
+                    </export>
+              </cluster>
+          </clusters>
+
+          <locations>
+              <location type="data" path="/user/ambari-qa/falcon/demo/primary/importfeed/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}"/>
+              <location type="stats" path="/none"/>
+              <location type="meta" path="/none"/>
+          </locations>
+
+          <ACL owner="ambari-qa" group="users" permission="0755"/>
+          <schema location="/none" provider="none"/>
+
+      </feed>
+      </verbatim>
+
+   * *Import policy*
+     The import policy uses the datasource entity specified in the "source" to connect to the database. The tableName
+     specified should exist in the source datasource.
+
+     Extraction type specifies whether to pull data from external datasource "full" everytime or "incrementally".
+     The mergepolicy specifies how to organize (snapshot or append, i.e time series partiitons) the data on hadoop. 
+     The valid combinations are: 
+      * [full,snapshot] - data is extracted in full and dumped into the feed instance location.
+      * [incremental, append] - data is extracted incrementally using the key specified in the *deltacolumn* 
+        and added as a partition to the feed instance location. 
+      * [incremental, snapshot] - data is extracted incrementally and merged with already existing data on hadoop to
+        produce one latest feed instance.*This feature is not supported currently*. The use case for this feature is 
+        to efficiently import very large dimention tables that have updates and inserts onto hadoop and make it available
+        as a snapshot with latest updates to consumers.
+
+      The following example defines an incremental extraction with append organization:
+
+      <verbatim>
+           <import> 
+                <source name="mysql-db" tableName="simple">
+                    <extract type="incremental">
+                        <deltacolumn>modified_time</deltacolumn>
+                        <mergepolicy>append</mergepolicy>
+                    </extract>  
+                    <fields>
+                        <includes>
+                            <field>id</field>
+                            <field>name</field>
+                        </includes>
+                    </fields>
+                </source>
+                <arguments>
+                    <argument name="--split-by" value="id"/>
+                    <argument name="--num-mappers" value="2"/>
+                </arguments>
+            </import>
+	</verbatim>
+
+      
+     The fields option enables to control what fields get imported. By default, all fields get import. The "includes" option
+     brings only those fields specified. The "excludes" option brings all the fields other than specified.
+
+     The arguments section enables to pass in any extra arguments needed for fine control on the underlying implementation --
+     in this case, Sqoop.
+
+   * *Export policy*
+
+     The export, like import, uses the datasource for connecting to the database. Load type specifies whether to insert
+     or only update data onto the external table. Fields option behaves the same way as in import policy.
+     The tableName specified should exist in the external datasource.
+
+---+++ Operation
+   Once the Datasource and Feed entity with import and export policies are defined, Users can submit and schedule
+   the Import and Export operations via CLI and REST API as below:
+
+   <verbatim>
+
+    ## submit the mysql-db datasource defined in the file mysql_datasource.xml
+    falcon entity -submit -type datasource -file mysql_datasource.xml
+
+    ## submit the customer_feed specified in the customer_email_feed.xml
+    falcon entity -submit -type feed -file customer_email_feed.xml
+
+    ## schedule the customer_feed
+    falcon entity -schedule -type feed -name customer_feed
+
+   </verbatim>
+
+   Falcon will create corresponding oozie bundles with coordinator and workflow for import and export operation.
commit	00f6b0be7336ed324b229843a302afb3fc796592	[log] [tgz]
author	Pallavi Rao <pallavi.rao@inmobi.com>	Wed Feb 03 09:49:46 2016 +0530
committer	Pallavi Rao <pallavi.rao@inmobi.com>	Wed Feb 03 09:49:46 2016 +0530
tree	45041169542fae1f5a08a13b925bcea1d21b7dd1
parent	12e56b644da1a21f1f64e27c562c22adb3beee95 [diff]