docs/src/site/twiki/ImportExport.twiki - falcon - Git at Google

 ---+Falcon Data Import and Export


 ---++Overview

 Falcon provides constructs to periodically bring raw data from external data sources (like databases, drop boxes etc)
 onto Hadoop and push derived data computed on Hadoop onto external data sources.

 As of this release, Falcon only supports Relational Databases (e.g. Oracle, MySQL etc) via JDBC as external data source.
 The future releases will add support for other external data sources.


 ---++Prerequisites

 Following are the prerequisites to import external data from and export to databases.

    * *Sqoop 1.4.6+*
    * *Oozie 4.2.0+*
    * *Appropriate database connector*


 *Note:* Falcon uses Sqoop for import/export operation. Sqoop will require appropriate database driver to connect to
 the relational database. Please refer to the Sqoop documentation for any Sqoop related question. Please make sure
 the database driver jar is copied into oozie share lib for Sqoop.

 <verbatim>
 For example, in order to import and export with MySQL, please make sure the latest MySQL connector
 *mysql-connector-java-5.1.31.jar+* is copied into oozie's Sqoop share lib

 /user/oozie/share/lib/{lib-dir}/sqoop/mysql-connector-java-5.1.31.jar+

 where {lib-dir} value varies in oozie deployments.

 </verbatim>

 ---++ Usage
 ---+++ Entity Definition and Setup
    * *Datasource Entity*
       Datasource entity abstracts connection and credential details to external data sources. The Datasource entity
       supports read and write interfaces with specific credentials. The default credential will be used if the read
       or write interface does not have its own credentials. In general, the Datasource entity will be defined by
       system administrator. Please refer to datasource XSD for more details.

       The following example defines a Datasource entity for a MySQL database. The import operation will use
       the read interface with url "jdbc:mysql://dbhost/test", user name "import_usr" and password text "sqoop".
       Where as, the export operation will use the write interface with url "jdbc:mysql://dbhost/test" with user
       name "export_usr" and password specified in a HDFS file at the location "/user/ambari-qa/password-store/password_write_user".

       The default credential specifies the password using password text and will be used if either read or write interface
       does not provide credentials.

       The available read and write interfaces enable database administrators to segregate read and write workloads.

       <verbatim>

       File: mysql-database.xml

       <?xml version="1.0" encoding="UTF-8"?>
       <datasource colo="west-coast" description="MySQL database on west coast" type="mysql" name="mysql-db" xmlns="uri:falcon:datasource:0.1">
           <tags>owner=foobar@ambari.apache.org, consumer=phoe@ambari.apache.org</tags>
           <interfaces>
               <!-- ***** read interface ***** -->
               <interface type="readonly" endpoint="jdbc:mysql://dbhost/test">
                   <credential type="password-text">
                       <userName>import_usr</userName>
                       <passwordText>sqoop</passwordFile>
                   </credential>
               </interface>

               <!-- ***** write interface ***** -->
               <interface type="write"  endpoint="jdbc:mysql://dbhost/test">
                   <credential type="password-file">
                       <userName>export_usr</userName>
                       <passwordFile>/user/ambari-qa/password-store/password_write_user</passwordFile>
                   </credential>
               </interface>

               <!-- *** default credential *** -->
               <credential type="password-text">
                 <userName>sqoop2_user</userName>
                 <passwordText>sqoop</passwordText>
               </credential>

           </interfaces>

           <driver>
               <clazz>com.mysql.jdbc.Driver</clazz>
               <jar>/user/oozie/share/lib/lib_20150721010816/sqoop/mysql-connector-java-5.1.31</jar>
           </driver>
       </datasource>
       </verbatim>

    * *Feed  Entity*
       Feed entity now enables users to define IMPORT and EXPORT policies in addition to RETENTION and REPLICATION.
       The IMPORT and EXPORT policies will refer to a already defined Datasource entity for connection and credential
       details and take a table name from the policy to operate on. Please refer to feed entity XSD for details.

       The following example defines a Feed entity with IMPORT and EXPORT policies. Both the IMPORT and EXPORT operations
       refer to a datasource entity "mysql-db". The IMPORT operation will use the read interface and credentials while
       the EXPORT operation will use the write interface and credentials. A feed instance is created every 1 hour
       since the frequency of the Feed is hour(1) and the Feed instances are deleted after 90 days because of the
       retention policy.


       <verbatim>

       File: customer_email_feed.xml

       <?xml version="1.0" encoding="UTF-8"?>
       <!--
        A feed representing Hourly customer email data retained for 90 days
        -->
       <feed description="Raw customer email feed" name="customer_feed" xmlns="uri:falcon:feed:0.1">
           <tags>externalSystem=USWestEmailServers,classification=secure</tags>
           <groups>DataImportPipeline</groups>
           <frequency>hours(1)</frequency>
           <late-arrival cut-off="hours(4)"/>
           <clusters>
               <cluster name="primaryCluster" type="source">
                   <validity start="2015-12-15T00:00Z" end="2016-03-31T00:00Z"/>
                   <retention limit="days(90)" action="delete"/>
                   <import>
                       <source name="mysql-db" tableName="simple">
                           <extract type="full">
                               <mergepolicy>snapshot</mergepolicy>
                           </extract>
                           <fields>
                               <includes>
                                   <field>id</field>
                                   <field>name</field>
                               </includes>
                           </fields>
                       </source>
                       <arguments>
                           <argument name="--split-by" value="id"/>
                           <argument name="--num-mappers" value="2"/>
                       </arguments>
                   </import>
                   <export>
                         <target name="mysql-db" tableName="simple_export">
                             <load type="insert"/>
                             <fields>
                               <includes>
                                 <field>id</field>
                                 <field>name</field>
                               </includes>
                             </fields>
                         </target>
                         <arguments>
                              <argument name="--update-key" value="id"/>
                         </arguments>
                     </export>
               </cluster>
           </clusters>

           <locations>
               <location type="data" path="/user/ambari-qa/falcon/demo/primary/importfeed/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}"/>
               <location type="stats" path="/none"/>
               <location type="meta" path="/none"/>
           </locations>

           <ACL owner="ambari-qa" group="users" permission="0755"/>
           <schema location="/none" provider="none"/>

       </feed>
       </verbatim>

    * *Import policy*
      The import policy uses the datasource entity specified in the "source" to connect to the database. The tableName
      specified should exist in the source datasource.

      Extraction type specifies whether to pull data from external datasource "full" everytime or "incrementally".
      The mergepolicy specifies how to organize (snapshot or append, i.e time series partiitons) the data on hadoop.
      The valid combinations are:
       * [full,snapshot] - data is extracted in full and dumped into the feed instance location.
       * [incremental, append] - data is extracted incrementally using the key specified in the *deltacolumn*
         and added as a partition to the feed instance location.
       * [incremental, snapshot] - data is extracted incrementally and merged with already existing data on hadoop to
         produce one latest feed instance.*This feature is not supported currently*. The use case for this feature is
         to efficiently import very large dimention tables that have updates and inserts onto hadoop and make it available
         as a snapshot with latest updates to consumers.

       The following example defines an incremental extraction with append organization:

       <verbatim>
            <import>
                 <source name="mysql-db" tableName="simple">
                     <extract type="incremental">
                         <deltacolumn>modified_time</deltacolumn>
                         <mergepolicy>append</mergepolicy>
                     </extract>
                     <fields>
                         <includes>
                             <field>id</field>
                             <field>name</field>
                         </includes>
                     </fields>
                 </source>
                 <arguments>
                     <argument name="--split-by" value="id"/>
                     <argument name="--num-mappers" value="2"/>
                 </arguments>
             </import>
 	</verbatim>


      The fields option enables to control what fields get imported. By default, all fields get import. The "includes" option
      brings only those fields specified. The "excludes" option brings all the fields other than specified.

      The arguments section enables to pass in any extra arguments needed for fine control on the underlying implementation --
      in this case, Sqoop.

    * *Export policy*

      The export, like import, uses the datasource for connecting to the database. Load type specifies whether to insert
      or only update data onto the external table. Fields option behaves the same way as in import policy.
      The tableName specified should exist in the external datasource.

 ---+++ Operation
    Once the Datasource and Feed entity with import and export policies are defined, Users can submit and schedule
    the Import and Export operations via CLI and REST API as below:

    <verbatim>

     ## submit the mysql-db datasource defined in the file mysql_datasource.xml
     falcon entity -submit -type datasource -file mysql_datasource.xml

     ## submit the customer_feed specified in the customer_email_feed.xml
     falcon entity -submit -type feed -file customer_email_feed.xml

     ## schedule the customer_feed
     falcon entity -schedule -type feed -name customer_feed

    </verbatim>

    Falcon will create corresponding oozie bundles with coordinator and workflow for import and export operation.
	---+Falcon Data Import and Export


	---++Overview

	Falcon provides constructs to periodically bring raw data from external data sources (like databases, drop boxes etc)
	onto Hadoop and push derived data computed on Hadoop onto external data sources.

	As of this release, Falcon only supports Relational Databases (e.g. Oracle, MySQL etc) via JDBC as external data source.
	The future releases will add support for other external data sources.


	---++Prerequisites

	Following are the prerequisites to import external data from and export to databases.

	* Sqoop 1.4.6+
	* Oozie 4.2.0+
	* Appropriate database connector


	Note: Falcon uses Sqoop for import/export operation. Sqoop will require appropriate database driver to connect to
	the relational database. Please refer to the Sqoop documentation for any Sqoop related question. Please make sure
	the database driver jar is copied into oozie share lib for Sqoop.

	<verbatim>
	For example, in order to import and export with MySQL, please make sure the latest MySQL connector
	mysql-connector-java-5.1.31.jar+ is copied into oozie's Sqoop share lib

	/user/oozie/share/lib/{lib-dir}/sqoop/mysql-connector-java-5.1.31.jar+

	where {lib-dir} value varies in oozie deployments.

	</verbatim>

	---++ Usage
	---+++ Entity Definition and Setup
	* Datasource Entity
	Datasource entity abstracts connection and credential details to external data sources. The Datasource entity
	supports read and write interfaces with specific credentials. The default credential will be used if the read
	or write interface does not have its own credentials. In general, the Datasource entity will be defined by
	system administrator. Please refer to datasource XSD for more details.

	The following example defines a Datasource entity for a MySQL database. The import operation will use
	the read interface with url "jdbc:mysql://dbhost/test", user name "import_usr" and password text "sqoop".
	Where as, the export operation will use the write interface with url "jdbc:mysql://dbhost/test" with user
	name "export_usr" and password specified in a HDFS file at the location "/user/ambari-qa/password-store/password_write_user".

	The default credential specifies the password using password text and will be used if either read or write interface
	does not provide credentials.

	The available read and write interfaces enable database administrators to segregate read and write workloads.

	<verbatim>

	File: mysql-database.xml

	<?xml version="1.0" encoding="UTF-8"?>
	<datasource colo="west-coast" description="MySQL database on west coast" type="mysql" name="mysql-db" xmlns="uri:falcon:datasource:0.1">
	<tags>owner=foobar@ambari.apache.org, consumer=phoe@ambari.apache.org</tags>
	<interfaces>
	<!-- *** read interface *** -->
	<interface type="readonly" endpoint="jdbc:mysql://dbhost/test">
	<credential type="password-text">
	<userName>import_usr</userName>
	<passwordText>sqoop</passwordFile>
	</credential>
	</interface>

	<!-- *** write interface *** -->
	<interface type="write" endpoint="jdbc:mysql://dbhost/test">
	<credential type="password-file">
	<userName>export_usr</userName>
	<passwordFile>/user/ambari-qa/password-store/password_write_user</passwordFile>
	</credential>
	</interface>

	<!-- * default credential * -->
	<credential type="password-text">
	<userName>sqoop2_user</userName>
	<passwordText>sqoop</passwordText>
	</credential>

	</interfaces>

	<driver>
	<clazz>com.mysql.jdbc.Driver</clazz>
	<jar>/user/oozie/share/lib/lib_20150721010816/sqoop/mysql-connector-java-5.1.31</jar>
	</driver>
	</datasource>
	</verbatim>

	* Feed Entity
	Feed entity now enables users to define IMPORT and EXPORT policies in addition to RETENTION and REPLICATION.
	The IMPORT and EXPORT policies will refer to a already defined Datasource entity for connection and credential
	details and take a table name from the policy to operate on. Please refer to feed entity XSD for details.

	The following example defines a Feed entity with IMPORT and EXPORT policies. Both the IMPORT and EXPORT operations
	refer to a datasource entity "mysql-db". The IMPORT operation will use the read interface and credentials while
	the EXPORT operation will use the write interface and credentials. A feed instance is created every 1 hour
	since the frequency of the Feed is hour(1) and the Feed instances are deleted after 90 days because of the
	retention policy.


	<verbatim>

	File: customer_email_feed.xml

	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	A feed representing Hourly customer email data retained for 90 days
	-->
	<feed description="Raw customer email feed" name="customer_feed" xmlns="uri:falcon:feed:0.1">
	<tags>externalSystem=USWestEmailServers,classification=secure</tags>
	<groups>DataImportPipeline</groups>
	<frequency>hours(1)</frequency>
	<late-arrival cut-off="hours(4)"/>
	<clusters>
	<cluster name="primaryCluster" type="source">
	<validity start="2015-12-15T00:00Z" end="2016-03-31T00:00Z"/>
	<retention limit="days(90)" action="delete"/>
	<import>
	<source name="mysql-db" tableName="simple">
	<extract type="full">
	<mergepolicy>snapshot</mergepolicy>
	</extract>
	<fields>
	<includes>
	<field>id</field>
	<field>name</field>
	</includes>
	</fields>
	</source>
	<arguments>
	<argument name="--split-by" value="id"/>
	<argument name="--num-mappers" value="2"/>
	</arguments>
	</import>
	<export>
	<target name="mysql-db" tableName="simple_export">
	<load type="insert"/>
	<fields>
	<includes>
	<field>id</field>
	<field>name</field>
	</includes>
	</fields>
	</target>
	<arguments>
	<argument name="--update-key" value="id"/>
	</arguments>
	</export>
	</cluster>
	</clusters>

	<locations>
	<location type="data" path="/user/ambari-qa/falcon/demo/primary/importfeed/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}"/>
	<location type="stats" path="/none"/>
	<location type="meta" path="/none"/>
	</locations>

	<ACL owner="ambari-qa" group="users" permission="0755"/>
	<schema location="/none" provider="none"/>

	</feed>
	</verbatim>

	* Import policy
	The import policy uses the datasource entity specified in the "source" to connect to the database. The tableName
	specified should exist in the source datasource.

	Extraction type specifies whether to pull data from external datasource "full" everytime or "incrementally".
	The mergepolicy specifies how to organize (snapshot or append, i.e time series partiitons) the data on hadoop.
	The valid combinations are:
	* [full,snapshot] - data is extracted in full and dumped into the feed instance location.
	* [incremental, append] - data is extracted incrementally using the key specified in the deltacolumn
	and added as a partition to the feed instance location.
	* [incremental, snapshot] - data is extracted incrementally and merged with already existing data on hadoop to
	produce one latest feed instance.This feature is not supported currently. The use case for this feature is
	to efficiently import very large dimention tables that have updates and inserts onto hadoop and make it available
	as a snapshot with latest updates to consumers.

	The following example defines an incremental extraction with append organization:

	<verbatim>
	<import>
	<source name="mysql-db" tableName="simple">
	<extract type="incremental">
	<deltacolumn>modified_time</deltacolumn>
	<mergepolicy>append</mergepolicy>
	</extract>
	<fields>
	<includes>
	<field>id</field>
	<field>name</field>
	</includes>
	</fields>
	</source>
	<arguments>
	<argument name="--split-by" value="id"/>
	<argument name="--num-mappers" value="2"/>
	</arguments>
	</import>
	</verbatim>


	The fields option enables to control what fields get imported. By default, all fields get import. The "includes" option
	brings only those fields specified. The "excludes" option brings all the fields other than specified.

	The arguments section enables to pass in any extra arguments needed for fine control on the underlying implementation --
	in this case, Sqoop.

	* Export policy

	The export, like import, uses the datasource for connecting to the database. Load type specifies whether to insert
	or only update data onto the external table. Fields option behaves the same way as in import policy.
	The tableName specified should exist in the external datasource.

	---+++ Operation
	Once the Datasource and Feed entity with import and export policies are defined, Users can submit and schedule
	the Import and Export operations via CLI and REST API as below:

	<verbatim>

	## submit the mysql-db datasource defined in the file mysql_datasource.xml
	falcon entity -submit -type datasource -file mysql_datasource.xml

	## submit the customer_feed specified in the customer_email_feed.xml
	falcon entity -submit -type feed -file customer_email_feed.xml

	## schedule the customer_feed
	falcon entity -schedule -type feed -name customer_feed

	</verbatim>

	Falcon will create corresponding oozie bundles with coordinator and workflow for import and export operation.