blob: 2d8045c9b521315060af43330fc4305afd0562ba [file] [log] [blame]
<!DOCTYPE html>
<!--
| Generated by Apache Maven Doxia at 2018-03-12
| Rendered using Apache Maven Fluido Skin 1.3.0
-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="Date-Revision-yyyymmdd" content="20180312" />
<meta http-equiv="Content-Language" content="en" />
<title>Falcon - Falcon Native Scheduler</title>
<link rel="stylesheet" href="./css/apache-maven-fluido-1.3.0.min.css" />
<link rel="stylesheet" href="./css/site.css" />
<link rel="stylesheet" href="./css/print.css" media="print" />
<script type="text/javascript" src="./js/apache-maven-fluido-1.3.0.min.js"></script>
<script type="text/javascript">$( document ).ready( function() { $( '.carousel' ).carousel( { interval: 3500 } ) } );</script>
</head>
<body class="topBarDisabled">
<div class="container">
<div id="banner">
<div class="pull-left">
<div id="bannerLeft">
<img src="images/falcon-logo.png" alt="Apache Falcon" width="200px" height="45px"/>
</div>
</div>
<div class="pull-right"> </div>
<div class="clear"><hr/></div>
</div>
<div id="breadcrumbs">
<ul class="breadcrumb">
<li class="">
<a href="index.html" title="Falcon">
Falcon</a>
</li>
<li class="divider ">/</li>
<li class="">Falcon Native Scheduler</li>
<li id="publishDate" class="pull-right">Last Published: 2018-03-12</li> <li class="divider pull-right">|</li>
<li id="projectVersion" class="pull-right">Version: 0.11</li>
</ul>
</div>
<div id="bodyColumn" >
<div class="section">
<h2>Falcon Native Scheduler<a name="Falcon_Native_Scheduler"></a></h2></div>
<div class="section">
<h3>Overview<a name="Overview"></a></h3>
<p>Falcon has been using Oozie as its scheduling engine. While the use of Oozie works reasonably well, there are scenarios where Oozie scheduling is proving to be a limiting factor. In its current form, Falcon relies on Oozie for both scheduling and for workflow execution, due to which the scheduling is limited to time based/cron based scheduling with additional gating conditions on data availability. Also, this imposes restrictions on datasets being periodic in nature. In order to offer better scheduling capabilities, Falcon comes with its own native scheduler.</p></div>
<div class="section">
<h3>Capabilities<a name="Capabilities"></a></h3>
<p>The native scheduler will offer the capabilities offered by Oozie co-ordinator and more. The native scheduler will be built and released over the next few releases of Falcon giving users an opportunity to use it and provide feedback.</p>
<p>Currently, the native scheduler offers the following capabilities:</p>
<ol style="list-style-type: decimal">
<li>Submit and schedule a Falcon process that runs periodically (without data dependency) - It could be a PIG script, oozie workflow, Hive (all the engine types currently supported).</li>
<li>Monitor/Query/Modify the scheduled process - All applicable entity APIs and instance APIs should work as it does now. Falcon provides data management functions for feeds declaratively. It allows users to represent feed locations as time-based partition directories on HDFS containing files.</li></ol>
<p><b>NOTE: Execution order is FIFO. LIFO and LAST_ONLY are not supported yet.</b></p>
<p>In the near future, Falcon scheduler will provide feature parity with Oozie scheduler and in subsequent releases will provide the following features:</p>
<ul>
<li>Periodic, cron-based, calendar-based scheduling.</li>
<li>Data availability based scheduling.</li>
<li>External trigger/notification based scheduling.</li>
<li>Support for periodic/a-periodic datasets.</li>
<li>Support for optional/mandatory datasets. Option to specify minumum/maximum/exactly-N instances of data to consume.</li>
<li>Handle dependencies across entities during re-run.</li></ul></div>
<div class="section">
<h3>Configuring Native Scheduler<a name="Configuring_Native_Scheduler"></a></h3>
<p>You can enable native scheduler by making changes to <b><i>$FALCON_HOME/conf/startup.properties</i></b> as follows. You will need to restart Falcon Server for the changes to take effect.</p>
<div class="source">
<pre>
*.dag.engine.impl=org.apache.falcon.workflow.engine.OozieDAGEngine
*.application.services=org.apache.falcon.security.AuthenticationInitializationService,\
org.apache.falcon.workflow.WorkflowJobEndNotificationService, \
org.apache.falcon.service.ProcessSubscriberService,\
org.apache.falcon.service.EntitySLAMonitoringService,\
org.apache.falcon.service.LifecyclePolicyMap,\
org.apache.falcon.service.FalconJPAService,\
org.apache.falcon.entity.store.ConfigurationStore,\
org.apache.falcon.rerun.service.RetryService,\
org.apache.falcon.rerun.service.LateRunService,\
org.apache.falcon.metadata.MetadataMappingService,\
org.apache.falcon.service.LogCleanupService,\
org.apache.falcon.service.GroupsService,\
org.apache.falcon.service.ProxyUserService,\
org.apache.falcon.notification.service.impl.JobCompletionService,\
org.apache.falcon.notification.service.impl.SchedulerService,\
org.apache.falcon.notification.service.impl.AlarmService,\
org.apache.falcon.notification.service.impl.DataAvailabilityService,\
org.apache.falcon.execution.FalconExecutionService
</pre></div></div>
<div class="section">
<h4>Making the Native Scheduler the default scheduler<a name="Making_the_Native_Scheduler_the_default_scheduler"></a></h4>
<p>To ensure backward compatibility, even when the native scheduler is enabled, the default scheduler is still Oozie. This means users will be scheduling entities on Oozie scheduler, by default. They will need to explicitly specify the scheduler as native, if they wish to schedule entities using native scheduler.</p>
<p><a href="#Scheduling_new_entities_on_Native_Scheduler">This section</a> has more details on how to schedule on either of the schedulers.</p>
<p>If you wish to make the Falcon Native Scheduler your default scheduler and remove Oozie as the scheduler, set the following property in <b><i>$FALCON_HOME/conf/startup.properties</i></b></p>
<div class="source">
<pre>
## If you wish to use Falcon native scheduler as your default scheduler, set the workflow engine to FalconWorkflowEngine instead of OozieWorkflowEngine. ##
*.workflow.engine.impl=org.apache.falcon.workflow.engine.FalconWorkflowEngine
</pre></div></div>
<div class="section">
<h4>Configuring the state store for Native Scheduler<a name="Configuring_the_state_store_for_Native_Scheduler"></a></h4>
<p>You can configure statestore by making changes to <b><i>$FALCON_HOME/conf/statestore.properties</i></b> as follows. You will need to restart Falcon Server for the changes to take effect.</p>
<p>Falcon Server needs to maintain state of the entities and instances in a persistent store for the system to be recoverable. Since Prism only federates, it does not need to maintain any state information. Following properties need to be set in statestore.properties of Falcon Servers:</p>
<div class="source">
<pre>
######### StateStore Properties #####
*.falcon.state.store.impl=org.apache.falcon.state.store.jdbc.JDBCStateStore
*.falcon.statestore.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
*.falcon.statestore.jdbc.url=jdbc:derby:data/falcon.db
# StateStore credentials file where username,password and other properties can be stored securely.
# Set this credentials file permission 400 and make sure user who starts falcon should only have read permission.
# Give Absolute path to credentials file along with file name or put in classpath with file name statestore.credentials.
# Credentials file should be present either in given location or class path, otherwise falcon won't start.
*.falcon.statestore.credentials.file=
*.falcon.statestore.jdbc.username=sa
*.falcon.statestore.jdbc.password=
*.falcon.statestore.connection.data.source=org.apache.commons.dbcp.BasicDataSource
# Maximum number of active connections that can be allocated from this pool at the same time.
*.falcon.statestore.pool.max.active.conn=10
*.falcon.statestore.connection.properties=
# Indicates the interval (in milliseconds) between eviction runs.
*.falcon.statestore.validate.db.connection.eviction.interval=300000
## The number of objects to examine during each run of the idle object evictor thread.
*.falcon.statestore.validate.db.connection.eviction.num=10
## Creates Falcon DB.
## If set to true, it creates the DB schema if it does not exist. If the DB schema exists is a NOP.
## If set to false, it does not create the DB schema. If the DB schema does not exist it fails start up.
*.falcon.statestore.create.db.schema=true
</pre></div>
<p>The _*.falcon.statestore.jdbc.url_ property in statestore.properties determines the DB and data location. All other properties are common across RDBMS.</p>
<p><b>NOTE : Although multiple Falcon Servers can share a DB (not applicable for Derby DB), it is recommended that you have different DBs for different Falcon Servers for better performance.</b></p>
<p>You will need to create the state DB and tables before starting the Falcon Server. To create tables, a tool comes bundled with the Falcon installation. You can use the <i>falcon-db.sh</i> script to create tables in the DB. The script needs to be run only for Falcon Servers and can be run by any user that has execute permission on the script. The script picks up the DB connection details from <b><i>$FALCON_HOME/conf/statestore.properties</i></b>. Ensure that you have granted the right privileges to the user mentioned in statestore.properties_, so the tables can be created.</p>
<p>You can use the help command to get details on the sub-commands supported:</p>
<div class="source">
<pre>
./bin/falcon-db.sh help
Hadoop home is set, adding libraries from '/Users/pallavi.rao/falcon/hadoop-2.6.0/bin/hadoop classpath' into falcon classpath
usage:
Falcon DB initialization tool currently supports Derby DB/ Mysql
falcondb help : Display usage for all commands or specified command
falcondb version : Show Falcon DB version information
falcondb create &lt;OPTIONS&gt; : Create Falcon DB schema
-run Confirmation option regarding DB schema creation/upgrade
-sqlfile &lt;arg&gt; Generate SQL script instead of creating/upgrading the DB
schema
falcondb upgrade &lt;OPTIONS&gt; : Upgrade Falcon DB schema
-run Confirmation option regarding DB schema creation/upgrade
-sqlfile &lt;arg&gt; Generate SQL script instead of creating/upgrading the DB
schema
</pre></div>
<p>Currently, MySQL, postgreSQL and Derby are supported as state stores. We may extend support to other DBs in the future. Falcon has been tested against MySQL v5.5 and PostgreSQL v9.5. If you are using MySQL ensure you also copy mysql-connector-java-&lt;version&gt;.jar under <b><i>$FALCON_HOME/server/webapp/falcon/WEB-INF/lib</i></b> and <b><i>$FALCON_HOME/client/lib</i></b></p></div>
<div class="section">
<h5>Using Derby as the State Store<a name="Using_Derby_as_the_State_Store"></a></h5>
<p>Using Derby is ideal for QA and staging setup. Falcon comes bundled with a Derby connector and no explicit setup is required (although you can set it up) in terms creating the DB or tables. For example,</p>
<div class="source">
<pre> *.falcon.statestore.jdbc.url=jdbc:derby:data/falcon.db;create=true
</pre></div>
<p>tells Falcon to use the Derby JDBC connector, with data directory, $FALCON_HOME/data/ and DB name 'falcon'. If <i>create=true</i> is specified, you will not need to create a DB up front; a database will be created if it does not exist.</p></div>
<div class="section">
<h5>Using MySQL as the State Store<a name="Using_MySQL_as_the_State_Store"></a></h5>
<p>The jdbc.url property in statestore.properties determines the DB and data location. For example,</p>
<div class="source">
<pre> *.falcon.statestore.jdbc.url=jdbc:mysql://localhost:3306/falcon
</pre></div>
<p>tells Falcon to use the MySQL JDBC connector, which is accessible @localhost:3306, with DB name 'falcon'.</p></div>
<div class="section">
<h3>Scheduling new entities on Native Scheduler<a name="Scheduling_new_entities_on_Native_Scheduler"></a></h3>
<p>To schedule an entity (currently only process is supported) using the native scheduler, you need to specify the scheduler in the schedule command as shown below:</p>
<div class="source">
<pre>
$FALCON_HOME/bin/falcon entity -type process -name &lt;process name&gt; -schedule -properties falcon.scheduler:native
</pre></div>
<p>If Oozie is configured as the default scheduler, you can skip the scheduler option or explicitly set it to <i>oozie</i>, as shown below:</p>
<div class="source">
<pre>
$FALCON_HOME/bin/falcon entity -type process -name &lt;process name&gt; -schedule
OR
$FALCON_HOME/bin/falcon entity -type process -name &lt;process name&gt; -schedule -properties falcon.scheduler:oozie
</pre></div>
<p>If the native scheduler is configured as the default scheduler, then, you can omit the scheduler option, as shown below:</p>
<div class="source">
<pre>
$FALCON_HOME/bin/falcon entity -type process -name &lt;process name&gt; -schedule
</pre></div></div>
<div class="section">
<h3>Migrating entities from Oozie Scheduler to Native Scheduler<a name="Migrating_entities_from_Oozie_Scheduler_to_Native_Scheduler"></a></h3>
<p>Currently, user will have to delete and re-create entities in order to move across schedulers. Attempting to schedule an already scheduled entity on a different scheduler will result in an error. Note that the history of instances prior to scheduling on native scheduler will not be available via the instance APIs. However, user can retrieve that information using metadata APIs. Native scheduler must be enabled before migrating entities to native scheduler.</p>
<p><a href="#Configuring_Native_Scheduler">Configuring Native Scheduler</a> has more details on how to enable native scheduler.</p></div>
<div class="section">
<h4>Migrating from Oozie to Native Scheduler<a name="Migrating_from_Oozie_to_Native_Scheduler"></a></h4>
<p></p>
<ul>
<li>Delete the entity (process).</li></ul>
<div class="source">
<pre>$FALCON_HOME/bin/falcon entity -type process -name &lt;process name&gt; -delete
</pre></div>
<p></p>
<ul>
<li>Submit the entity (process) with start time from where the Oozie scheduler left off.</li></ul>
<div class="source">
<pre>$FALCON_HOME/bin/falcon entity -type process -submit &lt;path to process xml&gt;
</pre></div>
<p></p>
<ul>
<li>Schedule the entity on native scheduler.</li></ul>
<div class="source">
<pre> $FALCON_HOME/bin/falcon entity -type process -name &lt;process name&gt; -schedule -properties falcon.scheduler:native
</pre></div></div>
<div class="section">
<h4>Reverting to Oozie from Native Scheduler<a name="Reverting_to_Oozie_from_Native_Scheduler"></a></h4>
<p></p>
<ul>
<li>Delete the entity (process).</li></ul>
<div class="source">
<pre>$FALCON_HOME/bin/falcon entity -type process -name &lt;process name&gt; -delete
</pre></div>
<p></p>
<ul>
<li>Submit the entity (process) with start time from where the Native scheduler left off.</li></ul>
<div class="source">
<pre>$FALCON_HOME/bin/falcon entity -type process -submit &lt;path to process xml&gt;
</pre></div>
<p></p>
<ul>
<li>Schedule the entity on the default scheduler (Oozie).</li></ul>
<div class="source">
<pre> $FALCON_HOME/bin/falcon entity -type process -name &lt;process name&gt; -schedule
</pre></div></div>
<div class="section">
<h4>Differences in API responses between Oozie and Native Scheduler<a name="Differences_in_API_responses_between_Oozie_and_Native_Scheduler"></a></h4>
<p>Most API responses are similar whether the entity is scheduled via Oozie or via Native scheduler. However, there are a few exceptions and those are listed below.</p></div>
<div class="section">
<h5>Rerun API<a name="Rerun_API"></a></h5>
<p>When a user performs a rerun using Oozie scheduler, Falcon directly reruns the workflow on Oozie and the instance will be moved to 'RUNNING'.</p>
<p>Example response:</p>
<div class="source">
<pre>
$ falcon instance -rerun processMerlinOozie -start 2016-01-08T12:13Z -end 2016-01-08T12:15Z
Consolidated Status: SUCCEEDED
Instances:
Instance Cluster SourceCluster Status Start End Details Log
-----------------------------------------------------------------------------------------------
2016-01-08T12:13Z ProcessMultipleClustersTest-corp-9706f068 - RUNNING 2016-01-08T13:03Z 2016-01-08T13:03Z - http://8RPCG32.corp.inmobi.com:11000/oozie?job=0001811-160104160825636-oozie-oozi-W
2016-01-08T12:13Z ProcessMultipleClustersTest-corp-0b270a1d - RUNNING 2016-01-08T13:03Z 2016-01-08T13:03Z - http://lda01:11000/oozie?job=0002247-160104115615658-oozie-oozi-W
Additional Information:
Response: ua1/RERUN
ua2/RERUN
Request Id: ua1/871377866@qtp-630572412-35 - 7190c4c8-bacb-4639-8d48-c9e639f544da
ua2/1554129706@qtp-536122141-13 - bc18127b-1bf8-4ea1-99e6-b1f10ba3a441
</pre></div>
<p>However, when a user performs a rerun on native scheduler, the instance is scheduled again. This is done intentionally so as to not violate the number of instances running in parallel. Hence, the user will see the status of the instance as 'READY'.</p>
<p>Example response:</p>
<div class="source">
<pre>
$ falcon instance -rerun ProcessMultipleClustersTest-agregator-coord16-8f55f59b -start 2016-01-08T12:13Z -end 2016-01-08T12:15Z
Consolidated Status: SUCCEEDED
Instances:
Instance Cluster SourceCluster Status Start End Details Log
-----------------------------------------------------------------------------------------------
2016-01-08T12:13Z ProcessMultipleClustersTest-corp-9706f068 - READY 2016-01-08T13:03Z 2016-01-08T13:03Z - http://8RPCG32.corp.inmobi.com:11000/oozie?job=0001812-160104160825636-oozie-oozi-W
2016-01-08T12:13Z ProcessMultipleClustersTest-corp-0b270a1d - READY 2016-01-08T13:03Z 2016-01-08T13:03Z - http://lda01:11000/oozie?job=0002248-160104115615658-oozie-oozi-W
Additional Information:
Response: ua1/RERUN
ua2/RERUN
Request Id: ua1/871377866@qtp-630572412-35 - 8d118d4d-c0ef-4335-a9af-10364498ec4f
ua2/1554129706@qtp-536122141-13 - c2a3fc50-8b05-47ce-9c85-ca432b96d923
</pre></div></div>
</div>
</div>
<hr/>
<footer>
<div class="container">
<div class="row span12">Copyright &copy; 2013-2018
<a href="http://www.apache.org">Apache Software Foundation</a>.
All Rights Reserved.
</div>
<p id="poweredBy" class="pull-right">
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img class="builtBy" alt="Built by Maven" src="./images/logos/maven-feather.png" />
</a>
</p>
</div>
</footer>
</body>
</html>