blob: 9749df597a1b976b340e743d69acb9d7d770f495 [file] [log] [blame]
<noautolink>
[[index][::Go back to Oozie Documentation Index::]]
-----
---+!! Oozie Bundle Specification
The goal of this document is to define a new oozie abstraction called bundle system specialized in submitting and maintaining a set of coordinator applications.
%TOC%
---++ Changelog
---++ 1. Bundle Overview
Bundle is a higher-level oozie abstraction that will batch a set of coordinator applications. The user will be able to start/stop/suspend/resume/rerun in the bundle level resulting a better and easy operational control.
More specifically, the oozie *Bundle* system allows the user to define and execute a bunch of coordinator applications often called a data pipeline. There is no explicit dependency among the coordinator applications in a bundle. However, a user could use the data dependency of coordinator applications to create an implicit data application pipeline.
---++ 2. Definitions
*Kick-off-time:* The time when a bundle should start and submit coordinator applications.
*Bundle Application:* A bundle application defines a set of coordinator applications and when to start those. Normally, bundle applications are parameterized. A bundle application is written in XML.
*Bundle Job:* A bundle job is an executable instance of a bundle application. A job submission is done by submitting a job configuration that resolves all parameters in the application definition.
*Bundle Definition Language:* The language used to describe bundle applications.
---++ 3. Expression Language for Parameterization
Bundle application definitions can be parameterized with variables.
At job submission time all the parameters are resolved into concrete values.
The parameterization of bundle definitions is done using JSP Expression Language syntax from the [[http://jcp.org/aboutJava/communityprocess/final/jsr152/][JSP 2.0 Specification (JSP.2.3)]], allowing not only to support variables as parameters but also complex expressions.
EL expressions can be used in XML attribute values and XML text element values. They cannot be used in XML element and XML attribute names.
---++ 4. Bundle Job
---+++ 4.1. Bundle Job Status
At any time, a bundle job is in one of the following status: *PREP, RUNNING, RUNNINGWITHERROR, SUSPENDED, PREPSUSPENDED, SUSPENDEDWITHERROR, PAUSED, PAUSEDWITHERROR, PREPPAUSED, SUCCEEDED, DONEWITHERROR, KILLED, FAILED*.
---+++ 4.2. Transitions of Bundle Job Status
Valid bundle job status transitions are:
* *PREP --> PREPSUSPENDED | PREPPAUSED | RUNNING | KILLED*
* *RUNNING --> RUNNINGWITHERROR | SUSPENDED | PAUSED | SUCCEEDED | KILLED*
* *RUNNINGWITHERROR --> RUNNING | SUSPENDEDWITHERROR | PAUSEDWITHERROR | DONEWITHERROR | FAILED | KILLED*
* *PREPSUSPENDED --> PREP | KILLED*
* *SUSPENDED --> RUNNING | KILLED*
* *SUSPENDEDWITHERROR --> RUNNINGWITHERROR | KILLED*
* *PREPPAUSED --> PREP | KILLED*
* *PAUSED --> SUSPENDED | RUNNING | KILLED*
* *PAUSEDWITHERROR --> SUSPENDEDWITHERROR | RUNNINGWITHERROR | KILLED*
---+++ 4.3. Details of Status Transitions
When a bundle job is submitted, oozie parses the bundle job XML. Oozie then creates a record for the bundle with status *PREP* and returns a unique ID.
When a user requests to suspend a bundle job that is in *PREP* state, oozie puts the job in status *PREPSUSPENDED*. Similarly, when pause time reaches for a bundle job with *PREP* status, oozie puts the job in status *PREPPAUSED*.
Conversely, when a user requests to resume a *PREPSUSPENDED* bundle job, oozie puts the job in status *PREP*. And when pause time is reset for a bundle job that is in *PREPPAUSED* state, oozie puts the job in status *PREP*.
There are two ways a bundle job could be started.
* If =kick-off-time= (defined in the bundle xml) reaches. The default value is null which means starts coordinators NOW.
* If user sends a start request to START the bundle.
When a bundle job starts, oozie puts the job in status *RUNNING* and it submits all the coordinator jobs. If any coordinator job goes to *FAILED/KILLED/DONEWITHERROR* state, the bundle job is put in *RUNNINGWITHERROR*
When a user requests to kill a bundle job, oozie puts the job in status *KILLED* and it sends kill to all submitted coordinator jobs.
When a user requests to suspend a bundle job that is in *RUNNING* status, oozie puts the job in status *SUSPENDED* and it suspends all submitted coordinator jobs. Similarly, when a user requests to suspend a bundle job that is in *RUNNINGWITHERROR* status, oozie puts the job in status *SUSPENDEDWITHERROR* and it suspends all submitted coordinator jobs.
When pause time reaches for a bundle job that is in *RUNNING* status, oozie puts the job in status *PAUSED*. When pause time reaches for a bundle job that is in *RUNNINGWITHERROR* status, oozie puts the job in status *PAUSEDWITHERROR*.
Conversely, when a user requests to resume a *SUSPENDED* bundle job, oozie puts the job in status *RUNNING*. Similarly, when a user requests to resume a *SUSPENDEDWITHERROR* bundle job, oozie puts the job in status *RUNNINGWITHERROR*. And when pause time is reset for a bundle job and job status is *PAUSED*, oozie puts the job in status *RUNNING*. Similarly, when the pause time is reset for a bundle job and job status is *PAUSEDWITHERROR*, oozie puts the job in status *RUNNINGWITHERROR*
When all the coordinator jobs finish, oozie updates the bundle status accordingly. If all coordinators reaches to the _same_ terminal state, bundle job status also move to the same status. For example, if all coordinators are *SUCCEEDED*, oozie puts the bundle job into *SUCCEEDED* status. However, if all coordinator jobs don't finish with the same status, oozie puts the bundle job into *DONEWITHERROR*.
---+++ 4.3. Bundle Application Definition
A bundle definition is defined in XML by a name, controls and one or more coordinator application specifications:
* *%BLUE% name: %ENDCOLOR%* The name for the bundle job.
* *%BLUE% controls: %ENDCOLOR%* The control specification for the bundle.
* *%BLUE% kick-off-time: %ENDCOLOR%* It defines when the bundle job should start and submit the coordinator applications. This field is optional and the default is *NOW* that means the job should start right-a-way.
* *%BLUE% coordinator: %ENDCOLOR%* Coordinator application specification. There should be at least one coordinator application in any bundle.
* *%BLUE% name: %ENDCOLOR%* Name of the coordinator application. It can be used for referring this application through bundle to control such as kill, suspend, rerun.
* *%BLUE% enabled: %ENDCOLOR%* Enabled can be used to enable or disable a coordinator. It is optional. The default value for enabled is true.
* *%BLUE% app-path: %ENDCOLOR%* Path of the coordinator application definition in hdfs. This is a mandatory element.
* *%BLUE% configuration: %ENDCOLOR%* A hadoop like configuration to parameterize corresponding coordinator application. This is optional.
* *%BLUE% Parameterization: %ENDCOLOR%* Configuration properties that are a valid Java identifier, [A-Za-z_][0-9A-Za-z_]*, are available as =${NAME}= variables within the bundle application definition. Configuration properties that are not a valid Java identifier, for example =job.tracker=, are available via the =${bundle:conf(String name)}= function. Valid Java identifier properties are available via this function as well.
*%PURPLE% Syntax: %ENDCOLOR%*
<verbatim>
<bundle-app name=[NAME] xmlns='uri:oozie:bundle:0.1'>
<controls>
<kick-off-time>[DATETIME]</kick-off-time>
</controls>
<coordinator name=[NAME] enabled=[TRUE | FALSE] >
<app-path>[COORD-APPLICATION-PATH]</app-path>
<configuration>
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
</coordinator>
...
</bundle-app>
</verbatim>
*%GREEN% Examples: %ENDCOLOR%*
*A Bundle Job that maintains two coordinator applications:*
<verbatim>
<bundle-app name='APPNAME' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns='uri:oozie:bundle:0.1'>
<controls>
<kick-off-time>${kickOffTime}</kick-off-time>
</controls>
<coordinator name="${bundle:conf('coordName1')}" >
<app-path>${appPath}</app-path>
<configuration>
<property>
<name>startTime1</name>
<value>${bundle:conf('coord1.startTime1')}</value>
</property>
<property>
<name>endTime1</name>
<value>${END_TIME}</value>
</property>
</configuration>
</coordinator>
<coordinator name='coordJobFromBundle2' >
<app-path>${appPath2}</app-path>
<configuration>
<property>
<name>startTime2</name>
<value>${START_TIME2}</value>
</property>
<property>
<name>endTime2</name>
<value>${END_TIME2}</value>
</property>
</configuration>
</coordinator>
</bundle-app>
</verbatim>
---+++ 4.4. Bundle Formal Parameters
As of schema 0.2, a list of formal parameters can be provided which will allow Oozie to verify, at submission time, that said
properties are actually specified (i.e. before the job is executed and fails). Default values can also be provided.
*Example:*
The previous Bundle Job application definition with formal parameters:
<verbatim>
<bundle-app name='APPNAME' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns='uri:oozie:bundle:0.2'>
<parameters>
<property>
<name>appPath</name>
</property>
<property>
<name>appPath2</name>
<value>hdfs://foo:8020/user/joe/job/job.properties</value>
</property>
</parameters>
<controls>
<kick-off-time>${kickOffTime}</kick-off-time>
</controls>
<coordinator name='coordJobFromBundle1' >
<app-path>${appPath}</app-path>
<configuration>
<property>
<name>startTime1</name>
<value>${START_TIME}</value>
</property>
<property>
<name>endTime1</name>
<value>${END_TIME}</value>
</property>
</configuration>
</coordinator>
<coordinator name='coordJobFromBundle2' >
<app-path>${appPath2}</app-path>
<configuration>
<property>
<name>startTime2</name>
<value>${START_TIME2}</value>
</property>
<property>
<name>endTime2</name>
<value>${END_TIME2}</value>
</property>
</configuration>
</coordinator>
</bundle-app>
</verbatim>
In the above example, if =appPath= is not specified, Oozie will print an error message instead of submitting the job. If
=appPath2= is not specified, Oozie will use the default value, =hdfs://foo:8020/user/joe/job/job.properties=.
---++ 5. User Propagation
When submitting a bundle job, the configuration must contain a =user.name= property. If security is enabled, Oozie must ensure that the value of the =user.name= property in the configuration match the user credentials present in the protocol (web services) request.
When submitting a bundle job, the configuration may contain the =oozie.job.acl= property (the =group.name= property
has been deprecated). If authorization is enabled, this property is treated as as the ACL for the job, it can contain
user and group IDs separated by commas.
The specified user and ACL are assigned to the created bundle job.
Oozie must propagate the specified user and ACL to the system executing its children jobs (coordinator jobs).
---++ 6. Bundle Application Deployment
A bundle application consist exclusively of bundle application definition and associated coordinator application specifications. They must be installed in an HDFS directory. To submit a job for a bundle application, the full HDFS path to bundle application definition must be specified.
---+++ 6.1. Organizing Bundle Applications
TBD.
---++ 7. Bundle Job Submission
When a bundle job is submitted to Oozie, the submitter must specified all the required job properties plus the HDFS path to the bundle application definition for the job.
The bundle application definition HDFS path must be specified in the 'oozie.bundle.application.path' job property.
All the bundle job properties, the HDFS path for the bundle application, the 'user.name' and 'oozie.job.acl' must be
submitted to the Oozie using an XML configuration file (Hadoop XML configuration file).
*%GREEN% Example: %ENDCOLOR%*:
<verbatim>
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>user.name</name>
<value>joe</value>
</property>
<property>
<name>oozie.bundle.application.path</name>
<value>hdfs://foo:8020/user/joe/mybundles/hello-bundle1.xml</value>
</property>
...
</configuration>
</verbatim>
---++ 8. Bundle Rerun
---+++ Rerunning a Bundle Job
Oozie provides a way of rerunning a bundle job. The user could request to rerun a subset of coordinators within a bundle by defining a list of coordinator's names. In addition, a user could define a list of dates or ranges of dates (in UTC format) to rerun for those time windows.
There is a way of asking whether to cleanup all output directories before rerun. By default, oozie will remove all output directories. Moreover, there is an option by which a user could ask to re-calculate the dynamic input directories defined by latest function in coordinators.
---+++ Rerun Arguments
<verbatim>
$oozie job -rerun <bundle_Job_id> [-coordinator <list of coordinator name separate by comma>
[-date 2009-01-01T01:00Z::2009-05-31T23:59Z, 2009-11-10T01:00Z, 2009-12-31T22:00Z]
[-nocleanup] [-refresh]
</verbatim>
* The =rerun= option reruns a bundle job that is *not* in (=KILLED=, =FAILED=, =PREP=, =PREPPAUSED=, =PREPSUSPENDED=).
* Rerun a bundle job that is in =PAUSED= state will reset the paused time.
* The option -coordinator determines the name of coordinator that will be rerun. By default all coordinators are rerun.
* Multiple ranges can be used in -date. See the above examples.
* The dates specified in -date must be UTC.
* If -nocleanup is given, corresponding coordinator directories will not be removed; otherwise the 'output-event' will be deleted.
* If -refresh is set, new dataset is re-evaluated for latest() and future() for the corresponding coordinators.
* If -refresh is set, all dependencies will be re-checked; otherwise only missed dependencies will be checked for the corresponding coordinators.
After the command is executed the rerun bundle job will be in =RUNNING= status.
Refer to the [[DG_CoordinatorRerun][Rerunning Coordinator Actions]] for details on rerun of coordinator job.
---++ Appendixes
---+++ Appendix A, Oozie Bundle XML-Schema
---++++ Oozie Bundle Schema 0.1
<verbatim>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:bundle="uri:oozie:bundle:0.1"
elementFormDefault="qualified" targetNamespace="uri:oozie:bundle:0.1">
<xs:element name="bundle-app" type="bundle:BUNDLE-APP"/>
<xs:simpleType name="IDENTIFIER">
<xs:restriction base="xs:string">
<xs:pattern value="([a-zA-Z]([\-_a-zA-Z0-9])*){1,39})"/>
</xs:restriction>
</xs:simpleType>
<xs:complexType name="BUNDLE-APP">
<xs:sequence>
<xs:element name="controls" type="bundle:CONTROLS" minOccurs="0" maxOccurs="1"/>
<xs:element name="coordinator" type="bundle:COORDINATOR" minOccurs="1" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="name" type="bundle:IDENTIFIER" use="required"/>
</xs:complexType>
<xs:complexType name="CONTROLS">
<xs:sequence minOccurs="0" maxOccurs="1">
<xs:element name="kick-off-time" type="xs:string" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="COORDINATOR">
<xs:sequence minOccurs="1" maxOccurs="1">
<xs:element name="app-path" type="xs:string" minOccurs="1" maxOccurs="1"/>
<xs:element name="configuration" type="bundle:CONFIGURATION" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
<xs:attribute name="name" type="bundle:IDENTIFIER" use="required"/>
<xs:attribute name="critical" type="xs:string" use="optional"/>
</xs:complexType>
<xs:complexType name="CONFIGURATION">
<xs:sequence>
<xs:element name="property" minOccurs="1" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="name" minOccurs="1" maxOccurs="1" type="xs:string"/>
<xs:element name="value" minOccurs="1" maxOccurs="1" type="xs:string"/>
<xs:element name="description" minOccurs="0" maxOccurs="1" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:schema>
</verbatim>
---++++ Oozie Bundle Schema 0.2
<verbatim>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:bundle="uri:oozie:bundle:0.2"
elementFormDefault="qualified" targetNamespace="uri:oozie:bundle:0.2">
<xs:element name="bundle-app" type="bundle:BUNDLE-APP"/>
<xs:simpleType name="IDENTIFIER">
<xs:restriction base="xs:string">
<xs:pattern value="([a-zA-Z]([\-_a-zA-Z0-9])*){1,39}"/>
</xs:restriction>
</xs:simpleType>
<xs:complexType name="BUNDLE-APP">
<xs:sequence>
<xs:element name="parameters" type="bundle:PARAMETERS" minOccurs="0" maxOccurs="1"/>
<xs:element name="controls" type="bundle:CONTROLS" minOccurs="0" maxOccurs="1"/>
<xs:element name="coordinator" type="bundle:COORDINATOR" minOccurs="1" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="required"/>
</xs:complexType>
<xs:complexType name="PARAMETERS">
<xs:sequence>
<xs:element name="property" minOccurs="1" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="name" minOccurs="1" maxOccurs="1" type="xs:string"/>
<xs:element name="value" minOccurs="0" maxOccurs="1" type="xs:string"/>
<xs:element name="description" minOccurs="0" maxOccurs="1" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="CONTROLS">
<xs:sequence minOccurs="0" maxOccurs="1">
<xs:element name="kick-off-time" type="xs:string" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="COORDINATOR">
<xs:sequence minOccurs="1" maxOccurs="1">
<xs:element name="app-path" type="xs:string" minOccurs="1" maxOccurs="1"/>
<xs:element name="configuration" type="bundle:CONFIGURATION" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
<xs:attribute name="name" type="bundle:IDENTIFIER" use="required"/>
<xs:attribute name="critical" type="xs:string" use="optional"/>
<xs:attribute name="enabled" type="xs:string" use="optional"/>
</xs:complexType>
<xs:complexType name="CONFIGURATION">
<xs:sequence>
<xs:element name="property" minOccurs="1" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="name" minOccurs="1" maxOccurs="1" type="xs:string"/>
<xs:element name="value" minOccurs="1" maxOccurs="1" type="xs:string"/>
<xs:element name="description" minOccurs="0" maxOccurs="1" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:schema>
</verbatim>
[[index][::Go back to Oozie Documentation Index::]]
</noautolink>