docs/src/site/twiki/DG_HCatalogIntegration.twiki - oozie - Git at Google

 <noautolink>

 [[index][::Go back to Oozie Documentation Index::]]

 ---+!! HCatalog Integration (Since Oozie 4.x)

 %TOC%

 ---++ HCatalog Overview
     HCatalog is a table and storage management layer for Hadoop that enables users with different data processing
 tools - Pig, MapReduce, and Hive - to more easily read and write data on the grid. HCatalog's table abstraction presents
 users with a relational view of data in the Hadoop distributed file system (HDFS).

     Read [[http://incubator.apache.org/hcatalog/docs/r0.5.0/index.html][HCatalog Documentation]] to know more about HCatalog.
 Working with HCatalog using pig is detailed in
 [[http://incubator.apache.org/hcatalog/docs/r0.5.0/loadstore.html][HCatLoader and HCatStorer]].
 Working with HCatalog using MapReduce directly is detailed in
 [[http://incubator.apache.org/hcatalog/docs/r0.5.0/inputoutput.html][HCatInputFormat and HCatOutputFormat]].

 ---+++ HCatalog notifications
    HCatalog provides notifications through a JMS provider like ActiveMQ when a new partition is added to a table in the
 database. This allows applications to consume those events and schedule the work that depends on them. In case of Oozie,
 the notifications are used to determine the availability of HCatalog partitions defined as data dependencies in the
 Coordinator and trigger workflows.

 Read [[http://incubator.apache.org/hcatalog/docs/r0.5.0/notification.html][HCatalog Notification]] to know more about
 notifications in HCatalog.

 ---++ Oozie HCatalog Integration
    Oozie's Coordinators so far have been supporting HDFS directories as a input data dependency. When a HDFS URI
 template is specified as a dataset and input events are defined in Coordinator for the dataset, Oozie performs data
 availability checks by polling the HDFS directory URIs resolved based on the nominal time. When all the data
 dependencies are met, the Coordinator's workflow is triggered which then consumes the available HDFS data.

 With addition of HCatalog support, Coordinators also support specifying a set of HCatalog table partitions as a dataset.
 The workflow is triggered when the HCatalog table partitions are available and the workflow actions can then read the
 partition data. A mix of HDFS and HCatalog dependencies can be specified as input data dependencies.
 Similar to HDFS directories, HCatalog table partitions can also be specified as output dataset events.

 With HDFS data dependencies, Oozie has to poll HDFS every time to determine the availability of a directory.
 If the HCatalog server is configured to publish partition availability notifications to a JMS provider, Oozie can be
 configured to subscribe to it and trigger jobs immediately. This pub-sub model reduces pressure on Namenode and also
 cuts down on delays caused by polling intervals.

 In the absence of a message bus in the deployment, Oozie will always
 poll the HCatalog server directly for partition availability with the same frequency as the HDFS polling. Even when
 subscribed to notifications, Oozie falls back to polling HCatalog server for partitions that were available before the
 coordinator action was materialized and to deal with missed notifications due to system downtimes. The frequency of the
 fallback polling is usually lower than the constant polling. Defaults are 10 minutes and 1 minute respectively.


 ---+++ Oozie Server Configuration
    Refer to [[AG_Install#HCatalog_Configuration][HCatalog Configuration]] section of [[AG_Install][Oozie Install]]
 documentation for the Oozie server side configuration required to support HCatalog table partitions as a data dependency.

 ---+++ HCatalog URI Format

 Oozie supports specifying HCatalog partitions as a data dependency through a URI notation. The HCatalog partition URI is
 used to identify a set of table partitions: hcat://bar:8020/logsDB/logsTable/dt=20090415;region=US.

 The format to specify a HCatalog table partition URI is
 hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value];...

 For example,
 <verbatim>
   <dataset name="logs" frequency="${coord:days(1)}"
            initial-instance="2009-02-15T08:15Z" timezone="America/Los_Angeles">
     <uri-template>
       hcat://myhcatmetastore:9080/database1/table1/datestamp=${YEAR}${MONTH}${DAY}${HOUR};region=USA
     </uri-template>
   </dataset>
 </verbatim>

 #HCatalogLibraries
 ---+++ HCatalog Libraries

 A workflow action interacting with HCatalog requires the following jars in the classpath:
 hcatalog-core.jar, hcatalog-pig-adapter.jar, webhcat-java-client.jar, hive-common.jar, hive-exec.jar,
 hive-metastore.jar, hive-serde.jar and libfb303.jar.
 hive-site.xml which has the configuration to talk to the HCatalog server also needs to be in the classpath. The correct
 version of HCatalog and hive jars should be placed in classpath based on the version of HCatalog installed on the cluster.

 The jars can be added to the classpath of the action using one of the below ways.
    * You can place the jars and hive-site.xml in the system shared library. The shared library for a pig, hive or java action can be overridden to include hcatalog shared libraries along with the action's shared library. Refer to [[WorkflowFunctionalSpec.html#a17_HDFS_Share_Libraries_for_Workflow_Applications_since_Oozie_2.3][Shared Libraries]] for more information. The oozie-sharelib-[version].tar.gz in the oozie distribution bundles the required HCatalog jars in a hcatalog sharelib. If using a different version of HCatalog than the one bundled in the sharelib, copy the required HCatalog jars from such version into the sharelib.
    * You can place the jars and hive-site.xml in the workflow application lib/ path.
    * You can specify the location of the jar files in =archive= tag and the hive-site.xml in =file= tag in the corresponding pig, hive or java action.

 ---+++ Coordinator

 Refer to [[CoordinatorFunctionalSpec][Coordinator Functional Specification]] for more information about
    * how to specify HCatalog partitions as a data dependency using input dataset events
    * how to specify HCatalog partitions as output dataset events
    * the various EL functions available to work with HCatalog dataset events and how to use them to access HCatalog partitions in pig, hive or java actions in a workflow.

 ---+++ Workflow
 Refer to [[WorkflowFunctionalSpec][Workflow Functional Specification]] for more information about
    * how to drop HCatalog partitions in the prepare block of a action
    * the HCatalog EL functions available to use in workflows

 Refer to [[DG_ActionAuthentication][Action Authentication]] for more information about
    * how to access a secure HCatalog from any action (e.g. hive, pig, etc) in a workflow

 ---+++ Known Issues
    * When rerunning a coordinator action without specifying -nocleanup option if the 'output-event' are hdfs directories, then they are deleted. But if the 'output-event' is a hcatalog partition, currently the partition is not dropped.

 </noautolink>
	<noautolink>

	[[index][::Go back to Oozie Documentation Index::]]

	---+!! HCatalog Integration (Since Oozie 4.x)

	%TOC%

	---++ HCatalog Overview
	HCatalog is a table and storage management layer for Hadoop that enables users with different data processing
	tools - Pig, MapReduce, and Hive - to more easily read and write data on the grid. HCatalog's table abstraction presents
	users with a relational view of data in the Hadoop distributed file system (HDFS).

	Read [[http://incubator.apache.org/hcatalog/docs/r0.5.0/index.html][HCatalog Documentation]] to know more about HCatalog.
	Working with HCatalog using pig is detailed in
	[[http://incubator.apache.org/hcatalog/docs/r0.5.0/loadstore.html][HCatLoader and HCatStorer]].
	Working with HCatalog using MapReduce directly is detailed in
	[[http://incubator.apache.org/hcatalog/docs/r0.5.0/inputoutput.html][HCatInputFormat and HCatOutputFormat]].

	---+++ HCatalog notifications
	HCatalog provides notifications through a JMS provider like ActiveMQ when a new partition is added to a table in the
	database. This allows applications to consume those events and schedule the work that depends on them. In case of Oozie,
	the notifications are used to determine the availability of HCatalog partitions defined as data dependencies in the
	Coordinator and trigger workflows.

	Read [[http://incubator.apache.org/hcatalog/docs/r0.5.0/notification.html][HCatalog Notification]] to know more about
	notifications in HCatalog.

	---++ Oozie HCatalog Integration
	Oozie's Coordinators so far have been supporting HDFS directories as a input data dependency. When a HDFS URI
	template is specified as a dataset and input events are defined in Coordinator for the dataset, Oozie performs data
	availability checks by polling the HDFS directory URIs resolved based on the nominal time. When all the data
	dependencies are met, the Coordinator's workflow is triggered which then consumes the available HDFS data.

	With addition of HCatalog support, Coordinators also support specifying a set of HCatalog table partitions as a dataset.
	The workflow is triggered when the HCatalog table partitions are available and the workflow actions can then read the
	partition data. A mix of HDFS and HCatalog dependencies can be specified as input data dependencies.
	Similar to HDFS directories, HCatalog table partitions can also be specified as output dataset events.

	With HDFS data dependencies, Oozie has to poll HDFS every time to determine the availability of a directory.
	If the HCatalog server is configured to publish partition availability notifications to a JMS provider, Oozie can be
	configured to subscribe to it and trigger jobs immediately. This pub-sub model reduces pressure on Namenode and also
	cuts down on delays caused by polling intervals.

	In the absence of a message bus in the deployment, Oozie will always
	poll the HCatalog server directly for partition availability with the same frequency as the HDFS polling. Even when
	subscribed to notifications, Oozie falls back to polling HCatalog server for partitions that were available before the
	coordinator action was materialized and to deal with missed notifications due to system downtimes. The frequency of the
	fallback polling is usually lower than the constant polling. Defaults are 10 minutes and 1 minute respectively.


	---+++ Oozie Server Configuration
	Refer to [[AG_Install#HCatalog_Configuration][HCatalog Configuration]] section of [[AG_Install][Oozie Install]]
	documentation for the Oozie server side configuration required to support HCatalog table partitions as a data dependency.

	---+++ HCatalog URI Format

	Oozie supports specifying HCatalog partitions as a data dependency through a URI notation. The HCatalog partition URI is
	used to identify a set of table partitions: hcat://bar:8020/logsDB/logsTable/dt=20090415;region=US.

	The format to specify a HCatalog table partition URI is
	hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value];...

	For example,
	<verbatim>
	<dataset name="logs" frequency="${coord:days(1)}"
	initial-instance="2009-02-15T08:15Z" timezone="America/Los_Angeles">
	<uri-template>
	hcat://myhcatmetastore:9080/database1/table1/datestamp=${YEAR}${MONTH}${DAY}${HOUR};region=USA
	</uri-template>
	</dataset>
	</verbatim>

	#HCatalogLibraries
	---+++ HCatalog Libraries

	A workflow action interacting with HCatalog requires the following jars in the classpath:
	hcatalog-core.jar, hcatalog-pig-adapter.jar, webhcat-java-client.jar, hive-common.jar, hive-exec.jar,
	hive-metastore.jar, hive-serde.jar and libfb303.jar.
	hive-site.xml which has the configuration to talk to the HCatalog server also needs to be in the classpath. The correct
	version of HCatalog and hive jars should be placed in classpath based on the version of HCatalog installed on the cluster.

	The jars can be added to the classpath of the action using one of the below ways.
	* You can place the jars and hive-site.xml in the system shared library. The shared library for a pig, hive or java action can be overridden to include hcatalog shared libraries along with the action's shared library. Refer to [[WorkflowFunctionalSpec.html#a17_HDFS_Share_Libraries_for_Workflow_Applications_since_Oozie_2.3][Shared Libraries]] for more information. The oozie-sharelib-[version].tar.gz in the oozie distribution bundles the required HCatalog jars in a hcatalog sharelib. If using a different version of HCatalog than the one bundled in the sharelib, copy the required HCatalog jars from such version into the sharelib.
	* You can place the jars and hive-site.xml in the workflow application lib/ path.
	* You can specify the location of the jar files in =archive= tag and the hive-site.xml in =file= tag in the corresponding pig, hive or java action.

	---+++ Coordinator

	Refer to [[CoordinatorFunctionalSpec][Coordinator Functional Specification]] for more information about
	* how to specify HCatalog partitions as a data dependency using input dataset events
	* how to specify HCatalog partitions as output dataset events
	* the various EL functions available to work with HCatalog dataset events and how to use them to access HCatalog partitions in pig, hive or java actions in a workflow.

	---+++ Workflow
	Refer to [[WorkflowFunctionalSpec][Workflow Functional Specification]] for more information about
	* how to drop HCatalog partitions in the prepare block of a action
	* the HCatalog EL functions available to use in workflows

	Refer to [[DG_ActionAuthentication][Action Authentication]] for more information about
	* how to access a secure HCatalog from any action (e.g. hive, pig, etc) in a workflow

	---+++ Known Issues
	* When rerunning a coordinator action without specifying -nocleanup option if the 'output-event' are hdfs directories, then they are deleted. But if the 'output-event' is a hcatalog partition, currently the partition is not dropped.

	</noautolink>