blob: c9d598f5de9e7ee296009988c284e5ac5f2197c1 [file] [log] [blame]
<!DOCTYPE html>
<!--
| Generated by Apache Maven Doxia Site Renderer 1.8 from src/site/twiki/Hook-Hive.twiki at 2018-09-06
| Rendered using Apache Maven Fluido Skin 1.7
-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="Date-Revision-yyyymmdd" content="20180906" />
<meta http-equiv="Content-Language" content="en" />
<title>Apache Atlas &#x2013; Apache Atlas Hook & Bridge for Apache Hive</title>
<link rel="stylesheet" href="./css/apache-maven-fluido-1.7.min.css" />
<link rel="stylesheet" href="./css/site.css" />
<link rel="stylesheet" href="./css/print.css" media="print" />
<script type="text/javascript" src="./js/apache-maven-fluido-1.7.min.js"></script>
</head>
<body class="topBarEnabled">
<div id="topbar" class="navbar navbar-fixed-top ">
<div class="navbar-inner">
<div class="container" style="width: 68%;"><div class="nav-collapse">
<ul class="nav">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Apache Atlas <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="index.html" title="Overview">Overview</a></li>
<li><a href="license.html" title="License">License</a></li>
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="Downloads">Downloads</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/ATLAS" title="Wiki">Wiki</a></li>
<li><a href="https://git-wip-us.apache.org/repos/asf/atlas.git" title="Git">Git</a></li>
<li><a href="https://issues.apache.org/jira/browse/ATLAS" title="Jira">Jira</a></li>
<li><a href="https://reviews.apache.org/groups/atlas/?sort=-time_added" title="Review Board">Review Board</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Project Information <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="project-info.html" title="Summary">Summary</a></li>
<li><a href="mail-lists.html" title="Mailing Lists">Mailing Lists</a></li>
<li><a href="team-list.html" title="Team">Team</a></li>
<li><a href="issue-tracking.html" title="Issue Tracking">Issue Tracking</a></li>
<li><a href="source-repository.html" title="Source Repository">Source Repository</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Downloads <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="1.1.0">1.1.0</a></li>
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="1.0.0">1.0.0</a></li>
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="0.8.2">0.8.2</a></li>
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="0.8.1">0.8.1</a></li>
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="0.8-incubating">0.8-incubating</a></li>
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="0.7.1-incubating">0.7.1-incubating</a></li>
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="0.7-incubating">0.7-incubating</a></li>
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="0.6-incubating">0.6-incubating</a></li>
<li><a href="http://atlas.apache.org/#/Downloads" target="_blank" title="0.5-incubating">0.5-incubating</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Documentation <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="../index.html" title="latest">latest</a></li>
<li><a href="../1.1.0/index.html" title="1.1.0">1.1.0</a></li>
<li><a href="../1.0.0/index.html" title="1.0.0">1.0.0</a></li>
<li><a href="../0.8.2/index.html" title="0.8.2">0.8.2</a></li>
<li><a href="../0.8.1/index.html" title="0.8.1">0.8.1</a></li>
<li><a href="../0.8.0-incubating/index.html" title="0.8-incubating">0.8-incubating</a></li>
<li><a href="../0.7.1-incubating/index.html" title="0.7.1-incubating">0.7.1-incubating</a></li>
<li><a href="../0.7.0-incubating/index.html" title="0.7-incubating">0.7-incubating</a></li>
<li><a href="../0.6.0-incubating/index.html" title="0.6-incubating">0.6-incubating</a></li>
<li><a href="../0.5.0-incubating/index.html" title="0.5-incubating">0.5-incubating</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">ASF <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="http://www.apache.org/foundation/how-it-works.html" title="How Apache Works">How Apache Works</a></li>
<li><a href="https://www.apache.org/events/current-event" title="Events">Events</a></li>
<li><a href="https://www.apache.org/licenses/" title="License">License</a></li>
<li><a href="http://www.apache.org/foundation/" title="Foundation">Foundation</a></li>
<li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsoring Apache">Sponsoring Apache</a></li>
<li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li>
</ul>
</li>
</ul>
<form id="search-form" action="https://www.google.com/search" method="get" class="navbar-search pull-right" >
<input value="http://atlas.apache.org" name="sitesearch" type="hidden"/>
<input class="search-query" name="q" id="query" type="text" />
</form>
<script type="text/javascript">asyncJs( 'https://cse.google.com/brand?form=search-form' )</script>
<iframe src="https://www.facebook.com/plugins/like.php?href=http://atlas.apache.org/atlas-docs&send=false&layout=button_count&show-faces=false&action=like&colorscheme=dark"
scrolling="no" frameborder="0"
style="border:none; width:100px; height:20px; margin-top: 10px;" class="pull-right" ></iframe>
<script type="text/javascript">asyncJs( 'https://apis.google.com/js/plusone.js' )</script>
<ul class="nav pull-right"><li style="margin-top: 10px;">
<div class="g-plusone" data-href="http://atlas.apache.org/atlas-docs" data-size="medium" width="60px" align="right" ></div>
</li></ul>
</div>
</div>
</div>
</div>
<div class="container">
<div id="banner">
<div class="pull-left"><a href=".." id="bannerLeft"><img src="images/atlas-logo.png" alt="Apache Atlas" width="200px" height="45px"/></a></div>
<div class="pull-right"></div>
<div class="clear"><hr/></div>
</div>
<div id="breadcrumbs">
<ul class="breadcrumb">
<li class=""><a href="http://www.apache.org" class="externalLink" title="Apache">Apache</a><span class="divider">/</span></li>
<li class=""><a href="index.html" title="Atlas">Atlas</a><span class="divider">/</span></li>
<li class="active ">Apache Atlas Hook & Bridge for Apache Hive</li>
<li id="publishDate" class="pull-right"><span class="divider">|</span> Last Published: 2018-09-18</li>
<li id="projectVersion" class="pull-right">Version: 1.1.0</li>
</ul>
</div>
<div id="bodyColumn" >
<div class="section">
<h2><a name="Apache_Atlas_Hook_.26_Bridge_for_Apache_Hive"></a>Apache Atlas Hook &amp; Bridge for Apache Hive</h2></div>
<div class="section">
<h3><a name="Hive_Model"></a>Hive Model</h3>
<p>Hive model includes the following types:</p>
<ul>
<li>Entity types:
<ul>
<li>hive_db
<ul>
<li>super-types: !Asset</li>
<li>attributes: qualifiedName, name, description, owner, clusterName, location, parameters, ownerName</li></ul></li>
<li>hive_table
<ul>
<li>super-types: DataSet</li>
<li>attributes: qualifiedName, name, description, owner, db, createTime, lastAccessTime, comment, retention, sd, partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary</li></ul></li>
<li>hive_column
<ul>
<li>super-types: DataSet</li>
<li>attributes: qualifiedName, name, description, owner, type, comment, table</li></ul></li>
<li>hive_storagedesc
<ul>
<li>super-types: Referenceable</li>
<li>attributes: qualifiedName, table, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories</li></ul></li>
<li>hive_process
<ul>
<li>super-types: Process</li>
<li>attributes: qualifiedName, name, description, owner, inputs, outputs, startTime, endTime, userName, operationType, queryText, queryPlan, queryId, clusterName</li></ul></li>
<li>hive_column_lineage
<ul>
<li>super-types: Process</li>
<li>attributes: qualifiedName, name, description, owner, inputs, outputs, query, depenendencyType, expression</li></ul></li></ul></li></ul>
<p></p>
<ul>
<li>Enum types:
<ul>
<li>hive_principal_type
<ul>
<li>values: USER, ROLE, GROUP</li></ul></li></ul></li></ul>
<p></p>
<ul>
<li>Struct types:
<ul>
<li>hive_order
<ul>
<li>attributes: col, order</li></ul></li>
<li>hive_serde
<ul>
<li>attributes: name, serializationLib, parameters</li></ul></li></ul></li></ul>
<p>Hive entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below. Note that dbName, tableName and columnName should be in lower case.</p>
<div class="source"><pre class="prettyprint">
hive_db.qualifiedName: &lt;dbName&gt;@&lt;clusterName&gt;
hive_table.qualifiedName: &lt;dbName&gt;.&lt;tableName&gt;@&lt;clusterName&gt;
hive_column.qualifiedName: &lt;dbName&gt;.&lt;tableName&gt;.&lt;columnName&gt;@&lt;clusterName&gt;
hive_process.queryString: trimmed query string in lower case
</pre></div></div>
<div class="section">
<h3><a name="Hive_Hook"></a>Hive Hook</h3>
<p>Atlas Hive hook registers with Hive to listen for create/update/delete operations and updates the metadata in Atlas, via Kafka notifications, for the changes in Hive. Follow the instructions below to setup Atlas hook in Hive:</p>
<ul>
<li>Set-up Atlas hook in hive-site.xml by adding the following:</li></ul>
<div class="source"><pre class="prettyprint">
&lt;property&gt;
&lt;name&gt;hive.exec.post.hooks&lt;/name&gt;
&lt;value&gt;org.apache.atlas.hive.hook.HiveHook&lt;/value&gt;
&lt;/property&gt;
</pre></div>
<p></p>
<ul>
<li>untar apache-atlas-${project.version}-hive-hook.tar.gz</li>
<li>cd apache-atlas-hive-hook-${project.version}</li>
<li>Copy entire contents of folder apache-atlas-hive-hook-${project.version}/hook/hive to &lt;atlas package&gt;/hook/hive</li>
<li>Add 'export HIVE_AUX_JARS_PATH=&lt;atlas package&gt;/hook/hive' in hive-env.sh of your hive configuration</li>
<li>Copy &lt;atlas-conf&gt;/atlas-application.properties to the hive conf directory.</li></ul>
<p>The following properties in atlas-application.properties control the thread pool and notification details:</p>
<div class="source"><pre class="prettyprint">
atlas.hook.hive.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in Hive query completion. Default: false
atlas.hook.hive.numRetries=3 # number of retries for notification failure. Default: 3
atlas.hook.hive.queueSize=10000 # queue size for the threadpool. Default: 10000
atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary
atlas.kafka.zookeeper.connect= # Zookeeper connect URL for Kafka. Example: localhost:2181
atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000
atlas.kafka.zookeeper.session.timeout.ms=60000 # Zookeeper session timeout. Default: 60000
atlas.kafka.zookeeper.sync.time.ms=20 # Zookeeper sync time. Default: 20
</pre></div>
<p>Other configurations for Kafka notification producer can be specified by prefixing the configuration name with &quot;atlas.kafka.&quot;. For list of configuration supported by Kafka producer, please refer to <a class="externalLink" href="http://kafka.apache.org/documentation/#producerconfigs">Kafka Producer Configs</a></p></div>
<div class="section">
<h3><a name="Column_Level_Lineage"></a>Column Level Lineage</h3>
<p>Starting from 0.8-incubating version of Atlas, Column level lineage is captured in Atlas. Below are the details</p></div>
<div class="section">
<h4><a name="Model"></a>Model</h4>
<p></p>
<ul>
<li>ColumnLineageProcess type is a subtype of Process</li></ul>
<p></p>
<ul>
<li>This relates an output Column to a set of input Columns or the Input Table</li></ul>
<p></p>
<ul>
<li>The lineage also captures the kind of dependency, as listed below:
<ul>
<li>SIMPLE: output column has the same value as the input</li>
<li>EXPRESSION: output column is transformed by some expression at runtime (for e.g. a Hive SQL expression) on the Input Columns.</li>
<li>SCRIPT: output column is transformed by a user provided script.</li></ul></li></ul>
<p></p>
<ul>
<li>In case of EXPRESSION dependency the expression attribute contains the expression in string form</li></ul>
<p></p>
<ul>
<li>Since Process links input and output DataSets, Column is a subtype of DataSet</li></ul></div>
<div class="section">
<h4><a name="Examples"></a>Examples</h4>
<p>For a simple CTAS below:</p>
<div class="source"><pre class="prettyprint">
create table t2 as select id, name from T1
</pre></div>
<p>The lineage is captured as</p>
<p><img src="images/column_lineage_ex1.png" alt="" /></p></div>
<div class="section">
<h4><a name="Extracting_Lineage_from_Hive_commands"></a>Extracting Lineage from Hive commands</h4>
<p>* The HiveHook maps the LineageInfo in the HookContext to Column lineage instances</p>
<p>* The LineageInfo in Hive provides column-level lineage for the final FileSinkOperator, linking them to the input columns in the Hive Query</p></div>
<div class="section">
<h3><a name="NOTES"></a>NOTES</h3>
<p></p>
<ul>
<li>Column level lineage works with Hive version 1.2.1 after the patch for <a href="https://issues.apache.org/jira/browse/HIVE-13112">HIVE-13112</a> is applied to Hive source</li>
<li>Since database name, table name and column names are case insensitive in hive, the corresponding names in entities are lowercase. So, any search APIs should use lowercase while querying on the entity names</li>
<li>The following hive operations are captured by hive hook currently
<ul>
<li>create database</li>
<li>create table/view, create table as select</li>
<li>load, import, export</li>
<li>DMLs (insert)</li>
<li>alter database</li>
<li>alter table (skewed table information, stored as, protection is not supported)</li>
<li>alter view</li></ul></li></ul></div>
<div class="section">
<h3><a name="Importing_Hive_Metadata"></a>Importing Hive Metadata</h3>
<p>Apache Atlas provides a command-line utility, import-hive.sh, to import metadata of Apache Hive databases and tables into Apache Atlas. This utility can be used to initialize Apache Atlas with databases/tables present in Apache Hive. This utility supports importing metadata of a specific table, tables in a specific database or all databases and tables.</p>
<div class="source"><pre class="prettyprint">
Usage 1: &lt;atlas package&gt;/hook-bin/import-hive.sh
Usage 2: &lt;atlas package&gt;/hook-bin/import-hive.sh [-d &lt;database regex&gt; OR --database &lt;database regex&gt;] [-t &lt;table regex&gt; OR --table &lt;table regex&gt;]
Usage 3: &lt;atlas package&gt;/hook-bin/import-hive.sh [-f &lt;filename&gt;]
File Format:
database1:tbl1
database1:tbl2
database2:tbl1
</pre></div></div>
</div>
</div>
<hr/>
<footer>
<div class="container">
<div class="row">
<p><a href="https://www.apache.org/foundation/contributing"><img src="https://www.apache.org/images/SupportApache-small.png" alt="Support the ASF" id="asf-logo" height="20" width="20" /></a>Copyright © 2011-2018 The Apache Software Foundation. Licensed under the <a href="https://www.apache.org/licenses/">Apache License, Version 2.0</a>.<br/>
Apache Atlas, Atlas, Apache, the Apache feather logo are trademarks of the <a href="https://www.apache.org">Apache Software Foundation</a>.<br/>
All other marks mentioned may be trademarks or registered trademarks of their respective owners.</p>
</div>
<p id="poweredBy" class="pull-right"><a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy"><img class="builtBy" alt="Built by Maven" src="./images/logos/maven-feather.png" /></a>
</p>
</div>
</footer>
</body>
</html>