blob: cde4c386630dc14a16662b43444704a5b932f142 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="author" content="Apache Software Foundation">
<link rel="shortcut icon" href="../../img/favicon.ico">
<title>Gobblin Compliance Design - Apache Gobblin</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/github.min.css">
<link href="../../css/extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Gobblin Compliance Design";
var mkdocs_page_input_path = "developer-guide/Gobblin-Compliance-Design.md";
var mkdocs_page_url = null;
</script>
<script src="../../js/jquery-2.1.1.min.js" defer></script>
<script src="../../js/modernizr-2.8.3.min.js" defer></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Apache Gobblin</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" title="Type search term here" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="/">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Powered-By/">Companies Powered By Gobblin</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Getting-Started/">Getting Started</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Gobblin-Architecture/">Architecture</a>
</li>
<li class="toctree-l1">
<span class="caption-text">User Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../../user-guide/Working-with-Job-Configuration-Files/">Job Configuration Files</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-Deployment/">Deployment</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-as-a-Library/">Gobblin as a Library</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-CLI/">Gobblin CLI</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-Compliance/">Gobblin Compliance</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-on-Yarn/">Gobblin on Yarn</a>
</li>
<li class="">
<a class="" href="../../user-guide/Compaction/">Compaction</a>
</li>
<li class="">
<a class="" href="../../user-guide/State-Management-and-Watermarks/">State Management and Watermarks</a>
</li>
<li class="">
<a class="" href="../../user-guide/Working-with-the-ForkOperator/">Fork Operator</a>
</li>
<li class="">
<a class="" href="../../user-guide/Configuration-Properties-Glossary/">Configuration Glossary</a>
</li>
<li class="">
<a class="" href="../../user-guide/Source-schema-and-Converters/">Source schema and Converters</a>
</li>
<li class="">
<a class="" href="../../user-guide/Partitioned-Writers/">Partitioned Writers</a>
</li>
<li class="">
<a class="" href="../../user-guide/Monitoring/">Monitoring</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-template/">Template</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-Schedulers/">Schedulers</a>
</li>
<li class="">
<a class="" href="../../user-guide/Job-Execution-History-Store/">Job Execution History Store</a>
</li>
<li class="">
<a class="" href="../../user-guide/Building-Gobblin/">Building Gobblin</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-genericLoad/">Generic Configuration Loading</a>
</li>
<li class="">
<a class="" href="../../user-guide/Hive-Registration/">Hive Registration</a>
</li>
<li class="">
<a class="" href="../../user-guide/Config-Management/">Config Management</a>
</li>
<li class="">
<a class="" href="../../user-guide/Docker-Integration/">Docker Integration</a>
</li>
<li class="">
<a class="" href="../../user-guide/Troubleshooting/">Troubleshooting</a>
</li>
<li class="">
<a class="" href="../../user-guide/FAQs/">FAQs</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sources</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sources/AvroFileSource/">Avro files</a>
</li>
<li class="">
<a class="" href="../../sources/CopySource/">File copy</a>
</li>
<li class="">
<a class="" href="../../sources/QueryBasedSource/">Query based</a>
</li>
<li class="">
<a class="" href="../../sources/RestApiSource/">Rest Api</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleAnalyticsSource/">Google Analytics</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleDriveSource/">Google Drive</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleWebmaster/">Google Webmaster</a>
</li>
<li class="">
<a class="" href="../../sources/HadoopTextInputSource/">Hadoop Text Input</a>
</li>
<li class="">
<a class="" href="../../sources/HelloWorldSource/">Hello World</a>
</li>
<li class="">
<a class="" href="../../sources/HiveAvroToOrcSource/">Hive Avro-to-ORC</a>
</li>
<li class="">
<a class="" href="../../sources/HivePurgerSource/">Hive compliance purging</a>
</li>
<li class="">
<a class="" href="../../sources/SimpleJsonSource/">JSON</a>
</li>
<li class="">
<a class="" href="../../sources/KafkaSource/">Kafka</a>
</li>
<li class="">
<a class="" href="../../sources/MySQLSource/">MySQL</a>
</li>
<li class="">
<a class="" href="../../sources/OracleSource/">Oracle</a>
</li>
<li class="">
<a class="" href="../../sources/SalesforceSource/">Salesforce</a>
</li>
<li class="">
<a class="" href="../../sources/SftpSource/">SFTP</a>
</li>
<li class="">
<a class="" href="../../sources/SqlServerSource/">SQL Server</a>
</li>
<li class="">
<a class="" href="../../sources/TeradataSource/">Teradata</a>
</li>
<li class="">
<a class="" href="../../sources/WikipediaSource/">Wikipedia</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sinks (Writers)</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sinks/AvroHdfsDataWriter/">Avro HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/ParquetHdfsDataWriter/">Parquet HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/SimpleBytesWriter/">HDFS Byte array</a>
</li>
<li class="">
<a class="" href="../../sinks/ConsoleWriter/">Console</a>
</li>
<li class="">
<a class="" href="../../sinks/CouchbaseWriter/">Couchbase</a>
</li>
<li class="">
<a class="" href="../../sinks/Http/">HTTP</a>
</li>
<li class="">
<a class="" href="../../sinks/Gobblin-JDBC-Writer/">JDBC</a>
</li>
<li class="">
<a class="" href="../../sinks/Kafka/">Kafka</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Adaptors</span>
<ul class="subnav">
<li class="">
<a class="" href="../../adaptors/Gobblin-Distcp/">Gobblin Distcp</a>
</li>
<li class="">
<a class="" href="../../adaptors/Hive-Avro-To-ORC-Converter/">Hive Avro-To-Orc Converter</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Case Studies</span>
<ul class="subnav">
<li class="">
<a class="" href="../../case-studies/Kafka-HDFS-Ingestion/">Kafka-HDFS Ingestion</a>
</li>
<li class="">
<a class="" href="../../case-studies/Publishing-Data-to-S3/">Publishing Data to S3</a>
</li>
<li class="">
<a class="" href="../../case-studies/Writing-ORC-Data/">Writing ORC Data</a>
</li>
<li class="">
<a class="" href="../../case-studies/Hive-Distcp/">Hive Distcp</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Data Management</span>
<ul class="subnav">
<li class="">
<a class="" href="../../data-management/Gobblin-Retention/">Retention</a>
</li>
<li class="">
<a class="" href="../../data-management/DistcpNgEvents/">Distcp-NG events</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Metrics</span>
<ul class="subnav">
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics/">Quick Start</a>
</li>
<li class="">
<a class="" href="../../metrics/Existing-Reporters/">Existing Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Metrics-for-Gobblin-ETL/">Metrics for Gobblin ETL</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Architecture/">Gobblin Metrics Architecture</a>
</li>
<li class="">
<a class="" href="../../metrics/Implementing-New-Reporters/">Implementing New Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Performance/">Gobblin Metrics Performance</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Developer Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../Customization-for-New-Source/">Customization for New Source</a>
</li>
<li class="">
<a class="" href="../Customization-for-Converter-and-Operator/">Customization for Converter and Operator</a>
</li>
<li class="">
<a class="" href="../CodingStyle/">Code Style Guide</a>
</li>
<li class=" current">
<a class="current" href="./">Gobblin Compliance Design</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#introduction">Introduction</a></li>
<li class="toctree-l3"><a href="#design">Design</a></li>
<ul>
<li><a class="toctree-l4" href="#onboarding">Onboarding</a></li>
<li><a class="toctree-l4" href="#purger">Purger</a></li>
<li><a class="toctree-l4" href="#retention">Retention</a></li>
<li><a class="toctree-l4" href="#restore">Restore</a></li>
</ul>
</ul>
</li>
<li class="">
<a class="" href="../IDE-setup/">IDE setup</a>
</li>
<li class="">
<a class="" href="../Monitoring-Design/">Monitoring Design</a>
</li>
<li class="">
<a class="" href="../Documentation-Architecture/">Documentation Architecture</a>
</li>
<li class="">
<a class="" href="../Contributing/">Contributing</a>
</li>
<li class="">
<a class="" href="../GobblinModules/">Gobblin Modules</a>
</li>
<li class="">
<a class="" href="../HighLevelConsumer/">High Level Consumer</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Project</span>
<ul class="subnav">
<li class="">
<a class="" href="../../project/Feature-List/">Feature List</a>
</li>
<li class="">
<a class="" href="/people">Contributors and Team</a>
</li>
<li class="">
<a class="" href="../../project/Talks-and-Tech-Blogs/">Talks and Tech Blog Posts</a>
</li>
<li class="">
<a class="" href="../../project/Posts/">Posts</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Miscellaneous</span>
<ul class="subnav">
<li class="">
<a class="" href="../../miscellaneous/Camus-to-Gobblin-Migration/">Camus to Gobblin Migration</a>
</li>
<li class="">
<a class="" href="../../miscellaneous/Exactly-Once-Support/">Exactly Once Support</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Apache Gobblin</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>Developer Guide &raquo;</li>
<li>Gobblin Compliance Design</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/apache/incubator-gobblin/edit/master/docs/developer-guide/Gobblin-Compliance-Design.md" rel="nofollow"> Edit on Gobblin</a>
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<div class="toc">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#design">Design</a><ul>
<li><a href="#onboarding">Onboarding</a></li>
<li><a href="#purger">Purger</a><ul>
<li><a href="#gobblin-constructs">Gobblin constructs</a></li>
<li><a href="#hive-operations">Hive operations</a></li>
</ul>
</li>
<li><a href="#retention">Retention</a></li>
<li><a href="#restore">Restore</a></li>
</ul>
</li>
</ul>
</div>
<h1 id="introduction">Introduction</h1>
<hr />
<p>The Gobblin Compliance module allows for data purging to meet regulatory compliance requirements. The module includes purging, retention and restore functionality for datasets.</p>
<p>The purging is performed using Hive meaning that purging of datasets is supported in any format that Hive can read from and write to, including for example ORC and Parquet. Further the purger is built on top of the Gobblin framework which means that the fault-tolerance, scalability and flexibility that Gobblin provides is taken full advantage of.</p>
<p>The <a href="../user-guide/Gobblin-Compliance">User Guide</a> describes how to onboard a dataset for purging.</p>
<h1 id="design">Design</h1>
<hr />
<p>The elements of the Compliance design are:</p>
<ul>
<li>The onboarding process</li>
<li>The purge process</li>
<li>The retention process</li>
<li>The restore process</li>
</ul>
<h2 id="onboarding">Onboarding</h2>
<hr />
<p>A dataset is onboarded to the Purger with these steps:</p>
<ol>
<li>The whitelist includes either of the database or table that will be considered for purging</li>
<li>Every table that is to be purged includes the necessary information for purging (dataset descriptor) as a JSON string in its TBLPROPERTIES</li>
</ol>
<p>The purger iterates over all the tables that are whitelisted, and of those tables further looks for the presence of the dataset descriptor to specify the information required by the purger to proceed with the purge process.</p>
<p>With this information, the purger iterates over the partitions of the table that needs to be purged and proceeds to purge each partition of the table individually.</p>
<h2 id="purger">Purger</h2>
<hr />
<p>The purger code is mostly in the <code>gobblin.compliance.purger</code> package.</p>
<p>The elements of the purger are:</p>
<ul>
<li>The Gobblin constructs</li>
<li>The Hive operations</li>
</ul>
<h3 id="gobblin-constructs">Gobblin constructs</h3>
<hr />
<p>The Gobblin constructs that make up the Purger are:</p>
<ul>
<li><code>HivePurgerSource</code> generates a WorkUnit per partition that needs to be purged</li>
<li><code>HivePurgerExtractor</code> instantiates a <code>PurgeableHivePartitionDataset</code> object that encapsulates all the information required to purge the partition</li>
<li>For each partition, <code>HivePurgerConverter</code> populates the purge queries into the <code>PurgeableHivePartitionDataset</code> object</li>
<li>The purge queries are executed by <code>HivePurgerWriter</code> </li>
<li>The <code>HivePurgerPublisher</code> moves successful Workunits to the <code>COMMITTED</code> state</li>
</ul>
<h3 id="hive-operations">Hive operations</h3>
<hr />
<p>The purging process operates as follows:</p>
<ul>
<li>The partition information including location and partitioning scheme is determined from the metadata of the partition</li>
<li>A new external staging table is created using the Hive <code>LIKE</code> construct of the current table that is being purged</li>
<li>The location of this staging table on HDFS is a new folder within the table location with the current timestamp</li>
<li>The purge query executes a <code>LEFT OUTER JOIN</code> of the original table against the table containing the ids whose data is to be purged and <code>INSERT OVERWRITE</code>s this data into the staging table, and thereby location. Once this query returns, the location will contain the purged data</li>
<li>Since when we <code>ALTER</code> the original partition location next to the new staging table location, we preserve the location of the current/original location of the partition by creating a backup table pointing to this location. We do not move this immediately to avoid breaking any in-flight queries.</li>
<li>The next step is to <code>ALTER</code> the partition location to the location containing the purged data</li>
<li>The final step is to <code>DROP</code> the staging table, this only drops the metadata and not the data</li>
</ul>
<p>Taking as an example, a <code>tracking.event</code> table, and the <code>datepartition=2017-02-16-00/is_guest=0</code> partition, the purge process would be the following:</p>
<ul>
<li>Let's assume the <code>tracking.event</code> table is located at the location <code>/user/tracking/event/</code></li>
<li>The full partition name would be <code>tracking@event@datepartition=2017-02-16-00/is_guest=0</code> per Hive, and let's assume the data is located at <code>/user/tracking/event/original/datepartition=2017-02-16-00/is_guest=0/</code></li>
<li>A staging table <code>tracking.event_staging_1234567890123</code> (<code>1234567890123</code> is the example timestamp we will use for clarity, a real timestamp looks more like '1487154972824') is created <code>LIKE tracking.event</code> with the location <code>/user/tracking/event/1234567890123/datepartition=2017-02-16-00/is_guest=0/</code>. This would be within the original table location</li>
<li>The purge query would be similar to (assuming u_purger.guestids has the ids whose data is to be purged):</li>
</ul>
<pre><code class="hive">INSERT OVERWRITE TABLE tracking@event_staging_1234567890123
PARTITION (datepartition='2017-02-16-00',is_guest='0')
SELECT /*+MAPJOIN(b) */ a.metadata.guestid, a.col_a, a.col_b
FROM tracking.event a
LEFT JOIN u_purger.guestids b
ON a.metadata.guestid=b.guestid
WHERE b.guestid IS NULL AND a.datepartition='2017-02-16-00' AND a.is_guest='0'
</code></pre>
<ul>
<li>A backup table <code>tracking.event_backup_1234567890123</code> is created with PARTITION <code>datepartition=2017-02-16-00,is_guest=0</code> pointing to the original location <code>/user/tracking/event/original/datepartition=2017-02-16-00/is_guest=0</code></li>
<li>The partition location of <code>tracking@event@2017-02-16-00</code> is updated to be <code>/user/tracking/event/1234567890123/datepartition=2017-02-16-00/is_guest=0</code></li>
<li>The <code>tracking.event_staging_1234567890123</code> table is dropped</li>
</ul>
<h2 id="retention">Retention</h2>
<hr />
<p>The retention code is mostly in the <code>gobblin.compliance.retention</code> package.</p>
<p>The retention process builds on top of <a href="../data-management/Gobblin-Retention">Gobblin Retention</a> and performs the following operations:</p>
<ul>
<li>Cleanup of backup data beyond a specified policy</li>
<li>Cleanup of any staging tables not cleaned up in case of failures</li>
<li>Reaping of backup locations from the original location</li>
<li>Cleanup of trash data from the restore process beyond a specified policy</li>
</ul>
<h2 id="restore">Restore</h2>
<hr />
<p>The restore code is mostly in the <code>gobblin.compliance.restore</code> package.</p>
<p>The restore process allows for restoration to a backup dataset if required.</p>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../IDE-setup/" class="btn btn-neutral float-right" title="IDE setup">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../CodingStyle/" class="btn btn-neutral" title="Code Style Guide"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org" rel="nofollow">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme" rel="nofollow">theme</a> provided by <a href="https://readthedocs.org" rel="nofollow">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../CodingStyle/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../IDE-setup/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../..';</script>
<script src="../../js/theme.js" defer></script>
<script src="../../js/extra.js" defer></script>
<script src="../../search/main.js" defer></script>
</body>
</html>