blob: fa378a814df08dd0c964b786cf55e8e5dc21bcad [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="author" content="Apache Software Foundation">
<link rel="shortcut icon" href="../../img/favicon.ico">
<title>Writing ORC Data - Apache Gobblin</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/github.min.css">
<link href="../../css/extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Writing ORC Data";
var mkdocs_page_input_path = "case-studies/Writing-ORC-Data.md";
var mkdocs_page_url = null;
</script>
<script src="../../js/jquery-2.1.1.min.js" defer></script>
<script src="../../js/modernizr-2.8.3.min.js" defer></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Apache Gobblin</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" title="Type search term here" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="/">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Powered-By/">Companies Powered By Gobblin</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Getting-Started/">Getting Started</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Gobblin-Architecture/">Architecture</a>
</li>
<li class="toctree-l1">
<span class="caption-text">User Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../../user-guide/Working-with-Job-Configuration-Files/">Job Configuration Files</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-Deployment/">Deployment</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-as-a-Library/">Gobblin as a Library</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-CLI/">Gobblin CLI</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-Compliance/">Gobblin Compliance</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-on-Yarn/">Gobblin on Yarn</a>
</li>
<li class="">
<a class="" href="../../user-guide/Compaction/">Compaction</a>
</li>
<li class="">
<a class="" href="../../user-guide/State-Management-and-Watermarks/">State Management and Watermarks</a>
</li>
<li class="">
<a class="" href="../../user-guide/Working-with-the-ForkOperator/">Fork Operator</a>
</li>
<li class="">
<a class="" href="../../user-guide/Configuration-Properties-Glossary/">Configuration Glossary</a>
</li>
<li class="">
<a class="" href="../../user-guide/Source-schema-and-Converters/">Source schema and Converters</a>
</li>
<li class="">
<a class="" href="../../user-guide/Partitioned-Writers/">Partitioned Writers</a>
</li>
<li class="">
<a class="" href="../../user-guide/Monitoring/">Monitoring</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-template/">Template</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-Schedulers/">Schedulers</a>
</li>
<li class="">
<a class="" href="../../user-guide/Job-Execution-History-Store/">Job Execution History Store</a>
</li>
<li class="">
<a class="" href="../../user-guide/Building-Gobblin/">Building Gobblin</a>
</li>
<li class="">
<a class="" href="../../user-guide/Gobblin-genericLoad/">Generic Configuration Loading</a>
</li>
<li class="">
<a class="" href="../../user-guide/Hive-Registration/">Hive Registration</a>
</li>
<li class="">
<a class="" href="../../user-guide/Config-Management/">Config Management</a>
</li>
<li class="">
<a class="" href="../../user-guide/Docker-Integration/">Docker Integration</a>
</li>
<li class="">
<a class="" href="../../user-guide/Troubleshooting/">Troubleshooting</a>
</li>
<li class="">
<a class="" href="../../user-guide/FAQs/">FAQs</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sources</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sources/AvroFileSource/">Avro files</a>
</li>
<li class="">
<a class="" href="../../sources/CopySource/">File copy</a>
</li>
<li class="">
<a class="" href="../../sources/QueryBasedSource/">Query based</a>
</li>
<li class="">
<a class="" href="../../sources/RestApiSource/">Rest Api</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleAnalyticsSource/">Google Analytics</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleDriveSource/">Google Drive</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleWebmaster/">Google Webmaster</a>
</li>
<li class="">
<a class="" href="../../sources/HadoopTextInputSource/">Hadoop Text Input</a>
</li>
<li class="">
<a class="" href="../../sources/HelloWorldSource/">Hello World</a>
</li>
<li class="">
<a class="" href="../../sources/HiveAvroToOrcSource/">Hive Avro-to-ORC</a>
</li>
<li class="">
<a class="" href="../../sources/HivePurgerSource/">Hive compliance purging</a>
</li>
<li class="">
<a class="" href="../../sources/SimpleJsonSource/">JSON</a>
</li>
<li class="">
<a class="" href="../../sources/KafkaSource/">Kafka</a>
</li>
<li class="">
<a class="" href="../../sources/MySQLSource/">MySQL</a>
</li>
<li class="">
<a class="" href="../../sources/OracleSource/">Oracle</a>
</li>
<li class="">
<a class="" href="../../sources/SalesforceSource/">Salesforce</a>
</li>
<li class="">
<a class="" href="../../sources/SftpSource/">SFTP</a>
</li>
<li class="">
<a class="" href="../../sources/SqlServerSource/">SQL Server</a>
</li>
<li class="">
<a class="" href="../../sources/TeradataSource/">Teradata</a>
</li>
<li class="">
<a class="" href="../../sources/WikipediaSource/">Wikipedia</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sinks (Writers)</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sinks/AvroHdfsDataWriter/">Avro HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/ParquetHdfsDataWriter/">Parquet HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/SimpleBytesWriter/">HDFS Byte array</a>
</li>
<li class="">
<a class="" href="../../sinks/ConsoleWriter/">Console</a>
</li>
<li class="">
<a class="" href="../../sinks/CouchbaseWriter/">Couchbase</a>
</li>
<li class="">
<a class="" href="../../sinks/Http/">HTTP</a>
</li>
<li class="">
<a class="" href="../../sinks/Gobblin-JDBC-Writer/">JDBC</a>
</li>
<li class="">
<a class="" href="../../sinks/Kafka/">Kafka</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Adaptors</span>
<ul class="subnav">
<li class="">
<a class="" href="../../adaptors/Gobblin-Distcp/">Gobblin Distcp</a>
</li>
<li class="">
<a class="" href="../../adaptors/Hive-Avro-To-ORC-Converter/">Hive Avro-To-Orc Converter</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Case Studies</span>
<ul class="subnav">
<li class="">
<a class="" href="../Kafka-HDFS-Ingestion/">Kafka-HDFS Ingestion</a>
</li>
<li class="">
<a class="" href="../Publishing-Data-to-S3/">Publishing Data to S3</a>
</li>
<li class=" current">
<a class="current" href="./">Writing ORC Data</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#table-of-contents">Table of Contents</a></li>
<li class="toctree-l3"><a href="#introduction">Introduction</a></li>
<li class="toctree-l3"><a href="#hive-serde-integration">Hive SerDe Integration</a></li>
<li class="toctree-l3"><a href="#writing-to-an-orc-file">Writing to an ORC File</a></li>
<ul>
<li><a class="toctree-l4" href="#data-flow">Data Flow</a></li>
</ul>
<li class="toctree-l3"><a href="#extending-gobblins-serde-integration">Extending Gobblin's SerDe Integration</a></li>
</ul>
</li>
<li class="">
<a class="" href="../Hive-Distcp/">Hive Distcp</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Data Management</span>
<ul class="subnav">
<li class="">
<a class="" href="../../data-management/Gobblin-Retention/">Retention</a>
</li>
<li class="">
<a class="" href="../../data-management/DistcpNgEvents/">Distcp-NG events</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Metrics</span>
<ul class="subnav">
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics/">Quick Start</a>
</li>
<li class="">
<a class="" href="../../metrics/Existing-Reporters/">Existing Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Metrics-for-Gobblin-ETL/">Metrics for Gobblin ETL</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Architecture/">Gobblin Metrics Architecture</a>
</li>
<li class="">
<a class="" href="../../metrics/Implementing-New-Reporters/">Implementing New Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Performance/">Gobblin Metrics Performance</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Developer Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../../developer-guide/Customization-for-New-Source/">Customization for New Source</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Customization-for-Converter-and-Operator/">Customization for Converter and Operator</a>
</li>
<li class="">
<a class="" href="../../developer-guide/CodingStyle/">Code Style Guide</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Gobblin-Compliance-Design/">Gobblin Compliance Design</a>
</li>
<li class="">
<a class="" href="../../developer-guide/IDE-setup/">IDE setup</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Monitoring-Design/">Monitoring Design</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Documentation-Architecture/">Documentation Architecture</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Contributing/">Contributing</a>
</li>
<li class="">
<a class="" href="../../developer-guide/GobblinModules/">Gobblin Modules</a>
</li>
<li class="">
<a class="" href="../../developer-guide/HighLevelConsumer/">High Level Consumer</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Project</span>
<ul class="subnav">
<li class="">
<a class="" href="../../project/Feature-List/">Feature List</a>
</li>
<li class="">
<a class="" href="/people">Contributors and Team</a>
</li>
<li class="">
<a class="" href="../../project/Talks-and-Tech-Blogs/">Talks and Tech Blog Posts</a>
</li>
<li class="">
<a class="" href="../../project/Posts/">Posts</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Miscellaneous</span>
<ul class="subnav">
<li class="">
<a class="" href="../../miscellaneous/Camus-to-Gobblin-Migration/">Camus to Gobblin Migration</a>
</li>
<li class="">
<a class="" href="../../miscellaneous/Exactly-Once-Support/">Exactly Once Support</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Apache Gobblin</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>Case Studies &raquo;</li>
<li>Writing ORC Data</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/apache/incubator-gobblin/edit/master/docs/case-studies/Writing-ORC-Data.md" rel="nofollow"> Edit on Gobblin</a>
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h1 id="table-of-contents">Table of Contents</h1>
<div class="toc">
<ul>
<li><a href="#table-of-contents">Table of Contents</a></li>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#hive-serde-integration">Hive SerDe Integration</a></li>
<li><a href="#writing-to-an-orc-file">Writing to an ORC File</a><ul>
<li><a href="#data-flow">Data Flow</a></li>
</ul>
</li>
<li><a href="#extending-gobblins-serde-integration">Extending Gobblin's SerDe Integration</a></li>
</ul>
</div>
<h1 id="introduction">Introduction</h1>
<p>Gobblin is capable of writing data to ORC files by leveraging Hive's SerDe library. Gobblin has native integration with Hive SerDe's library via the <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-hive-registration/src/main/java/org/apache/gobblin/hive/HiveSerDeManager.java" rel="nofollow">HiveSerDeWrapper</a> class.</p>
<p>This document will briefly explain how Gobblin integrates with Hive's SerDe library, and show an example of writing ORC files.</p>
<h1 id="hive-serde-integration">Hive SerDe Integration</h1>
<p><a href="https://cwiki.apache.org/confluence/display/Hive/SerDe">Hive's SerDe library</a> defines the interface Hive uses for serialization and deserialization of data. The Hive SerDe library has out of the box SerDe support for Avro, ORC, Parquet, CSV, and JSON SerDes. However, users are free to define custom SerDes.</p>
<p>Gobblin integrates with the Hive SerDe's in a few different places. Here is a list of integration points that are relevant for this document:</p>
<ul>
<li><code>HiveSerDeWrapper</code> wrapper around Hive's SerDe library that provides some nice utilities and structure that the rest of Gobblin can interfact with</li>
<li><code>HiveSerDeConverter</code> takes a <code>Writable</code> object in a specific format, and converts it to the Writable of another format (e.g. from <code>AvroGenericRecordWritable</code> to <code>OrcSerdeRow</code>)</li>
<li><code>HiveWritableHdfsDataWriter</code> writes a <code>Writable</code> object to a specific file, typically this writes the output of a <code>HiveSerDeConverter</code></li>
</ul>
<h1 id="writing-to-an-orc-file">Writing to an ORC File</h1>
<p>An end-to-end example of writing to an ORC file is provided in the configuration found <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia-orc.pull" rel="nofollow">here</a>. This <code>.pull</code> file is almost identical to the Wikipedia example discussed in the <a href="../../Getting-Started/">Getting Started Guide</a>. The only difference is that the output is written in ORC instead of Avro. The configuration file mentioned above can be directly used as a template for writing data to ORC files, below is a detailed explanation of the configuration options that need to be changed, and why they need to be changed.</p>
<ul>
<li><code>converter.classes</code> requires two additional converters: <code>gobblin.converter.avro.AvroRecordToAvroWritableConverter</code> and <code>gobblin.converter.serde.HiveSerDeConverter</code><ul>
<li>The output of the first converter (the <code>WikipediaConverter</code>) returns Avro <code>GenericRecord</code>s</li>
<li>These records must be converted to <code>Writable</code> object in order for the Hive SerDe to process them, which is where the <code>AvroRecordToAvroWritableConverter</code> comes in</li>
<li>The <code>HiveSerDeConverter</code> does the actual heavy lifting of converting the Avro Records to ORC Records</li>
</ul>
</li>
<li>In order to configure the <code>HiveSerDeConverter</code> the following properites need to be added:<ul>
<li><code>serde.deserializer.type=AVRO</code> says that the records being fed into the converter are Avro records<ul>
<li><code>avro.schema.literal</code> or <code>avro.schema.url</code> must be set when using this deserializer so that the Hive SerDe knows what Avro Schema to use when converting the record</li>
</ul>
</li>
<li><code>serde.serializer.type=ORC</code> says that the records that should be returned by the converter are ORC records</li>
</ul>
</li>
<li><code>writer.builder.class</code> should be set to <code>gobblin.writer.HiveWritableHdfsDataWriterBuilder</code><ul>
<li>This writer class will take the output of the <code>HiveSerDeConverter</code> and write the actual ORC records to an ORC file</li>
</ul>
</li>
<li><code>writer.output.format</code> should be set to <code>ORC</code>; this ensures the files produced end with the <code>.orc</code> file extension</li>
<li><code>fork.record.queue.capacity</code> should be set to <code>1</code><ul>
<li>This ensures no caching of records is done before they get passed to the writer; this is necessary because the <code>OrcSerde</code> caches the object it uses to serialize records, and it does not allow copying of Orc Records</li>
</ul>
</li>
</ul>
<p>The example job can be run the same way the regular Wikipedia job is run, except the output will be in the ORC format.</p>
<h2 id="data-flow">Data Flow</h2>
<p>For the Wikipedia to ORC example, data flows in the following manner:</p>
<ul>
<li>It is extracted from Wikipedia via the <code>WikipediaExtractor</code>, which also converts each Wikipedia entry into a <code>JsonElement</code></li>
<li>The <code>WikipediaConverter</code> then converts the Wikipedia JSON entry into an Avro <code>GenericRecord</code></li>
<li>The <code>AvroRecordToAvroWritableConverter</code> converts the Avro <code>GenericRecord</code> to a <code>AvroGenericRecordWritable</code></li>
<li>The <code>HiveSerDeConverter</code> converts the <code>AvroGenericRecordWritable</code> to a <code>OrcSerdeRow</code></li>
<li>The <code>HiveWritableHdfsDataWriter</code> uses the <code>OrcOutputFormat</code> to write the <code>OrcSerdeRow</code> to an <code>OrcFile</code></li>
</ul>
<h1 id="extending-gobblins-serde-integration">Extending Gobblin's SerDe Integration</h1>
<p>While this tutorial only discusses Avro to ORC conversion, it should be relatively straightfoward to use the approach mentioned in this document to convert CSV, JSON, etc. data into ORC.</p>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../Hive-Distcp/" class="btn btn-neutral float-right" title="Hive Distcp">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../Publishing-Data-to-S3/" class="btn btn-neutral" title="Publishing Data to S3"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org" rel="nofollow">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme" rel="nofollow">theme</a> provided by <a href="https://readthedocs.org" rel="nofollow">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../Publishing-Data-to-S3/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../Hive-Distcp/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../..';</script>
<script src="../../js/theme.js" defer></script>
<script src="../../js/extra.js" defer></script>
<script src="../../search/main.js" defer></script>
</body>
</html>