blob: 75225f5b3321e2c3c04729f5e08d0fbd08b2bd6d [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="author" content="Apache Software Foundation">
<link rel="shortcut icon" href="../../img/favicon.ico">
<title>Gobblin as a Library - Apache Gobblin</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/github.min.css">
<link href="../../css/extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Gobblin as a Library";
var mkdocs_page_input_path = "user-guide/Gobblin-as-a-Library.md";
var mkdocs_page_url = null;
</script>
<script src="../../js/jquery-2.1.1.min.js" defer></script>
<script src="../../js/modernizr-2.8.3.min.js" defer></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Apache Gobblin</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" title="Type search term here" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="/">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Powered-By/">Companies Powered By Gobblin</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Getting-Started/">Getting Started</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Gobblin-Architecture/">Architecture</a>
</li>
<li class="toctree-l1">
<span class="caption-text">User Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../Working-with-Job-Configuration-Files/">Job Configuration Files</a>
</li>
<li class="">
<a class="" href="../Gobblin-Deployment/">Deployment</a>
</li>
<li class=" current">
<a class="current" href="./">Gobblin as a Library</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#table-of-contents">Table of Contents</a></li>
<li class="toctree-l3"><a href="#using-gobblin-as-a-library">Using Gobblin as a Library</a></li>
<li class="toctree-l3"><a href="#creating-an-embedded-gobblin-instance">Creating an Embedded Gobblin instance</a></li>
<li class="toctree-l3"><a href="#configuring-embedded-gobblin">Configuring Embedded Gobblin</a></li>
<li class="toctree-l3"><a href="#running-embedded-gobblin">Running Embedded Gobblin</a></li>
<li class="toctree-l3"><a href="#extending-embedded-gobblin">Extending Embedded Gobblin</a></li>
</ul>
</li>
<li class="">
<a class="" href="../Gobblin-CLI/">Gobblin CLI</a>
</li>
<li class="">
<a class="" href="../Gobblin-Compliance/">Gobblin Compliance</a>
</li>
<li class="">
<a class="" href="../Gobblin-on-Yarn/">Gobblin on Yarn</a>
</li>
<li class="">
<a class="" href="../Compaction/">Compaction</a>
</li>
<li class="">
<a class="" href="../State-Management-and-Watermarks/">State Management and Watermarks</a>
</li>
<li class="">
<a class="" href="../Working-with-the-ForkOperator/">Fork Operator</a>
</li>
<li class="">
<a class="" href="../Configuration-Properties-Glossary/">Configuration Glossary</a>
</li>
<li class="">
<a class="" href="../Source-schema-and-Converters/">Source schema and Converters</a>
</li>
<li class="">
<a class="" href="../Partitioned-Writers/">Partitioned Writers</a>
</li>
<li class="">
<a class="" href="../Monitoring/">Monitoring</a>
</li>
<li class="">
<a class="" href="../Gobblin-template/">Template</a>
</li>
<li class="">
<a class="" href="../Gobblin-Schedulers/">Schedulers</a>
</li>
<li class="">
<a class="" href="../Job-Execution-History-Store/">Job Execution History Store</a>
</li>
<li class="">
<a class="" href="../Building-Gobblin/">Building Gobblin</a>
</li>
<li class="">
<a class="" href="../Gobblin-genericLoad/">Generic Configuration Loading</a>
</li>
<li class="">
<a class="" href="../Hive-Registration/">Hive Registration</a>
</li>
<li class="">
<a class="" href="../Config-Management/">Config Management</a>
</li>
<li class="">
<a class="" href="../Docker-Integration/">Docker Integration</a>
</li>
<li class="">
<a class="" href="../Troubleshooting/">Troubleshooting</a>
</li>
<li class="">
<a class="" href="../FAQs/">FAQs</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sources</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sources/AvroFileSource/">Avro files</a>
</li>
<li class="">
<a class="" href="../../sources/CopySource/">File copy</a>
</li>
<li class="">
<a class="" href="../../sources/QueryBasedSource/">Query based</a>
</li>
<li class="">
<a class="" href="../../sources/RestApiSource/">Rest Api</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleAnalyticsSource/">Google Analytics</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleDriveSource/">Google Drive</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleWebmaster/">Google Webmaster</a>
</li>
<li class="">
<a class="" href="../../sources/HadoopTextInputSource/">Hadoop Text Input</a>
</li>
<li class="">
<a class="" href="../../sources/HelloWorldSource/">Hello World</a>
</li>
<li class="">
<a class="" href="../../sources/HiveAvroToOrcSource/">Hive Avro-to-ORC</a>
</li>
<li class="">
<a class="" href="../../sources/HivePurgerSource/">Hive compliance purging</a>
</li>
<li class="">
<a class="" href="../../sources/SimpleJsonSource/">JSON</a>
</li>
<li class="">
<a class="" href="../../sources/KafkaSource/">Kafka</a>
</li>
<li class="">
<a class="" href="../../sources/MySQLSource/">MySQL</a>
</li>
<li class="">
<a class="" href="../../sources/OracleSource/">Oracle</a>
</li>
<li class="">
<a class="" href="../../sources/SalesforceSource/">Salesforce</a>
</li>
<li class="">
<a class="" href="../../sources/SftpSource/">SFTP</a>
</li>
<li class="">
<a class="" href="../../sources/SqlServerSource/">SQL Server</a>
</li>
<li class="">
<a class="" href="../../sources/TeradataSource/">Teradata</a>
</li>
<li class="">
<a class="" href="../../sources/WikipediaSource/">Wikipedia</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sinks (Writers)</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sinks/AvroHdfsDataWriter/">Avro HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/ParquetHdfsDataWriter/">Parquet HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/SimpleBytesWriter/">HDFS Byte array</a>
</li>
<li class="">
<a class="" href="../../sinks/ConsoleWriter/">Console</a>
</li>
<li class="">
<a class="" href="../../sinks/CouchbaseWriter/">Couchbase</a>
</li>
<li class="">
<a class="" href="../../sinks/Http/">HTTP</a>
</li>
<li class="">
<a class="" href="../../sinks/Gobblin-JDBC-Writer/">JDBC</a>
</li>
<li class="">
<a class="" href="../../sinks/Kafka/">Kafka</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Adaptors</span>
<ul class="subnav">
<li class="">
<a class="" href="../../adaptors/Gobblin-Distcp/">Gobblin Distcp</a>
</li>
<li class="">
<a class="" href="../../adaptors/Hive-Avro-To-ORC-Converter/">Hive Avro-To-Orc Converter</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Case Studies</span>
<ul class="subnav">
<li class="">
<a class="" href="../../case-studies/Kafka-HDFS-Ingestion/">Kafka-HDFS Ingestion</a>
</li>
<li class="">
<a class="" href="../../case-studies/Publishing-Data-to-S3/">Publishing Data to S3</a>
</li>
<li class="">
<a class="" href="../../case-studies/Writing-ORC-Data/">Writing ORC Data</a>
</li>
<li class="">
<a class="" href="../../case-studies/Hive-Distcp/">Hive Distcp</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Data Management</span>
<ul class="subnav">
<li class="">
<a class="" href="../../data-management/Gobblin-Retention/">Retention</a>
</li>
<li class="">
<a class="" href="../../data-management/DistcpNgEvents/">Distcp-NG events</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Metrics</span>
<ul class="subnav">
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics/">Quick Start</a>
</li>
<li class="">
<a class="" href="../../metrics/Existing-Reporters/">Existing Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Metrics-for-Gobblin-ETL/">Metrics for Gobblin ETL</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Architecture/">Gobblin Metrics Architecture</a>
</li>
<li class="">
<a class="" href="../../metrics/Implementing-New-Reporters/">Implementing New Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Performance/">Gobblin Metrics Performance</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Developer Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../../developer-guide/Customization-for-New-Source/">Customization for New Source</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Customization-for-Converter-and-Operator/">Customization for Converter and Operator</a>
</li>
<li class="">
<a class="" href="../../developer-guide/CodingStyle/">Code Style Guide</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Gobblin-Compliance-Design/">Gobblin Compliance Design</a>
</li>
<li class="">
<a class="" href="../../developer-guide/IDE-setup/">IDE setup</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Monitoring-Design/">Monitoring Design</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Documentation-Architecture/">Documentation Architecture</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Contributing/">Contributing</a>
</li>
<li class="">
<a class="" href="../../developer-guide/GobblinModules/">Gobblin Modules</a>
</li>
<li class="">
<a class="" href="../../developer-guide/HighLevelConsumer/">High Level Consumer</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Project</span>
<ul class="subnav">
<li class="">
<a class="" href="../../project/Feature-List/">Feature List</a>
</li>
<li class="">
<a class="" href="/people">Contributors and Team</a>
</li>
<li class="">
<a class="" href="../../project/Talks-and-Tech-Blogs/">Talks and Tech Blog Posts</a>
</li>
<li class="">
<a class="" href="../../project/Posts/">Posts</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Miscellaneous</span>
<ul class="subnav">
<li class="">
<a class="" href="../../miscellaneous/Camus-to-Gobblin-Migration/">Camus to Gobblin Migration</a>
</li>
<li class="">
<a class="" href="../../miscellaneous/Exactly-Once-Support/">Exactly Once Support</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Apache Gobblin</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>User Guide &raquo;</li>
<li>Gobblin as a Library</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/apache/incubator-gobblin/edit/master/docs/user-guide/Gobblin-as-a-Library.md" rel="nofollow"> Edit on Gobblin</a>
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h2 id="table-of-contents">Table of Contents</h2>
<div class="toc">
<ul>
<li><a href="#table-of-contents">Table of Contents</a></li>
<li><a href="#using-gobblin-as-a-library">Using Gobblin as a Library</a></li>
<li><a href="#creating-an-embedded-gobblin-instance">Creating an Embedded Gobblin instance</a></li>
<li><a href="#configuring-embedded-gobblin">Configuring Embedded Gobblin</a></li>
<li><a href="#running-embedded-gobblin">Running Embedded Gobblin</a></li>
<li><a href="#extending-embedded-gobblin">Extending Embedded Gobblin</a></li>
</ul>
</div>
<h2 id="using-gobblin-as-a-library">Using Gobblin as a Library</h2>
<p>A Gobblin ingestion flow can be embedded into a java application using the <code>EmbeddedGobblin</code> class.</p>
<p>The following code will run a Hello-World Gobblin job as an embedded application using a template. This will simply print "Hello World \&lt;i>!" to stdout a few times.</p>
<pre><code class="java">EmbeddedGobblin embeddedGobblin = new EmbeddedGobblin(&quot;TestJob&quot;)
.setTemplate(ResourceBasedJobTemplate.forResourcePath(&quot;templates/hello-world.template&quot;));
JobExecutionResult result = embeddedGobblin.run();
</code></pre>
<p>Note: <code>EmbeddedGobblin</code> starts and destroys an embedded Gobblin instance every time <code>run()</code> is called. If an application needs to run a large number of Gobblin jobs, it should instantiate and manage its own Gobblin driver.</p>
<h2 id="creating-an-embedded-gobblin-instance">Creating an Embedded Gobblin instance</h2>
<p>The code snippet above creates an <code>EmbeddedGobblin</code> instance. This instance can run arbitrary Gobblin ingestion jobs, and allows the use of templates. However, the user needs to configure the job by using the exact key needed for each feature.</p>
<p>An alternative is to use a subclass of <code>EmbeddedGobblin</code> which provides methods to more easily configure the job. For example, an easier way to run a Gobblin distcp job is to use <code>EmbeddedGobblinDistcp</code>:</p>
<pre><code class="java">EmbeddedGobblinDistcp distcp = new EmbeddedGobblinDistcp(sourcePath, targetPath).delete();
distcp.run();
</code></pre>
<p>This subclass automatically knows which template to use, the required configurations for the job (which are included as constructor parameters), and also provides convenience methods for the most common configurations (in the case above, the method <code>delete()</code> instructs the job to delete files that exist in the target but not the source).</p>
<p>The following is a non-extensive list of available subclasses of <code>EmbeddedGobblin</code>:
<em> <code>EmbeddedGobblinDistcp</code>: distributed copy between Hadoop compatible file systems.
</em> <code>EmbeddedWikipediaExample</code>: a getting started example that pulls page updated from Wikipedia.</p>
<h2 id="configuring-embedded-gobblin">Configuring Embedded Gobblin</h2>
<p><code>EmbeddedGobblin</code> allows any configuration that a standalone Gobblin job would allow. <code>EmbeddedGobblin</code> itself provides a few convenience methods to alter the behavior of the Gobblin framework. Other methods allow users to set a job template to use or set job level configurations.</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>Parameters</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>mrMode</code></td>
<td>N/A</td>
<td>Gobblin should run on MR mode.</td>
</tr>
<tr>
<td><code>setTemplate</code></td>
<td>Template object to use</td>
<td>Use a job template.</td>
</tr>
<tr>
<td><code>useStateStore</code></td>
<td>State store directory</td>
<td>By default, embedded Gobblin is stateless and disables state store. This method enables the state store at the indicated location allowing using watermarks from previous jobs.</td>
</tr>
<tr>
<td><code>distributeJar</code></td>
<td>Path to jar in local fs</td>
<td>Indicates that a specific jar is needed by Gobblin workers when running in distributed mode (e.g. MR mode). Gobblin will automatically add this jar to the classpath of the workers.</td>
</tr>
<tr>
<td><code>setConfiguration</code></td>
<td>key - value pair</td>
<td>Sets a job level configuration.</td>
</tr>
<tr>
<td><code>setJobTimeout</code></td>
<td>timeout and time unit, or ISO period</td>
<td>Sets the timeout for the Gobblin job. <code>run()</code> will throw a <code>TimeoutException</code> if the job is not done after this period. (Default: 10 days)</td>
</tr>
<tr>
<td><code>setLaunchTimeout</code></td>
<td>timeout and time unit, or ISO period</td>
<td>Sets the timeout for launching Gobblin job. <code>run()</code> will throw a <code>TimeoutException</code> if the job has not started after this period. (Default: 10 seconds)</td>
</tr>
<tr>
<td><code>setShutdownTimeout</code></td>
<td>timeout and time unit, or ISO period</td>
<td>Sets the timeout for shutting down embedded Gobblin after the job has finished. <code>run()</code> will throw a <code>TimeoutException</code> if the method has not returned within the timeout after the job finishes. Note that a <code>TimeoutException</code> may indicate that Gobblin could not release JVM resources, including threads.</td>
</tr>
</tbody>
</table>
<p>Additional to the above, subclasses of <code>EmbeddedGobblin</code> might offer their own convenience methods.</p>
<h2 id="running-embedded-gobblin">Running Embedded Gobblin</h2>
<p>After <code>EmbeddedGobblin</code> has been configured it can be run with one of two methods:
<em> <code>run()</code>: blocking call. Returns a <code>JobExecutionResult</code> after the job finishes and Gobblin shuts down.
</em> <code>runAsync()</code>: asynchronous call. Returns a <code>JobExecutionDriver</code>, which implements <code>Future&lt;JobExecutionResult&gt;</code>.</p>
<h2 id="extending-embedded-gobblin">Extending Embedded Gobblin</h2>
<p>Developers can extend <code>EmbeddedGobblin</code> to provide users with easier ways to launch a particular type of job. For an example see <code>EmbeddedGobblinDistcp</code>.</p>
<p>Best practices:
<em> Generally, a subclass of <code>EmbeddedGobblin</code> is based on a template. The template should be automatically loaded on construction and the constructor should call <code>setTemplate(myTemplate)</code>.
</em> All required configurations for a job should be parsed from the constructor arguments. User should be able to run <code>new MyEmbeddedGobblinExtension(params...).run()</code> and get a sensible job run.
* Convenience methods should be added for the most common configurations users would want to change. In general a convenience method will call a few other methods transparently to the user. For example:</p>
<pre><code class="java"> public EmbeddedGobblinDistcp simulate() {
this.setConfiguration(CopySource.SIMULATE, Boolean.toString(true));
return this;
}
</code></pre>
<ul>
<li>If the job requires additional jars in the workers that are not part of the minimal Gobblin ingestion classpath (see <code>EmbeddedGobblin#getCoreGobblinJars</code> for this list), then the constructor should call <code>distributeJar(myJar)</code> for the additional jars.</li>
</ul>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../Gobblin-CLI/" class="btn btn-neutral float-right" title="Gobblin CLI">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../Gobblin-Deployment/" class="btn btn-neutral" title="Deployment"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org" rel="nofollow">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme" rel="nofollow">theme</a> provided by <a href="https://readthedocs.org" rel="nofollow">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../Gobblin-Deployment/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../Gobblin-CLI/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../..';</script>
<script src="../../js/theme.js" defer></script>
<script src="../../js/extra.js" defer></script>
<script src="../../search/main.js" defer></script>
</body>
</html>