blob: e80dd56ccd28936023e2a1bcc4a28062310f015b [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="author" content="Apache Software Foundation">
<link rel="shortcut icon" href="../../img/favicon.ico">
<title>Docker Integration - Apache Gobblin</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/github.min.css">
<link href="../../css/extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Docker Integration";
var mkdocs_page_input_path = "user-guide/Docker-Integration.md";
var mkdocs_page_url = null;
</script>
<script src="../../js/jquery-2.1.1.min.js" defer></script>
<script src="../../js/modernizr-2.8.3.min.js" defer></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Apache Gobblin</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" title="Type search term here" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="/">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Powered-By/">Companies Powered By Gobblin</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Getting-Started/">Getting Started</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Gobblin-Architecture/">Architecture</a>
</li>
<li class="toctree-l1">
<span class="caption-text">User Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../Working-with-Job-Configuration-Files/">Job Configuration Files</a>
</li>
<li class="">
<a class="" href="../Gobblin-Deployment/">Deployment</a>
</li>
<li class="">
<a class="" href="../Gobblin-as-a-Library/">Gobblin as a Library</a>
</li>
<li class="">
<a class="" href="../Gobblin-CLI/">Gobblin CLI</a>
</li>
<li class="">
<a class="" href="../Gobblin-Compliance/">Gobblin Compliance</a>
</li>
<li class="">
<a class="" href="../Gobblin-on-Yarn/">Gobblin on Yarn</a>
</li>
<li class="">
<a class="" href="../Compaction/">Compaction</a>
</li>
<li class="">
<a class="" href="../State-Management-and-Watermarks/">State Management and Watermarks</a>
</li>
<li class="">
<a class="" href="../Working-with-the-ForkOperator/">Fork Operator</a>
</li>
<li class="">
<a class="" href="../Configuration-Properties-Glossary/">Configuration Glossary</a>
</li>
<li class="">
<a class="" href="../Source-schema-and-Converters/">Source schema and Converters</a>
</li>
<li class="">
<a class="" href="../Partitioned-Writers/">Partitioned Writers</a>
</li>
<li class="">
<a class="" href="../Monitoring/">Monitoring</a>
</li>
<li class="">
<a class="" href="../Gobblin-template/">Template</a>
</li>
<li class="">
<a class="" href="../Gobblin-Schedulers/">Schedulers</a>
</li>
<li class="">
<a class="" href="../Job-Execution-History-Store/">Job Execution History Store</a>
</li>
<li class="">
<a class="" href="../Building-Gobblin/">Building Gobblin</a>
</li>
<li class="">
<a class="" href="../Gobblin-genericLoad/">Generic Configuration Loading</a>
</li>
<li class="">
<a class="" href="../Hive-Registration/">Hive Registration</a>
</li>
<li class="">
<a class="" href="../Config-Management/">Config Management</a>
</li>
<li class=" current">
<a class="current" href="./">Docker Integration</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#table-of-contents">Table of Contents</a></li>
<li class="toctree-l3"><a href="#introduction">Introduction</a></li>
<li class="toctree-l3"><a href="#docker">Docker</a></li>
<li class="toctree-l3"><a href="#docker-repositories">Docker Repositories</a></li>
<ul>
<li><a class="toctree-l4" href="#gobblin-wikipedia-repository">Gobblin-Wikipedia Repository</a></li>
<li><a class="toctree-l4" href="#gobblin-standalone-repository">Gobblin-Standalone Repository</a></li>
</ul>
<li class="toctree-l3"><a href="#future-work">Future Work</a></li>
</ul>
</li>
<li class="">
<a class="" href="../Troubleshooting/">Troubleshooting</a>
</li>
<li class="">
<a class="" href="../FAQs/">FAQs</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sources</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sources/AvroFileSource/">Avro files</a>
</li>
<li class="">
<a class="" href="../../sources/CopySource/">File copy</a>
</li>
<li class="">
<a class="" href="../../sources/QueryBasedSource/">Query based</a>
</li>
<li class="">
<a class="" href="../../sources/RestApiSource/">Rest Api</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleAnalyticsSource/">Google Analytics</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleDriveSource/">Google Drive</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleWebmaster/">Google Webmaster</a>
</li>
<li class="">
<a class="" href="../../sources/HadoopTextInputSource/">Hadoop Text Input</a>
</li>
<li class="">
<a class="" href="../../sources/HelloWorldSource/">Hello World</a>
</li>
<li class="">
<a class="" href="../../sources/HiveAvroToOrcSource/">Hive Avro-to-ORC</a>
</li>
<li class="">
<a class="" href="../../sources/HivePurgerSource/">Hive compliance purging</a>
</li>
<li class="">
<a class="" href="../../sources/SimpleJsonSource/">JSON</a>
</li>
<li class="">
<a class="" href="../../sources/KafkaSource/">Kafka</a>
</li>
<li class="">
<a class="" href="../../sources/MySQLSource/">MySQL</a>
</li>
<li class="">
<a class="" href="../../sources/OracleSource/">Oracle</a>
</li>
<li class="">
<a class="" href="../../sources/SalesforceSource/">Salesforce</a>
</li>
<li class="">
<a class="" href="../../sources/SftpSource/">SFTP</a>
</li>
<li class="">
<a class="" href="../../sources/SqlServerSource/">SQL Server</a>
</li>
<li class="">
<a class="" href="../../sources/TeradataSource/">Teradata</a>
</li>
<li class="">
<a class="" href="../../sources/WikipediaSource/">Wikipedia</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sinks (Writers)</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sinks/AvroHdfsDataWriter/">Avro HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/ParquetHdfsDataWriter/">Parquet HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/SimpleBytesWriter/">HDFS Byte array</a>
</li>
<li class="">
<a class="" href="../../sinks/ConsoleWriter/">Console</a>
</li>
<li class="">
<a class="" href="../../sinks/CouchbaseWriter/">Couchbase</a>
</li>
<li class="">
<a class="" href="../../sinks/Http/">HTTP</a>
</li>
<li class="">
<a class="" href="../../sinks/Gobblin-JDBC-Writer/">JDBC</a>
</li>
<li class="">
<a class="" href="../../sinks/Kafka/">Kafka</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Adaptors</span>
<ul class="subnav">
<li class="">
<a class="" href="../../adaptors/Gobblin-Distcp/">Gobblin Distcp</a>
</li>
<li class="">
<a class="" href="../../adaptors/Hive-Avro-To-ORC-Converter/">Hive Avro-To-Orc Converter</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Case Studies</span>
<ul class="subnav">
<li class="">
<a class="" href="../../case-studies/Kafka-HDFS-Ingestion/">Kafka-HDFS Ingestion</a>
</li>
<li class="">
<a class="" href="../../case-studies/Publishing-Data-to-S3/">Publishing Data to S3</a>
</li>
<li class="">
<a class="" href="../../case-studies/Writing-ORC-Data/">Writing ORC Data</a>
</li>
<li class="">
<a class="" href="../../case-studies/Hive-Distcp/">Hive Distcp</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Data Management</span>
<ul class="subnav">
<li class="">
<a class="" href="../../data-management/Gobblin-Retention/">Retention</a>
</li>
<li class="">
<a class="" href="../../data-management/DistcpNgEvents/">Distcp-NG events</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Metrics</span>
<ul class="subnav">
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics/">Quick Start</a>
</li>
<li class="">
<a class="" href="../../metrics/Existing-Reporters/">Existing Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Metrics-for-Gobblin-ETL/">Metrics for Gobblin ETL</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Architecture/">Gobblin Metrics Architecture</a>
</li>
<li class="">
<a class="" href="../../metrics/Implementing-New-Reporters/">Implementing New Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Performance/">Gobblin Metrics Performance</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Developer Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../../developer-guide/Customization-for-New-Source/">Customization for New Source</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Customization-for-Converter-and-Operator/">Customization for Converter and Operator</a>
</li>
<li class="">
<a class="" href="../../developer-guide/CodingStyle/">Code Style Guide</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Gobblin-Compliance-Design/">Gobblin Compliance Design</a>
</li>
<li class="">
<a class="" href="../../developer-guide/IDE-setup/">IDE setup</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Monitoring-Design/">Monitoring Design</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Documentation-Architecture/">Documentation Architecture</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Contributing/">Contributing</a>
</li>
<li class="">
<a class="" href="../../developer-guide/GobblinModules/">Gobblin Modules</a>
</li>
<li class="">
<a class="" href="../../developer-guide/HighLevelConsumer/">High Level Consumer</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Project</span>
<ul class="subnav">
<li class="">
<a class="" href="../../project/Feature-List/">Feature List</a>
</li>
<li class="">
<a class="" href="/people">Contributors and Team</a>
</li>
<li class="">
<a class="" href="../../project/Talks-and-Tech-Blogs/">Talks and Tech Blog Posts</a>
</li>
<li class="">
<a class="" href="../../project/Posts/">Posts</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Miscellaneous</span>
<ul class="subnav">
<li class="">
<a class="" href="../../miscellaneous/Camus-to-Gobblin-Migration/">Camus to Gobblin Migration</a>
</li>
<li class="">
<a class="" href="../../miscellaneous/Exactly-Once-Support/">Exactly Once Support</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Apache Gobblin</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>User Guide &raquo;</li>
<li>Docker Integration</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/apache/incubator-gobblin/edit/master/docs/user-guide/Docker-Integration.md" rel="nofollow"> Edit on Gobblin</a>
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h1 id="table-of-contents">Table of Contents</h1>
<div class="toc">
<ul>
<li><a href="#table-of-contents">Table of Contents</a></li>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#docker">Docker</a></li>
<li><a href="#docker-repositories">Docker Repositories</a><ul>
<li><a href="#gobblin-wikipedia-repository">Gobblin-Wikipedia Repository</a></li>
<li><a href="#gobblin-standalone-repository">Gobblin-Standalone Repository</a></li>
</ul>
</li>
<li><a href="#future-work">Future Work</a></li>
</ul>
</div>
<h1 id="introduction">Introduction</h1>
<p>Gobblin integrates with Docker by running a Gobblin standalone service inside a Docker container. The Gobblin service inside the container can monitor the host filesystem for new job configuration files, run the jobs, and write the resulting data to the host filesystem. The Gobblin Docker images can be found on Docker Hub at: https://hub.docker.com/u/gobblin/</p>
<h1 id="docker">Docker</h1>
<p>For more information on Docker, including how to install it, check out the documentation at: https://docs.docker.com/</p>
<h1 id="docker-repositories">Docker Repositories</h1>
<p>Gobblin currently has four different repositories, and all are on Docker Hub <a href="https://hub.docker.com/u/gobblin/" rel="nofollow">here</a>.</p>
<p>The <code>gobblin/gobblin-wikipedia</code> repository contains images that run the Gobblin Wikipedia job found in the <a href="../Getting-Started">getting started guide</a>. These images are useful for users new to Docker or Gobblin, they primarily act as a "Hello World" example for the Gobblin Docker integration.</p>
<p>The <code>gobblin/gobblin-standalone</code> repository contains images that run a <a href="Gobblin-Deployment#standalone-architecture">Gobblin standalone service</a> inside a Docker container. These images provide an easy and simple way to setup a Gobblin standalone service on any Docker compatible machine.</p>
<p>The <code>gobblin/gobblin-base</code> and <code>gobblin/gobblin-distributions</code> repositories are for internal use only, and are primarily useful for Gobblin developers.</p>
<h2 id="gobblin-wikipedia-repository">Gobblin-Wikipedia Repository</h2>
<p>The Docker images for this repository can be found on Docker Hub <a href="https://hub.docker.com/r/gobblin/gobblin-wikipedia/" rel="nofollow">here</a>. These images are mainly meant to act as a "Hello World" example for the Gobblin-Docker integration, and to provide a sanity check to see if the Gobblin-Docker integration is working on a given machine. The image contains the Gobblin configuration files to run the <a href="../Getting-Started">Gobblin Wikipedia job</a>. When a container is launched using the <code>gobblin-wikipedia</code> image, Gobblin starts up, runs the Wikipedia example, and then exits.</p>
<p>Running the <code>gobblin-wikipedia</code> image requires taking following steps (lets assume we want to an Ubuntu based image):</p>
<ul>
<li>Download the images from the <code>gobblin/gobblin-wikipedia</code> repository</li>
</ul>
<pre><code>docker pull gobblin/gobblin-wikipedia:ubuntu-gobblin-latest
</code></pre>
<ul>
<li>Run the <code>gobblin/gobblin-wikipedia:ubuntu-gobblin-latest</code> image in a Docker container</li>
</ul>
<pre><code>docker run gobblin/gobblin-wikipedia:ubuntu-gobblin-latest
</code></pre>
<p>The logs are printed to the console, and no errors should pop up. This should provide a nice sanity check to ensure that everything is working as expected. The output of the job will be written to a directory inside the container. When the container exits that data will be lost. In order to preserve the output of the job, continue to the next step.</p>
<ul>
<li>Preserving the output of a Docker container requires using a <a href="https://docs.docker.com/engine/tutorials/dockervolumes/" rel="nofollow">data volume</a>. To do this, run the below command:</li>
</ul>
<pre><code>docker run -v /home/gobblin/work-dir:/home/gobblin/work-dir gobblin-wikipedia
</code></pre>
<p>The output of the Gobblin-Wikipedia job should now be written to <code>/home/gobblin/work-dir/job-output</code>. The <code>-v</code> command in Docker uses a feature of Docker called <a href="https://docs.docker.com/engine/tutorials/dockervolumes/" rel="nofollow">data volumes</a>. The <code>-v</code> option mounts a host directory into a container and is of the form <code>[host-directory]:[container-directory]</code>. Now any modifications to the host directory can be seen inside the container-directory, and any modifications to the container-directory can be seen inside the host-directory. This is a standard way to ensure data persists even after a Docker container finishes. It's important to note that the <code>[host-directory]</code> in the <code>-v</code> option can be changed to any directory (on OSX it must be under the <code>/Users/</code> directory), but the <code>[container-directory]</code> must remain <code>/home/gobblin/work-dir</code> (at least for now).</p>
<h2 id="gobblin-standalone-repository">Gobblin-Standalone Repository</h2>
<p>The Docker images for this repository can be found on Docker Hub <a href="https://hub.docker.com/r/gobblin/gobblin-standalone/" rel="nofollow">here</a>. These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a <code>.job</code> or <code>.pull</code> file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found <a href="Working-with-Job-Configuration-Files#adding-or-changing-job-configuration-files">here</a>). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service.</p>
<p>Running the <code>gobblin-standalone</code> image requires taking the following steps:</p>
<ul>
<li>Download the images from the <code>gobblin/gobblin-standalone</code> repository</li>
</ul>
<pre><code>docker pull gobblin/gobblin-standalone:ubuntu-gobblin-latest
</code></pre>
<ul>
<li>Run the <code>gobblin/gobblin-standalone:ubuntu-gobblin-latest</code> image in a Docker container</li>
</ul>
<pre><code>docker run -v /home/gobblin/conf:/etc/opt/job-conf \
-v /home/gobblin/work-dir:/home/gobblin/work-dir \
-v /home/gobblin/logs:/var/log/gobblin \
gobblin/gobblin-standalone:ubuntu-gobblin-latest
</code></pre>
<p>A data volume needs to be created for the job configuration directory (contains all the job configuration files), the work directory (contains all the job output data), and the logs directory (contains all the Gobblin standalone logs).</p>
<p>The <code>-v /home/gobblin/conf:/etc/opt/job-conf</code> option allows any new job / pull files added to the <code>/home/gobblin/conf</code> directory on the host filesystem will be seen by the Gobblin standalone service inside the container. So any job / pull added to the <code>/home/gobblin/conf</code> directory on the local filesystem will be run by the Gobblin standalone inside running inside the Docker container. Note the container directory (<code>/etc/opt/job-conf</code>) should not be modified, while the host directory (<code>/home/gobblin/conf</code>) directory can be any directory on the host filesystem that contains job / pull files.</p>
<p>The <code>-v /home/gobblin/work-dir:/home/gobblin/work-dir</code> option allows the container to write data to the host filesystem, so that the data persists after the container is shutdown. Once again, the container directory (<code>/home/gobblin/work-dir</code>) should not be modified, while the host directory (<code>/home/gobblin/work-dir</code>) can be any directory on the host filesystem.</p>
<p>The <code>-v /home/gobblin/logs:/var/log/gobblin</code> option allows the Gobblin standalone logs to be written to the host filesystem, so that they can be read on the host machine. This is useful for monitoring and debugging purposes. Once again, the container directory (<code>/var/log/gobblin</code>) directory should not be modified, while the container directory (<code>/home/gobblin/logs</code>) can be any directory on the host filesystem.</p>
<h1 id="future-work">Future Work</h1>
<ul>
<li>Create <code>gobblin-dev</code> images that provide an development environment for Gobblin contributors</li>
<li>Create <code>gobblin-kafka</code> images that provide an end-to-end service for writing to Kafka and ingesting the Kafka data through Gobblin</li>
<li>Test and write a tutorial on using <code>gobblin-standalone</code> images to write to a HDFS cluster</li>
<li>Create images based on <a href="https://hub.docker.com/_/alpine/" rel="nofollow">Linux Alpine</a> (lightweight Linux distro)</li>
</ul>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../Troubleshooting/" class="btn btn-neutral float-right" title="Troubleshooting">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../Config-Management/" class="btn btn-neutral" title="Config Management"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org" rel="nofollow">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme" rel="nofollow">theme</a> provided by <a href="https://readthedocs.org" rel="nofollow">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../Config-Management/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../Troubleshooting/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../..';</script>
<script src="../../js/theme.js" defer></script>
<script src="../../js/extra.js" defer></script>
<script src="../../search/main.js" defer></script>
</body>
</html>