blob: f37c230d584b8ea694f0962e0e2074dbc536553b [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="author" content="Apache Software Foundation">
<link rel="shortcut icon" href="../../img/favicon.ico">
<title>Gobblin on Yarn - Apache Gobblin</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/github.min.css">
<link href="../../css/extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Gobblin on Yarn";
var mkdocs_page_input_path = "user-guide/Gobblin-on-Yarn.md";
var mkdocs_page_url = null;
</script>
<script src="../../js/jquery-2.1.1.min.js" defer></script>
<script src="../../js/modernizr-2.8.3.min.js" defer></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Apache Gobblin</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" title="Type search term here" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="/">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Powered-By/">Companies Powered By Gobblin</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Getting-Started/">Getting Started</a>
</li>
<li class="toctree-l1">
<a class="" href="../../Gobblin-Architecture/">Architecture</a>
</li>
<li class="toctree-l1">
<span class="caption-text">User Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../Working-with-Job-Configuration-Files/">Job Configuration Files</a>
</li>
<li class="">
<a class="" href="../Gobblin-Deployment/">Deployment</a>
</li>
<li class="">
<a class="" href="../Gobblin-as-a-Library/">Gobblin as a Library</a>
</li>
<li class="">
<a class="" href="../Gobblin-CLI/">Gobblin CLI</a>
</li>
<li class="">
<a class="" href="../Gobblin-Compliance/">Gobblin Compliance</a>
</li>
<li class=" current">
<a class="current" href="./">Gobblin on Yarn</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#table-of-contents">Table of Contents</a></li>
<li class="toctree-l3"><a href="#introduction">Introduction</a></li>
<li class="toctree-l3"><a href="#architecture">Architecture</a></li>
<ul>
<li><a class="toctree-l4" href="#overview">Overview</a></li>
<li><a class="toctree-l4" href="#the-role-of-apache-helix">The Role of Apache Helix</a></li>
<li><a class="toctree-l4" href="#gobblin-yarn-application-launcher">Gobblin Yarn Application Launcher</a></li>
<li><a class="toctree-l4" href="#gobblin-applicationmaster">Gobblin ApplicationMaster</a></li>
<li><a class="toctree-l4" href="#gobblin-workunitrunner">Gobblin WorkUnitRunner</a></li>
<li><a class="toctree-l4" href="#failure-handling">Failure Handling</a></li>
</ul>
<li class="toctree-l3"><a href="#log-aggregation">Log Aggregation</a></li>
<li class="toctree-l3"><a href="#security-and-delegation-token-management">Security and Delegation Token Management</a></li>
<li class="toctree-l3"><a href="#configuration">Configuration</a></li>
<ul>
<li><a class="toctree-l4" href="#configuration-properties">Configuration Properties</a></li>
<li><a class="toctree-l4" href="#job-lock">Job Lock</a></li>
<li><a class="toctree-l4" href="#configuration-system">Configuration System</a></li>
</ul>
<li class="toctree-l3"><a href="#deployment">Deployment</a></li>
<ul>
<li><a class="toctree-l4" href="#deployment-on-a-unsecured-yarn-cluster">Deployment on a Unsecured Yarn Cluster</a></li>
<li><a class="toctree-l4" href="#deployment-on-a-secured-yarn-cluster">Deployment on a Secured Yarn Cluster</a></li>
<li><a class="toctree-l4" href="#supporting-existing-gobblin-jobs">Supporting Existing Gobblin Jobs</a></li>
</ul>
<li class="toctree-l3"><a href="#monitoring">Monitoring</a></li>
</ul>
</li>
<li class="">
<a class="" href="../Compaction/">Compaction</a>
</li>
<li class="">
<a class="" href="../State-Management-and-Watermarks/">State Management and Watermarks</a>
</li>
<li class="">
<a class="" href="../Working-with-the-ForkOperator/">Fork Operator</a>
</li>
<li class="">
<a class="" href="../Configuration-Properties-Glossary/">Configuration Glossary</a>
</li>
<li class="">
<a class="" href="../Source-schema-and-Converters/">Source schema and Converters</a>
</li>
<li class="">
<a class="" href="../Partitioned-Writers/">Partitioned Writers</a>
</li>
<li class="">
<a class="" href="../Monitoring/">Monitoring</a>
</li>
<li class="">
<a class="" href="../Gobblin-template/">Template</a>
</li>
<li class="">
<a class="" href="../Gobblin-Schedulers/">Schedulers</a>
</li>
<li class="">
<a class="" href="../Job-Execution-History-Store/">Job Execution History Store</a>
</li>
<li class="">
<a class="" href="../Building-Gobblin/">Building Gobblin</a>
</li>
<li class="">
<a class="" href="../Gobblin-genericLoad/">Generic Configuration Loading</a>
</li>
<li class="">
<a class="" href="../Hive-Registration/">Hive Registration</a>
</li>
<li class="">
<a class="" href="../Config-Management/">Config Management</a>
</li>
<li class="">
<a class="" href="../Docker-Integration/">Docker Integration</a>
</li>
<li class="">
<a class="" href="../Troubleshooting/">Troubleshooting</a>
</li>
<li class="">
<a class="" href="../FAQs/">FAQs</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sources</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sources/AvroFileSource/">Avro files</a>
</li>
<li class="">
<a class="" href="../../sources/CopySource/">File copy</a>
</li>
<li class="">
<a class="" href="../../sources/QueryBasedSource/">Query based</a>
</li>
<li class="">
<a class="" href="../../sources/RestApiSource/">Rest Api</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleAnalyticsSource/">Google Analytics</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleDriveSource/">Google Drive</a>
</li>
<li class="">
<a class="" href="../../sources/GoogleWebmaster/">Google Webmaster</a>
</li>
<li class="">
<a class="" href="../../sources/HadoopTextInputSource/">Hadoop Text Input</a>
</li>
<li class="">
<a class="" href="../../sources/HelloWorldSource/">Hello World</a>
</li>
<li class="">
<a class="" href="../../sources/HiveAvroToOrcSource/">Hive Avro-to-ORC</a>
</li>
<li class="">
<a class="" href="../../sources/HivePurgerSource/">Hive compliance purging</a>
</li>
<li class="">
<a class="" href="../../sources/SimpleJsonSource/">JSON</a>
</li>
<li class="">
<a class="" href="../../sources/KafkaSource/">Kafka</a>
</li>
<li class="">
<a class="" href="../../sources/MySQLSource/">MySQL</a>
</li>
<li class="">
<a class="" href="../../sources/OracleSource/">Oracle</a>
</li>
<li class="">
<a class="" href="../../sources/SalesforceSource/">Salesforce</a>
</li>
<li class="">
<a class="" href="../../sources/SftpSource/">SFTP</a>
</li>
<li class="">
<a class="" href="../../sources/SqlServerSource/">SQL Server</a>
</li>
<li class="">
<a class="" href="../../sources/TeradataSource/">Teradata</a>
</li>
<li class="">
<a class="" href="../../sources/WikipediaSource/">Wikipedia</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Sinks (Writers)</span>
<ul class="subnav">
<li class="">
<a class="" href="../../sinks/AvroHdfsDataWriter/">Avro HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/ParquetHdfsDataWriter/">Parquet HDFS</a>
</li>
<li class="">
<a class="" href="../../sinks/SimpleBytesWriter/">HDFS Byte array</a>
</li>
<li class="">
<a class="" href="../../sinks/ConsoleWriter/">Console</a>
</li>
<li class="">
<a class="" href="../../sinks/CouchbaseWriter/">Couchbase</a>
</li>
<li class="">
<a class="" href="../../sinks/Http/">HTTP</a>
</li>
<li class="">
<a class="" href="../../sinks/Gobblin-JDBC-Writer/">JDBC</a>
</li>
<li class="">
<a class="" href="../../sinks/Kafka/">Kafka</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Adaptors</span>
<ul class="subnav">
<li class="">
<a class="" href="../../adaptors/Gobblin-Distcp/">Gobblin Distcp</a>
</li>
<li class="">
<a class="" href="../../adaptors/Hive-Avro-To-ORC-Converter/">Hive Avro-To-Orc Converter</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Case Studies</span>
<ul class="subnav">
<li class="">
<a class="" href="../../case-studies/Kafka-HDFS-Ingestion/">Kafka-HDFS Ingestion</a>
</li>
<li class="">
<a class="" href="../../case-studies/Publishing-Data-to-S3/">Publishing Data to S3</a>
</li>
<li class="">
<a class="" href="../../case-studies/Writing-ORC-Data/">Writing ORC Data</a>
</li>
<li class="">
<a class="" href="../../case-studies/Hive-Distcp/">Hive Distcp</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Data Management</span>
<ul class="subnav">
<li class="">
<a class="" href="../../data-management/Gobblin-Retention/">Retention</a>
</li>
<li class="">
<a class="" href="../../data-management/DistcpNgEvents/">Distcp-NG events</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Gobblin Metrics</span>
<ul class="subnav">
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics/">Quick Start</a>
</li>
<li class="">
<a class="" href="../../metrics/Existing-Reporters/">Existing Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Metrics-for-Gobblin-ETL/">Metrics for Gobblin ETL</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Architecture/">Gobblin Metrics Architecture</a>
</li>
<li class="">
<a class="" href="../../metrics/Implementing-New-Reporters/">Implementing New Reporters</a>
</li>
<li class="">
<a class="" href="../../metrics/Gobblin-Metrics-Performance/">Gobblin Metrics Performance</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Developer Guide</span>
<ul class="subnav">
<li class="">
<a class="" href="../../developer-guide/Customization-for-New-Source/">Customization for New Source</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Customization-for-Converter-and-Operator/">Customization for Converter and Operator</a>
</li>
<li class="">
<a class="" href="../../developer-guide/CodingStyle/">Code Style Guide</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Gobblin-Compliance-Design/">Gobblin Compliance Design</a>
</li>
<li class="">
<a class="" href="../../developer-guide/IDE-setup/">IDE setup</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Monitoring-Design/">Monitoring Design</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Documentation-Architecture/">Documentation Architecture</a>
</li>
<li class="">
<a class="" href="../../developer-guide/Contributing/">Contributing</a>
</li>
<li class="">
<a class="" href="../../developer-guide/GobblinModules/">Gobblin Modules</a>
</li>
<li class="">
<a class="" href="../../developer-guide/HighLevelConsumer/">High Level Consumer</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Project</span>
<ul class="subnav">
<li class="">
<a class="" href="../../project/Feature-List/">Feature List</a>
</li>
<li class="">
<a class="" href="/people">Contributors and Team</a>
</li>
<li class="">
<a class="" href="../../project/Talks-and-Tech-Blogs/">Talks and Tech Blog Posts</a>
</li>
<li class="">
<a class="" href="../../project/Posts/">Posts</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Miscellaneous</span>
<ul class="subnav">
<li class="">
<a class="" href="../../miscellaneous/Camus-to-Gobblin-Migration/">Camus to Gobblin Migration</a>
</li>
<li class="">
<a class="" href="../../miscellaneous/Exactly-Once-Support/">Exactly Once Support</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Apache Gobblin</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>User Guide &raquo;</li>
<li>Gobblin on Yarn</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/apache/incubator-gobblin/edit/master/docs/user-guide/Gobblin-on-Yarn.md" rel="nofollow"> Edit on Gobblin</a>
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h1 id="table-of-contents">Table of Contents</h1>
<div class="toc">
<ul>
<li><a href="#table-of-contents">Table of Contents</a></li>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#architecture">Architecture</a><ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#the-role-of-apache-helix">The Role of Apache Helix</a></li>
<li><a href="#gobblin-yarn-application-launcher">Gobblin Yarn Application Launcher</a><ul>
<li><a href="#yarnappsecuritymanager">YarnAppSecurityManager</a></li>
<li><a href="#logcopier">LogCopier</a></li>
</ul>
</li>
<li><a href="#gobblin-applicationmaster">Gobblin ApplicationMaster</a><ul>
<li><a href="#yarnservice">YarnService</a></li>
<li><a href="#gobblinhelixjobscheduler">GobblinHelixJobScheduler</a></li>
<li><a href="#logcopier_1">LogCopier</a></li>
<li><a href="#yarncontainersecuritymanager">YarnContainerSecurityManager</a></li>
</ul>
</li>
<li><a href="#gobblin-workunitrunner">Gobblin WorkUnitRunner</a><ul>
<li><a href="#taskexecutor">TaskExecutor</a></li>
<li><a href="#gobblinhelixtaskstatetracker">GobblinHelixTaskStateTracker</a></li>
<li><a href="#logcopier_2">LogCopier</a></li>
<li><a href="#yarncontainersecuritymanager_1">YarnContainerSecurityManager</a></li>
</ul>
</li>
<li><a href="#failure-handling">Failure Handling</a><ul>
<li><a href="#applicationmaster-failure-handling">ApplicationMaster Failure Handling</a></li>
<li><a href="#container-failure-handling">Container Failure Handling</a></li>
<li><a href="#handling-failures-to-get-applicationreport">Handling Failures to get ApplicationReport</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#log-aggregation">Log Aggregation</a></li>
<li><a href="#security-and-delegation-token-management">Security and Delegation Token Management</a></li>
<li><a href="#configuration">Configuration</a><ul>
<li><a href="#configuration-properties">Configuration Properties</a></li>
<li><a href="#job-lock">Job Lock</a></li>
<li><a href="#configuration-system">Configuration System</a></li>
</ul>
</li>
<li><a href="#deployment">Deployment</a><ul>
<li><a href="#deployment-on-a-unsecured-yarn-cluster">Deployment on a Unsecured Yarn Cluster</a></li>
<li><a href="#deployment-on-a-secured-yarn-cluster">Deployment on a Secured Yarn Cluster</a></li>
<li><a href="#supporting-existing-gobblin-jobs">Supporting Existing Gobblin Jobs</a></li>
</ul>
</li>
<li><a href="#monitoring">Monitoring</a></li>
</ul>
</div>
<h1 id="introduction">Introduction</h1>
<p>Gobblin currently is capable of running in the standalone mode on a single machine or in the MapReduce (MR) mode as a MR job on a Hadoop cluster. A Gobblin job is typically running on a schedule through a scheduler, e.g., the built-in <code>JobScheduler</code>, Azkaban, or Oozie, and each job run ingests new data or data updated since the last run. So this is essentially a batch model for data ingestion and how soon new data becomes available on Hadoop depends on the schedule of the job. </p>
<p>On another aspect, for high data volume data sources such as Kafka, Gobblin typically runs in the MR mode with a considerable number of tasks running in the mappers of a MR job. This helps Gobblin to scale out for data sources with large volumes of data. The MR mode, however, suffers from problems such as large overhead mostly due to the overhead of submitting and launching a MR job and poor cluster resource usage. The MR mode is also fundamentally not appropriate for real-time data ingestion given its batch nature. These deficiencies are summarized in more details below:</p>
<ul>
<li>In the MR mode, every Gobblin job run starts a new MR job, which costs a considerable amount of time to allocate and start the containers for running the mapper/reducer tasks. This cost can be totally eliminated if the containers are already up and running.</li>
<li>Each Gobblin job running in the MR mode requests a new set of containers and releases them upon job completion. So it's impossible for two jobs to share the containers even though the containers are perfectly capable of running tasks of both jobs.</li>
<li>In the MR mode, All <code>WorkUnit</code>s are pre-assigned to the mappers before launching the MR job. The assignment is fixed by evenly distributing the <code>WorkUnit</code>s to the mappers so each mapper gets a fair share of the work in terms of the <em>number of <code>WorkUnits</code></em>. However, an evenly distributed number of <code>WorkUnit</code>s per mapper does not always guarantee a fair share of the work in terms of the volume of data to pull. This, combined with the fact that the mappers that finish earlier cannot "steal" <code>WorkUnit</code>s assigned to other mappers, means the responsibility of load balancing is on the <code>Source</code> implementations, which is not trivial to do, and is virtually impossible in heterogeneous Hadoop clusters where different nodes have different capacity. This also means the duration of a job is determined by the slowest mapper.</li>
<li>A MR job can only hold its containers for a limited of time, beyond which the job may get killed. Real-time data ingestion, however, requires the ingestion tasks to be running all the time or alternatively dividing a continuous data stream into well-defined mini-batches (as in Spark Streaming) that can be promptly executed once created. Both require long-running containers, which are not supported in the MR mode. </li>
</ul>
<p>Those deficiencies motivated the work on making Gobblin run on Yarn as a native Yarn application. Running Gobblin as a native Yarn application allows much more control over container provisioning and lifecycle management so it's possible to keep the containers running continuously. It also makes it possible to dynamically change the number of containers at runtime depending on the load to further improve the resource efficiency, something that's impossible in the MR mode. </p>
<p>This wiki page documents the design and architecture of the native Gobblin Yarn application and some implementation details. It also covers the configuration system and properties for the application, as well as deployment settings on both unsecured and secured Yarn clusters. </p>
<h1 id="architecture">Architecture</h1>
<h2 id="overview">Overview</h2>
<p>The architecture of Gobblin on Yarn is illustrated in the following diagram. In addition to Yarn, Gobblin on Yarn also leverages <a href="http://helix.apache.org/">Apache Helix</a>, whose role is discussed in <a href="#the-role-of-apache-helix">The Role of Apache Helix</a>. A Gobblin Yarn application consists of three components: the Yarn Application Launcher, the Yarn ApplicationMaster (serving as the Helix <em>controller</em>), and the Yarn WorkUnitRunner (serving as the Helix <em>participant</em>). The following sections describe each component in details.</p>
<p align="center">
<figure>
<img src=../../img/Gobblin-on-Yarn-with-Helix.png alt="Gobblin on Yarn with Helix" width="800">
</figure>
</p>
<h2 id="the-role-of-apache-helix">The Role of Apache Helix</h2>
<p><a href="http://helix.apache.org/">Apache Helix</a> is mainly used for managing the cluster of containers and running the <code>WorkUnit</code>s through its <a href="http://helix.apache.org/0.7.1-docs/recipes/task_dag_execution.html">Distributed Task Execution Framework</a>. </p>
<p>The assignment of tasks to available containers (or participants in Helix's term) is handled by Helix through a finite state model named the <code>TaskStateModel</code>. Using this <code>TaskStateModel</code>, Helix is also able to do task rebalancing in case new containers get added or some existing containers die. Clients can also choose to force a task rebalancing if some tasks take much longer time than the others. </p>
<p>Helix also supports a way of doing messaging between different components of a cluster, e.g., between the controller to the participants, or between the client and the controller. The Gobblin Yarn application uses this messaging mechanism to implement graceful shutdown initiated by the client as well as delegation token renew notifications from the client to the ApplicationMaster and the WorkUnitRunner containers.</p>
<p>Heiix relies on ZooKeeper for its operations, and particularly for maintaining the state of the cluster and the resources (tasks in this case). Both the Helix controller and participants connect to ZooKeeper during their entire lifetime. The ApplicationMaster serves as the Helix controller and the worker containers serve as the Helix participants, respectively, as discussed in details below. </p>
<h2 id="gobblin-yarn-application-launcher">Gobblin Yarn Application Launcher</h2>
<p>The Gobblin Yarn Application Launcher (implemented by class <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java" rel="nofollow"><code>GobblinYarnAppLauncher</code></a>) is the client/driver of a Gobblin Yarn application. The first thing the <code>GobblinYarnAppLauncher</code> does when it starts is to register itself with Helix as a <em>spectator</em> and creates a new Helix cluster with name specified through the configuration property <code>gobblin.yarn.helix.cluster.name</code>, if no cluster with the name exists. </p>
<p>The <code>GobblinYarnAppLauncher</code> then sets up the Gobblin Yarn application and submits it to run on Yarn. Once the Yarn application successfully starts running, it starts an application state monitor that periodically checks the state of the Gobblin Yarn application. If the state is one of the exit states (<code>FINISHED</code>, <code>FAILED</code>, or <code>KILLED</code>), the <code>GobblinYarnAppLauncher</code> shuts down itself. </p>
<p>Upon successfully submitting the application to run on Yarn, the <code>GobblinYarnAppLauncher</code> also starts a <code>ServiceManager</code> that manages the following services that auxiliate the running of the application:</p>
<h3 id="yarnappsecuritymanager">YarnAppSecurityManager</h3>
<p>The <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAppSecurityManager.java" rel="nofollow"><code>YarnAppSecurityManager</code></a> works with the <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnContainerSecurityManager.java" rel="nofollow"><code>YarnContainerSecurityManager</code></a> running in the ApplicationMaster and the WorkUnitRunner for a complete solution for security and delegation token management. The <code>YarnAppSecurityManager</code> is responsible for periodically logging in through a Kerberos keytab and getting the delegation token refreshed regularly after each login. Each time the delegation token is refreshed, the <code>YarnContainerSecurityManager</code> writes the new token to a file on HDFS and sends a message to the ApplicationMaster and each WorkUnitRunner, notifying them the refresh of the delegation token. Checkout <a href="#yarncontainersecuritymanager"><code>YarnContainerSecurityManager</code></a> on how the other side of this system works.</p>
<h3 id="logcopier">LogCopier</h3>
<p>The service <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-utility/src/main/java/org/apache/gobblin/util/logs/LogCopier.java" rel="nofollow"><code>LogCopier</code></a> in <code>GobblinYarnAppLauncher</code> streams the ApplicationMaster and WorkUnitRunner logs in near real-time from the central location on HDFS where the logs are streamed to from the ApplicationMaster and WorkUnitRunner containers, to the local directory specified through the configuration property <code>gobblin.yarn.logs.sink.root.dir</code> on the machine where the <code>GobblinYarnAppLauncher</code> runs. More details on this can be found in <a href="#log-aggregation">Log Aggregation</a>.</p>
<h2 id="gobblin-applicationmaster">Gobblin ApplicationMaster</h2>
<p>The ApplicationMaster process runs the <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinApplicationMaster.java" rel="nofollow"><code>GobblinApplicationMaster</code></a>, which uses a <code>ServiceManager</code> to manage the services supporting the operation of the ApplicationMaster process. The services running in <code>GobblinApplicationMaster</code> will be discussed later. When it starts, the first thing <code>GobblinApplicationMaster</code> does is to connect to ZooKeeper and register itself as a Helix <em>controller</em>. It then starts the <code>ServiceManager</code>, which in turn starts the services it manages, as described below. </p>
<h3 id="yarnservice">YarnService</h3>
<p>The service <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java" rel="nofollow"><code>YarnService</code></a> handles all Yarn-related task including the following:</p>
<ul>
<li>Registering and un-registering the ApplicationMaster with the Yarn ResourceManager.</li>
<li>Requesting the initial set of containers from the Yarn ResourceManager.</li>
<li>Handling any container changes at runtime, e.g., adding more containers or shutting down containers no longer needed. This also includes stopping running containers when the application is asked to stop.</li>
</ul>
<p>This design makes it switch to a different resource manager, e.g., Mesos, by replacing the service <code>YarnService</code> with something else specific to the resource manager, e.g., <code>MesosService</code>.</p>
<h3 id="gobblinhelixjobscheduler">GobblinHelixJobScheduler</h3>
<p><a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinApplicationMaster.java"><code>GobblinApplicationMaster</code></a> runs the <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java"><code>GobblinHelixJobScheduler</code></a> that schedules jobs to run through the Helix <a href="http://helix.apache.org/0.7.1-docs/recipes/task_dag_execution.html">Distributed Task Execution Framework</a>. For each Gobblin job run, the <code>GobblinHelixJobScheduler</code> starts a <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java" rel="nofollow"><code>GobblinHelixJobLauncher</code></a> that wraps the Gobblin job into a <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJob.java" rel="nofollow"><code>GobblinHelixJob</code></a> and each Gobblin <code>Task</code> into a <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixTask.java" rel="nofollow"><code>GobblinHelixTask</code></a>, which implements the Helix's <code>Task</code> interface so Helix knows how to execute it. The <code>GobblinHelixJobLauncher</code> then submits the job to a Helix job queue named after the Gobblin job name, from which the Helix Distributed Task Execution Framework picks up the job and runs its tasks through the live participants (available containers).</p>
<p>Like the <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-runtime/src/main/java/org/apache/gobblin/runtime/local/LocalJobLauncher.java" rel="nofollow"><code>LocalJobLauncher</code></a> and <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-runtime/src/main/java/org/apache/gobblin/runtime/mapreduce/MRJobLauncher.java" rel="nofollow"><code>MRJobLauncher</code></a>, the <code>GobblinHelixJobLauncher</code> handles output data commit and job state persistence. </p>
<h3 id="logcopier_1">LogCopier</h3>
<p>The service <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-utility/src/main/java/org/apache/gobblin/util/logs/LogCopier.java" rel="nofollow"><code>LogCopier</code></a> in <code>GobblinApplicationMaster</code> streams the ApplicationMaster logs in near real-time from the machine running the ApplicationMaster container to a central location on HDFS so the logs can be accessed at runtime. More details on this can be found in <a href="#log-aggregation">Log Aggregation</a>.</p>
<h3 id="yarncontainersecuritymanager">YarnContainerSecurityManager</h3>
<p>The <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnContainerSecurityManager.java" rel="nofollow"><code>YarnContainerSecurityManager</code></a> runs in both the ApplicationMaster and the WorkUnitRunner. When it starts, it registers a message handler with the <code>HelixManager</code> for handling messages on refreshes of the delegation token. Once such a message is received, the <code>YarnContainerSecurityManager</code> gets the path to the token file on HDFS from the message, and updated the the current login user with the new token read from the file.</p>
<h2 id="gobblin-workunitrunner">Gobblin WorkUnitRunner</h2>
<p>The WorkUnitRunner process runs as part of <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java" rel="nofollow"><code>GobblinTaskRunner</code></a>, which uses a <code>ServiceManager</code> to manage the services supporting the operation of the WorkUnitRunner process. The services running in <code>GobblinWorkUnitRunner</code> will be discussed later. When it starts, the first thing <code>GobblinWorkUnitRunner</code> does is to connect to ZooKeeper and register itself as a Helix <em>participant</em>. It then starts the <code>ServiceManager</code>, which in turn starts the services it manages, as discussed below. </p>
<h3 id="taskexecutor">TaskExecutor</h3>
<p>The <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-runtime/src/main/java/org/apache/gobblin/runtime/TaskExecutor.java" rel="nofollow"><code>TaskExecutor</code></a> remains the same as in the standalone and MR modes, and is purely responsible for running tasks assigned to a WorkUnitRunner. </p>
<h3 id="gobblinhelixtaskstatetracker">GobblinHelixTaskStateTracker</h3>
<p>The <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixTaskStateTracker.java" rel="nofollow"><code>GobblinHelixTaskStateTracker</code></a> has a similar responsibility as the <code>LocalTaskStateTracker</code> and <code>MRTaskStateTracker</code>: keeping track of the state of running tasks including operational metrics, e.g., total records pulled, records pulled per second, total bytes pulled, bytes pulled per second, etc.</p>
<h3 id="logcopier_2">LogCopier</h3>
<p>The service <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-utility/src/main/java/org/apache/gobblin/util/logs/LogCopier.java" rel="nofollow"><code>LogCopier</code></a> in <code>GobblinWorkUnitRunner</code> streams the WorkUnitRunner logs in near real-time from the machine running the WorkUnitRunner container to a central location on HDFS so the logs can be accessed at runtime. More details on this can be found in <a href="#log-aggregation">Log Aggregation</a>.</p>
<h3 id="yarncontainersecuritymanager_1">YarnContainerSecurityManager</h3>
<p>The <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnContainerSecurityManager.java" rel="nofollow"><code>YarnContainerSecurityManager</code></a> in <code>GobblinWorkUnitRunner</code> works in the same way as it in <code>GobblinApplicationMaster</code>. </p>
<h2 id="failure-handling">Failure Handling</h2>
<h3 id="applicationmaster-failure-handling">ApplicationMaster Failure Handling</h3>
<p>Under normal operation, the Gobblin ApplicationMaster stays alive unless being asked to stop through a message sent from the launcher (the <code>GobblinYarnAppLauncher</code>) as part of the orderly shutdown process. It may, however, fail or get killed by the Yarn ResourceManager for various reasons. For example, the container running the ApplicationMaster may fail and exit due to node failures, or get killed because of using more memory than claimed. When a shutdown of the ApplicationMaster is triggered (e.g., when the shutdown hook is triggered) for any reason, it does so gracefully, i.e., it attempts to stop every services it manages, stop all the running containers, and unregister itself with the ResourceManager. Shutting down the ApplicationMaster shuts down the Yarn application and the application launcher will eventually know that the application completes through a periodic check on the application status. </p>
<h3 id="container-failure-handling">Container Failure Handling</h3>
<p>Under normal operation, a Gobblin Yarn container stays alive unless being released and stopped by the Gobblin ApplicationMaster, and in this case the exit status of the container will be zero. However, a container may exit unexpectedly due to various reasons. For example, a container may fail and exit due to node failures, or be killed because of using more memory than claimed. In this case when a container exits abnormally with a non-zero exit code, Gobblin Yarn tries to restart the Helix instance running in the container by requesting a new Yarn container as a replacement to run the instance. The maximum number of retries can be configured through the key <code>gobblin.yarn.helix.instance.max.retries</code>.</p>
<p>When requesting a new container to replace the one that completes and exits abnormally, the application has a choice of specifying the same host that runs the completed container as the preferred host, depending on the boolean value of configuration key <code>gobblin.yarn.container.affinity.enabled</code>. Note that for certain exit codes that indicate something wrong with the host, the value of <code>gobblin.yarn.container.affinity.enabled</code> is ignored and no preferred host gets specified, leaving Yarn to figure out a good candidate node for the new container. </p>
<h3 id="handling-failures-to-get-applicationreport">Handling Failures to get ApplicationReport</h3>
<p>As mentioned above, once the Gobblin Yarn application successfully starts running, the <code>GobblinYarnAppLauncher</code> starts an application state monitor that periodically checks the state of the Yarn application by getting an <code>ApplicationReport</code>. It may fail to do so and throw an exception, however, if the Yarn client is having some problem connecting and communicating with the Yarn cluster. For example, if the Yarn cluster is down for maintenance, the Yarn client will not be able to get an <code>ApplicationReport</code>. The <code>GobblinYarnAppLauncher</code> keeps track of the number of consecutive failures to get an <code>ApplicationReport</code> and initiates a shutdown if this number exceeds the threshold as specified through the configuration property <code>gobblin.yarn.max.get.app.report.failures</code>. The shutdown will trigger an email notification if the configuration property <code>gobblin.yarn.email.notification.on.shutdown</code> is set to <code>true</code>.</p>
<h1 id="log-aggregation">Log Aggregation</h1>
<p>Yarn provides both a Web UI and a command-line tool to access the logs of an application, and also does log aggregation so the logs of all the containers become available on the client side upon requested. However, there are a few limitations that make it hard to access the logs of an application at runtime:</p>
<ul>
<li>The command-line utility for downloading the aggregated logs will only be able to do so after the application finishes, making it useless for getting access to the logs at the application runtime. </li>
<li>The Web UI does allow logs to be viewed at runtime, but only when the user that access the UI is the same as the user that launches the application. On a Yarn cluster where security is enabled, the user launching the Gobblin Yarn application is typically a user of some headless account.</li>
</ul>
<p>Because Gobblin runs on Yarn as a long-running native Yarn application, getting access to the logs at runtime is critical to know what's going on in the application and to detect any issues in the application as early as possible. Unfortunately we cannot use the log facility provided by Yarn here due to the above limitations. Alternatively, Gobblin on Yarn has its own mechanism for doing log aggregation and providing access to the logs at runtime, described as follows.</p>
<p>Both the Gobblin ApplicationMaster and WorkUnitRunner run a <code>LogCopier</code> that periodically copies new entries of both <code>stdout</code> and <code>stderr</code> logs of the corresponding processes from the containers to a central location on HDFS under the directory <code>${gobblin.yarn.work.dir}/_applogs</code> in the subdirectories named after the container IDs, one per container. The names of the log files on HDFS combine the container IDs and the original log file names so it's easy to tell which container generates which log file. More specifically, the log files produced by the ApplicationMaster are named <code>&lt;container id&gt;.GobblinApplicationMaster.{stdout,stderr}</code>, and the log files produced by the WorkUnitRunner are named <code>&lt;container id&gt;.GobblinWorkUnitRunner.{stdout,stderr}</code>.</p>
<p>The Gobblin YarnApplicationLauncher also runs a <code>LogCopier</code> that periodically copies new log entries from log files under <code>${gobblin.yarn.work.dir}/_applogs</code> on HDFS to the local filesystem under the directory configured by the property <code>gobblin.yarn.logs.sink.root.dir</code>. By default, the <code>LogCopier</code> checks for new log entries every 60 seconds and will keep reading new log entries until it reaches the end of the log file. This setup enables the Gobblin Yarn application to stream container process logs near real-time all the way to the client/driver. </p>
<h1 id="security-and-delegation-token-management">Security and Delegation Token Management</h1>
<p>On a Yarn cluster with security enabled (e.g., Kerberos authentication is required to access HDFS), security and delegation token management is necessary to allow Gobblin run as a long-running Yarn application. Specifically, Gobblin running on a secured Yarn cluster needs to get its delegation token for accessing HDFS renewed periodically, which also requires periodic keytab re-logins because a delegation token can only be renewed up to a limited number of times in one login.</p>
<p>The Gobblin Yarn application supports Kerberos-based authentication and login through a keytab file. The <code>YarnAppSecurityManager</code> running in the Yarn Application Launcher and the <code>YarnContainerSecurityManager</code> running in the ApplicationMaster and WorkUnitRunner work together to get every Yarn containers updated whenever the delegation token gets updated on the client side by the <code>YarnAppSecurityManager</code>. More specifically, the <code>YarnAppSecurityManager</code> periodically logins through the keytab and gets the delegation token refreshed regularly after each successful login. Every time the <code>YarnAppSecurityManager</code> refreshes the delegation token, the <code>YarnContainerSecurityManager</code> writes the new token to a file on HDFS and sends a <code>TOKEN_FILE_UPDATED</code> message to the ApplicationMaster and each WorkUnitRunner, notifying them the refresh of the delegation token. Upon receiving such a message, the <code>YarnContainerSecurityManager</code> running in the ApplicationMaster or WorkUnitRunner gets the path to the token file on HDFS from the message, and updated the the current login user with the new token read from the file.</p>
<p>Both the interval between two Kerberos keytab logins and the interval between two delegation token refreshes are configurable, through the configuration properties <code>gobblin.yarn.login.interval.minutes</code> and <code>gobblin.yarn.token.renew.interval.minutes</code>, respectively. </p>
<h1 id="configuration">Configuration</h1>
<h2 id="configuration-properties">Configuration Properties</h2>
<p>In additional to the common Gobblin configuration properties, documented in <a href="Configuration-Properties-Glossary"><code>Configuration Properties Glossary</code></a>, Gobblin on Yarn uses the following configuration properties. </p>
<table>
<thead>
<tr>
<th>Property</th>
<th>Default Value</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>gobblin.yarn.app.name</code></td>
<td><code>GobblinYarn</code></td>
<td>The Gobblin Yarn appliation name.</td>
</tr>
<tr>
<td><code>gobblin.yarn.app.queue</code></td>
<td><code>default</code></td>
<td>The Yarn queue the Gobblin Yarn application will run in.</td>
</tr>
<tr>
<td><code>gobblin.yarn.work.dir</code></td>
<td><code>/gobblin</code></td>
<td>The working directory (typically on HDFS) for the Gobblin Yarn application.</td>
</tr>
<tr>
<td><code>gobblin.yarn.app.report.interval.minutes</code></td>
<td>5</td>
<td>The interval in minutes between two Gobblin Yarn application status reports.</td>
</tr>
<tr>
<td><code>gobblin.yarn.max.get.app.report.failures</code></td>
<td>4</td>
<td>Maximum allowed number of consecutive failures to get a Yarn <code>ApplicationReport</code>.</td>
</tr>
<tr>
<td><code>gobblin.yarn.email.notification.on.shutdown</code></td>
<td><code>false</code></td>
<td>Whether email notification is enabled or not on shutdown of the <code>GobblinYarnAppLauncher</code>. If this is set to <code>true</code>, the following configuration properties also need to be set for email notification to work: <code>email.host</code>, <code>email.smtp.port</code>, <code>email.user</code>, <code>email.password</code>, <code>email.from</code>, and <code>email.tos</code>. Refer to <a href="Configuration-Properties-Glossary#Email-Alert-Properties">Email Alert Properties</a> for more information on those configuration properties.</td>
</tr>
<tr>
<td><code>gobblin.yarn.app.master.memory.mbs</code></td>
<td>512</td>
<td>How much memory in MBs to request for the container running the Gobblin ApplicationMaster.</td>
</tr>
<tr>
<td><code>gobblin.yarn.app.master.cores</code></td>
<td>1</td>
<td>The number of vcores to request for the container running the Gobblin ApplicationMaster.</td>
</tr>
<tr>
<td><code>gobblin.yarn.app.master.jars</code></td>
<td></td>
<td>A comma-separated list of jars the Gobblin ApplicationMaster depends on but not in the <code>lib</code> directory.</td>
</tr>
<tr>
<td><code>gobblin.yarn.app.master.files.local</code></td>
<td></td>
<td>A comma-separated list of files on the local filesystem the Gobblin ApplicationMaster depends on.</td>
</tr>
<tr>
<td><code>gobblin.yarn.app.master.files.remote</code></td>
<td></td>
<td>A comma-separated list of files on a remote filesystem (typically HDFS) the Gobblin ApplicationMaster depends on.</td>
</tr>
<tr>
<td><code>gobblin.yarn.app.master.jvm.args</code></td>
<td></td>
<td>Additional JVM arguments for the JVM process running the Gobblin ApplicationMaster, e.g., <code>-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m</code> <code>-XX:CompressedClassSpaceSize=256m -Dconfig.trace=loads</code>.</td>
</tr>
<tr>
<td><code>gobblin.yarn.initial.containers</code></td>
<td>1</td>
<td>The number of containers to request initially when the application starts to run the WorkUnitRunner.</td>
</tr>
<tr>
<td><code>gobblin.yarn.container.memory.mbs</code></td>
<td>512</td>
<td>How much memory in MBs to request for the container running the Gobblin WorkUnitRunner.</td>
</tr>
<tr>
<td><code>gobblin.yarn.container.cores</code></td>
<td>1</td>
<td>The number of vcores to request for the container running the Gobblin WorkUnitRunner.</td>
</tr>
<tr>
<td><code>gobblin.yarn.container.jars</code></td>
<td></td>
<td>A comma-separated list of jars the Gobblin WorkUnitRunner depends on but not in the <code>lib</code> directory.</td>
</tr>
<tr>
<td><code>gobblin.yarn.container.files.local</code></td>
<td></td>
<td>A comma-separated list of files on the local filesystem the Gobblin WorkUnitRunner depends on.</td>
</tr>
<tr>
<td><code>gobblin.yarn.container.files.remote</code></td>
<td></td>
<td>A comma-separated list of files on a remote filesystem (typically HDFS) the Gobblin WorkUnitRunner depends on.</td>
</tr>
<tr>
<td><code>gobblin.yarn.container.jvm.args</code></td>
<td></td>
<td>Additional JVM arguments for the JVM process running the Gobblin WorkUnitRunner, e.g., <code>-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m</code> <code>-XX:CompressedClassSpaceSize=256m -Dconfig.trace=loads</code>.</td>
</tr>
<tr>
<td><code>gobblin.yarn.container.affinity.enabled</code></td>
<td><code>true</code></td>
<td>Whether the same host should be used as the preferred host when requesting a replacement container for the one that exits.</td>
</tr>
<tr>
<td><code>gobblin.yarn.helix.cluster.name</code></td>
<td><code>GobblinYarn</code></td>
<td>The name of the Helix cluster that will be registered with ZooKeeper.</td>
</tr>
<tr>
<td><code>gobblin.yarn.zk.connection.string</code></td>
<td><code>localhost:2181</code></td>
<td>The ZooKeeper connection string used by Helix.</td>
</tr>
<tr>
<td><code>helix.instance.max.retries</code></td>
<td>2</td>
<td>Maximum number of times the application tries to restart a failed Helix instance (corresponding to a Yarn container).</td>
</tr>
<tr>
<td><code>gobblin.yarn.lib.jars.dir</code></td>
<td></td>
<td>The directory where library jars are stored, typically <code>gobblin-dist/lib</code>.</td>
</tr>
<tr>
<td><code>gobblin.yarn.job.conf.path</code></td>
<td></td>
<td>The path to either a directory where Gobblin job configuration files are stored or a single job configuration file. Internally Gobblin Yarn will package the configuration files as a tarball so you don't need to.</td>
</tr>
<tr>
<td><code>gobblin.yarn.logs.sink.root.dir</code></td>
<td></td>
<td>The directory on local filesystem on the driver/client side where the aggregated container logs of both the ApplicationMaster and WorkUnitRunner are stored.</td>
</tr>
<tr>
<td><code>gobblin.yarn.log.copier.max.file.size</code></td>
<td>Unbounded</td>
<td>The maximum bytes per log file. When this is exceeded a new log file will be created.</td>
</tr>
<tr>
<td><code>gobblin.yarn.log.copier.scheduler</code></td>
<td><code>ScheduledExecutorService</code></td>
<td>The scheduler to use to copy the log files. Possible values: <code>ScheduledExecutorService</code>, <code>HashedWheelTimer</code>. The <code>HashedWheelTimer</code> scheduler is experimental but is expected to become the default after a sufficient burn in period.</td>
</tr>
<tr>
<td><code>gobblin.yarn.keytab.file.path</code></td>
<td></td>
<td>The path to the Kerberos keytab file used for keytab-based authentication/login.</td>
</tr>
<tr>
<td><code>gobblin.yarn.keytab.principal.name</code></td>
<td></td>
<td>The principal name of the keytab.</td>
</tr>
<tr>
<td><code>gobblin.yarn.login.interval.minutes</code></td>
<td>1440</td>
<td>The interval in minutes between two keytab logins.</td>
</tr>
<tr>
<td><code>gobblin.yarn.token.renew.interval.minutes</code></td>
<td>720</td>
<td>The interval in minutes between two delegation token renews.</td>
</tr>
</tbody>
</table>
<h2 id="job-lock">Job Lock</h2>
<p>It is recommended to use zookeeper for maintaining job locks. See <a href="Configuration-Properties-Glossary#ZookeeperBasedJobLock-Properties">ZookeeperBasedJobLock Properties</a> for the relevant configuration properties.</p>
<h2 id="configuration-system">Configuration System</h2>
<p>The Gobblin Yarn application uses the <a href="https://github.com/typesafehub/config" rel="nofollow">Typesafe Config</a> library to handle the application configuration. Following <a href="https://github.com/typesafehub/config" rel="nofollow">Typesafe Config</a>'s model, the Gobblin Yarn application uses a single file named <code>application.conf</code> for all configuration properties and another file named <code>reference.conf</code> for default values. A sample <code>application.conf</code> is shown below: </p>
<pre><code># Yarn/Helix configuration properties
gobblin.yarn.helix.cluster.name=GobblinYarnTest
gobblin.yarn.app.name=GobblinYarnTest
gobblin.yarn.lib.jars.dir=&quot;/home/gobblin/gobblin-dist/lib/&quot;
gobblin.yarn.app.master.files.local=&quot;/home/gobblin/gobblin-dist/conf/log4j-yarn.properties,/home/gobblin/gobblin-dist/conf/application.conf,/home/gobblin/gobblin-dist/conf/reference.conf&quot;
gobblin.yarn.container.files.local=${gobblin.yarn.app.master.files.local}
gobblin.yarn.job.conf.path=&quot;/home/gobblin/gobblin-dist/job-conf&quot;
gobblin.yarn.keytab.file.path=&quot;/home/gobblin/gobblin.headless.keytab&quot;
gobblin.yarn.keytab.principal.name=gobblin
gobblin.yarn.app.master.jvm.args=&quot;-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m&quot;
gobblin.yarn.container.jvm.args=&quot;-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m&quot;
gobblin.yarn.logs.sink.root.dir=/home/gobblin/gobblin-dist/applogs
# File system URIs
writer.fs.uri=${fs.uri}
state.store.fs.uri=${fs.uri}
# Writer related configuration properties
writer.destination.type=HDFS
writer.output.format=AVRO
writer.staging.dir=${gobblin.yarn.work.dir}/task-staging
writer.output.dir=${gobblin.yarn.work.dir}/task-output
# Data publisher related configuration properties
data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
data.publisher.final.dir=${gobblin.yarn.work.dir}/job-output
data.publisher.replace.final.dir=false
# Directory where job/task state files are stored
state.store.dir=${gobblin.yarn.work.dir}/state-store
# Directory where error files from the quality checkers are stored
qualitychecker.row.err.file=${gobblin.yarn.work.dir}/err
# Use zookeeper for maintaining the job lock
job.lock.enabled=true
job.lock.type=ZookeeperBasedJobLock
# Directory where job locks are stored
job.lock.dir=${gobblin.yarn.work.dir}/locks
# Directory where metrics log files are stored
metrics.log.dir=${gobblin.yarn.work.dir}/metrics
</code></pre>
<p>A sample <code>reference.conf</code> is shown below:</p>
<pre><code># Yarn/Helix configuration properties
gobblin.yarn.app.queue=default
gobblin.yarn.helix.cluster.name=GobblinYarn
gobblin.yarn.app.name=GobblinYarn
gobblin.yarn.app.master.memory.mbs=512
gobblin.yarn.app.master.cores=1
gobblin.yarn.app.report.interval.minutes=5
gobblin.yarn.max.get.app.report.failures=4
gobblin.yarn.email.notification.on.shutdown=false
gobblin.yarn.initial.containers=1
gobblin.yarn.container.memory.mbs=512
gobblin.yarn.container.cores=1
gobblin.yarn.container.affinity.enabled=true
gobblin.yarn.helix.instance.max.retries=2
gobblin.yarn.keytab.login.interval.minutes=1440
gobblin.yarn.token.renew.interval.minutes=720
gobblin.yarn.work.dir=/user/gobblin/gobblin-yarn
gobblin.yarn.zk.connection.string=${zookeeper.connection.string}
fs.uri=&quot;hdfs://localhost:9000&quot;
zookeeper.connection.string=&quot;localhost:2181&quot;
</code></pre>
<h1 id="deployment">Deployment</h1>
<p>A standard deployment of Gobblin on Yarn requires a Yarn cluster running Hadoop 2.x (<code>2.3.0</code> and above recommended) and a ZooKeeper cluster. Make sure the client machine (typically the gateway of the Yarn cluster) is able to access the ZooKeeper instance. </p>
<h2 id="deployment-on-a-unsecured-yarn-cluster">Deployment on a Unsecured Yarn Cluster</h2>
<p>To do a deployment of the Gobblin Yarn application, first build Gobblin using the following command from the root directory of the Gobblin project.</p>
<pre><code>./gradlew clean build
</code></pre>
<p>To build Gobblin against a specific version of Hadoop 2.x, e.g., <code>2.7.0</code>, run the following command instead:</p>
<pre><code>./gradlew clean build -PhadoopVersion=2.7.0
</code></pre>
<p>After Gobblin is successfully built, a tarball named <code>gobblin-dist-[project-version].tar.gz</code> should have been created under the root directory of the project. To deploy the Gobblin Yarn application on a unsecured Yarn cluster, uncompress the tarball somewhere and run the following commands: </p>
<pre><code>cd gobblin-dist
bin/gobblin-yarn.sh
</code></pre>
<p>Note that for the above commands to work, the Hadoop/Yarn configuration directory must be on the classpath and the configuration must be pointing to the right Yarn cluster, or specifically the right ResourceManager and NameNode URLs. This is defined like the following in <code>gobblin-yarn.sh</code>:</p>
<pre><code>CLASSPATH=${FWDIR_CONF}:${GOBBLIN_JARS}:${YARN_CONF_DIR}:${HADOOP_YARN_HOME}/lib
</code></pre>
<h2 id="deployment-on-a-secured-yarn-cluster">Deployment on a Secured Yarn Cluster</h2>
<p>When deploying the Gobblin Yarn application on a secured Yarn cluster, make sure the keytab file path is correctly specified in <code>application.conf</code> and the correct principal for the keytab is used as follows. The rest of the deployment is the same as that on a unsecured Yarn cluster.</p>
<pre><code>gobblin.yarn.keytab.file.path=&quot;/home/gobblin/gobblin.headless.keytab&quot;
gobblin.yarn.keytab.principal.name=gobblin
</code></pre>
<h2 id="supporting-existing-gobblin-jobs">Supporting Existing Gobblin Jobs</h2>
<p>Gobblin on Yarn is backward compatible and supports existing Gobblin jobs running in the standalone and MR modes. To run existing Gobblin jobs, simply put the job configuration files into a directory on the local file system of the driver and setting the configuration property <code>gobblin.yarn.job.conf.path</code> to point to the directory. When the Gobblin Yarn application starts, Yarn will package the configuration files as a tarball and make sure the tarball gets copied to the ApplicationMaster and properly uncompressed. The <code>GobblinHelixJobScheduler</code> then loads the job configuration files and schedule the jobs to run.</p>
<h1 id="monitoring">Monitoring</h1>
<p>Gobblin Yarn uses the <a href="../metrics/Gobblin-Metrics">Gobblin Metrics</a> library for collecting and reporting metrics at the container, job, and task levels. Each <code>GobblinWorkUnitRunner</code> maintains a <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-cluster/src/main/java/org/apache/gobblin/cluster/ContainerMetrics.java" rel="nofollow"><code>ContainerMetrics</code></a> that is the parent of the <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-runtime/src/main/java/org/apache/gobblin/runtime/util/JobMetrics.java" rel="nofollow"><code>JobMetrics</code></a> of each job run the container is involved, which is the parent of the <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-runtime/src/main/java/org/apache/gobblin/runtime/util/TaskMetrics.java" rel="nofollow"><code>TaskMetrics</code></a> of each task of the job run. This hierarchical structure allows us to do pre-aggregation in the containers before reporting the metrics to the backend. </p>
<p>Collected metrics can be reported to various sinks such as Kafka, files, and JMX, depending on the configuration. Specifically, <code>metrics.enabled</code> controls whether metrics collecting and reporting are enabled or not. <code>metrics.reporting.kafka.enabled</code>, <code>metrics.reporting.file.enabled</code>, and <code>metrics.reporting.jmx.enabled</code> control whether collected metrics should be reported or not to Kafka, files, and JMX, respectively. Please refer to <a href="Configuration-Properties-Glossary#Metrics-Properties">Metrics Properties</a> for the available configuration properties related to metrics collecting and reporting. </p>
<p>In addition to metric collecting and reporting, Gobblin Yarn also supports writing job execution information to a MySQL-backed job execution history store, which keeps track of job execution information. Please refer to the <a href="https://github.com/apache/incubator-gobblin/tree/master/gobblin-metastore/src/main/resources/db/migration" rel="nofollow">DDL</a> for the relevant MySQL tables. Detailed information on the job execution history store including how to configure it can be found <a href="Job-Execution-History-Store">here</a>. </p>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../Compaction/" class="btn btn-neutral float-right" title="Compaction">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../Gobblin-Compliance/" class="btn btn-neutral" title="Gobblin Compliance"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org" rel="nofollow">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme" rel="nofollow">theme</a> provided by <a href="https://readthedocs.org" rel="nofollow">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../Gobblin-Compliance/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../Compaction/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../..';</script>
<script src="../../js/theme.js" defer></script>
<script src="../../js/extra.js" defer></script>
<script src="../../search/main.js" defer></script>
</body>
</html>