blob: 785e650830253887f56ecfad10db5fddf8702d0a [file] [log] [blame]
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Apache Mesos - Agent Recovery</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta property="og:locale" content="en_US"/>
<meta property="og:type" content="website"/>
<meta property="og:title" content="Apache Mesos"/>
<meta property="og:site_name" content="Apache Mesos"/>
<meta property="og:url" content="http://mesos.apache.org/"/>
<meta property="og:image" content="http://mesos.apache.org/assets/img/mesos_logo_fb_preview.png"/>
<meta property="og:description"
content="Apache Mesos abstracts resources away from machines,
enabling fault-tolerant and elastic distributed systems
to easily be built and run effectively."/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:site" content="@ApacheMesos"/>
<meta name="twitter:title" content="Apache Mesos"/>
<meta name="twitter:image" content="http://mesos.apache.org/assets/img/mesos_logo_fb_preview.png"/>
<meta name="twitter:description"
content="Apache Mesos abstracts resources away from machines,
enabling fault-tolerant and elastic distributed systems
to easily be built and run effectively."/>
<link href="//netdna.bootstrapcdn.com/bootstrap/3.1.1/css/bootstrap.min.css" rel="stylesheet">
<link rel="alternate" type="application/atom+xml" title="Apache Mesos Blog" href="/blog/feed.xml">
<link href="../../../assets/css/main.css" media="screen" rel="stylesheet" type="text/css" />
<!-- Google Analytics Magic -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-20226872-1']);
_gaq.push(['_setDomainName', 'apache.org']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<!-- magical breadcrumbs -->
<div class="topnav">
<div class="container">
<ul class="breadcrumb">
<li>
<div class="dropdown">
<a data-toggle="dropdown" href="#">Apache Software Foundation <span class="caret"></span></a>
<ul class="dropdown-menu" role="menu" aria-labelledby="dLabel">
<li><a href="http://www.apache.org">Apache Homepage</a></li>
<li><a href="http://www.apache.org/licenses/">License</a></li>
<li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
<li><a href="http://www.apache.org/security/">Security</a></li>
</ul>
</div>
</li>
<li><a href="http://mesos.apache.org">Apache Mesos</a></li>
<li><a href="/documentation
/">Documentation
</a></li>
</ul><!-- /.breadcrumb -->
</div><!-- /.container -->
</div><!-- /.topnav -->
<!-- navbar excitement -->
<div class="navbar navbar-default navbar-static-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#mesos-menu" aria-expanded="false">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/"><img src="/assets/img/mesos_logo.png" alt="Apache Mesos logo"/></a>
</div><!-- /.navbar-header -->
<div class="navbar-collapse collapse" id="mesos-menu">
<ul class="nav navbar-nav navbar-right">
<li><a href="/gettingstarted/">Getting Started</a></li>
<li><a href="/blog/">Blog</a></li>
<li><a href="/documentation/latest/">Documentation</a></li>
<li><a href="/downloads/">Downloads</a></li>
<li><a href="/community/">Community</a></li>
</ul>
</div><!-- /#mesos-menu -->
</div><!-- /.container -->
</div><!-- /.navbar -->
<div class="content">
<div class="container">
<div class="row-fluid">
<div class="col-md-4">
<h4>If you're new to Mesos</h4>
<p>See the <a href="/gettingstarted/">getting started</a> page for more
information about downloading, building, and deploying Mesos.</p>
<h4>If you'd like to get involved or you're looking for support</h4>
<p>See our <a href="/community/">community</a> page for more details.</p>
</div>
<div class="col-md-8">
<h1>Agent Recovery</h1>
<p>If the <code>mesos-agent</code> process on a host exits (perhaps due to a Mesos bug or
because the operator kills the process while <a href="/documentation/latest/./upgrades/">upgrading Mesos</a>),
any executors/tasks that were being managed by the <code>mesos-agent</code> process will
continue to run. When <code>mesos-agent</code> is restarted, the operator can control how
those old executors/tasks are handled:</p>
<ol>
<li>By default, all the executors/tasks that were being managed by the old
<code>mesos-agent</code> process are killed.</li>
<li>If a framework enabled <em>checkpointing</em> when it registered with the master,
any executors belonging to that framework can reconnect to the new
<code>mesos-agent</code> process and continue running uninterrupted.</li>
</ol>
<p>Hence, enabling framework checkpointing enables tasks to tolerate Mesos agent
upgrades and unexpected <code>mesos-agent</code> crashes without experiencing any
downtime.</p>
<p>Agent recovery works by having the agent <em>checkpoint</em> information (e.g., Task
Info, Executor Info, Status Updates) about the tasks and executors it is
managing to local disk. If a framework enables checkpointing, any subsequent
agent restarts will recover the checkpointed information and reconnect with any
executors that are still running.</p>
<p>Note that if the operating system on the agent is rebooted, all executors and
tasks running on the host are killed and are not automatically restarted when
the host comes back up.</p>
<h2>Framework Configuration</h2>
<p>A framework can control whether its executors will be recovered by setting the <code>checkpoint</code> flag in its <code>FrameworkInfo</code> when registering with the master. Enabling this feature results in increased I/O overhead at each agent that runs tasks launched by the framework. By default, frameworks do <strong>not</strong> checkpoint their state.</p>
<h2>Agent Configuration</h2>
<p>Three <a href="/documentation/latest/./configuration/">configuration flags</a> control the recovery behavior of a Mesos agent:</p>
<ul>
<li><p><code>strict</code>: Whether to do agent recovery in strict mode [Default: true].</p>
<ul>
<li>If strict=true, all recovery errors are considered fatal.</li>
<li>If strict=false, any errors (e.g., corruption in checkpointed data) during recovery are
ignored and as much state as possible is recovered.</li>
</ul>
</li>
<li><p><code>recover</code>: Whether to recover status updates and reconnect with old executors [Default: reconnect].</p>
<ul>
<li>If recover=reconnect, reconnect with any old live executors, provided the executor&rsquo;s framework enabled checkpointing.</li>
<li>If recover=cleanup, kill any old live executors and exit. Use this option when doing an incompatible agent or executor upgrade!
<blockquote><p>NOTE: If no checkpointing information exists, no recovery is performed
and the agent registers with the master as a new agent.</p></blockquote></li>
</ul>
</li>
<li><p><code>recovery_timeout</code>: Amount of time allotted for the agent to recover [Default: 15 mins].</p>
<ul>
<li>If the agent takes longer than <code>recovery_timeout</code> to recover, any executors that are waiting to
reconnect to the agent will self-terminate.</li>
</ul>
</li>
</ul>
<blockquote><p>NOTE: If none of the frameworks have enabled checkpointing,
the executors and tasks running at an agent die when the agent dies
and are not recovered.</p></blockquote>
<p>A restarted agent should re-register with master within a timeout (75 seconds by default: see the <code>--max_agent_ping_timeouts</code> and <code>--agent_ping_timeout</code> <a href="/documentation/latest/./configuration/">configuration flags</a>). If the agent takes longer than this timeout to re-register, the master shuts down the agent, which in turn will shutdown any live executors/tasks. Therefore, it is highly recommended to automate the process of restarting an agent (e.g., using a process supervisor such as <a href="http://mmonit.com/monit/">monit</a> or <code>systemd</code>).</p>
<h2>Known issues with <code>systemd</code> and process lifetime</h2>
<p>There is a known issue when using <code>systemd</code> to launch the <code>mesos-agent</code>. A description of the problem can be found in <a href="https://issues.apache.org/jira/browse/MESOS-3425">MESOS-3425</a> and all relevant work can be tracked in the epic <a href="https://issues.apache.org/jira/browse/MESOS-3007">MESOS-3007</a>.
This problem was fixed in Mesos <code>0.25.0</code> for the mesos containerizer when cgroups isolation is enabled. Further fixes for the posix isolators and docker containerizer are available in <code>0.25.1</code>, <code>0.26.1</code>, <code>0.27.1</code>, and <code>0.28.0</code>.</p>
<p>It is recommended that you use the default <a href="http://www.freedesktop.org/software/systemd/man/systemd.kill.html">KillMode</a> for systemd processes, which is <code>control-group</code>, which kills all child processes when the agent stops. This ensures that &ldquo;side-car&rdquo; processes such as the <code>fetcher</code> and <code>perf</code> are terminated alongside the agent.
The systemd patches for Mesos explicitly move executors and their children into a separate systemd slice, dissociating their lifetime from the agent. This ensures the executors survive agent restarts.</p>
<p>The following excerpt of a <code>systemd</code> unit configuration file shows how to set the flag explicitly:</p>
<pre><code>[Service]
ExecStart=/usr/bin/mesos-agent
KillMode=control-cgroup
</code></pre>
</div>
</div>
</div><!-- /.container -->
</div><!-- /.content -->
<hr>
<!-- footer -->
<div class="footer">
<div class="container">
<div class="col-md-4 social-blk">
<span class="social">
<a href="https://twitter.com/ApacheMesos"
class="twitter-follow-button"
data-show-count="false" data-size="large">Follow @ApacheMesos</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a href="https://twitter.com/intent/tweet?button_hashtag=mesos"
class="twitter-hashtag-button"
data-size="large"
data-related="ApacheMesos">Tweet #mesos</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</span>
</div>
<div class="col-md-8 trademark">
<p>&copy; 2012-2017 <a href="http://apache.org">The Apache Software Foundation</a>.
Apache Mesos, the Apache feather logo, and the Apache Mesos project logo are trademarks of The Apache Software Foundation.
<p>
</div>
</div><!-- /.container -->
</div><!-- /.footer -->
<!-- JS -->
<script src="//code.jquery.com/jquery-1.11.0.min.js" type="text/javascript"></script>
<script src="//netdna.bootstrapcdn.com/bootstrap/3.1.1/js/bootstrap.min.js" type="text/javascript"></script>
</body>
</html>