| <!DOCTYPE html> |
| <html> |
| <head> |
| <meta charset="utf-8"> |
| <title>Apache Mesos - Agent Recovery</title> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| |
| <meta property="og:locale" content="en_US"/> |
| <meta property="og:type" content="website"/> |
| <meta property="og:title" content="Apache Mesos"/> |
| <meta property="og:site_name" content="Apache Mesos"/> |
| <meta property="og:url" content="http://mesos.apache.org/"/> |
| <meta property="og:image" content="http://mesos.apache.org/assets/img/mesos_logo_fb_preview.png"/> |
| <meta property="og:description" |
| content="Apache Mesos abstracts resources away from machines, |
| enabling fault-tolerant and elastic distributed systems |
| to easily be built and run effectively."/> |
| |
| <meta name="twitter:card" content="summary"/> |
| <meta name="twitter:site" content="@ApacheMesos"/> |
| <meta name="twitter:title" content="Apache Mesos"/> |
| <meta name="twitter:image" content="http://mesos.apache.org/assets/img/mesos_logo_fb_preview.png"/> |
| <meta name="twitter:description" |
| content="Apache Mesos abstracts resources away from machines, |
| enabling fault-tolerant and elastic distributed systems |
| to easily be built and run effectively."/> |
| |
| <link href="//netdna.bootstrapcdn.com/bootstrap/3.1.1/css/bootstrap.min.css" rel="stylesheet"> |
| <link rel="alternate" type="application/atom+xml" title="Apache Mesos Blog" href="/blog/feed.xml"> |
| <link href="../../assets/css/main.css" media="screen" rel="stylesheet" type="text/css" /> |
| |
| |
| |
| <!-- Google Analytics Magic --> |
| <script type="text/javascript"> |
| var _gaq = _gaq || []; |
| _gaq.push(['_setAccount', 'UA-20226872-1']); |
| _gaq.push(['_setDomainName', 'apache.org']); |
| _gaq.push(['_trackPageview']); |
| |
| (function() { |
| var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; |
| ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; |
| var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); |
| })(); |
| </script> |
| |
| </head> |
| <body> |
| <!-- magical breadcrumbs --> |
| <div class="topnav"> |
| <div class="container"> |
| <ul class="breadcrumb"> |
| <li> |
| <div class="dropdown"> |
| <a data-toggle="dropdown" href="#">Apache Software Foundation <span class="caret"></span></a> |
| <ul class="dropdown-menu" role="menu" aria-labelledby="dLabel"> |
| <li><a href="http://www.apache.org">Apache Homepage</a></li> |
| <li><a href="http://www.apache.org/licenses/">License</a></li> |
| <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> |
| <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> |
| <li><a href="http://www.apache.org/security/">Security</a></li> |
| </ul> |
| </div> |
| </li> |
| |
| <li><a href="http://mesos.apache.org">Apache Mesos</a></li> |
| |
| |
| <li><a href="/documentation |
| /">Documentation |
| </a></li> |
| |
| |
| </ul><!-- /.breadcrumb --> |
| </div><!-- /.container --> |
| </div><!-- /.topnav --> |
| |
| <!-- navbar excitement --> |
| <div class="navbar navbar-default navbar-static-top" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#mesos-menu" aria-expanded="false"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| <a class="navbar-brand" href="/"><img src="/assets/img/mesos_logo.png" alt="Apache Mesos logo"/></a> |
| </div><!-- /.navbar-header --> |
| |
| <div class="navbar-collapse collapse" id="mesos-menu"> |
| <ul class="nav navbar-nav navbar-right"> |
| <li><a href="/gettingstarted/">Getting Started</a></li> |
| <li><a href="/blog/">Blog</a></li> |
| <li><a href="/documentation/latest/">Documentation</a></li> |
| <li><a href="/downloads/">Downloads</a></li> |
| <li><a href="/community/">Community</a></li> |
| </ul> |
| </div><!-- /#mesos-menu --> |
| </div><!-- /.container --> |
| </div><!-- /.navbar --> |
| |
| <div class="content"> |
| <div class="container"> |
| <div class="row-fluid"> |
| <div class="col-md-4"> |
| <h4>If you're new to Mesos</h4> |
| <p>See the <a href="/gettingstarted/">getting started</a> page for more |
| information about downloading, building, and deploying Mesos.</p> |
| |
| <h4>If you'd like to get involved or you're looking for support</h4> |
| <p>See our <a href="/community/">community</a> page for more details.</p> |
| </div> |
| <div class="col-md-8"> |
| <h1>Agent Recovery</h1> |
| |
| <p>If the <code>mesos-agent</code> process on a host exits (perhaps due to a Mesos bug or |
| because the operator kills the process while <a href="/documentation/latest/./upgrades/">upgrading Mesos</a>), |
| any executors/tasks that were being managed by the <code>mesos-agent</code> process will |
| continue to run. When <code>mesos-agent</code> is restarted, the operator can control how |
| those old executors/tasks are handled:</p> |
| |
| <ol> |
| <li>By default, all the executors/tasks that were being managed by the old |
| <code>mesos-agent</code> process are killed.</li> |
| <li>If a framework enabled <em>checkpointing</em> when it registered with the master, |
| any executors belonging to that framework can reconnect to the new |
| <code>mesos-agent</code> process and continue running uninterrupted.</li> |
| </ol> |
| |
| |
| <p>Hence, enabling framework checkpointing enables tasks to tolerate Mesos agent |
| upgrades and unexpected <code>mesos-agent</code> crashes without experiencing any |
| downtime.</p> |
| |
| <p>Agent recovery works by having the agent <em>checkpoint</em> information (e.g., Task |
| Info, Executor Info, Status Updates) about the tasks and executors it is |
| managing to local disk. If a framework enables checkpointing, any subsequent |
| agent restarts will recover the checkpointed information and reconnect with any |
| executors that are still running.</p> |
| |
| <p>Note that if the operating system on the agent is rebooted, all executors and |
| tasks running on the host are killed and are not automatically restarted when |
| the host comes back up.</p> |
| |
| <h2>Framework Configuration</h2> |
| |
| <p>A framework can control whether its executors will be recovered by setting the <code>checkpoint</code> flag in its <code>FrameworkInfo</code> when registering with the master. Enabling this feature results in increased I/O overhead at each agent that runs tasks launched by the framework. By default, frameworks do <strong>not</strong> checkpoint their state.</p> |
| |
| <h2>Agent Configuration</h2> |
| |
| <p>Three <a href="/documentation/latest/./configuration/">configuration flags</a> control the recovery behavior of a Mesos agent:</p> |
| |
| <ul> |
| <li><p><code>strict</code>: Whether to do agent recovery in strict mode [Default: true].</p> |
| |
| <ul> |
| <li>If strict=true, all recovery errors are considered fatal.</li> |
| <li>If strict=false, any errors (e.g., corruption in checkpointed data) during recovery are |
| ignored and as much state as possible is recovered.</li> |
| </ul> |
| </li> |
| <li><p><code>recover</code>: Whether to recover status updates and reconnect with old executors [Default: reconnect].</p> |
| |
| <ul> |
| <li>If recover=reconnect, reconnect with any old live executors, provided the executor’s framework enabled checkpointing.</li> |
| <li>If recover=cleanup, kill any old live executors and exit. Use this option when doing an incompatible agent or executor upgrade! |
| |
| <blockquote><p>NOTE: If no checkpointing information exists, no recovery is performed |
| and the agent registers with the master as a new agent.</p></blockquote></li> |
| </ul> |
| </li> |
| <li><p><code>recovery_timeout</code>: Amount of time allotted for the agent to recover [Default: 15 mins].</p> |
| |
| <ul> |
| <li>If the agent takes longer than <code>recovery_timeout</code> to recover, any executors that are waiting to |
| reconnect to the agent will self-terminate.</li> |
| </ul> |
| </li> |
| </ul> |
| |
| |
| <blockquote><p>NOTE: If none of the frameworks have enabled checkpointing, |
| the executors and tasks running at an agent die when the agent dies |
| and are not recovered.</p></blockquote> |
| |
| <p>A restarted agent should re-register with master within a timeout (75 seconds by default: see the <code>--max_agent_ping_timeouts</code> and <code>--agent_ping_timeout</code> <a href="/documentation/latest/./configuration/">configuration flags</a>). If the agent takes longer than this timeout to re-register, the master shuts down the agent, which in turn will shutdown any live executors/tasks. Therefore, it is highly recommended to automate the process of restarting an agent (e.g., using a process supervisor such as <a href="http://mmonit.com/monit/">monit</a> or <code>systemd</code>).</p> |
| |
| <h2>Known issues with <code>systemd</code> and process lifetime</h2> |
| |
| <p>There is a known issue when using <code>systemd</code> to launch the <code>mesos-agent</code>. A description of the problem can be found in <a href="https://issues.apache.org/jira/browse/MESOS-3425">MESOS-3425</a> and all relevant work can be tracked in the epic <a href="https://issues.apache.org/jira/browse/MESOS-3007">MESOS-3007</a>. |
| This problem was fixed in Mesos <code>0.25.0</code> for the mesos containerizer when cgroups isolation is enabled. Further fixes for the posix isolators and docker containerizer are available in <code>0.25.1</code>, <code>0.26.1</code>, <code>0.27.1</code>, and <code>0.28.0</code>.</p> |
| |
| <p>It is recommended that you use the default <a href="http://www.freedesktop.org/software/systemd/man/systemd.kill.html">KillMode</a> for systemd processes, which is <code>control-group</code>, which kills all child processes when the agent stops. This ensures that “side-car” processes such as the <code>fetcher</code> and <code>perf</code> are terminated alongside the agent. |
| The systemd patches for Mesos explicitly move executors and their children into a separate systemd slice, dissociating their lifetime from the agent. This ensures the executors survive agent restarts.</p> |
| |
| <p>The following excerpt of a <code>systemd</code> unit configuration file shows how to set the flag explicitly:</p> |
| |
| <pre><code>[Service] |
| ExecStart=/usr/bin/mesos-agent |
| KillMode=control-cgroup |
| </code></pre> |
| |
| </div> |
| </div> |
| |
| </div><!-- /.container --> |
| </div><!-- /.content --> |
| |
| <hr> |
| |
| |
| |
| <!-- footer --> |
| <div class="footer"> |
| <div class="container"> |
| <div class="col-md-4 social-blk"> |
| <span class="social"> |
| <a href="https://twitter.com/ApacheMesos" |
| class="twitter-follow-button" |
| data-show-count="false" data-size="large">Follow @ApacheMesos</a> |
| <script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script> |
| <a href="https://twitter.com/intent/tweet?button_hashtag=mesos" |
| class="twitter-hashtag-button" |
| data-size="large" |
| data-related="ApacheMesos">Tweet #mesos</a> |
| <script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script> |
| </span> |
| </div> |
| |
| <div class="col-md-8 trademark"> |
| <p>© 2012-2017 <a href="http://apache.org">The Apache Software Foundation</a>. |
| Apache Mesos, the Apache feather logo, and the Apache Mesos project logo are trademarks of The Apache Software Foundation. |
| <p> |
| </div> |
| </div><!-- /.container --> |
| </div><!-- /.footer --> |
| |
| <!-- JS --> |
| <script src="//code.jquery.com/jquery-1.11.0.min.js" type="text/javascript"></script> |
| <script src="//netdna.bootstrapcdn.com/bootstrap/3.1.1/js/bootstrap.min.js" type="text/javascript"></script> |
| </body> |
| </html> |