blob: fcbc0a974629cb72e36e066e8967f24fc7c65724 [file] [log] [blame]
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Apache Mesos - Operational Guide</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta property="og:locale" content="en_US"/>
<meta property="og:type" content="website"/>
<meta property="og:title" content="Apache Mesos"/>
<meta property="og:site_name" content="Apache Mesos"/>
<meta property="og:url" content="http://mesos.apache.org/"/>
<meta property="og:image" content="http://mesos.apache.org/assets/img/mesos_logo_fb_preview.png"/>
<meta property="og:description"
content="Apache Mesos abstracts resources away from machines,
enabling fault-tolerant and elastic distributed systems
to easily be built and run effectively."/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:site" content="@ApacheMesos"/>
<meta name="twitter:title" content="Apache Mesos"/>
<meta name="twitter:image" content="http://mesos.apache.org/assets/img/mesos_logo_fb_preview.png"/>
<meta name="twitter:description"
content="Apache Mesos abstracts resources away from machines,
enabling fault-tolerant and elastic distributed systems
to easily be built and run effectively."/>
<link href="//netdna.bootstrapcdn.com/bootstrap/3.1.1/css/bootstrap.min.css" rel="stylesheet">
<link rel="alternate" type="application/atom+xml" title="Apache Mesos Blog" href="/blog/feed.xml">
<link href="../../assets/css/main.css" media="screen" rel="stylesheet" type="text/css" />
<!-- Google Analytics Magic -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-20226872-1']);
_gaq.push(['_setDomainName', 'apache.org']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<!-- magical breadcrumbs -->
<div class="topnav">
<div class="container">
<ul class="breadcrumb">
<li>
<div class="dropdown">
<a data-toggle="dropdown" href="#">Apache Software Foundation <span class="caret"></span></a>
<ul class="dropdown-menu" role="menu" aria-labelledby="dLabel">
<li><a href="http://www.apache.org">Apache Homepage</a></li>
<li><a href="http://www.apache.org/licenses/">License</a></li>
<li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
<li><a href="http://www.apache.org/security/">Security</a></li>
</ul>
</div>
</li>
<li><a href="http://mesos.apache.org">Apache Mesos</a></li>
<li><a href="/documentation
/">Documentation
</a></li>
</ul><!-- /.breadcrumb -->
</div><!-- /.container -->
</div><!-- /.topnav -->
<!-- navbar excitement -->
<div class="navbar navbar-default navbar-static-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#mesos-menu" aria-expanded="false">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/"><img src="/assets/img/mesos_logo.png" alt="Apache Mesos logo"/></a>
</div><!-- /.navbar-header -->
<div class="navbar-collapse collapse" id="mesos-menu">
<ul class="nav navbar-nav navbar-right">
<li><a href="/gettingstarted/">Getting Started</a></li>
<li><a href="/blog/">Blog</a></li>
<li><a href="/documentation/latest/">Documentation</a></li>
<li><a href="/downloads/">Downloads</a></li>
<li><a href="/community/">Community</a></li>
</ul>
</div><!-- /#mesos-menu -->
</div><!-- /.container -->
</div><!-- /.navbar -->
<div class="content">
<div class="container">
<div class="row-fluid">
<div class="col-md-4">
<h4>If you're new to Mesos</h4>
<p>See the <a href="/gettingstarted/">getting started</a> page for more
information about downloading, building, and deploying Mesos.</p>
<h4>If you'd like to get involved or you're looking for support</h4>
<p>See our <a href="/community/">community</a> page for more details.</p>
</div>
<div class="col-md-8">
<h1>Operational Guide</h1>
<h2>Using a process supervisor</h2>
<p>Mesos uses a &ldquo;<a href="https://en.wikipedia.org/wiki/Fail-fast">fail-fast</a>&rdquo; approach to error handling: if a serious error occurs, Mesos will typically exit rather than trying to continue running in a possibly erroneous state. For example, when Mesos is configured for <a href="/documentation/latest/./high-availability/">high availability</a>, the leading master will abort itself when it discovers it has been partitioned away from the Zookeeper quorum. This is a safety precaution to ensure the previous leader doesn&rsquo;t continue communicating in an unsafe state.</p>
<p>To ensure that such failures are handled appropriately, production deployments of Mesos typically use a <em>process supervisor</em> (such as systemd or supervisord) to detect when Mesos processes exit. The supervisor can be configured to restart the failed process automatically and/or to notify the cluster operator to investigate the situation.</p>
<h2>Changing the master quorum</h2>
<p>The master leverages a <a href="/documentation/latest/./replicated-log-internals/">Paxos-based replicated log</a> as its storage backend (<code>--registry=replicated_log</code> is the only storage backend currently supported). Each master participates in the ensemble as a log replica. The <code>--quorum</code> flag determines a majority of the masters.</p>
<p>The following table shows the tolerance to master failures for each quorum size:</p>
<table>
<thead>
<tr>
<th style="text-align:right;"> Masters </th>
<th style="text-align:right;"> Quorum Size </th>
<th style="text-align:right;"> Failure Tolerance </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right;"> 1 </td>
<td style="text-align:right;"> 1 </td>
<td style="text-align:right;"> 0 </td>
</tr>
<tr>
<td style="text-align:right;"> 3 </td>
<td style="text-align:right;"> 2 </td>
<td style="text-align:right;"> 1 </td>
</tr>
<tr>
<td style="text-align:right;"> 5 </td>
<td style="text-align:right;"> 3 </td>
<td style="text-align:right;"> 2 </td>
</tr>
<tr>
<td style="text-align:right;"> &hellip; </td>
<td style="text-align:right;"> &hellip; </td>
<td style="text-align:right;"> &hellip; </td>
</tr>
<tr>
<td style="text-align:right;"> 2N - 1 </td>
<td style="text-align:right;"> N </td>
<td style="text-align:right;"> N - 1 </td>
</tr>
</tbody>
</table>
<p>It is recommended to run with 3 or 5 masters, when desiring high availability.</p>
<h3>NOTE</h3>
<p>When configuring the quorum, it is essential to ensure that there are only so many masters running as specified in the table above. If additional masters are running, this violates the quorum and the log may be corrupted! As a result, it is recommended to gate the running of the master process with something that enforces a static whitelist of the master hosts. See <a href="https://issues.apache.org/jira/browse/MESOS-1546">MESOS-1546</a> for adding a safety whitelist within Mesos itself.</p>
<p>For online reconfiguration of the log, see: <a href="https://issues.apache.org/jira/browse/MESOS-683">MESOS-683</a>.</p>
<h3>Increasing the quorum size</h3>
<p>As the size of a cluster grows, it may be desired to increase the quorum size for additional fault tolerance.</p>
<p>The following steps indicate how to increment the quorum size, using 3 -> 5 masters as an example (quorum size 2 -> 3):</p>
<ol>
<li>Initially, 3 masters are running with <code>--quorum=2</code></li>
<li>Restart the original 3 masters with <code>--quorum=3</code></li>
<li>Start 2 additional masters with <code>--quorum=3</code></li>
</ol>
<p>To increase the quorum by N, repeat this process to increment the quorum size N times.</p>
<p>NOTE: Currently, moving out of a single master setup requires wiping the replicated log
state and starting fresh. This will wipe all persistent data (e.g., agents, maintenance
information, quota information, etc). To move from 1 master to 3 masters:</p>
<ol>
<li>Stop the standalone master.</li>
<li>Remove the replicated log data (<code>replicated_log</code> under the <code>--work_dir</code>).</li>
<li>Start the original master and two new masters with <code>--quorum=2</code></li>
</ol>
<h3>Decreasing the quorum size</h3>
<p>The following steps indicate how to decrement the quorum size, using 5 -> 3 masters as an example (quorum size 3 -> 2):</p>
<ol>
<li>Initially, 5 masters are running with <code>--quorum=3</code></li>
<li>Remove 2 masters from the cluster, ensure they will not be restarted (see NOTE section above). Now 3 masters are running with <code>--quorum=3</code></li>
<li>Restart the 3 masters with <code>--quorum=2</code></li>
</ol>
<p>To decrease the quorum by N, repeat this process to decrement the quorum size N times.</p>
<h3>Replacing a master</h3>
<p>Please see the NOTE section above. So long as the failed master is guaranteed to not re-join the ensemble, it is safe to start a new master <em>with an empty log</em> and allow it to catch up.</p>
<h2>External access for Mesos master</h2>
<p>If the default IP (or the command line arg <code>--ip</code>) is an internal IP, then external entities such as framework schedulers will be unable to reach the master. To address that scenario, an externally accessible IP:port can be setup via the <code>--advertise_ip</code> and <code>--advertise_port</code> command line arguments of <code>mesos-master</code>. If configured, external entities such as framework schedulers interact with the advertise_ip:advertise_port from where the request needs to be proxied to the internal IP:port on which the Mesos master is listening.</p>
<h2>HTTP requests to non-leading master</h2>
<p>HTTP requests to some master endpoints (e.g., <a href="/documentation/latest/./endpoints/master/state/">/state</a>, <a href="/documentation/latest/./endpoints/master/machine/down/">/machine/down</a>) can only be answered by the leading master. Such requests made to a non-leading master will result in either a <code>307 Temporary Redirect</code> (with the location of the leading master) or <code>503 Service Unavailable</code> (if the master does not know who the current leader is).</p>
</div>
</div>
</div><!-- /.container -->
</div><!-- /.content -->
<hr>
<!-- footer -->
<div class="footer">
<div class="container">
<div class="col-md-4 social-blk">
<span class="social">
<a href="https://twitter.com/ApacheMesos"
class="twitter-follow-button"
data-show-count="false" data-size="large">Follow @ApacheMesos</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a href="https://twitter.com/intent/tweet?button_hashtag=mesos"
class="twitter-hashtag-button"
data-size="large"
data-related="ApacheMesos">Tweet #mesos</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</span>
</div>
<div class="col-md-8 trademark">
<p>&copy; 2012-2017 <a href="http://apache.org">The Apache Software Foundation</a>.
Apache Mesos, the Apache feather logo, and the Apache Mesos project logo are trademarks of The Apache Software Foundation.
<p>
</div>
</div><!-- /.container -->
</div><!-- /.footer -->
<!-- JS -->
<script src="//code.jquery.com/jquery-1.11.0.min.js" type="text/javascript"></script>
<script src="//netdna.bootstrapcdn.com/bootstrap/3.1.1/js/bootstrap.min.js" type="text/javascript"></script>
</body>
</html>