blob: a49c93b5356ffb0a90e3f5514ad87cb4976ca88f [file] [log] [blame]
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Apache Mesos - Designing Highly Available Mesos Frameworks</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta property="og:locale" content="en_US"/>
<meta property="og:type" content="website"/>
<meta property="og:title" content="Apache Mesos"/>
<meta property="og:site_name" content="Apache Mesos"/>
<meta property="og:url" content="http://mesos.apache.org/"/>
<meta property="og:image" content="http://mesos.apache.org/assets/img/mesos_logo_fb_preview.png"/>
<meta property="og:description"
content="Apache Mesos abstracts resources away from machines,
enabling fault-tolerant and elastic distributed systems
to easily be built and run effectively."/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:site" content="@ApacheMesos"/>
<meta name="twitter:title" content="Apache Mesos"/>
<meta name="twitter:image" content="http://mesos.apache.org/assets/img/mesos_logo_fb_preview.png"/>
<meta name="twitter:description"
content="Apache Mesos abstracts resources away from machines,
enabling fault-tolerant and elastic distributed systems
to easily be built and run effectively."/>
<link href="//netdna.bootstrapcdn.com/bootstrap/3.1.1/css/bootstrap.min.css" rel="stylesheet">
<link rel="alternate" type="application/atom+xml" title="Apache Mesos Blog" href="/blog/feed.xml">
<link href="../../../assets/css/main.css" media="screen" rel="stylesheet" type="text/css" />
<!-- Google Analytics Magic -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-20226872-1']);
_gaq.push(['_setDomainName', 'apache.org']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<!-- magical breadcrumbs -->
<div class="topnav">
<div class="container">
<ul class="breadcrumb">
<li>
<div class="dropdown">
<a data-toggle="dropdown" href="#">Apache Software Foundation <span class="caret"></span></a>
<ul class="dropdown-menu" role="menu" aria-labelledby="dLabel">
<li><a href="http://www.apache.org">Apache Homepage</a></li>
<li><a href="http://www.apache.org/licenses/">License</a></li>
<li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
<li><a href="http://www.apache.org/security/">Security</a></li>
</ul>
</div>
</li>
<li><a href="http://mesos.apache.org">Apache Mesos</a></li>
<li><a href="/documentation
/">Documentation
</a></li>
</ul><!-- /.breadcrumb -->
</div><!-- /.container -->
</div><!-- /.topnav -->
<!-- navbar excitement -->
<div class="navbar navbar-default navbar-static-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#mesos-menu" aria-expanded="false">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/"><img src="/assets/img/mesos_logo.png" alt="Apache Mesos logo"/></a>
</div><!-- /.navbar-header -->
<div class="navbar-collapse collapse" id="mesos-menu">
<ul class="nav navbar-nav navbar-right">
<li><a href="/gettingstarted/">Getting Started</a></li>
<li><a href="/blog/">Blog</a></li>
<li><a href="/documentation/latest/">Documentation</a></li>
<li><a href="/downloads/">Downloads</a></li>
<li><a href="/community/">Community</a></li>
</ul>
</div><!-- /#mesos-menu -->
</div><!-- /.container -->
</div><!-- /.navbar -->
<div class="content">
<div class="container">
<div class="row-fluid">
<div class="col-md-4">
<h4>If you're new to Mesos</h4>
<p>See the <a href="/gettingstarted/">getting started</a> page for more
information about downloading, building, and deploying Mesos.</p>
<h4>If you'd like to get involved or you're looking for support</h4>
<p>See our <a href="/community/">community</a> page for more details.</p>
</div>
<div class="col-md-8">
<h1>Designing Highly Available Mesos Frameworks</h1>
<p>A Mesos framework manages tasks. For a Mesos framework to be highly available,
it must continue to manage tasks correctly in the presence of a variety of
failure scenarios. The most common failure conditions that framework authors
should consider include:</p>
<ul>
<li><p>The Mesos master that a framework scheduler is connected to might fail, for
example by crashing or by losing network connectivity. If the master has been
configured to use <a href="/documentation/latest/./high-availability/">high-availability mode</a>, this will
result in promoting another Mesos master replica to become the current
leader. In this situation, the scheduler should re-register with the new
master and ensure that task state is consistent.</p></li>
<li><p>The host where a framework scheduler is running might fail. To ensure that the
framework remains available and can continue to schedule new tasks, framework
authors should ensure that multiple copies of the scheduler run on different
nodes, and that a backup copy is promoted to become the new leader when the
previous leader fails. Mesos itself does not dictate how framework authors
should handle this situation, although we provide some suggestions below. It
can be useful to deploy multiple copies of your framework scheduler using
a long-running task scheduler such as Apache Aurora or Marathon.</p></li>
<li><p>The host where a task is running might fail. Alternatively, the node itself
might not have failed but the Mesos agent on the node might be unable to
communicate with the Mesos master, e.g., due to a network partition.</p></li>
</ul>
<p>Note that more than one of these failures might occur simultaneously.</p>
<h2>Mesos Architecture</h2>
<p>Before discussing the specific failure scenarios outlined above, it is worth
highlighting some aspects of how Mesos is designed that influence high
availability:</p>
<ul>
<li><p>Mesos provides unreliable messaging between components by default: messages
are delivered &ldquo;at-most-once&rdquo; (they might be dropped). Framework authors should
expect that messages they send might not be received and be prepared to take
appropriate corrective action. To detect that a message might be lost,
frameworks typically use timeouts. For example, if a framework attempts to
launch a task, that message might not be received by the Mesos master (e.g.,
due to a transient network failure). To address this, the framework scheduler
should set a timeout after attempting to launch a new task. If the scheduler
hasn&rsquo;t seen a status update for the new task before the timeout fires, it
should take corrective action&mdash;for example, by performing <a href="/documentation/latest/./reconciliation/">task state reconciliation</a>,
and then launching a new copy of the task if necessary.</p>
<ul>
<li><p>In general, distributed systems cannot distinguish between &ldquo;lost&rdquo; messages
and messages that are merely delayed. In the example above, the scheduler
might see a status update for the first task launch attempt immediately
<em>after</em> its timeout has fired and it has already begun taking corrective
action. Scheduler authors should be aware of this possibility and program
accordingly.</p></li>
<li><p>Mesos actually provides ordered (but unreliable) message delivery between
any pair of processes: for example, if a framework sends messages M1 and M2
to the master, the master might receive no messages, just M1, just M2, or M1
followed by M2 &ndash; it will <em>not</em> receive M2 followed by M1.</p></li>
<li><p>As a convenience for framework authors, Mesos provides reliable delivery of
task status updates. The agent persists task status updates to disk and then
forwards them to the master. The master sends status updates to the
appropriate framework scheduler. When a scheduler acknowledges a status
update, the master forwards the acknowledgment back to the agent, which
allows the stored status update to be garbage collected. If the agent does
not receive an acknowledgment for a task status update within a certain
amount of time, it will repeatedly resend the status update to the master,
which will again forward the update to the scheduler. Hence, task status
updates will be delivered &ldquo;at least once&rdquo;, assuming that the agent and the
scheduler both remain available. To handle the fact that task status updates
might be delivered more than once, it can be helpful to make the framework
logic that processes them <a href="https://en.wikipedia.org/wiki/Idempotence">idempotent</a>.</p></li>
</ul>
</li>
<li><p>The Mesos master stores information about the active tasks and registered
frameworks <em>in memory</em>: it does not persist it to disk or attempt to ensure
that this information is preserved after a master failover. This helps the
Mesos master scale to large clusters with many tasks and frameworks. A
downside of this design is that after a failure, more work is required to
recover the lost in-memory master state.</p></li>
<li><p>If all the Mesos masters are unavailable (e.g., crashed or unreachable), the
cluster should continue to operate: existing Mesos agents and user tasks should
continue running. However, new tasks cannot be scheduled, and frameworks will
not receive resource offers or status updates about previously launched tasks.</p></li>
<li><p>Mesos does not dictate how frameworks should be implemented and does not try
to assume responsibility for how frameworks should deal with failures.
Instead, Mesos tries to provide framework developers with the tools they need
to implement this behavior themselves. Different frameworks might choose to
handle failures differently, depending on their exact requirements.</p></li>
</ul>
<h2>Recommendations for Highly Available Frameworks</h2>
<p>Highly available framework designs typically follow a few common patterns:</p>
<ol>
<li><p>To tolerate scheduler failures, frameworks run multiple scheduler instances
(three instances is typical). At any given time, only one of these scheduler
instances is the <em>leader</em>: this instance is connected to the Mesos master,
receives resource offers and task status updates, and launches new tasks. The
other scheduler replicas are <em>followers</em>: they are used only when the leader
fails, in which case one of the followers is chosen to become the new leader.</p></li>
<li><p>Schedulers need a mechanism to decide when the current scheduler leader has
failed and to elect a new leader. This is typically accomplished using a
coordination service like <a href="https://zookeeper.apache.org/">Apache ZooKeeper</a>
or <a href="https://github.com/coreos/etcd">etcd</a>. Consult the documentation of the
coordination system you are using for more information on how to correctly
implement leader election.</p></li>
<li><p>After electing a new leading scheduler, the new leader should reconnect to
the Mesos master. When registering with the master, the framework should set
the <code>id</code> field in its <code>FrameworkInfo</code> to the ID that was assigned to the
failed scheduler instance. This ensures that the master will recognize that
the connection does not start a new session, but rather continues (and
replaces) the session used by the failed scheduler instance.</p>
<blockquote><p>NOTE: When the old scheduler leader disconnects from the master, by default
the master will immediately kill all the tasks and executors associated with
the failed framework. For a typical production framework, this default
behavior is very undesirable! To avoid this, highly available frameworks
should set the <code>failover_timeout</code> field in their <code>FrameworkInfo</code> to a
generous value. To avoid accidental destruction of tasks in production
environments, many frameworks use a <code>failover_timeout</code> of 1 week or more.</p></blockquote>
<ul>
<li> In the current implementation, a framework&rsquo;s <code>failover_timeout</code> is not
preserved during master failover. Hence, if a framework fails but the
leading master fails before the <code>failover_timeout</code> is reached, the newly
elected leading master won&rsquo;t know that the framework&rsquo;s tasks should be
killed after a period of time. Hence, if the framework never
reregisters, those tasks will continue to run indefinitely but will be
orphaned. This behavior will likely be fixed in a future version of
Mesos (<a href="https://issues.apache.org/jira/browse/MESOS-4659">MESOS-4659</a>).</li>
</ul>
</li>
<li><p>After connecting to the Mesos master, the new leading scheduler should ensure
that its local state is consistent with the current state of the cluster. For
example, suppose that the previous leading scheduler attempted to launch a
new task and then immediately failed. The task might have launched
successfully, at which point the newly elected leader will begin to receive
status updates about it. To handle this situation, frameworks typically use a
strongly consistent distributed data store to record information about active
and pending tasks. In fact, the same coordination service that is used for
leader election (such as ZooKeeper or etcd) can often be used for this
purpose. Some Mesos frameworks (such as Apache Aurora) use the Mesos
<a href="/documentation/latest/./replicated-log-internals/">replicated log</a> for this purpose.</p>
<ul>
<li><p>The data store should be used to record the actions that the scheduler
<em>intends</em> to take, before it takes them. For example, if a scheduler
decides to launch a new task, it <em>first</em> writes this intent to its data
store. Then it sends a &ldquo;launch task&rdquo; message to the Mesos master. If this
instance of the scheduler fails and a new scheduler is promoted to become
the leader, the new leader can consult the data store to find <em>all possible
tasks</em> that might be running on the cluster. This is an instance of the
<a href="https://en.wikipedia.org/wiki/Write-ahead_logging">write-ahead logging</a>
pattern often employed by database systems and filesystems to improve
reliability. Two aspects of this design are worth emphasizing.</p>
<ol>
<li><p>The scheduler must persist its intent <em>before</em> launching the task: if
the task is launched first and then the scheduler fails before it can
write to the data store, the new leading scheduler won&rsquo;t know about the
new task. If this occurs, the new scheduler instance will begin
receiving task status updates for a task that it has no knowledge of;
there is often not a good way to recover from this situation.</p></li>
<li><p>Second, the scheduler should ensure that its intent has been durably
recorded in the data store before continuing to launch the task (for
example, it should wait for a quorum of replicas in the data store to
have acknowledged receipt of the write operation). For more details on
how to do this, consult the documentation for the data store you are
using.</p></li>
</ol>
</li>
</ul>
</li>
</ol>
<h2>The Life Cycle of a Task</h2>
<p>A Mesos task transitions through a sequence of states. The authoritative &ldquo;source
of truth&rdquo; for the current state of a task is the agent on which the task is
running. A framework scheduler learns about the current state of a task by
communicating with the Mesos master&mdash;specifically, by listening for task status
updates and by performing task state reconciliation.</p>
<p>Frameworks can represent the state of a task using a state machine, with one
initial state and several possible terminal states:</p>
<ul>
<li><p>A task begins in the <code>TASK_STAGING</code> state. A task is in this state when the
master has received the framework&rsquo;s request to launch the task but the task
has not yet started to run. In this state, the task&rsquo;s dependencies are
fetched&mdash;for example, using the <a href="/documentation/latest/./fetcher/">Mesos fetcher cache</a>.</p></li>
<li><p>The <code>TASK_STARTING</code> state is optional and intended primarily for use by
custom executors. It can be used to describe the fact that a custom executor
has learned about the task (and maybe started fetching its dependencies) but has
not yet started to run it.</p></li>
<li><p>A task transitions to the <code>TASK_RUNNING</code> state after it has begun running
successfully (if the task fails to start, it transitions to one of the
terminal states listed below).</p>
<ul>
<li><p>If a framework attempts to launch a task but does not receive a status
update for it within a timeout, the framework should perform
<a href="/documentation/latest/./reconciliation/">reconciliation</a>. That is, it should ask the master for
the current state of the task. The master will reply with <code>TASK_LOST</code> status
updates for unknown tasks. The framework can then use this to distinguish
between tasks that are slow to launch and tasks that the master has never
heard about (e.g., because the task launch message was dropped).</p>
<ul>
<li>Note that the correctness of this technique depends on the fact that
messaging between the scheduler and the master is ordered.</li>
</ul>
</li>
</ul>
</li>
<li><p>The <code>TASK_KILLING</code> state is optional and is intended to indicate that the
request to kill the task has been received by the executor, but the task has
not yet been killed. This is useful for tasks that require some time to
terminate gracefully. Executors must not generate this state unless the
framework has the <code>TASK_KILLING_STATE</code> framework capability.</p></li>
<li><p>There are several terminal states:</p>
<ul>
<li><code>TASK_FINISHED</code> is used when a task completes successfully.</li>
<li><code>TASK_FAILED</code> indicates that a task aborted with an error.</li>
<li><code>TASK_KILLED</code> indicates that a task was killed by the executor.</li>
<li><code>TASK_LOST</code> indicates that the task was running on an agent that has lost
contact with the current master (typically due to a network partition or an
agent host failure). This case is described further below.</li>
<li><code>TASK_ERROR</code> indicates that a task launch attempt failed because of an error
in the task specification.</li>
</ul>
</li>
</ul>
<p>Note that the same task status can be used in several different (but usually
related) situations. For example, <code>TASK_ERROR</code> is used when the framework&rsquo;s
principal is not authorized to launch tasks as a certain user, and also when the
task description is syntactically malformed (e.g., the task ID contains an
invalid character). The <code>reason</code> field of the <code>TaskStatus</code> message can be used
to disambiguate between such situations.</p>
<h2>Dealing with Partitioned or Failed Agents</h2>
<p>The Mesos master tracks the availability and health of the registered agents
using two different mechanisms:</p>
<ol>
<li><p>The state of a persistent TCP connection between the master and the agent.</p></li>
<li><p><em>Health checks</em> using periodic ping messages to the agent. The master sends
&ldquo;ping&rdquo; messages to the agent and expects a &ldquo;pong&rdquo; response message within a
configurable timeout. The agent is considered to have failed if it does not
respond promptly to a certain number of ping messages in a row. This behavior
is controlled by the <code>--agent_ping_timeout</code> and <code>--max_agent_ping_timeouts</code>
master flags.</p></li>
</ol>
<p>If the persistent TCP connection to the agent breaks or the agent fails health
checks, the master decides that the agent has failed and takes steps to remove
it from the cluster. Specifically:</p>
<ul>
<li><p>If the TCP connection breaks, the agent is considered disconnected. The
semantics when a registered agent gets disconnected are as follows for each
framework running on that agent:</p>
<ul>
<li><p>If the framework is <a href="/documentation/latest/./agent-recovery/">checkpointing</a>: no immediate action
is taken. The agent is given a chance to reconnect until health checks time
out.</p></li>
<li><p>If the framework is not checkpointing: all the framework&rsquo;s tasks and
executors are considered lost. The master immediately sends <code>TASK_LOST</code>
status updates for the tasks. These updates are not delivered reliably to
the scheduler (see NOTE below). The agent is given a chance to reconnect
until health checks timeout. If the agent does reconnect, any tasks for
which <code>TASK_LOST</code> updates were previously sent will be killed.</p>
<ul>
<li>The rationale for this behavior is that, using typical TCP settings, an
error in the persistent TCP connection between the master and the agent is
more likely to correspond to an agent error (e.g., the <code>mesos-agent</code>
process terminating unexpectedly) than a network partition, because the
Mesos health-check timeouts are much smaller than the typical values of
the corresponding TCP-level timeouts. Since non-checkpointing frameworks
will not survive a restart of the <code>mesos-agent</code> process, the master sends
<code>TASK_LOST</code> status updates so that these tasks can be rescheduled
promptly. Of course, the heuristic that TCP errors do not correspond to
network partitions may not be true in some environments.</li>
</ul>
</li>
</ul>
</li>
<li><p>If the agent fails health checks, it is scheduled for removal. The removals can
be rate limited by the master (see <code>--agent_removal_rate_limit</code> master flag)
to avoid removing a slew of agents at once (e.g., during a network partition).</p></li>
<li><p>When it is time to remove an agent, the master removes the agent from the list
of registered agents in the master&rsquo;s <a href="/documentation/latest/./replicated-log-internals/">durable state</a>
(this will survive master failover). The master sends a <code>slaveLost</code> callback
to every registered scheduler driver; it also sends <code>TASK_LOST</code> status updates
for every task that was running on the removed agent.</p>
<blockquote><p>NOTE: Neither the callback nor the task status updates are delivered
reliably by the master. For example, if the master or scheduler fails over
or there is a network connectivity issue during the delivery of these
messages, they will not be resent.</p></blockquote></li>
<li><p>Meanwhile, any tasks at the removed agent will continue to run and the agent
will repeatedly attempt to reconnect to the master. Once a removed agent is
able to reconnect to the master (e.g., because the network partition has
healed), the reregistration attempt will be refused and the agent will be
asked to shutdown. The agent will then shutdown all running tasks and
executors. Persistent volumes and dynamic reservations on the removed agent
will be preserved.</p>
<ul>
<li>A removed agent can rejoin the cluster by restarting the <code>mesos-agent</code>
process. When a removed agent is shutdown by the master, Mesos ensures that
the next time <code>mesos-agent</code> is started (using the same work directory at the
same host), the agent will receive a new agent ID; in effect, the agent will
be treated as a newly joined agent. The agent will retain any previously
created persistent volumes and dynamic reservations, although the agent ID
associated with these resources will have changed.</li>
</ul>
</li>
</ul>
<p>Typically, frameworks respond to failed or partitioned agents by scheduling new
copies of the tasks that were running on the lost agent. This should be done
with caution, however: it is possible that the lost agent is still alive, but is
partitioned from the master and is unable to communicate with it. Depending on
the nature of the network partition, tasks on the agent might still be able to
communicate with external clients or other hosts in the cluster. Frameworks can
take steps to prevent this (e.g., by having tasks connect to ZooKeeper and cease
operation if their ZooKeeper session expires), but Mesos leaves such details to
framework authors.</p>
<h2>Dealing with Partitioned or Failed Masters</h2>
<p>The behavior described above does not apply during the period immediately after
a new Mesos master is elected. As noted above, most Mesos master state is only
kept in memory; hence, when the leading master fails and a new master is
elected, the new master will have little knowledge of the current state of the
cluster. Instead, it rebuilds this information as the frameworks and agents
notice that a new master has been elected and then <em>reregister</em> with it.</p>
<h3>Framework Reregistration</h3>
<p>When master failover occurs, frameworks that were connected to the previous
leading master should reconnect to the new leading
master. <code>MesosSchedulerDriver</code> handles most of the details of detecting when the
previous leading master has failed and connecting to the new leader; when the
framework has successfully reregistered with the new leading master, the
<code>reregistered</code> scheduler driver callback will be invoked.</p>
<h3>Agent Reregistration</h3>
<p>During the period after a new master has been elected but before a given agent
has reregistered or the <code>agent_reregister_timeout</code> has fired, attempting to
reconcile the state of a task running on that agent will not return any
information (because the master cannot accurately determine the state of the
task).</p>
<p>If an agent does not reregister with the new master within a timeout (controlled
by the <code>--agent_reregister_timeout</code> configuration flag), the master marks the
agent as failed and follows the same steps described above. However, there is
one difference: by default, agents are <em>allowed to reconnect</em> following master
failover, even after the <code>agent_reregister_timeout</code> has fired. This means that
frameworks might see a <code>TASK_LOST</code> update for a task but then later discover
that the task is running (because the agent where it was running was allowed to
reconnect).</p>
</div>
</div>
</div><!-- /.container -->
</div><!-- /.content -->
<hr>
<!-- footer -->
<div class="footer">
<div class="container">
<div class="col-md-4 social-blk">
<span class="social">
<a href="https://twitter.com/ApacheMesos"
class="twitter-follow-button"
data-show-count="false" data-size="large">Follow @ApacheMesos</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a href="https://twitter.com/intent/tweet?button_hashtag=mesos"
class="twitter-hashtag-button"
data-size="large"
data-related="ApacheMesos">Tweet #mesos</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</span>
</div>
<div class="col-md-8 trademark">
<p>&copy; 2012-2017 <a href="http://apache.org">The Apache Software Foundation</a>.
Apache Mesos, the Apache feather logo, and the Apache Mesos project logo are trademarks of The Apache Software Foundation.
<p>
</div>
</div><!-- /.container -->
</div><!-- /.footer -->
<!-- JS -->
<script src="//code.jquery.com/jquery-1.11.0.min.js" type="text/javascript"></script>
<script src="//netdna.bootstrapcdn.com/bootstrap/3.1.1/js/bootstrap.min.js" type="text/javascript"></script>
</body>
</html>