| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8"> |
| <meta name="viewport" content="width=device-width, initial-scale=1"> |
| <title>Apache Aurora</title> |
| <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css"> |
| <link href="/assets/css/main.css" rel="stylesheet"> |
| <!-- Analytics --> |
| <script type="text/javascript"> |
| var _gaq = _gaq || []; |
| _gaq.push(['_setAccount', 'UA-45879646-1']); |
| _gaq.push(['_setDomainName', 'apache.org']); |
| _gaq.push(['_trackPageview']); |
| |
| (function() { |
| var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; |
| ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; |
| var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); |
| })(); |
| </script> |
| </head> |
| <body> |
| <div class="container-fluid section-header"> |
| <div class="container"> |
| <div class="nav nav-bar"> |
| <a href="/"><img src="/assets/img/aurora_logo_dkbkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a> |
| <ul class="nav navbar-nav navbar-right"> |
| <li><a href="/documentation/latest/">Documentation</a></li> |
| <li><a href="/community/">Community</a></li> |
| <li><a href="/downloads/">Downloads</a></li> |
| <li><a href="/blog/">Blog</a></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| |
| <div class="container-fluid"> |
| <div class="container content"> |
| <div class="col-md-12 documentation"> |
| <h5 class="page-header text-uppercase">Documentation |
| <select onChange="window.location.href='/documentation/' + this.value + '/deploying-aurora-scheduler/'" |
| value="0.8.0"> |
| <option value="0.22.0" |
| > |
| 0.22.0 |
| (latest) |
| </option> |
| <option value="0.21.0" |
| > |
| 0.21.0 |
| </option> |
| <option value="0.20.0" |
| > |
| 0.20.0 |
| </option> |
| <option value="0.19.1" |
| > |
| 0.19.1 |
| </option> |
| <option value="0.19.0" |
| > |
| 0.19.0 |
| </option> |
| <option value="0.18.1" |
| > |
| 0.18.1 |
| </option> |
| <option value="0.18.0" |
| > |
| 0.18.0 |
| </option> |
| <option value="0.17.0" |
| > |
| 0.17.0 |
| </option> |
| <option value="0.16.0" |
| > |
| 0.16.0 |
| </option> |
| <option value="0.15.0" |
| > |
| 0.15.0 |
| </option> |
| <option value="0.14.0" |
| > |
| 0.14.0 |
| </option> |
| <option value="0.13.0" |
| > |
| 0.13.0 |
| </option> |
| <option value="0.12.0" |
| > |
| 0.12.0 |
| </option> |
| <option value="0.11.0" |
| > |
| 0.11.0 |
| </option> |
| <option value="0.10.0" |
| > |
| 0.10.0 |
| </option> |
| <option value="0.9.0" |
| > |
| 0.9.0 |
| </option> |
| <option value="0.8.0" |
| selected="selected"> |
| 0.8.0 |
| </option> |
| <option value="0.7.0-incubating" |
| > |
| 0.7.0-incubating |
| </option> |
| <option value="0.6.0-incubating" |
| > |
| 0.6.0-incubating |
| </option> |
| <option value="0.5.0-incubating" |
| > |
| 0.5.0-incubating |
| </option> |
| </select> |
| </h5> |
| <h1 id="deploying-the-aurora-scheduler">Deploying the Aurora Scheduler</h1> |
| |
| <p>When setting up your cluster, you will install the scheduler on a small number (usually 3 or 5) of |
| machines. This guide helps you get the scheduler set up and troubleshoot some common hurdles.</p> |
| |
| <ul> |
| <li><a href="#installing-aurora">Installing Aurora</a> |
| |
| <ul> |
| <li><a href="#creating-the-distribution-zip-file-optional">Creating the Distribution .zip File (Optional)</a></li> |
| <li><a href="#installing-aurora-1">Installing Aurora</a></li> |
| </ul></li> |
| <li><a href="#configuring-aurora">Configuring Aurora</a> |
| |
| <ul> |
| <li><a href="#a-note-on-configuration">A Note on Configuration</a></li> |
| <li><a href="#replicated-log-configuration">Replicated Log Configuration</a></li> |
| <li><a href="#initializing-the-replicated-log">Initializing the Replicated Log</a></li> |
| <li><a href="#storage-performance-considerations">Storage Performance Considerations</a></li> |
| <li><a href="#network-considerations">Network considerations</a></li> |
| <li><a href="#considerations-for-running-jobs-in-docker">Considerations for running jobs in docker</a></li> |
| <li><a href="#security-considerations">Security Considerations</a></li> |
| </ul></li> |
| <li><a href="#running-aurora">Running Aurora</a> |
| |
| <ul> |
| <li><a href="#maintaining-an-aurora-installation">Maintaining an Aurora Installation</a></li> |
| <li><a href="#monitoring">Monitoring</a></li> |
| <li><a href="#running-stateful-services">Running stateful services</a></li> |
| <li><a href="#dedicated-attribute">Dedicated attribute</a> |
| |
| <ul> |
| <li><a href="#syntax">Syntax</a></li> |
| <li><a href="#example">Example</a></li> |
| </ul></li> |
| </ul></li> |
| <li><a href="#common-problems">Common problems</a> |
| |
| <ul> |
| <li><a href="#replicated-log-not-initialized">Replicated log not initialized</a></li> |
| <li><a href="#symptoms">Symptoms</a></li> |
| <li><a href="#solution">Solution</a></li> |
| <li><a href="#scheduler-not-registered">Scheduler not registered</a></li> |
| <li><a href="#symptoms-1">Symptoms</a></li> |
| <li><a href="#solution-1">Solution</a></li> |
| <li><a href="#tasks-are-stuck-in-pending-forever">Tasks are stuck in PENDING forever</a></li> |
| <li><a href="#symptoms-2">Symptoms</a></li> |
| <li><a href="#solution-2">Solution</a></li> |
| </ul></li> |
| </ul> |
| |
| <h2 id="installing-aurora">Installing Aurora</h2> |
| |
| <p>The Aurora scheduler is a standalone Java server. As part of the build process it creates a bundle |
| of all its dependencies, with the notable exceptions of the JVM and libmesos. Each target server |
| should have a JVM (Java 7 or higher) and libmesos (0.22.0) installed.</p> |
| |
| <h3 id="creating-the-distribution-zip-file-optional">Creating the Distribution .zip File (Optional)</h3> |
| |
| <p>To create a distribution for installation you will need build tools installed. On Ubuntu this can be |
| done with <code>sudo apt-get install build-essential default-jdk</code>.</p> |
| <pre class="highlight plaintext"><code>git clone http://gitbox.apache.org/repos/asf/aurora.git |
| cd aurora |
| ./gradlew distZip |
| </code></pre> |
| |
| <p>Copy the generated <code>dist/distributions/aurora-scheduler-*.zip</code> to each node that will run a scheduler.</p> |
| |
| <h3 id="installing-aurora">Installing Aurora</h3> |
| |
| <p>Extract the aurora-scheduler zip file. The example configurations assume it is extracted to |
| <code>/usr/local/aurora-scheduler</code>.</p> |
| <pre class="highlight plaintext"><code>sudo unzip dist/distributions/aurora-scheduler-*.zip -d /usr/local |
| sudo ln -nfs "$(ls -dt /usr/local/aurora-scheduler-* | head -1)" /usr/local/aurora-scheduler |
| </code></pre> |
| |
| <h2 id="configuring-aurora">Configuring Aurora</h2> |
| |
| <h3 id="a-note-on-configuration">A Note on Configuration</h3> |
| |
| <p>Like Mesos, Aurora uses command-line flags for runtime configuration. As such the Aurora |
| “configuration file” is typically a <code>scheduler.sh</code> shell script of the form.</p> |
| <pre class="highlight shell"><code><span style="color: #999988;font-style: italic">#!/bin/bash</span> |
| <span style="color: #008080">AURORA_HOME</span><span style="color: #000000;font-weight: bold">=</span>/usr/local/aurora-scheduler |
| |
| <span style="color: #999988;font-style: italic"># Flags controlling the JVM.</span> |
| <span style="color: #008080">JAVA_OPTS</span><span style="color: #000000;font-weight: bold">=(</span> |
| -Xmx2g |
| -Xms2g |
| <span style="color: #999988;font-style: italic"># GC tuning, etc.</span> |
| <span style="color: #000000;font-weight: bold">)</span> |
| |
| <span style="color: #999988;font-style: italic"># Flags controlling the scheduler.</span> |
| <span style="color: #008080">AURORA_FLAGS</span><span style="color: #000000;font-weight: bold">=(</span> |
| -http_port<span style="color: #000000;font-weight: bold">=</span>8081 |
| <span style="color: #999988;font-style: italic"># Log configuration, etc.</span> |
| <span style="color: #000000;font-weight: bold">)</span> |
| |
| <span style="color: #999988;font-style: italic"># Environment variables controlling libmesos</span> |
| <span style="color: #0086B3">export </span><span style="color: #008080">JAVA_HOME</span><span style="color: #000000;font-weight: bold">=</span>... |
| <span style="color: #0086B3">export </span><span style="color: #008080">GLOG_v</span><span style="color: #000000;font-weight: bold">=</span>1 |
| <span style="color: #0086B3">export </span><span style="color: #008080">LIBPROCESS_PORT</span><span style="color: #000000;font-weight: bold">=</span>8083 |
| |
| <span style="color: #008080">JAVA_OPTS</span><span style="color: #000000;font-weight: bold">=</span><span style="color: #d14">"</span><span style="color: #000000;font-weight: bold">${</span><span style="color: #008080">JAVA_OPTS</span><span style="background-color: #f8f8f8">[*]</span><span style="color: #000000;font-weight: bold">}</span><span style="color: #d14">"</span> <span style="color: #0086B3">exec</span> <span style="color: #d14">"</span><span style="color: #008080">$AURORA_HOME</span><span style="color: #d14">/bin/aurora-scheduler"</span> <span style="color: #d14">"</span><span style="color: #000000;font-weight: bold">${</span><span style="color: #008080">AURORA_FLAGS</span><span style="background-color: #f8f8f8">[@]</span><span style="color: #000000;font-weight: bold">}</span><span style="color: #d14">"</span> |
| </code></pre> |
| |
| <p>That way Aurora’s current flags are visible in <code>ps</code> and in the <code>/vars</code> admin endpoint.</p> |
| |
| <p>Examples are available under <code>examples/scheduler/</code>. For a list of available Aurora flags and their |
| documentation run</p> |
| <pre class="highlight plaintext"><code>/usr/local/aurora-scheduler/bin/aurora-scheduler -help |
| </code></pre> |
| |
| <h3 id="replicated-log-configuration">Replicated Log Configuration</h3> |
| |
| <p>All Aurora state is persisted to a replicated log. This includes all jobs Aurora is running |
| including where in the cluster they are being run and the configuration for running them, as |
| well as other information such as metadata needed to reconnect to the Mesos master, resource |
| quotas, and any other locks in place.</p> |
| |
| <p>Aurora schedulers use ZooKeeper to discover log replicas and elect a leader. Only one scheduler is |
| leader at a given time - the other schedulers follow log writes and prepare to take over as leader |
| but do not communicate with the Mesos master. Either 3 or 5 schedulers are recommended in a |
| production deployment depending on failure tolerance and they must have persistent storage.</p> |
| |
| <p>In a cluster with <code>N</code> schedulers, the flag <code>-native_log_quorum_size</code> should be set to |
| <code>floor(N/2) + 1</code>. So in a cluster with 1 scheduler it should be set to <code>1</code>, in a cluster with 3 it |
| should be set to <code>2</code>, and in a cluster of 5 it should be set to <code>3</code>.</p> |
| |
| <table><thead> |
| <tr> |
| <th>Number of schedulers (N)</th> |
| <th><code>-native_log_quorum_size</code> setting (<code>floor(N/2) + 1</code>)</th> |
| </tr> |
| </thead><tbody> |
| <tr> |
| <td>1</td> |
| <td>1</td> |
| </tr> |
| <tr> |
| <td>3</td> |
| <td>2</td> |
| </tr> |
| <tr> |
| <td>5</td> |
| <td>3</td> |
| </tr> |
| <tr> |
| <td>7</td> |
| <td>4</td> |
| </tr> |
| </tbody></table> |
| |
| <p><em>Incorrectly setting this flag will cause data corruption to occur!</em></p> |
| |
| <p>See <a href="/documentation/0.8.0/storage-config/#scheduler-storage-configuration-flags">this document</a> for more replicated |
| log and storage configuration options.</p> |
| |
| <h2 id="initializing-the-replicated-log">Initializing the Replicated Log</h2> |
| |
| <p>Before you start Aurora you will also need to initialize the log on a majority of the masters.</p> |
| <pre class="highlight plaintext"><code>mesos-log initialize --path="/path/to/native/log" |
| </code></pre> |
| |
| <p>The <code>--path</code> flag should match the <code>--native_log_file_path</code> flag to the scheduler. |
| Failing to do this will result the following message when you try to start the scheduler.</p> |
| <pre class="highlight plaintext"><code>Replica in EMPTY status received a broadcasted recover request |
| </code></pre> |
| |
| <h3 id="storage-performance-considerations">Storage Performance Considerations</h3> |
| |
| <p>See <a href="/documentation/0.8.0/scheduler-storage/">this document</a> for scheduler storage performance considerations.</p> |
| |
| <h3 id="network-considerations">Network considerations</h3> |
| |
| <p>The Aurora scheduler listens on 2 ports - an HTTP port used for client RPCs and a web UI, |
| and a libprocess (HTTP+Protobuf) port used to communicate with the Mesos master and for the log |
| replication protocol. These can be left unconfigured (the scheduler publishes all selected ports |
| to ZooKeeper) or explicitly set in the startup script as follows:</p> |
| <pre class="highlight plaintext"><code># ... |
| AURORA_FLAGS=( |
| # ... |
| -http_port=8081 |
| # ... |
| ) |
| # ... |
| export LIBPROCESS_PORT=8083 |
| # ... |
| </code></pre> |
| |
| <h3 id="considerations-for-running-jobs-in-docker-containers">Considerations for running jobs in docker containers</h3> |
| |
| <p><em>Note: Docker support is currently EXPERIMENTAL.</em></p> |
| |
| <p>In order for Aurora to launch jobs using docker containers, a few extra configuration options |
| must be set. The <a href="http://mesos.apache.org/documentation/latest/docker-containerizer/">docker containerizer</a> |
| must be enabled on the mesos slaves by launching them with the <code>--containerizers=docker,mesos</code> option.</p> |
| |
| <p>By default, Aurora will configure Mesos to copy the file specified in <code>-thermos_executor_path</code> |
| into the container’s sandbox. If using a wrapper script to launch the thermos executor, |
| specify the path to the wrapper in that argument. In addition, the path to the executor pex itself |
| must be included in the <code>-thermos_executor_resources</code> option. Doing so will ensure that both the |
| wrapper script and executor are correctly copied into the sandbox. Finally, ensure the wrapper |
| script does not access resources outside of the sandbox, as when the script is run from within a |
| docker container those resources will not exist.</p> |
| |
| <p>A scheduler flag, <code>-global_container_mounts</code> allows mounting paths from the host (i.e., the slave) |
| into all containers on that host. The format is a comma seperated list of host<em>path:container</em>path[:mode] |
| tuples. For example <code>-global_container_mounts=/opt/secret_keys_dir:/mnt/secret_keys_dir:ro</code> mounts |
| <code>/opt/secret_keys_dir</code> from the slaves into all launched containers. Valid modes are <code>ro</code> and <code>rw</code>.</p> |
| |
| <p>In order to correctly execute processes inside a job, the docker container must have python 2.7 |
| installed.</p> |
| |
| <h2 id="running-aurora">Running Aurora</h2> |
| |
| <p>Configure a supervisor like <a href="http://mmonit.com/monit/">Monit</a> or |
| <a href="http://supervisord.org/">supervisord</a> to run the created <code>scheduler.sh</code> file and restart it |
| whenever it fails. Aurora expects to be restarted by an external process when it fails. Aurora |
| supports an active health checking protocol on its admin HTTP interface - if a <code>GET /health</code> times |
| out or returns anything other than <code>200 OK</code> the scheduler process is unhealthy and should be |
| restarted.</p> |
| |
| <p>For example, monit can be configured with</p> |
| <pre class="highlight plaintext"><code>if failed port 8081 send "GET /health HTTP/1.0\r\n" expect "OK\n" with timeout 2 seconds for 10 cycles then restart |
| </code></pre> |
| |
| <p>assuming you set <code>-http_port=8081</code>.</p> |
| |
| <h2 id="security-considerations">Security Considerations</h2> |
| |
| <p>See <a href="/documentation/0.8.0/security/">security.md</a>.</p> |
| |
| <h3 id="maintaining-an-aurora-installation">Maintaining an Aurora Installation</h3> |
| |
| <h3 id="monitoring">Monitoring</h3> |
| |
| <p>Please see our dedicated <a href="/documentation/0.8.0/monitoring/">monitoring guide</a> for in-depth discussion on monitoring.</p> |
| |
| <h3 id="running-stateful-services">Running stateful services</h3> |
| |
| <p>Aurora is best suited to run stateless applications, but it also accommodates for stateful services |
| like databases, or services that otherwise need to always run on the same machines.</p> |
| |
| <h4 id="dedicated-attribute">Dedicated attribute</h4> |
| |
| <p>The Mesos slave has the <code>--attributes</code> command line argument which can be used to mark a slave with |
| static attributes (not to be confused with <code>--resources</code>, which are dynamic and accounted).</p> |
| |
| <p>Aurora makes these attributes available for matching with scheduling |
| <a href="/documentation/0.8.0/configuration-reference/#specifying-scheduling-constraints">constraints</a>. Most of these |
| constraints are arbitrary and available for custom use. There is one exception, though: the |
| <code>dedicated</code> attribute. Aurora treats this specially, and only allows matching jobs to run on these |
| machines, and will only schedule matching jobs on these machines.</p> |
| |
| <h5 id="syntax">Syntax</h5> |
| |
| <p>The dedicated attribute has semantic meaning. The format is <code>$role(/.*)?</code>. When a job is created, |
| the scheduler requires that the <code>$role</code> component matches the <code>role</code> field in the job |
| configuration, and will reject the job creation otherwise. The remainder of the attribute is |
| free-form. We’ve developed the idiom of formatting this attribute as <code>$role/$job</code>, but do not |
| enforce this.</p> |
| |
| <h5 id="example">Example</h5> |
| |
| <p>Consider the following slave command line:</p> |
| <pre class="highlight plaintext"><code>mesos-slave --attributes="host:$HOST;rack:$RACK;dedicated:db_team/redis" ... |
| </code></pre> |
| |
| <p>And this job configuration:</p> |
| <pre class="highlight plaintext"><code>Service( |
| name = 'redis', |
| role = 'db_team', |
| constraints = { |
| 'dedicated': 'db_team/redis' |
| } |
| ... |
| ) |
| </code></pre> |
| |
| <p>The job configuration is indicating that it should only be scheduled on slaves with the attribute |
| <code>dedicated:db_team/redis</code>. Additionally, Aurora will prevent any tasks that do <em>not</em> have that |
| constraint from running on those slaves.</p> |
| |
| <h2 id="common-problems">Common problems</h2> |
| |
| <p>So you’ve started your first cluster and are running into some issues? We’ve collected some common |
| stumbling blocks and solutions here to help get you moving.</p> |
| |
| <h3 id="replicated-log-not-initialized">Replicated log not initialized</h3> |
| |
| <h4 id="symptoms">Symptoms</h4> |
| |
| <ul> |
| <li>Scheduler RPCs and web interface claim <code>Storage is not READY</code></li> |
| <li>Scheduler log repeatedly prints messages like</li> |
| </ul> |
| <pre class="highlight plaintext"><code> I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status |
| received a broadcasted recover request |
| I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response |
| from a replica in EMPTY status |
| </code></pre> |
| |
| <h4 id="solution">Solution</h4> |
| |
| <p>When you create a new cluster, you need to inform a quorum of schedulers that they are safe to |
| consider their database to be empty by <a href="#initializing-the-replicated-log">initializing</a> the |
| replicated log. This is done to prevent the scheduler from modifying the cluster state in the event |
| of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated log path.</p> |
| |
| <h3 id="scheduler-not-registered">Scheduler not registered</h3> |
| |
| <h4 id="symptoms">Symptoms</h4> |
| |
| <p>Scheduler log contains</p> |
| <pre class="highlight plaintext"><code>Framework has not been registered within the tolerated delay. |
| </code></pre> |
| |
| <h4 id="solution">Solution</h4> |
| |
| <p>Double-check that the scheduler is configured correctly to reach the master. If you are registering |
| the master in ZooKeeper, make sure command line argument to the master:</p> |
| <pre class="highlight plaintext"><code>--zk=zk://$ZK_HOST:2181/mesos/master |
| </code></pre> |
| |
| <p>is the same as the one on the scheduler:</p> |
| <pre class="highlight plaintext"><code>-mesos_master_address=zk://$ZK_HOST:2181/mesos/master |
| </code></pre> |
| |
| <h3 id="tasks-are-stuck-in-pending-forever">Tasks are stuck in <code>PENDING</code> forever</h3> |
| |
| <h4 id="symptoms">Symptoms</h4> |
| |
| <p>The scheduler is registered, and (receiving offers](docs/monitoring.md#scheduler<em>resource</em>offers), |
| but tasks are perpetually shown as <code>PENDING - Constraint not satisfied: host</code>.</p> |
| |
| <h4 id="solution">Solution</h4> |
| |
| <p>Check that your slaves are configured with <code>host</code> and <code>rack</code> attributes. Aurora requires that |
| slaves are tagged with these two common failure domains to ensure that it can safely place tasks |
| such that jobs are resilient to failure.</p> |
| |
| <p>See our <a href="examples/vagrant/upstart/mesos-slave.conf">vagrant example</a> for details.</p> |
| |
| </div> |
| |
| </div> |
| </div> |
| <div class="container-fluid section-footer buffer"> |
| <div class="container"> |
| <div class="row"> |
| <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3> |
| <ul> |
| <li><a href="/downloads/">Downloads</a></li> |
| <li><a href="/community/">Mailing Lists</a></li> |
| <li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li> |
| <li><a href="/documentation/latest/contributing/">How To Contribute</a></li> |
| </ul> |
| </div> |
| <div class="col-md-2"><h3>The ASF</h3> |
| <ul> |
| <li><a href="http://www.apache.org/licenses/">License</a></li> |
| <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> |
| <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> |
| <li><a href="http://www.apache.org/security/">Security</a></li> |
| </ul> |
| </div> |
| <div class="col-md-6"> |
| <p class="disclaimer">© 2014-2017 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p> |
| </div> |
| </div> |
| </div> |
| |
| </body> |
| </html> |