| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8"> |
| <meta name="viewport" content="width=device-width, initial-scale=1"> |
| <title>Apache Aurora</title> |
| <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css"> |
| <link href="/assets/css/main.css" rel="stylesheet"> |
| <!-- Analytics --> |
| <script type="text/javascript"> |
| var _gaq = _gaq || []; |
| _gaq.push(['_setAccount', 'UA-45879646-1']); |
| _gaq.push(['_setDomainName', 'apache.org']); |
| _gaq.push(['_trackPageview']); |
| |
| (function() { |
| var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; |
| ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; |
| var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); |
| })(); |
| </script> |
| </head> |
| <body> |
| |
| <div class="container-fluid section-header"> |
| <div class="container"> |
| <div class="nav nav-bar"> |
| <a href="/"><img src="/assets/img/aurora_logo_white_bkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a> |
| <ul class="nav navbar-nav navbar-right"> |
| <li><a href="/documentation/latest/">Documentation</a></li> |
| <li><a href="/community/">Community</a></li> |
| <li><a href="/downloads/">Downloads</a></li> |
| <li><a href="/blog/">Blog</a></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="container-fluid"> |
| <div class="container content"> |
| <h2 id="aurora-sla-measurement">Aurora SLA Measurement</h2> |
| |
| <ul> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#metric-details">Metric Details</a> |
| |
| <ul> |
| <li><a href="#platform-uptime">Platform Uptime</a></li> |
| <li><a href="#job-uptime">Job Uptime</a></li> |
| <li><a href="#median-time-to-assigned-(mtta)">Median Time To Assigned (MTTA)</a></li> |
| <li><a href="#median-time-to-running-(mttr)">Median Time To Running (MTTR)</a></li> |
| </ul></li> |
| <li><a href="#limitations">Limitations</a></li> |
| </ul> |
| |
| <h2 id="overview">Overview</h2> |
| |
| <p>The primary goal of the feature is collection and monitoring of Aurora job SLA (Service Level |
| Agreements) metrics that defining a contractual relationship between the Aurora/Mesos platform |
| and hosted services.</p> |
| |
| <p>The Aurora SLA feature currently supports stat collection only for service (non-cron) |
| production jobs (<code>"production = True"</code> in your <code>.aurora</code> config).</p> |
| |
| <p>Counters that track SLA measurements are computed periodically within the scheduler. |
| The individual instance metrics are refreshed every minute (configurable via |
| <code>sla_stat_refresh_interval</code>). The instance counters are subsequently aggregated by |
| relevant grouping types before exporting to scheduler <code>/vars</code> endpoint (when using <code>vagrant</code> |
| that would be <code>http://192.168.33.7:8081/vars</code>)</p> |
| |
| <h2 id="metric-details">Metric Details</h2> |
| |
| <h3 id="platform-uptime">Platform Uptime</h3> |
| |
| <p><em>Aggregate amount of time a job spends in a non-runnable state due to platform unavailability |
| or scheduling delays. This metric tracks Aurora/Mesos uptime performance and reflects on any |
| system-caused downtime events (tasks LOST or DRAINED). Any user-initiated task kills/restarts |
| will not degrade this metric.</em></p> |
| |
| <p><strong>Collection scope:</strong></p> |
| |
| <ul> |
| <li>Per job - <code>sla_<job_key>_platform_uptime_percent</code></li> |
| <li>Per cluster - <code>sla_cluster_platform_uptime_percent</code></li> |
| </ul> |
| |
| <p><strong>Units:</strong> percent</p> |
| |
| <p>A fault in the task environment may cause the Aurora/Mesos to have different views on the task state |
| or lose track of the task existence. In such cases, the service task is marked as LOST and |
| rescheduled by Aurora. For example, this may happen when the task stays in ASSIGNED or STARTING |
| for too long or the Mesos slave becomes unhealthy (or disappears completely). The time between |
| task entering LOST and its replacement reaching RUNNING state is counted towards platform downtime.</p> |
| |
| <p>Another example of a platform downtime event is the administrator-requested task rescheduling. This |
| happens during planned Mesos slave maintenance when all slave tasks are marked as DRAINED and |
| rescheduled elsewhere.</p> |
| |
| <p>To accurately calculate Platform Uptime, we must separate platform incurred downtime from user |
| actions that put a service instance in a non-operational state. It is simpler to isolate |
| user-incurred downtime and treat all other downtime as platform incurred.</p> |
| |
| <p>Currently, a user can cause a healthy service (task) downtime in only two ways: via <code>killTasks</code> |
| or <code>restartShards</code> RPCs. For both, their affected tasks leave an audit state transition trail |
| relevant to uptime calculations. By applying a special “SLA meaning” to exposed task state |
| transition records, we can build a deterministic downtime trace for every given service instance.</p> |
| |
| <p>A task going through a state transition carries one of three possible SLA meanings |
| (see <a href="../src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java">SlaAlgorithm.java</a> for |
| sla-to-task-state mapping):</p> |
| |
| <ul> |
| <li><p>Task is UP: starts a period where the task is considered to be up and running from the Aurora |
| platform standpoint.</p></li> |
| <li><p>Task is DOWN: starts a period where the task cannot reach the UP state for some |
| non-user-related reason. Counts towards instance downtime.</p></li> |
| <li><p>Task is REMOVED from SLA: starts a period where the task is not expected to be UP due to |
| user initiated action or failure. We ignore this period for the uptime calculation purposes.</p></li> |
| </ul> |
| |
| <p>This metric is recalculated over the last sampling period (last minute) to account for |
| any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately adjacent to the |
| sampling interval as well as adjacent REMOVED events.</p> |
| |
| <h3 id="job-uptime">Job Uptime</h3> |
| |
| <p><em>Percentage of the job instances considered to be in RUNNING state for the specified duration |
| relative to request time. This is a purely application side metric that is considering aggregate |
| uptime of all RUNNING instances. Any user- or platform initiated restarts directly affect |
| this metric.</em></p> |
| |
| <p><strong>Collection scope:</strong> We currently expose job uptime values at 5 pre-defined |
| percentiles (50th,75th,90th,95th and 99th):</p> |
| |
| <ul> |
| <li><code>sla_<job_key>_job_uptime_50_00_sec</code></li> |
| <li><code>sla_<job_key>_job_uptime_75_00_sec</code></li> |
| <li><code>sla_<job_key>_job_uptime_90_00_sec</code></li> |
| <li><code>sla_<job_key>_job_uptime_95_00_sec</code></li> |
| <li><code>sla_<job_key>_job_uptime_99_00_sec</code></li> |
| </ul> |
| |
| <p><strong>Units:</strong> seconds |
| You can also get customized real-time stats from aurora client. See <code>aurora sla -h</code> for |
| more details.</p> |
| |
| <h3 id="median-time-to-assigned-(mtta)">Median Time To Assigned (MTTA)</h3> |
| |
| <p><em>Median time a job spends waiting for its tasks to be assigned to a host. This is a combined |
| metric that helps track the dependency of scheduling performance on the requested resources |
| (user scope) as well as the internal scheduler bin-packing algorithm efficiency (platform scope).</em></p> |
| |
| <p><strong>Collection scope:</strong></p> |
| |
| <ul> |
| <li>Per job - <code>sla_<job_key>_mtta_ms</code></li> |
| <li>Per cluster - <code>sla_cluster_mtta_ms</code></li> |
| <li>Per instance size (small, medium, large, x-large, xx-large). Size are defined in: |
| <a href="../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java">ResourceAggregates.java</a> |
| |
| <ul> |
| <li>By CPU:</li> |
| <li><code>sla_cpu_small_mtta_ms</code></li> |
| <li><code>sla_cpu_medium_mtta_ms</code></li> |
| <li><code>sla_cpu_large_mtta_ms</code></li> |
| <li><code>sla_cpu_xlarge_mtta_ms</code></li> |
| <li><code>sla_cpu_xxlarge_mtta_ms</code></li> |
| <li>By RAM:</li> |
| <li><code>sla_ram_small_mtta_ms</code></li> |
| <li><code>sla_ram_medium_mtta_ms</code></li> |
| <li><code>sla_ram_large_mtta_ms</code></li> |
| <li><code>sla_ram_xlarge_mtta_ms</code></li> |
| <li><code>sla_ram_xxlarge_mtta_ms</code></li> |
| <li>By DISK:</li> |
| <li><code>sla_disk_small_mtta_ms</code></li> |
| <li><code>sla_disk_medium_mtta_ms</code></li> |
| <li><code>sla_disk_large_mtta_ms</code></li> |
| <li><code>sla_disk_xlarge_mtta_ms</code></li> |
| <li><code>sla_disk_xxlarge_mtta_ms</code></li> |
| </ul></li> |
| </ul> |
| |
| <p><strong>Units:</strong> milliseconds</p> |
| |
| <p>MTTA only considers instances that have already reached ASSIGNED state and ignores those |
| that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource |
| constraints) do not affect metric curves.</p> |
| |
| <h3 id="median-time-to-running-(mttr)">Median Time To Running (MTTR)</h3> |
| |
| <p><em>Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric |
| reflecting on the overall time it takes for the Aurora/Mesos to start executing user content.</em></p> |
| |
| <p><strong>Collection scope:</strong></p> |
| |
| <ul> |
| <li>Per job - <code>sla_<job_key>_mttr_ms</code></li> |
| <li>Per cluster - <code>sla_cluster_mttr_ms</code></li> |
| <li>Per instance size (small, medium, large, x-large, xx-large). Size are defined in: |
| <a href="../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java">ResourceAggregates.java</a> |
| |
| <ul> |
| <li>By CPU:</li> |
| <li><code>sla_cpu_small_mttr_ms</code></li> |
| <li><code>sla_cpu_medium_mttr_ms</code></li> |
| <li><code>sla_cpu_large_mttr_ms</code></li> |
| <li><code>sla_cpu_xlarge_mttr_ms</code></li> |
| <li><code>sla_cpu_xxlarge_mttr_ms</code></li> |
| <li>By RAM:</li> |
| <li><code>sla_ram_small_mttr_ms</code></li> |
| <li><code>sla_ram_medium_mttr_ms</code></li> |
| <li><code>sla_ram_large_mttr_ms</code></li> |
| <li><code>sla_ram_xlarge_mttr_ms</code></li> |
| <li><code>sla_ram_xxlarge_mttr_ms</code></li> |
| <li>By DISK:</li> |
| <li><code>sla_disk_small_mttr_ms</code></li> |
| <li><code>sla_disk_medium_mttr_ms</code></li> |
| <li><code>sla_disk_large_mttr_ms</code></li> |
| <li><code>sla_disk_xlarge_mttr_ms</code></li> |
| <li><code>sla_disk_xxlarge_mttr_ms</code></li> |
| </ul></li> |
| </ul> |
| |
| <p><strong>Units:</strong> milliseconds</p> |
| |
| <p>MTTR only considers instances in RUNNING state. This ensures straggler instances (e.g. with |
| unreasonable resource constraints) do not affect metric curves.</p> |
| |
| <h2 id="limitations">Limitations</h2> |
| |
| <ul> |
| <li><p>The availability of Aurora SLA metrics is bound by the scheduler availability.</p></li> |
| <li><p>All metrics are calculated at a pre-defined interval (currently set at 1 minute). |
| Scheduler restarts may result in missed collections.</p></li> |
| </ul> |
| |
| </div> |
| </div> |
| |
| <div class="container-fluid section-footer buffer"> |
| <div class="container"> |
| <div class="row"> |
| <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3> |
| <ul> |
| <li><a href="/downloads/">Downloads</a></li> |
| <li><a href="/community/">Mailing Lists</a></li> |
| <li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li> |
| <li><a href="/documentation/latest/contributing/">How To Contribute</a></li> |
| </ul> |
| </div> |
| <div class="col-md-2"><h3>The ASF</h3> |
| <ul> |
| <li><a href="http://www.apache.org/licenses/">License</a></li> |
| <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> |
| <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> |
| <li><a href="http://www.apache.org/security/">Security</a></li> |
| </ul> |
| </div> |
| <div class="col-md-6"> |
| <p class="disclaimer">Apache Aurora is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.</p> |
| <p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p> |
| </div> |
| </div> |
| </div> |
| </body> |
| </html> |