publish/documentation/latest/sla/index.html - aurora-website - Git at Google

 <!DOCTYPE html>
 <html lang="en">
   <head>
     <meta charset="utf-8">
     <meta name="viewport" content="width=device-width, initial-scale=1">
 	<title>Apache Aurora</title>
     <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css">
     <link href="/assets/css/main.css" rel="stylesheet">
 	<!-- Analytics -->
 	<script type="text/javascript">
 		  var _gaq = _gaq || [];
 		  _gaq.push(['_setAccount', 'UA-45879646-1']);
 		  _gaq.push(['_setDomainName', 'apache.org']);
 		  _gaq.push(['_trackPageview']);

 		  (function() {
 		    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
 		    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
 		    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
 		  })();
 	</script>
   </head>
   <body>

         <div class="container-fluid section-header">
   <div class="container">
     <div class="nav nav-bar">
     <a href="/"><img src="/assets/img/aurora_logo_white_bkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a>
 	<ul class="nav navbar-nav navbar-right">
       <li><a href="/documentation/latest/">Documentation</a></li>
       <li><a href="/community/">Community</a></li>
       <li><a href="/downloads/">Downloads</a></li>
       <li><a href="/blog/">Blog</a></li>
     </ul>
     </div>
   </div>
 </div>
   	  <div class="container-fluid">
   	  	<div class="container content">
           <h2 id="aurora-sla-measurement">Aurora SLA Measurement</h2>

 <ul>
 <li><a href="#overview">Overview</a></li>
 <li><a href="#metric-details">Metric Details</a>

 <ul>
 <li><a href="#platform-uptime">Platform Uptime</a></li>
 <li><a href="#job-uptime">Job Uptime</a></li>
 <li><a href="#median-time-to-assigned-(mtta)">Median Time To Assigned (MTTA)</a></li>
 <li><a href="#median-time-to-running-(mttr)">Median Time To Running (MTTR)</a></li>
 </ul></li>
 <li><a href="#limitations">Limitations</a></li>
 </ul>

 <h2 id="overview">Overview</h2>

 <p>The primary goal of the feature is collection and monitoring of Aurora job SLA (Service Level
 Agreements) metrics that defining a contractual relationship between the Aurora/Mesos platform
 and hosted services.</p>

 <p>The Aurora SLA feature currently supports stat collection only for service (non-cron)
 production jobs (<code>&quot;production = True&quot;</code> in your <code>.aurora</code> config).</p>

 <p>Counters that track SLA measurements are computed periodically within the scheduler.
 The individual instance metrics are refreshed every minute (configurable via
 <code>sla_stat_refresh_interval</code>). The instance counters are subsequently aggregated by
 relevant grouping types before exporting to scheduler <code>/vars</code> endpoint (when using <code>vagrant</code>
 that would be <code>http://192.168.33.7:8081/vars</code>)</p>

 <h2 id="metric-details">Metric Details</h2>

 <h3 id="platform-uptime">Platform Uptime</h3>

 <p><em>Aggregate amount of time a job spends in a non-runnable state due to platform unavailability
 or scheduling delays. This metric tracks Aurora/Mesos uptime performance and reflects on any
 system-caused downtime events (tasks LOST or DRAINED). Any user-initiated task kills/restarts
 will not degrade this metric.</em></p>

 <p><strong>Collection scope:</strong></p>

 <ul>
 <li>Per job - <code>sla_&lt;job_key&gt;_platform_uptime_percent</code></li>
 <li>Per cluster - <code>sla_cluster_platform_uptime_percent</code></li>
 </ul>

 <p><strong>Units:</strong> percent</p>

 <p>A fault in the task environment may cause the Aurora/Mesos to have different views on the task state
 or lose track of the task existence. In such cases, the service task is marked as LOST and
 rescheduled by Aurora. For example, this may happen when the task stays in ASSIGNED or STARTING
 for too long or the Mesos slave becomes unhealthy (or disappears completely). The time between
 task entering LOST and its replacement reaching RUNNING state is counted towards platform downtime.</p>

 <p>Another example of a platform downtime event is the administrator-requested task rescheduling. This
 happens during planned Mesos slave maintenance when all slave tasks are marked as DRAINED and
 rescheduled elsewhere.</p>

 <p>To accurately calculate Platform Uptime, we must separate platform incurred downtime from user
 actions that put a service instance in a non-operational state. It is simpler to isolate
 user-incurred downtime and treat all other downtime as platform incurred.</p>

 <p>Currently, a user can cause a healthy service (task) downtime in only two ways: via <code>killTasks</code>
 or <code>restartShards</code> RPCs. For both, their affected tasks leave an audit state transition trail
 relevant to uptime calculations. By applying a special &ldquo;SLA meaning&rdquo; to exposed task state
 transition records, we can build a deterministic downtime trace for every given service instance.</p>

 <p>A task going through a state transition carries one of three possible SLA meanings
 (see <a href="../src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java">SlaAlgorithm.java</a> for
 sla-to-task-state mapping):</p>

 <ul>
 <li><p>Task is UP: starts a period where the task is considered to be up and running from the Aurora
 platform standpoint.</p></li>
 <li><p>Task is DOWN: starts a period where the task cannot reach the UP state for some
 non-user-related reason. Counts towards instance downtime.</p></li>
 <li><p>Task is REMOVED from SLA: starts a period where the task is not expected to be UP due to
 user initiated action or failure. We ignore this period for the uptime calculation purposes.</p></li>
 </ul>

 <p>This metric is recalculated over the last sampling period (last minute) to account for
 any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately adjacent to the
 sampling interval as well as adjacent REMOVED events.</p>

 <h3 id="job-uptime">Job Uptime</h3>

 <p><em>Percentage of the job instances considered to be in RUNNING state for the specified duration
 relative to request time. This is a purely application side metric that is considering aggregate
 uptime of all RUNNING instances. Any user- or platform initiated restarts directly affect
 this metric.</em></p>

 <p><strong>Collection scope:</strong> We currently expose job uptime values at 5 pre-defined
 percentiles (50th,75th,90th,95th and 99th):</p>

 <ul>
 <li><code>sla_&lt;job_key&gt;_job_uptime_50_00_sec</code></li>
 <li><code>sla_&lt;job_key&gt;_job_uptime_75_00_sec</code></li>
 <li><code>sla_&lt;job_key&gt;_job_uptime_90_00_sec</code></li>
 <li><code>sla_&lt;job_key&gt;_job_uptime_95_00_sec</code></li>
 <li><code>sla_&lt;job_key&gt;_job_uptime_99_00_sec</code></li>
 </ul>

 <p><strong>Units:</strong> seconds
 You can also get customized real-time stats from aurora client. See <code>aurora sla -h</code> for
 more details.</p>

 <h3 id="median-time-to-assigned-(mtta)">Median Time To Assigned (MTTA)</h3>

 <p><em>Median time a job spends waiting for its tasks to be assigned to a host. This is a combined
 metric that helps track the dependency of scheduling performance on the requested resources
 (user scope) as well as the internal scheduler bin-packing algorithm efficiency (platform scope).</em></p>

 <p><strong>Collection scope:</strong></p>

 <ul>
 <li>Per job - <code>sla_&lt;job_key&gt;_mtta_ms</code></li>
 <li>Per cluster - <code>sla_cluster_mtta_ms</code></li>
 <li>Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
 <a href="../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java">ResourceAggregates.java</a>

 <ul>
 <li>By CPU:</li>
 <li><code>sla_cpu_small_mtta_ms</code></li>
 <li><code>sla_cpu_medium_mtta_ms</code></li>
 <li><code>sla_cpu_large_mtta_ms</code></li>
 <li><code>sla_cpu_xlarge_mtta_ms</code></li>
 <li><code>sla_cpu_xxlarge_mtta_ms</code></li>
 <li>By RAM:</li>
 <li><code>sla_ram_small_mtta_ms</code></li>
 <li><code>sla_ram_medium_mtta_ms</code></li>
 <li><code>sla_ram_large_mtta_ms</code></li>
 <li><code>sla_ram_xlarge_mtta_ms</code></li>
 <li><code>sla_ram_xxlarge_mtta_ms</code></li>
 <li>By DISK:</li>
 <li><code>sla_disk_small_mtta_ms</code></li>
 <li><code>sla_disk_medium_mtta_ms</code></li>
 <li><code>sla_disk_large_mtta_ms</code></li>
 <li><code>sla_disk_xlarge_mtta_ms</code></li>
 <li><code>sla_disk_xxlarge_mtta_ms</code></li>
 </ul></li>
 </ul>

 <p><strong>Units:</strong> milliseconds</p>

 <p>MTTA only considers instances that have already reached ASSIGNED state and ignores those
 that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource
 constraints) do not affect metric curves.</p>

 <h3 id="median-time-to-running-(mttr)">Median Time To Running (MTTR)</h3>

 <p><em>Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric
 reflecting on the overall time it takes for the Aurora/Mesos to start executing user content.</em></p>

 <p><strong>Collection scope:</strong></p>

 <ul>
 <li>Per job - <code>sla_&lt;job_key&gt;_mttr_ms</code></li>
 <li>Per cluster - <code>sla_cluster_mttr_ms</code></li>
 <li>Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
 <a href="../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java">ResourceAggregates.java</a>

 <ul>
 <li>By CPU:</li>
 <li><code>sla_cpu_small_mttr_ms</code></li>
 <li><code>sla_cpu_medium_mttr_ms</code></li>
 <li><code>sla_cpu_large_mttr_ms</code></li>
 <li><code>sla_cpu_xlarge_mttr_ms</code></li>
 <li><code>sla_cpu_xxlarge_mttr_ms</code></li>
 <li>By RAM:</li>
 <li><code>sla_ram_small_mttr_ms</code></li>
 <li><code>sla_ram_medium_mttr_ms</code></li>
 <li><code>sla_ram_large_mttr_ms</code></li>
 <li><code>sla_ram_xlarge_mttr_ms</code></li>
 <li><code>sla_ram_xxlarge_mttr_ms</code></li>
 <li>By DISK:</li>
 <li><code>sla_disk_small_mttr_ms</code></li>
 <li><code>sla_disk_medium_mttr_ms</code></li>
 <li><code>sla_disk_large_mttr_ms</code></li>
 <li><code>sla_disk_xlarge_mttr_ms</code></li>
 <li><code>sla_disk_xxlarge_mttr_ms</code></li>
 </ul></li>
 </ul>

 <p><strong>Units:</strong> milliseconds</p>

 <p>MTTR only considers instances in RUNNING state. This ensures straggler instances (e.g. with
 unreasonable resource constraints) do not affect metric curves.</p>

 <h2 id="limitations">Limitations</h2>

 <ul>
 <li><p>The availability of Aurora SLA metrics is bound by the scheduler availability.</p></li>
 <li><p>All metrics are calculated at a pre-defined interval (currently set at 1 minute).
 Scheduler restarts may result in missed collections.</p></li>
 </ul>

   		</div>
   	  </div>

       	<div class="container-fluid section-footer buffer">
       <div class="container">
         <div class="row">
 		  <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3>
 		  <ul>
 		    <li><a href="/downloads/">Downloads</a></li>
             <li><a href="/community/">Mailing Lists</a></li>
 			<li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li>
 			<li><a href="/documentation/latest/contributing/">How To Contribute</a></li>
 		  </ul>
 	      </div>
 		  <div class="col-md-2"><h3>The ASF</h3>
           <ul>
             <li><a href="http://www.apache.org/licenses/">License</a></li>
             <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
             <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
             <li><a href="http://www.apache.org/security/">Security</a></li>
           </ul>
 		  </div>
 		  <div class="col-md-6">
 		    <p class="disclaimer">Apache Aurora is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.</p>
 			<p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p>
         </div>
       </div>
     </div>
 	</body>
 </html>
	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<title>Apache Aurora</title>
	<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css">
	<link href="/assets/css/main.css" rel="stylesheet">
	<!-- Analytics -->
	<script type="text/javascript">
	var _gaq = _gaq \|\| [];
	_gaq.push(['_setAccount', 'UA-45879646-1']);
	_gaq.push(['_setDomainName', 'apache.org']);
	_gaq.push(['_trackPageview']);

	(function() {
	var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
	ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
	var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
	})();
	</script>
	</head>
	<body>

	<div class="container-fluid section-header">
	<div class="container">
	<div class="nav nav-bar">
	<a href="/"><img src="/assets/img/aurora_logo_white_bkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a>
	<ul class="nav navbar-nav navbar-right">
	<li><a href="/documentation/latest/">Documentation</a></li>
	<li><a href="/community/">Community</a></li>
	<li><a href="/downloads/">Downloads</a></li>
	<li><a href="/blog/">Blog</a></li>
	</ul>
	</div>
	</div>
	</div>
	<div class="container-fluid">
	<div class="container content">
	<h2 id="aurora-sla-measurement">Aurora SLA Measurement</h2>

	<ul>
	<li><a href="#overview">Overview</a></li>
	<li><a href="#metric-details">Metric Details</a>

	<ul>
	<li><a href="#platform-uptime">Platform Uptime</a></li>
	<li><a href="#job-uptime">Job Uptime</a></li>
	<li><a href="#median-time-to-assigned-(mtta)">Median Time To Assigned (MTTA)</a></li>
	<li><a href="#median-time-to-running-(mttr)">Median Time To Running (MTTR)</a></li>
	</ul></li>
	<li><a href="#limitations">Limitations</a></li>
	</ul>

	<h2 id="overview">Overview</h2>

	<p>The primary goal of the feature is collection and monitoring of Aurora job SLA (Service Level
	Agreements) metrics that defining a contractual relationship between the Aurora/Mesos platform
	and hosted services.</p>

	<p>The Aurora SLA feature currently supports stat collection only for service (non-cron)
	production jobs (<code>"production = True"</code> in your <code>.aurora</code> config).</p>

	<p>Counters that track SLA measurements are computed periodically within the scheduler.
	The individual instance metrics are refreshed every minute (configurable via
	<code>sla_stat_refresh_interval</code>). The instance counters are subsequently aggregated by
	relevant grouping types before exporting to scheduler <code>/vars</code> endpoint (when using <code>vagrant</code>
	that would be <code>http://192.168.33.7:8081/vars</code>)</p>

	<h2 id="metric-details">Metric Details</h2>

	<h3 id="platform-uptime">Platform Uptime</h3>

	<p><em>Aggregate amount of time a job spends in a non-runnable state due to platform unavailability
	or scheduling delays. This metric tracks Aurora/Mesos uptime performance and reflects on any
	system-caused downtime events (tasks LOST or DRAINED). Any user-initiated task kills/restarts
	will not degrade this metric.</em></p>

	<p><strong>Collection scope:</strong></p>

	<ul>
	<li>Per job - <code>sla_<job_key>_platform_uptime_percent</code></li>
	<li>Per cluster - <code>sla_cluster_platform_uptime_percent</code></li>
	</ul>

	<p><strong>Units:</strong> percent</p>

	<p>A fault in the task environment may cause the Aurora/Mesos to have different views on the task state
	or lose track of the task existence. In such cases, the service task is marked as LOST and
	rescheduled by Aurora. For example, this may happen when the task stays in ASSIGNED or STARTING
	for too long or the Mesos slave becomes unhealthy (or disappears completely). The time between
	task entering LOST and its replacement reaching RUNNING state is counted towards platform downtime.</p>

	<p>Another example of a platform downtime event is the administrator-requested task rescheduling. This
	happens during planned Mesos slave maintenance when all slave tasks are marked as DRAINED and
	rescheduled elsewhere.</p>

	<p>To accurately calculate Platform Uptime, we must separate platform incurred downtime from user
	actions that put a service instance in a non-operational state. It is simpler to isolate
	user-incurred downtime and treat all other downtime as platform incurred.</p>

	<p>Currently, a user can cause a healthy service (task) downtime in only two ways: via <code>killTasks</code>
	or <code>restartShards</code> RPCs. For both, their affected tasks leave an audit state transition trail
	relevant to uptime calculations. By applying a special “SLA meaning” to exposed task state
	transition records, we can build a deterministic downtime trace for every given service instance.</p>

	<p>A task going through a state transition carries one of three possible SLA meanings
	(see <a href="../src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java">SlaAlgorithm.java</a> for
	sla-to-task-state mapping):</p>

	<ul>
	<li><p>Task is UP: starts a period where the task is considered to be up and running from the Aurora
	platform standpoint.</p></li>
	<li><p>Task is DOWN: starts a period where the task cannot reach the UP state for some
	non-user-related reason. Counts towards instance downtime.</p></li>
	<li><p>Task is REMOVED from SLA: starts a period where the task is not expected to be UP due to
	user initiated action or failure. We ignore this period for the uptime calculation purposes.</p></li>
	</ul>

	<p>This metric is recalculated over the last sampling period (last minute) to account for
	any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately adjacent to the
	sampling interval as well as adjacent REMOVED events.</p>

	<h3 id="job-uptime">Job Uptime</h3>

	<p><em>Percentage of the job instances considered to be in RUNNING state for the specified duration
	relative to request time. This is a purely application side metric that is considering aggregate
	uptime of all RUNNING instances. Any user- or platform initiated restarts directly affect
	this metric.</em></p>

	<p><strong>Collection scope:</strong> We currently expose job uptime values at 5 pre-defined
	percentiles (50th,75th,90th,95th and 99th):</p>

	<ul>
	<li><code>sla_<job_key>_job_uptime_50_00_sec</code></li>
	<li><code>sla_<job_key>_job_uptime_75_00_sec</code></li>
	<li><code>sla_<job_key>_job_uptime_90_00_sec</code></li>
	<li><code>sla_<job_key>_job_uptime_95_00_sec</code></li>
	<li><code>sla_<job_key>_job_uptime_99_00_sec</code></li>
	</ul>

	<p><strong>Units:</strong> seconds
	You can also get customized real-time stats from aurora client. See <code>aurora sla -h</code> for
	more details.</p>

	<h3 id="median-time-to-assigned-(mtta)">Median Time To Assigned (MTTA)</h3>

	<p><em>Median time a job spends waiting for its tasks to be assigned to a host. This is a combined
	metric that helps track the dependency of scheduling performance on the requested resources
	(user scope) as well as the internal scheduler bin-packing algorithm efficiency (platform scope).</em></p>

	<p><strong>Collection scope:</strong></p>

	<ul>
	<li>Per job - <code>sla_<job_key>_mtta_ms</code></li>
	<li>Per cluster - <code>sla_cluster_mtta_ms</code></li>
	<li>Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
	<a href="../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java">ResourceAggregates.java</a>

	<ul>
	<li>By CPU:</li>
	<li><code>sla_cpu_small_mtta_ms</code></li>
	<li><code>sla_cpu_medium_mtta_ms</code></li>
	<li><code>sla_cpu_large_mtta_ms</code></li>
	<li><code>sla_cpu_xlarge_mtta_ms</code></li>
	<li><code>sla_cpu_xxlarge_mtta_ms</code></li>
	<li>By RAM:</li>
	<li><code>sla_ram_small_mtta_ms</code></li>
	<li><code>sla_ram_medium_mtta_ms</code></li>
	<li><code>sla_ram_large_mtta_ms</code></li>
	<li><code>sla_ram_xlarge_mtta_ms</code></li>
	<li><code>sla_ram_xxlarge_mtta_ms</code></li>
	<li>By DISK:</li>
	<li><code>sla_disk_small_mtta_ms</code></li>
	<li><code>sla_disk_medium_mtta_ms</code></li>
	<li><code>sla_disk_large_mtta_ms</code></li>
	<li><code>sla_disk_xlarge_mtta_ms</code></li>
	<li><code>sla_disk_xxlarge_mtta_ms</code></li>
	</ul></li>
	</ul>

	<p><strong>Units:</strong> milliseconds</p>

	<p>MTTA only considers instances that have already reached ASSIGNED state and ignores those
	that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource
	constraints) do not affect metric curves.</p>

	<h3 id="median-time-to-running-(mttr)">Median Time To Running (MTTR)</h3>

	<p><em>Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric
	reflecting on the overall time it takes for the Aurora/Mesos to start executing user content.</em></p>

	<p><strong>Collection scope:</strong></p>

	<ul>
	<li>Per job - <code>sla_<job_key>_mttr_ms</code></li>
	<li>Per cluster - <code>sla_cluster_mttr_ms</code></li>
	<li>Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
	<a href="../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java">ResourceAggregates.java</a>

	<ul>
	<li>By CPU:</li>
	<li><code>sla_cpu_small_mttr_ms</code></li>
	<li><code>sla_cpu_medium_mttr_ms</code></li>
	<li><code>sla_cpu_large_mttr_ms</code></li>
	<li><code>sla_cpu_xlarge_mttr_ms</code></li>
	<li><code>sla_cpu_xxlarge_mttr_ms</code></li>
	<li>By RAM:</li>
	<li><code>sla_ram_small_mttr_ms</code></li>
	<li><code>sla_ram_medium_mttr_ms</code></li>
	<li><code>sla_ram_large_mttr_ms</code></li>
	<li><code>sla_ram_xlarge_mttr_ms</code></li>
	<li><code>sla_ram_xxlarge_mttr_ms</code></li>
	<li>By DISK:</li>
	<li><code>sla_disk_small_mttr_ms</code></li>
	<li><code>sla_disk_medium_mttr_ms</code></li>
	<li><code>sla_disk_large_mttr_ms</code></li>
	<li><code>sla_disk_xlarge_mttr_ms</code></li>
	<li><code>sla_disk_xxlarge_mttr_ms</code></li>
	</ul></li>
	</ul>

	<p><strong>Units:</strong> milliseconds</p>

	<p>MTTR only considers instances in RUNNING state. This ensures straggler instances (e.g. with
	unreasonable resource constraints) do not affect metric curves.</p>

	<h2 id="limitations">Limitations</h2>

	<ul>
	<li><p>The availability of Aurora SLA metrics is bound by the scheduler availability.</p></li>
	<li><p>All metrics are calculated at a pre-defined interval (currently set at 1 minute).
	Scheduler restarts may result in missed collections.</p></li>
	</ul>

	</div>
	</div>

	<div class="container-fluid section-footer buffer">
	<div class="container">
	<div class="row">
	<div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3>
	<ul>
	<li><a href="/downloads/">Downloads</a></li>
	<li><a href="/community/">Mailing Lists</a></li>
	<li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li>
	<li><a href="/documentation/latest/contributing/">How To Contribute</a></li>
	</ul>
	</div>
	<div class="col-md-2"><h3>The ASF</h3>
	<ul>
	<li><a href="http://www.apache.org/licenses/">License</a></li>
	<li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
	<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
	<li><a href="http://www.apache.org/security/">Security</a></li>
	</ul>
	</div>
	<div class="col-md-6">
	<p class="disclaimer">Apache Aurora is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.</p>
	<p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p>
	</div>
	</div>
	</div>
	</body>
	</html>