| --- |
| layout: documentation |
| --- |
| |
| |
| # Mesos Observability Metrics |
| |
| This document describes the observability metrics provided by Mesos master and |
| slave nodes. This document also provides some initial guidance on which metrics |
| you should monitor to detect abnormal situations in your cluster. |
| |
| |
| ## Overview |
| |
| Mesos master and slave nodes report a set of statistics and metrics that enable |
| you to monitor resource usage and detect abnormal situations early. The |
| information reported by Mesos includes details about available resources, used |
| resources, registered frameworks, active slaves, and task state. You can use |
| this information to create automated alerts and to plot different metrics over |
| time inside a monitoring dashboard. |
| |
| |
| ## Metric Types |
| |
| Mesos provides two different kinds of metrics: counters and gauges. |
| |
| **Counters** keep track of discrete events and are monotonically increasing. The |
| value of a metric of this type is always a natural number. Examples include the |
| number of failed tasks and the number of slave registrations. For some metrics |
| of this type, the rate of change is often more useful than the value itself. |
| |
| **Gauges** represent an instantaneous sample of some magnitude. Examples include |
| the amount of used memory in the cluster and the number of connected slaves. For |
| some metrics of this type, it is often useful to determine whether the value is |
| above or below a threshold for a sustained period of time. |
| |
| The tables in this document indicate the type of each available metric. |
| |
| |
| ## Master Nodes |
| |
| Metrics from the master node are available at the following URL: |
| |
| http://<mesos-master-ip>:5050/metrics/snapshot |
| |
| The response is a JSON object that contains metrics names and values as |
| key-value pairs. |
| |
| ### Observability metrics |
| |
| This section lists all available metrics from Mesos master nodes grouped by |
| category. |
| |
| #### Resources |
| |
| The following metrics provide information about the total resources available in |
| the cluster and their current usage. High resource usage for sustained periods |
| of time may indicate that you need to add capacity to your cluster or that a |
| framework is misbehaving. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>master/cpus_percent</code> |
| </td> |
| <td>Percentage of allocated CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/cpus_used</code> |
| </td> |
| <td>Number of allocated CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/cpus_total</code> |
| </td> |
| <td>Number of CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/cpus_revocable_percent</code> |
| </td> |
| <td>Percentage of allocated revocable CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/cpus_revocable_total</code> |
| </td> |
| <td>Number of revocable CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/cpus_revocable_used</code> |
| </td> |
| <td>Number of allocated revocable CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/disk_percent</code> |
| </td> |
| <td>Percentage of allocated disk space</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/disk_used</code> |
| </td> |
| <td>Allocated disk space in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/disk_total</code> |
| </td> |
| <td>Disk space in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/disk_revocable_percent</code> |
| </td> |
| <td>Percentage of allocated revocable disk space</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/disk_revocable_total</code> |
| </td> |
| <td>Revocable disk space in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/disk_revocable_used</code> |
| </td> |
| <td>Allocated revocable disk space in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/mem_percent</code> |
| </td> |
| <td>Percentage of allocated memory</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/mem_used</code> |
| </td> |
| <td>Allocated memory in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/mem_total</code> |
| </td> |
| <td>Memory in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/mem_revocable_percent</code> |
| </td> |
| <td>Percentage of allocated revocable memory</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/mem_revocable_total</code> |
| </td> |
| <td>Revocable memory in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/mem_revocable_used</code> |
| </td> |
| <td>Allocated revocable memory in MB</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Master |
| |
| The following metrics provide information about whether a master is currently |
| elected and how long it has been running. A cluster with no elected master |
| for sustained periods of time indicates a malfunctioning cluster. This |
| points to either leadership election issues (so check the connection to |
| ZooKeeper) or a flapping Master process. A low uptime value indicates that the |
| master has restarted recently. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>master/elected</code> |
| </td> |
| <td>Whether this is the elected master</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/uptime_secs</code> |
| </td> |
| <td>Uptime in seconds</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### System |
| |
| The following metrics provide information about the resources available on this |
| master node and their current usage. High resource usage in a master node for |
| sustained periods of time may degrade the performance of the cluster. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>system/cpus_total</code> |
| </td> |
| <td>Number of CPUs available in this master node</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/load_15min</code> |
| </td> |
| <td>Load average for the past 15 minutes</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/load_5min</code> |
| </td> |
| <td>Load average for the past 5 minutes</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/load_1min</code> |
| </td> |
| <td>Load average for the past minute</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/mem_free_bytes</code> |
| </td> |
| <td>Free memory in bytes</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/mem_total_bytes</code> |
| </td> |
| <td>Total memory in bytes</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Slaves |
| |
| The following metrics provide information about slave events, slave counts, and |
| slave states. A low number of active slaves may indicate that slaves are |
| unhealthy or that they are not able to connect to the elected master. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>master/slave_registrations</code> |
| </td> |
| <td>Number of slaves that were able to cleanly re-join the cluster and |
| connect back to the master after the master is disconnected.</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slave_removals</code> |
| </td> |
| <td>Number of slave removed for various reasons, including maintenance</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slave_reregistrations</code> |
| </td> |
| <td>Number of slave re-registrations</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slave_shutdowns_scheduled</code> |
| </td> |
| <td>Number of slaves which have failed their health check and are scheduled |
| to be removed. They will not be immediately removed due to the Slave |
| Removal Rate-Limit, but <code>master/slave_shutdowns_completed</code> |
| will start increasing as they do get removed.</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slave_shutdowns_canceled</code> |
| </td> |
| <td>Number of cancelled slave shutdowns. This happens when the slave removal |
| rate limit allows for a slave to reconnect and send a <code>PONG</code> |
| to the master before being removed.</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slave_shutdowns_completed</code> |
| </td> |
| <td>Number of slaves that failed their health check. These are slaves which |
| were not heard from despite the slave-removal rate limit, and have been |
| removed from the master's slave registry.</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slaves_active</code> |
| </td> |
| <td>Number of active slaves</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slaves_connected</code> |
| </td> |
| <td>Number of connected slaves</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slaves_disconnected</code> |
| </td> |
| <td>Number of disconnected slaves</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slaves_inactive</code> |
| </td> |
| <td>Number of inactive slaves</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Frameworks |
| |
| The following metrics provide information about the registered frameworks in the |
| cluster. No active or connected frameworks may indicate that a scheduler is not |
| registered or that it is misbehaving. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>master/frameworks_active</code> |
| </td> |
| <td>Number of active frameworks</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/frameworks_connected</code> |
| </td> |
| <td>Number of connected frameworks</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/frameworks_disconnected</code> |
| </td> |
| <td>Number of disconnected frameworks</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/frameworks_inactive</code> |
| </td> |
| <td>Number of inactive frameworks</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/outstanding_offers</code> |
| </td> |
| <td>Number of outstanding resource offers</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Tasks |
| |
| The following metrics provide information about active and terminated tasks. A |
| high rate of lost tasks may indicate that there is a problem with the cluster. |
| The task states listed here match those of the task state machine. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>master/tasks_error</code> |
| </td> |
| <td>Number of tasks that were invalid</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/tasks_failed</code> |
| </td> |
| <td>Number of failed tasks</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/tasks_finished</code> |
| </td> |
| <td>Number of finished tasks</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/tasks_killed</code> |
| </td> |
| <td>Number of killed tasks</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/tasks_lost</code> |
| </td> |
| <td>Number of lost tasks</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/tasks_running</code> |
| </td> |
| <td>Number of running tasks</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/tasks_staging</code> |
| </td> |
| <td>Number of staging tasks</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/tasks_starting</code> |
| </td> |
| <td>Number of starting tasks</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Messages |
| |
| The following metrics provide information about messages between the master and |
| the slaves and between the framework and the executors. A high rate of dropped |
| messages may indicate that there is a problem with the network. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>master/invalid_executor_to_framework_messages</code> |
| </td> |
| <td>Number of invalid executor to framework messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/invalid_framework_to_executor_messages</code> |
| </td> |
| <td>Number of invalid framework to executor messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/invalid_status_update_acknowledgements</code> |
| </td> |
| <td>Number of invalid status update acknowledgements</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/invalid_status_updates</code> |
| </td> |
| <td>Number of invalid status updates</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/dropped_messages</code> |
| </td> |
| <td>Number of dropped messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_authenticate</code> |
| </td> |
| <td>Number of authentication messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_deactivate_framework</code> |
| </td> |
| <td>Number of framework deactivation messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_decline_offers</code> |
| </td> |
| <td>Number of offers declined</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_executor_to_framework</code> |
| </td> |
| <td>Number of executor to framework messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_exited_executor</code> |
| </td> |
| <td>Number of terminated executor messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_framework_to_executor</code> |
| </td> |
| <td>Number of messages from a framework to an executor</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_kill_task</code> |
| </td> |
| <td>Number of kill task messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_launch_tasks</code> |
| </td> |
| <td>Number of launch task messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_reconcile_tasks</code> |
| </td> |
| <td>Number of reconcile task messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_register_framework</code> |
| </td> |
| <td>Number of framework registration messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_register_slave</code> |
| </td> |
| <td>Number of slave registration messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_reregister_framework</code> |
| </td> |
| <td>Number of framework re-registration messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_reregister_slave</code> |
| </td> |
| <td>Number of slave re-registration messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_resource_request</code> |
| </td> |
| <td>Number of resource request messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_revive_offers</code> |
| </td> |
| <td>Number of offer revival messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_status_update</code> |
| </td> |
| <td>Number of status update messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_status_update_acknowledgement</code> |
| </td> |
| <td>Number of status update acknowledgement messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_unregister_framework</code> |
| </td> |
| <td>Number of framework unregistration messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_unregister_slave</code> |
| </td> |
| <td>Number of slave unregistration messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/messages_update_slave</code> |
| </td> |
| <td>Number of update slave messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/recovery_slave_removals</code> |
| </td> |
| <td>Number of slaves not re-registered during master failover</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slave_removals/reason_registered</code> |
| </td> |
| <td>Number of slaves removed when new slaves registered at the same address</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slave_removals/reason_unhealthy</code> |
| </td> |
| <td>Number of slaves failed due to failed health checks</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/slave_removals/reason_unregistered</code> |
| </td> |
| <td>Number of slaves unregistered</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/valid_framework_to_executor_messages</code> |
| </td> |
| <td>Number of valid framework to executor messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/valid_status_update_acknowledgements</code> |
| </td> |
| <td>Number of valid status update acknowledgement messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/valid_status_updates</code> |
| </td> |
| <td>Number of valid status update messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/task_lost/source_master/reason_invalid_offers</code> |
| </td> |
| <td>Number of tasks lost due to invalid offers</code> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/task_lost/source_master/reason_slave_removed</code> |
| </td> |
| <td>Number of tasks lost due to slave removal</code> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/task_lost/source_slave/reason_executor_terminated</code> |
| </td> |
| <td>Number of tasks lost due to executor termination</code> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/valid_executor_to_framework_messages</code> |
| </td> |
| <td>Number of valid executor to framework messages</code> |
| <td>Counter</td> |
| </tr> |
| </table> |
| |
| #### Event queue |
| |
| The following metrics provide information about different types of events in the |
| event queue. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>master/event_queue_dispatches</code> |
| </td> |
| <td>Number of dispatches in the event queue</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/event_queue_http_requests</code> |
| </td> |
| <td>Number of HTTP requests in the event queue</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>master/event_queue_messages</code> |
| </td> |
| <td>Number of messages in the event queue</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Registrar |
| |
| The following metrics provide information about read and write latency to the |
| slave registrar. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>registrar/state_fetch_ms</code> |
| </td> |
| <td>Registry read latency in ms </td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>registrar/state_store_ms</code> |
| </td> |
| <td>Registry write latency in ms </td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>registrar/state_store_ms/max</code> |
| </td> |
| <td>Maximum registry write latency in ms</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>registrar/state_store_ms/min</code> |
| </td> |
| <td>Minimum registry write latency in ms</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>registrar/state_store_ms/p50</code> |
| </td> |
| <td>Median registry write latency in ms</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>registrar/state_store_ms/p90</code> |
| </td> |
| <td>90th percentile registry write latency in ms</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>registrar/state_store_ms/p95</code> |
| </td> |
| <td>95th percentile registry write latency in ms</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>registrar/state_store_ms/p99</code> |
| </td> |
| <td>99th percentile registry write latency in ms</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>registrar/state_store_ms/p999</code> |
| </td> |
| <td>99.9th percentile registry write latency in ms</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>registrar/state_store_ms/p9999</code> |
| </td> |
| <td>99.99th percentile registry write latency in ms</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| |
| ### Basic Alerts |
| |
| This section lists some examples of basic alerts that you can use to detect |
| abnormal situations in a cluster. |
| |
| #### master/uptime_secs is low |
| |
| The master has restarted. |
| |
| #### master/uptime_secs < 60 for sustained periods of time |
| |
| The cluster has a flapping master node. |
| |
| #### master/tasks_lost is increasing rapidly |
| |
| Tasks in the cluster are disappearing. Possible causes include hardware |
| failures, bugs in one of the frameworks, or bugs in Mesos. |
| |
| #### master/slaves_active is low |
| |
| Slaves are having trouble connecting to the master. |
| |
| #### master/cpus_percent > 0.9 for sustained periods of time |
| |
| Cluster CPU utilization is close to capacity. |
| |
| #### master/mem_percent > 0.9 for sustained periods of time |
| |
| Cluster memory utilization is close to capacity. |
| |
| #### master/elected is 0 for sustained periods of time |
| |
| No master is currently elected. |
| |
| |
| |
| |
| ## Slave Nodes |
| |
| Metrics from each slave node are available at the following URL: |
| |
| http://<mesos-slave>:5051/metrics/snapshot |
| |
| The response is a JSON object that contains metrics names and values as key- |
| value pairs. |
| |
| |
| ### Observability Metrics |
| |
| This section lists all available metrics from Mesos slave nodes grouped by |
| category. |
| |
| #### Resources |
| |
| The following metrics provide information about the total resources available in |
| the slave and their current usage. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>slave/cpus_percent</code> |
| </td> |
| <td>Percentage of allocated CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/cpus_used</code> |
| </td> |
| <td>Number of allocated CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/cpus_total</code> |
| </td> |
| <td>Number of CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/cpus_revocable_percent</code> |
| </td> |
| <td>Percentage of allocated revocable CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/cpus_revocable_total</code> |
| </td> |
| <td>Number of revocable CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/cpus_revocable_used</code> |
| </td> |
| <td>Number of allocated revocable CPUs</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/disk_percent</code> |
| </td> |
| <td>Percentage of allocated disk space</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/disk_used</code> |
| </td> |
| <td>Allocated disk space in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/disk_total</code> |
| </td> |
| <td>Disk space in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/mem_percent</code> |
| </td> |
| <td>Percentage of allocated memory</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/disk_revocable_percent</code> |
| </td> |
| <td>Percentage of allocated revocable disk space</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/disk_revocable_total</code> |
| </td> |
| <td>Revocable disk space in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/disk_revocable_used</code> |
| </td> |
| <td>Allocated revocable disk space in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/mem_used</code> |
| </td> |
| <td>Allocated memory in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/mem_total</code> |
| </td> |
| <td>Memory in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/mem_revocable_percent</code> |
| </td> |
| <td>Percentage of allocated revocable memory</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/mem_revocable_total</code> |
| </td> |
| <td>Revocable memory in MB</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/mem_revocable_used</code> |
| </td> |
| <td>Allocated revocable memory in MB</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Slave |
| |
| The following metrics provide information about whether a slave is currently |
| registered with a master and for how long it has been running. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>slave/registered</code> |
| </td> |
| <td>Whether this slave is registered with a master</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/uptime_secs</code> |
| </td> |
| <td>Uptime in seconds</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### System |
| |
| The following metrics provide information about the slave system. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>system/cpus_total</code> |
| </td> |
| <td>Number of CPUs available</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/load_15min</code> |
| </td> |
| <td>Load average for the past 15 minutes</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/load_5min</code> |
| </td> |
| <td>Load average for the past 5 minutes</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/load_1min</code> |
| </td> |
| <td>Load average for the past minute</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/mem_free_bytes</code> |
| </td> |
| <td>Free memory in bytes</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>system/mem_total_bytes</code> |
| </td> |
| <td>Total memory in bytes</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Executors |
| |
| The following metrics provide information about the executor instances running |
| on the slave. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>containerizer/mesos/container_destroy_errors</code> |
| </td> |
| <td>Number of containers destroyed due to launch errors</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/container_launch_errors</code> |
| </td> |
| <td>Number of container launch errors</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/executors_preempted</code> |
| </td> |
| <td>Number of executors destroyed due to preemption</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/frameworks_active</code> |
| </td> |
| <td>Number of active frameworks</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/executor_directory_max_allowed_age_secs</code> |
| </td> |
| <td>Maximum allowed age in seconds to delete executor directory</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/executors_registering</code> |
| </td> |
| <td>Number of executors registering</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/executors_running</code> |
| </td> |
| <td>Number of executors running</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/executors_terminated</code> |
| </td> |
| <td>Number of terminated executors</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/executors_terminating</code> |
| </td> |
| <td>Number of terminating executors</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/recovery_errors</code> |
| </td> |
| <td>Number of errors encountered during slave recovery</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Tasks |
| |
| The following metrics provide information about active and terminated tasks. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>slave/tasks_failed</code> |
| </td> |
| <td>Number of failed tasks</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/tasks_finished</code> |
| </td> |
| <td>Number of finished tasks</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/tasks_killed</code> |
| </td> |
| <td>Number of killed tasks</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/tasks_lost</code> |
| </td> |
| <td>Number of lost tasks</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/tasks_running</code> |
| </td> |
| <td>Number of running tasks</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/tasks_staging</code> |
| </td> |
| <td>Number of staging tasks</td> |
| <td>Gauge</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/tasks_starting</code> |
| </td> |
| <td>Number of starting tasks</td> |
| <td>Gauge</td> |
| </tr> |
| </table> |
| |
| #### Messages |
| |
| The following metrics provide information about messages between the slaves and |
| the master it is registered with. |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr><th>Metric</th><th>Description</th><th>Type</th> |
| </thead> |
| <tr> |
| <td> |
| <code>slave/invalid_framework_messages</code> |
| </td> |
| <td>Number of invalid framework messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/invalid_status_updates</code> |
| </td> |
| <td>Number of invalid status updates</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/valid_framework_messages</code> |
| </td> |
| <td>Number of valid framework messages</td> |
| <td>Counter</td> |
| </tr> |
| <tr> |
| <td> |
| <code>slave/valid_status_updates</code> |
| </td> |
| <td>Number of valid status updates</td> |
| <td>Counter</td> |
| </tr> |
| </table> |