This document describes the observability metrics provided by Mesos master and slave nodes. This document also provides some initial guidance on which metrics you should monitor to detect abnormal situations in your cluster.
Mesos master and slave nodes report a set of statistics and metrics that enable you to monitor resource usage and detect abnormal situations early. The information reported by Mesos includes details about available resources, used resources, registered frameworks, active slaves, and task state. You can use this information to create automated alerts and to plot different metrics over time inside a monitoring dashboard.
Mesos provides two different kinds of metrics: counters and gauges.
Counters keep track of discrete events and are monotonically increasing. The value of a metric of this type is always a natural number. Examples include the number of failed tasks and the number of slave registrations. For some metrics of this type, the rate of change is often more useful than the value itself.
Gauges represent an instantaneous sample of some magnitude. Examples include the amount of used memory in the cluster and the number of connected slaves. For some metrics of this type, it is often useful to determine whether the value is above or below a threshold for a sustained period of time.
The tables in this document indicate the type of each available metric.
Metrics from the master node are available at the following URL:
http://<mesos-master-ip>:5050/metrics/snapshot
The response is a JSON object that contains metrics names and values as key-value pairs.
This section lists all available metrics from Mesos master nodes grouped by category.
The following metrics provide information about the total resources available in the cluster and their current usage. High resource usage for sustained periods of time may indicate that you need to add capacity to your cluster or that a framework is misbehaving.
The following metrics provide information about whether a master is currently elected and how long it has been running. A cluster with no elected master for sustained periods of time indicates a malfunctioning cluster. This points to either leadership election issues (so check the connection to ZooKeeper) or a flapping Master process. A low uptime value indicates that the master has restarted recently.
The following metrics provide information about the resources available on this master node and their current usage. High resource usage in a master node for sustained periods of time may degrade the performance of the cluster.
The following metrics provide information about slave events, slave counts, and slave states. A low number of active slaves may indicate that slaves are unhealthy or that they are not able to connect to the elected master.
The following metrics provide information about the registered frameworks in the cluster. No active or connected frameworks may indicate that a scheduler is not registered or that it is misbehaving.
The following metrics provide information about active and terminated tasks. A high rate of lost tasks may indicate that there is a problem with the cluster. The task states listed here match those of the task state machine.
The following metrics provide information about messages between the master and the slaves and between the framework and the executors. A high rate of dropped messages may indicate that there is a problem with the network.
The following metrics provide information about different types of events in the event queue.
The following metrics provide information about read and write latency to the slave registrar.
This section lists some examples of basic alerts that you can use to detect abnormal situations in a cluster.
The master has restarted.
The cluster has a flapping master node.
Tasks in the cluster are disappearing. Possible causes include hardware failures, bugs in one of the frameworks, or bugs in Mesos.
Slaves are having trouble connecting to the master.
Cluster CPU utilization is close to capacity.
Cluster memory utilization is close to capacity.
No master is currently elected.
Metrics from each slave node are available at the following URL:
http://<mesos-slave>:5051/metrics/snapshot
The response is a JSON object that contains metrics names and values as key- value pairs.
This section lists all available metrics from Mesos slave nodes grouped by category.
The following metrics provide information about the total resources available in the slave and their current usage.
The following metrics provide information about whether a slave is currently registered with a master and for how long it has been running.
The following metrics provide information about the slave system.
The following metrics provide information about the executor instances running on the slave.
The following metrics provide information about active and terminated tasks.
The following metrics provide information about messages between the slaves and the master it is registered with.