layout: documentation

Mesos Observability Metrics

This document describes the observability metrics provided by Mesos master and slave nodes. This document also provides some initial guidance on which metrics you should monitor to detect abnormal situations in your cluster.

Overview

Mesos master and slave nodes report a set of statistics and metrics that enable you to monitor resource usage and detect abnormal situations early. The information reported by Mesos includes details about available resources, used resources, registered frameworks, active slaves, and task state. You can use this information to create automated alerts and to plot different metrics over time inside a monitoring dashboard.

Metric Types

Mesos provides two different kinds of metrics: counters and gauges.

Counters keep track of discrete events and are monotonically increasing. The value of a metric of this type is always a natural number. Examples include the number of failed tasks and the number of slave registrations. For some metrics of this type, the rate of change is often more useful than the value itself.

Gauges represent an instantaneous sample of some magnitude. Examples include the amount of used memory in the cluster and the number of connected slaves. For some metrics of this type, it is often useful to determine whether the value is above or below a threshold for a sustained period of time.

The tables in this document indicate the type of each available metric.

Master Nodes

Metrics from the master node are available at the following URL:

http://<mesos-master-ip>:5050/metrics/snapshot

The response is a JSON object that contains metrics names and values as key-value pairs.

Observability metrics

This section lists all available metrics from Mesos master nodes grouped by category.

Resources

The following metrics provide information about the total resources available in the cluster and their current usage. High resource usage for sustained periods of time may indicate that you need to add capacity to your cluster or that a framework is misbehaving.

Master

The following metrics provide information about whether a master is currently elected and how long it has been running. A cluster with no elected master for sustained periods of time indicates a malfunctioning cluster. This points to either leadership election issues (so check the connection to ZooKeeper) or a flapping Master process. A low uptime value indicates that the master has restarted recently.

System

The following metrics provide information about the resources available on this master node and their current usage. High resource usage in a master node for sustained periods of time may degrade the performance of the cluster.

Slaves

The following metrics provide information about slave events, slave counts, and slave states. A low number of active slaves may indicate that slaves are unhealthy or that they are not able to connect to the elected master.

Frameworks

The following metrics provide information about the registered frameworks in the cluster. No active or connected frameworks may indicate that a scheduler is not registered or that it is misbehaving.

Tasks

The following metrics provide information about active and terminated tasks. A high rate of lost tasks may indicate that there is a problem with the cluster. The task states listed here match those of the task state machine.

Messages

The following metrics provide information about messages between the master and the slaves and between the framework and the executors. A high rate of dropped messages may indicate that there is a problem with the network.

Event queue

The following metrics provide information about different types of events in the event queue.

Registrar

The following metrics provide information about read and write latency to the slave registrar.

Basic Alerts

This section lists some examples of basic alerts that you can use to detect abnormal situations in a cluster.

master/uptime_secs is low

The master has restarted.

master/uptime_secs < 60 for sustained periods of time

The cluster has a flapping master node.

master/tasks_lost is increasing rapidly

Tasks in the cluster are disappearing. Possible causes include hardware failures, bugs in one of the frameworks, or bugs in Mesos.

master/slaves_active is low

Slaves are having trouble connecting to the master.

master/cpus_percent > 0.9 for sustained periods of time

Cluster CPU utilization is close to capacity.

master/mem_percent > 0.9 for sustained periods of time

Cluster memory utilization is close to capacity.

master/elected is 0 for sustained periods of time

No master is currently elected.

Slave Nodes

Metrics from each slave node are available at the following URL:

http://<mesos-slave>:5051/metrics/snapshot

The response is a JSON object that contains metrics names and values as key- value pairs.

Observability Metrics

This section lists all available metrics from Mesos slave nodes grouped by category.

Resources

The following metrics provide information about the total resources available in the slave and their current usage.

Slave

The following metrics provide information about whether a slave is currently registered with a master and for how long it has been running.

System

The following metrics provide information about the slave system.

Executors

The following metrics provide information about the executor instances running on the slave.

Tasks

The following metrics provide information about active and terminated tasks.

Messages

The following metrics provide information about messages between the slaves and the master it is registered with.