title: Apache Mesos - Observability Metrics layout: documentation

Mesos Observability Metrics

This document describes the observability metrics provided by Mesos master and agent nodes. This document also provides some initial guidance on which metrics you should monitor to detect abnormal situations in your cluster.

Overview

Mesos master and agent nodes report a set of statistics and metrics that enable cluster operators to monitor resource usage and detect abnormal situations early. The information reported by Mesos includes details about available resources, used resources, registered frameworks, active agents, and task state. You can use this information to create automated alerts and to plot different metrics over time inside a monitoring dashboard.

Metric information is not persisted to disk at either master or agent nodes, which means that metrics will be reset when masters and agents are restarted. Similarly, if the current leading master fails and a new leading master is elected, metrics at the new master will be reset.

Metric Types

Mesos provides two different kinds of metrics: counters and gauges.

Counters keep track of discrete events and are monotonically increasing. The value of a metric of this type is always a natural number. Examples include the number of failed tasks and the number of agent registrations. For some metrics of this type, the rate of change is often more useful than the value itself.

Gauges represent an instantaneous sample of some magnitude. Examples include the amount of used memory in the cluster and the number of connected agents. For some metrics of this type, it is often useful to determine whether the value is above or below a threshold for a sustained period of time.

The tables in this document indicate the type of each available metric.

Master Nodes

Metrics from each master node are available via the /metrics/snapshot master endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.

Observability metrics

This section lists all available metrics from Mesos master nodes grouped by category.

Resources

The following metrics provide information about the total resources available in the cluster and their current usage. High resource usage for sustained periods of time may indicate that you need to add capacity to your cluster or that a framework is misbehaving.

Master

The following metrics provide information about whether a master is currently elected and how long it has been running. A cluster with no elected master for sustained periods of time indicates a malfunctioning cluster. This points to either leadership election issues (so check the connection to ZooKeeper) or a flapping Master process. A low uptime value indicates that the master has restarted recently.

System

The following metrics provide information about the resources available on this master node and their current usage. High resource usage in a master node for sustained periods of time may degrade the performance of the cluster.

Agents

The following metrics provide information about agent events, agent counts, and agent states. A low number of active agents may indicate that agents are unhealthy or that they are not able to connect to the elected master.

Frameworks

The following metrics provide information about the registered frameworks in the cluster. No active or connected frameworks may indicate that a scheduler is not registered or that it is misbehaving.

The following metrics are added for each framework which registers with the master, in order to provide detailed information about the behavior of the framework. The framework name is percent-encoded before creating these metrics; the actual name can be recovered by percent-decoding.

Tasks

The following metrics provide information about active and terminated tasks. A high rate of lost tasks may indicate that there is a problem with the cluster. The task states listed here match those of the task state machine.

Operations

The following metrics provide information about offer operations on the master.

Below, OPERATION_TYPE refers to any one of reserve, unreserve, create, destroy, grow_volume, shrink_volume, create_disk or destroy_disk.

NOTE: The counter for terminal operation states can over-count over time. In particular if an agent contained unacknowledged terminal status updates when it was marked gone or marked unreachable, these operations will be double-counted as both their original state and OPERATION_GONE/OPERATION_UNREACHABLE.

Messages

The following metrics provide information about messages between the master and the agents and between the framework and the executors. A high rate of dropped messages may indicate that there is a problem with the network.

Event queue

The following metrics provide information about different types of events in the event queue.

Registrar

The following metrics provide information about read and write latency to the agent registrar.

Replicated log

The following metrics provide information about the replicated log underneath the registrar, which is the persistent store for masters.

Allocator

The following metrics provide information about performance and resource allocations in the allocator.

Basic Alerts

This section lists some examples of basic alerts that you can use to detect abnormal situations in a cluster.

master/uptime_secs is low

The master has restarted.

master/uptime_secs < 60 for sustained periods of time

The cluster has a flapping master node.

master/tasks_lost is increasing rapidly

Tasks in the cluster are disappearing. Possible causes include hardware failures, bugs in one of the frameworks, or bugs in Mesos.

master/slaves_active is low

Agents are having trouble connecting to the master.

master/cpus_percent > 0.9 for sustained periods of time

Cluster CPU utilization is close to capacity.

master/mem_percent > 0.9 for sustained periods of time

Cluster memory utilization is close to capacity.

master/elected is 0 for sustained periods of time

No master is currently elected.

Agent Nodes

Metrics from each agent node are available via the /metrics/snapshot agent endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.

Observability Metrics

This section lists all available metrics from Mesos agent nodes grouped by category.

Resources

The following metrics provide information about the total resources available in the agent and their current usage.

Agent

The following metrics provide information about whether an agent is currently registered with a master and for how long it has been running.

System

The following metrics provide information about the agent system.

Executors

The following metrics provide information about the executor instances running on the agent.

Tasks

The following metrics provide information about active and terminated tasks.

Messages

The following metrics provide information about messages between the agents and the master it is registered with.

Containerizers

The following metrics provide information about both Mesos and Docker containerizers.

Resource Providers

The following metrics provide information about ongoing and completed operations that apply to resources provided by a resource provider with the given type and name. In the following metrics, the operation placeholder refers to the name of a particular operation type, which is described in the list of supported operation types.

Supported Operation Types

Since the supported operation types may vary among different resource providers, the following is a comprehensive list of operation types and the corresponding resource providers that support them. Note that the name column is for the operation placeholder in the above metrics.

For example, cluster operators can monitor the number of successful CREATE_VOLUME operations that are applied to the resource provider with type org.apache.mesos.rp.local.storage and name lvm through the resource_providers/org.apache.mesos.rp.local.storage.lvm/operations/create_disk/finished metric.

CSI Plugins

Storage resource providers in Mesos are backed by CSI plugins running in standalone containers. To monitor the health of these CSI plugins for a storage resource provider with type and name, the following metrics provide information about plugin terminations and ongoing and completed CSI calls made to the plugin.