This document describes the observability metrics provided by Mesos master and agent nodes. This document also provides some initial guidance on which metrics you should monitor to detect abnormal situations in your cluster.
Mesos master and agent nodes report a set of statistics and metrics that enable cluster operators to monitor resource usage and detect abnormal situations early. The information reported by Mesos includes details about available resources, used resources, registered frameworks, active agents, and task state. You can use this information to create automated alerts and to plot different metrics over time inside a monitoring dashboard.
Metric information is not persisted to disk at either master or agent nodes, which means that metrics will be reset when masters and agents are restarted. Similarly, if the current leading master fails and a new leading master is elected, metrics at the new master will be reset.
Mesos provides two different kinds of metrics: counters and gauges.
Counters keep track of discrete events and are monotonically increasing. The value of a metric of this type is always a natural number. Examples include the number of failed tasks and the number of agent registrations. For some metrics of this type, the rate of change is often more useful than the value itself.
Gauges represent an instantaneous sample of some magnitude. Examples include the amount of used memory in the cluster and the number of connected agents. For some metrics of this type, it is often useful to determine whether the value is above or below a threshold for a sustained period of time.
The tables in this document indicate the type of each available metric.
Metrics from each master node are available via the /metrics/snapshot master endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.
This section lists all available metrics from Mesos master nodes grouped by category.
The following metrics provide information about the total resources available in the cluster and their current usage. High resource usage for sustained periods of time may indicate that you need to add capacity to your cluster or that a framework is misbehaving.
The following metrics provide information about whether a master is currently elected and how long it has been running. A cluster with no elected master for sustained periods of time indicates a malfunctioning cluster. This points to either leadership election issues (so check the connection to ZooKeeper) or a flapping Master process. A low uptime value indicates that the master has restarted recently.
The following metrics provide information about the resources available on this master node and their current usage. High resource usage in a master node for sustained periods of time may degrade the performance of the cluster.
The following metrics provide information about agent events, agent counts, and agent states. A low number of active agents may indicate that agents are unhealthy or that they are not able to connect to the elected master.
The following metrics provide information about the registered frameworks in the cluster. No active or connected frameworks may indicate that a scheduler is not registered or that it is misbehaving.
The following metrics are added for each framework which registers with the master, in order to provide detailed information about the behavior of the framework. The framework name is percent-encoded before creating these metrics; the actual name can be recovered by percent-decoding.
The following metrics provide information about active and terminated tasks. A high rate of lost tasks may indicate that there is a problem with the cluster. The task states listed here match those of the task state machine.
The following metrics provide information about offer operations on the master.
Below, OPERATION_TYPE
refers to any one of reserve
, unreserve
, create
, destroy
, grow_volume
, shrink_volume
, create_disk
or destroy_disk
.
NOTE: The counter for terminal operation states can over-count over time. In particular if an agent contained unacknowledged terminal status updates when it was marked gone or marked unreachable, these operations will be double-counted as both their original state and OPERATION_GONE
/OPERATION_UNREACHABLE
.
The following metrics provide information about messages between the master and the agents and between the framework and the executors. A high rate of dropped messages may indicate that there is a problem with the network.
The following metrics provide information about different types of events in the event queue.
The following metrics provide information about read and write latency to the agent registrar.
The following metrics provide information about the replicated log underneath the registrar, which is the persistent store for masters.
The following metrics provide information about performance and resource allocations in the allocator.
This section lists some examples of basic alerts that you can use to detect abnormal situations in a cluster.
The master has restarted.
The cluster has a flapping master node.
Tasks in the cluster are disappearing. Possible causes include hardware failures, bugs in one of the frameworks, or bugs in Mesos.
Agents are having trouble connecting to the master.
Cluster CPU utilization is close to capacity.
Cluster memory utilization is close to capacity.
No master is currently elected.
Metrics from each agent node are available via the /metrics/snapshot agent endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.
This section lists all available metrics from Mesos agent nodes grouped by category.
The following metrics provide information about the total resources available in the agent and their current usage.
The following metrics provide information about whether an agent is currently registered with a master and for how long it has been running.
The following metrics provide information about the agent system.
The following metrics provide information about the executor instances running on the agent.
The following metrics provide information about active and terminated tasks.
The following metrics provide information about messages between the agents and the master it is registered with.
The following metrics provide information about both Mesos and Docker containerizers.
The following metrics provide information about ongoing and completed operations that apply to resources provided by a resource provider with the given type and name. In the following metrics, the operation placeholder refers to the name of a particular operation type, which is described in the list of supported operation types.
Since the supported operation types may vary among different resource providers, the following is a comprehensive list of operation types and the corresponding resource providers that support them. Note that the name column is for the operation placeholder in the above metrics.
For example, cluster operators can monitor the number of successful CREATE_VOLUME
operations that are applied to the resource provider with type org.apache.mesos.rp.local.storage
and name lvm
through the resource_providers/org.apache.mesos.rp.local.storage.lvm/operations/create_disk/finished
metric.
Storage resource providers in Mesos are backed by CSI plugins running in standalone containers. To monitor the health of these CSI plugins for a storage resource provider with type and name, the following metrics provide information about plugin terminations and ongoing and completed CSI calls made to the plugin.