title: Cluster Metrics layout: documentation documentation: true

#Cluster Metrics

There are lots of metrics to help you monitor a running cluster. Many of these metrics are still a work in progress and so is the metrics system itself so any of them may change, even between minor version releases. We will try to keep them as stable as possible, but they should all be considered somewhat unstable. Some of the metrics may also be for experimental features, or features that are not complete yet, so please read the description of the metric before using it for monitoring or alerting.

Also be aware that depending on the metrics system you use, the names are likely to be translated into a different format that is compatible with the system. Typically this means that the ‘:’ separating character will be replaced with a ‘.’ character.

Most metrics should have the units that they are reported in as a part of the description. For Timers often this is configured by the reporter that is uploading them to your system. Pay attention because even if the metric name has a time unit in it, it may be false.

Also most metrics, except for gauges and counters, are a collection of numbers, and not a single value. Often these result in multiple metrics being uploaded to a reporting system, such as percentiles for a histogram, or rates for a meter. It is dependent on the configured metrics reporter how this happens, or how the name here corresponds to the metric in your reporting system.

Cluster Metrics (From Nimbus)

These are metrics that come from the active nimbus instance and report the state of the cluster as a whole, as seen by nimbus.

Metric NameTypeDescription
cluster:num-nimbus-leadersgaugeNumber of nimbuses marked as a leader. This should really only ever be 1 in a healthy cluster, or 0 for a short period of time while a failover happens.
cluster:num-nimbusesgaugeNumber of nimbuses, leader or standby.
cluster:num-supervisorsgaugeNumber of supervisors.
cluster:num-topologiesgaugeNumber of topologies.
cluster:num-total-used-workersgaugeNumber of used workers/slots.
cluster:num-total-workersgaugeNumber of workers/slots.
cluster:total-fragmented-cpu-non-negativegaugeTotal fragmented CPU (% of core). This is CPU that the system thinks it cannot use because other resources on the node are used up.
cluster:total-fragmented-memory-non-negativegaugeTotal fragmented memory (MB). This is memory that the system thinks it cannot use because other resources on the node are used up.
topologies:assigned-cpuhistogramCPU scheduled per topology (% of a core)
topologies:assigned-mem-off-heaphistogramOff heap memory scheduled per topology (MB)
topologies:assigned-mem-on-heaphistogramOn heap memory scheduled per topology (MB)
topologies:num-executorshistogramNumber of executors per topology.
topologies:num-taskshistogramNumber of tasks per topology.
topologies:num-workershistogramNumber of workers per topology.
topologies:replication-counthistogramReplication count per topology.
topologies:requested-cpuhistogramCPU requested per topology (% of a core).
topologies:requested-mem-off-heaphistogramOff heap memory requested per topology (MB).
topologies:requested-mem-on-heaphistogramOn heap memory requested per topology (MB).
topologies:uptime-secshistogramUptime per topology (seconds).
nimbus:available-cpu-non-negativegaugeAvailable cpu on the cluster (% of a core).
nimbus:total-cpugaugetotal CPU on the cluster (% of a core)
nimbus:total-memorygaugetotal memory on the cluster MB
supervisors:fragmented-cpuhistogramfragmented cpu per supervisor (% of a core)
supervisors:fragmented-memhistogramfragmented memory per supervisor (MB)
supervisors:num-used-workershistogramworkers used per supervisor
supervisors:num-workershistogramnumber of workers per supervisor
supervisors:uptime-secshistogramuptime of supervisors
supervisors:used-cpuhistogramcpu used per supervisor (% of a core)
supervisors:used-memhistogrammemory used per supervisor MB

Nimbus Metrics

These are metrics that are specific to a nimbus instance. In many instances only the active nimbus will be reporting these metrics, but they could come from standby nimbus instances as well.

Metric NameTypeDescription
nimbus:files-upload-duration-mstimerTime it takes to upload a file from start to finish (Not Blobs, but this may change)
nimbus:longest-scheduling-time-msgaugeLongest time ever taken so far to schedule. This includes the current scheduling run, which is intended to detect if scheduling is stuck for some reason.
nimbus:mkAssignments-Errorsmetertracks exceptions from mkAssignments
nimbus:num-activate-callsmetercalls to the activate thrift method.
nimbus:num-added-executors-per-schedulinghistogramnumber of executors added after a scheduling run.
nimbus:num-added-slots-per-schedulinghistogramnumber of slots added after a scheduling run.
nimbus:num-beginFileUpload-callsmetercalls to the beginFileUpload thrift method.
nimbus:num-blacklisted-supervisorgaugeNumber of supervisors currently marked as blacklisted because they appear to be somewhat unstable.
nimbus:num-deactivate-callsmetercalls to deactivate thrift method.
nimbus:num-debug-callsmetercalls to debug thrift method.
nimbus:num-downloadChunk-callsmetercalls to downloadChunk thrift method.
nimbus:num-finishFileUpload-callsmetercalls to finishFileUpload thrift method.
nimbus:num-gained-leadershipmeternumber of times this nimbus gained leadership.
nimbus:num-getClusterInfo-callsmetercalls to getClusterInfo thrift method.
nimbus:num-getComponentPageInfo-callsmetercalls to getComponentPageInfo thrift method.
nimbus:num-getComponentPendingProfileActions-callsmetercalls to getComponentPendingProfileActions thrift method.
nimbus:num-getLeader-callsmetercalls to getLeader thrift method.
nimbus:num-getLogConfig-callsmetercalls to getLogConfig thrift method.
nimbus:num-getNimbusConf-callsmetercalls to getNimbusConf thrift method.
nimbus:num-getOwnerResourceSummaries-callsmetercalls to getOwnerResourceSummaries thrift method.
nimbus:num-getSupervisorPageInfo-callsmetercalls to getSupervisorPageInfo thrift method.
nimbus:num-getTopology-callsmetercalls to getTopology thrift method.
nimbus:num-getTopologyConf-callsmetercalls to getTopologyConf thrift method.
nimbus:num-getTopologyInfo-callsmetercalls to getTopologyInfo thrift method.
nimbus:num-getTopologyInfoWithOpts-callsmetercalls to getTopologyInfoWithOpts thrift method includes calls to getTopologyInfo.
nimbus:num-getTopologyPageInfo-callsmetercalls to getTopologyPageInfo thrift method.
nimbus:num-getUserTopology-callsmetercalls to getUserTopology thrift method.
nimbus:num-isTopologyNameAllowed-callsmetercalls to isTopologyNameAllowed thrift method.
nimbus:num-killTopology-callsmetercalls to killTopology thrift method.
nimbus:num-killTopologyWithOpts-callsmetercalls to killTopologyWithOpts thrift method includes calls to killTopology.
nimbus:num-launchedmeternumber of times a nimbus was launched
nimbus:num-lost-leadershipmeternumber of times this nimbus lost leadership
nimbus:num-negative-resource-eventsmeterAny time a resource goes negative (either CPU or Memory). This metric is not ideal as it is measured in a data structure that is used for internal calculations that may go negative and not actually represent over scheduling of a resource.
nimbus:num-net-executors-increase-per-schedulinghistogramadded executors minus removed executors after a scheduling run
nimbus:num-net-slots-increase-per-schedulinghistogramadded slots minus removed slots after a scheduling run
nimbus:num-rebalance-callsmetercalls to rebalance thrift method.
nimbus:num-removed-executors-per-schedulinghistogramnumber of executors removed after a scheduling run
nimbus:num-scheduling-timeoutsmeternumber of timeouts during scheduling
nimbus:num-removed-slots-per-schedulinghistogramnumber of slots removed after a scheduling run
nimbus:num-setLogConfig-callsmetercalls to setLogConfig thrift method.
nimbus:num-setWorkerProfiler-callsmetercalls to setWorkerProfiler thrift method.
nimbus:num-shutdown-callsmetertimes nimbus is shut down (this may not actually be reported as nimbus is in the middle of shutting down)
nimbus:num-submitTopology-callsmetercalls to submitTopology thrift method.
nimbus:num-submitTopologyWithOpts-callsmetercalls to submitTopologyWithOpts thrift method includes calls to submitTopology.
nimbus:num-uploadChunk-callsmetercalls to uploadChunk thrift method.
nimbus:num-uploadNewCredentials-callsmetercalls to uploadNewCredentials thrift method.
nimbus:process-worker-metric-callsmetercalls to processWorkerMetrics thrift method.
nimbus:scheduler-internal-errorsmetertracks internal scheduling errors
nimbus:topology-scheduling-duration-mstimertime it takes to do a scheduling run.
nimbus:total-available-memory-non-negativegaugeavailable memory on the cluster MB
nimbuses:uptime-secshistogramuptime of nimbuses
MetricsCleaner:purgeTimestampgaugelast time metrics were purged (Unfinished Feature)
RocksDB:metric-failuresmetergenerally any failure that happens in the rocksdb metrics store. (Unfinished Feature)

DRPC Metrics

Metrics related to DRPC servers.

Metric NameTypeDescription
drpc:HTTP-request-response-durationtimerhow long it takes to execute an http drpc request
drpc:num-execute-callsmetercalls to execute a DRPC request
drpc:num-execute-http-requestsmeterhttp requests to the DRPC server
drpc:num-failRequest-callsmetercalls to failRequest
drpc:num-fetchRequest-callsmetercalls to fetchRequest
drpc:num-result-callsmetercalls to returnResult
drpc:num-server-timedout-requestsmetertimes a DRPC request timed out without a response
drpc:num-shutdown-callsmeternumber of times shutdown is called on the drpc server

Logviewer Metrics

Metrics related to the logviewer process. This process currently also handles cleaning up worker logs when they get too large or too old.

Metric NameTypeDescription
logviewer:cleanup-routine-duration-mstimerhow long it takes to run the log cleanup routine
logviewer:deep-search-request-duration-mstimerhow long it takes for /deepSearch/{topoId}
logviewer:disk-space-freed-in-byteshistogramnumber of bytes cleaned up each time through the cleanup routine.
logviewer:download-file-size-rounded-MBhistogramsize in MB of files being downloaded
logviewer:num-daemonlog-page-http-requestsmetercalls to /daemonlog
logviewer:num-deep-search-no-resultmeternumber of deep search requests that did not return any results
logviewer:num-deep-search-requests-with-archivedmetercalls to /deepSearch/{topoId} with ?search-archived=true
logviewer:num-deep-search-requests-without-archivedmetercalls to /deepSearch/{topoId} with ?search-archived=false
logviewer:num-download-daemon-log-exceptionsmeternum errors in calls to /daemondownload
logviewer:num-download-dump-exceptionsmeternum errors in calls to /dumps/{topo-id}/{host-port}/{filename}
logviewer:num-download-log-daemon-file-http-requestsmetercalls to /daemondownload
logviewer:num-download-log-exceptionsmeternum errors in calls to /download
logviewer:num-download-log-file-http-requestsmetercalls to /download
logviewer:num-file-download-exceptionsmetererrors while trying to download files.
logviewer:num-file-download-exceptionsmeternumber of exceptions trying to download a log file
logviewer:num-file-open-exceptionsmetererrors trying to open a file (when deleting logs)
logviewer:num-file-open-exceptionsmeternumber of exceptions trying to open a log file for serving
logviewer:num-file-read-exceptionsmeternumber of exceptions trying to read from a log file for serving
logviewer:num-file-removal-exceptionsmeternumber of exceptions trying to cleanup files.
logviewer:num-files-cleaned-uphistogramnumber of files cleaned up each time through the cleanup routine.
logviewer:num-files-scanned-per-deep-searchhistogramnumber of files scanned per deep search
logviewer:num-list-dump-files-exceptionsmeternum errors in calls to /dumps/{topo-id}/{host-port}
logviewer:num-list-logs-http-requestmetercalls to /listLogs
logviewer:num-log-page-http-requestsmetercalls to /log
logviewer:num-other-cleanup-exceptionsmeternumber of exception in the cleanup loop, not directly deleting files.
logviewer:num-page-readmeternumber of pages (parts of a log file) that are served up
logviewer:num-read-daemon-log-exceptionsmeternum errors in calls to /daemonlog
logviewer:num-read-log-exceptionsmeternum errors in calls to /log
logviewer:num-search-exceptionsmeternum errors in calls to /search
logviewer:num-search-log-exceptionsmeternum errors in calls to /listLogs
logviewer:num-search-logs-requestsmetercalls to /search
logviewer:num-search-request-no-resultmeternumber of regular search results that were empty
logviewer:num-set-permission-exceptionsmeternum errors running set permissions to open up files for reading.
logviewer:num-shutdown-callsmeternumber of times shutdown was called on the logviewer
logviewer:search-requests-duration-mstimerhow long it takes for /search
logviewer:worker-log-dir-sizegaugesize in bytes of the worker logs directory.

Supervisor Metrics

Metrics associated with the supervisor, which launches the workers for a topology. The supervisor also has a state machine for each slot. Some of the metrics are associated with that state machine and can be confusing if you do not understand the state machine.

Metric NameTypeDescription
supervisor:blob-cache-update-durationtimerhow long it takes to update all of the blobs in the cache (frequently just check if they have changed, but may also include downloading them.)
supervisor:blob-fetching-rate-MB/shistogramDownload rate of a blob in MB/sec. Blobs are downloaded rarely so it is very bursty.
supervisor:blob-localization-durationtimerApproximately how long it takes to get the blob we want after it is requested.
supervisor:current-reserved-memory-mbgaugetotal amount of memory reserved for workers on the supervisor (MB)
supervisor:current-used-memory-mbgaugememory currently used as measured by the supervisor (this typically requires cgroups) (MB)
supervisor:local-resource-file-not-found-when-releasing-slotmeternumber of times file-not-found exception happens when reading local blobs upon releasing slots
supervisor:num-blob-update-version-changedmeternumber of times a version of a blob changes.
supervisor:num-cleanup-exceptionsmeterexceptions thrown during container cleanup.
supervisor:num-force-kill-exceptionsmeterexceptions thrown during force kill.
supervisor:num-kill-exceptionsmeterexceptions thrown during kill.
supervisor:num-launchedmeternumber of times the supervisor is launched.
supervisor:num-shell-exceptionsmeternumber of exceptions calling shell commands.
supervisor:num-slots-used-gaugegaugenumber of slots used on the supervisor.
supervisor:num-worker-start-timed-outmeternumber of times worker start timed out.
supervisor:num-worker-transitions-into-emptymeternumber of transitions into empty state.
supervisor:num-worker-transitions-into-killmeternumber of transitions into kill state.
supervisor:num-worker-transitions-into-kill-and-relaunchmeternumber of transitions into kill-and-relaunch state
supervisor:num-worker-transitions-into-kill-blob-updatemeternumber of transitions into kill-blob-update state
supervisor:num-worker-transitions-into-runningmeternumber of transitions into running state
supervisor:num-worker-transitions-into-waiting-for-blob-localizationmeternumber of transitions into waiting-for-blob-localization state
supervisor:num-worker-transitions-into-waiting-for-blob-updatemeternumber of transitions into waiting-for-blob-update state
supervisor:num-worker-transitions-into-waiting-for-worker-startmeternumber of transitions into waiting-for-worker-start state
supervisor:num-workers-force-killmeternumber of times a worker was force killed. This may mean that the worker did not exit cleanly/quickly.
supervisor:num-workers-killed-assignment-changedmeterworkers killed because the assignment changed.
supervisor:num-workers-killed-blob-changedmeterworkers killed because the blob changed and they needed to be relaunched.
supervisor:num-workers-killed-hb-nullmeterworkers killed because there was no hb at all from the worker. This would typically only happen when a worker is launched for the first time.
supervisor:num-workers-killed-hb-timeoutmeterworkers killed because the hb from the worker was too old. This often happens because of GC issues in the worker that prevents it from sending a heartbeat, but could also mean the worker process exited, and the supervisor is not the parent of the process to know that it exited.
supervisor:num-workers-killed-memory-violationmeterworkers killed because the worker was using too much memory. If the supervisor can monitor memory usage of the worker (typically through cgroups) and the worker goes over the limit it may be shot.
supervisor:num-workers-killed-process-exitmeterworkers killed because the process exited and the supervisor was the parent process
supervisor:num-workers-launchedmeternumber of workers launched
supervisor:single-blob-localization-durationtimerhow long it takes for a blob to be updated (downloaded, unzipped, inform slots, and make the move)
supervisor:time-worker-spent-in-state-empty-mstimertime spent in empty state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-kill-and-relaunch-mstimertime spent in kill-and-relaunch state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-kill-blob-update-mstimertime spent in kill-blob-update state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-kill-mstimertime spent in kill state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-running-mstimertime spent in running state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-waiting-for-blob-localization-mstimertime spent in waiting-for-blob-localization state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-waiting-for-blob-update-mstimertime spent in waiting-for-blob-update state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-waiting-for-worker-start-mstimertime spent in waiting-for-worker-start state as it transitions out. Not necessarily in ms.
supervisor:worker-launch-durationtimerTime taken for a worker to launch.
supervisor:worker-per-call-clean-up-duration-nsmeterhow long it takes to cleanup a worker (ns).
supervisor:worker-shutdown-duration-nsmeterhow long it takes to shutdown a worker (ns).

UI Metrics

Metrics associated with a single UI daemon.

Metric NameTypeDescription
ui:num-activate-topology-http-requestsmetercalls to /topology/{id}/activate
ui:num-all-topologies-summary-http-requestsmetercalls to /topology/summary
ui:num-build-visualization-http-requestsmetercalls to /topology/{id}/visualization
ui:num-cluster-configuration-http-requestsmetercalls to /cluster/configuration
ui:num-cluster-summary-http-requestsmetercalls to /cluster/summary
ui:num-component-op-response-http-requestsmetercalls to /topology/{id}/component/{component}/debug/{action}/{spct}
ui:num-component-page-http-requestsmetercalls to /topology/{id}/component/{component}
ui:num-deactivate-topology-http-requestsmetercalls to topology/{id}/deactivate
ui:num-debug-topology-http-requestsmetercalls to /topology/{id}/debug/{action}/{spct}
ui:num-get-owner-resource-summaries-http-requestmetercalls to /owner-resources or /owner-resources/{id}
ui:num-log-config-http-requestsmetercalls to /topology/{id}/logconfig
ui:num-main-page-http-requestsmeternumber of requests to /index.html
ui:num-mk-visualization-data-http-requestsmetercalls to /topology/{id}/visualization-init
ui:num-nimbus-summary-http-requestsmetercalls to /nimbus/summary
ui:num-supervisor-http-requestsmetercalls to /supervisor
ui:num-supervisor-summary-http-requestsmetercalls to /supervisor/summary
ui:num-topology-lag-http-requestsmetercalls to /topology/{id}/lag
ui:num-topology-metric-http-requestsmetercalls to /topology/{id}/metrics
ui:num-topology-op-response-http-requestsmetercalls to /topology/{id}/logconfig or /topology/{id}/rebalance/{wait-time} or /topology/{id}/kill/{wait-time}
ui:num-topology-page-http-requestsmetercalls to /topology/{id}
num-web-requestsmeternominally the total number of web requests being made.

Pacemaker Metrics (Deprecated)

The pacemaker process is deprecated and only still exists for backwards compatibility.

Metric NameTypeDescription
pacemaker:get-pulse=countmeternumber of times getPulse was called. yes the = is in the name, but typically this is mapped to a ‘-’ by the metrics reporters.
pacemaker:heartbeat-sizehistogramsize in bytes of heartbeats
pacemaker:send-pulse-countmeternumber of times sendPulse was called
pacemaker:size-total-keysgaugetotal number of keys in this pacemaker instance
pacemaker:total-receive-sizemetertotal size in bytes of heartbeats received
pacemaker:total-sent-sizemetertotal size in bytes of heartbeats read