id: metrics title: Scheduler Metrics keywords:
YuniKorn leverages Prometheus to record metrics. The metrics system keeps tracking of scheduler's critical execution paths, to reveal potential performance bottlenecks. Currently, there are three categories for these metrics:
all metrics are declared in yunikorn
namespace.
Metrics Name | Metrics Type | Description |
---|---|---|
containerAllocation | Counter | Total number of attempts to allocate containers. State of the attempt includes allocated , rejected , error , released . Increase only. |
applicationSubmission | Counter | Total number of application submissions. State of the attempt includes accepted and rejected . Increase only. |
applicationStatus | Gauge | Total number of application status. State of the application includes running and completed . |
totalNodeActive | Gauge | Total number of active nodes. |
totalNodeFailed | Gauge | Total number of failed nodes. |
nodeResourceUsage | Gauge | Total resource usage of node, by resource name. |
schedulingLatency | Histogram | Latency of the main scheduling routine, in seconds. |
nodeSortingLatency | Histogram | Latency of all nodes sorting, in seconds. |
appSortingLatency | Histogram | Latency of all applications sorting, in seconds. |
queueSortingLatency | Histogram | Latency of all queues sorting, in seconds. |
tryNodeLatency | Histogram | Latency of node condition checks for container allocations, such as placement constraints, in seconds, in seconds. |
Metrics Name | Metrics Type | Description |
---|---|---|
appMetrics | Counter | Application Metrics, record the total number of applications. State of the application includes accepted ,rejected and Completed . |
usedResourceMetrics | Gauge | Queue used resource. |
pendingResourceMetrics | Gauge | Queue pending resource. |
availableResourceMetrics | Gauge | Used resource metrics related to queues etc. |
Metrics Name | Metrics Type | Description |
---|---|---|
totalEventsCreated | Gauge | Total events created. |
totalEventsChanneled | Gauge | Total events channeled. |
totalEventsNotChanneled | Gauge | Total events not channeled. |
totalEventsProcessed | Gauge | Total events processed. |
totalEventsStored | Gauge | Total events stored. |
totalEventsNotStored | Gauge | Total events not stored. |
totalEventsCollected | Gauge | Total events collected. |
YuniKorn metrics are collected through Prometheus client library, and exposed via scheduler restful service. Once started, they can be accessed via endpoint http://localhost:9080/ws/v1/metrics.
It's simple to setup a Prometheus server to grab YuniKorn metrics periodically. Follow these steps:
Setup Prometheus (read more from Prometheus docs)
Configure Prometheus rules: a sample configuration
global: scrape_interval: 3s evaluation_interval: 15s scrape_configs: - job_name: 'yunikorn' scrape_interval: 1s metrics_path: '/ws/v1/metrics' static_configs: - targets: ['docker.for.mac.host.internal:9080']
docker pull prom/prometheus:latest docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
Use docker.for.mac.host.internal
instead of localhost
if you are running Prometheus in a local docker container on Mac OS. Once started, open Prometheus web UI: http://localhost:9090/graph. You'll see all available metrics from YuniKorn scheduler.