Flink monitoring

Flink server performance from built-in metrics data

SkyWalking leverages OpenTelemetry Collector to transfer the flink metrics to OpenTelemetry receiver and into the Meter System.

Data flow

  1. Configure Flink jobManager and TaskManager to expose metrics data for scraping through Prometheus endpoint.
  2. OpenTelemetry Collector fetches metrics from Flink jobManager and TaskManager through Prometheus endpoint, and pushes metrics to SkyWalking OAP Server via OpenTelemetry gRPC exporter.
  3. The SkyWalking OAP Server parses the expression with MAL to filter/calculate/aggregate and store the results.

Setup

  1. Set up built-in prometheus endpoint.
  2. Set up OpenTelemetry Collector . Please note that the OpenTelemetry Collector uses the job_name label by default, which may conflict with the job_name label in Flink. Please modify the Flink label name in the configuration to avoid this conflict, you can refer to here for details on Prometheus Receiver in OpenTelemetry Collector.
  3. Config SkyWalking OpenTelemetry receiver.

Flink Monitoring

Flink monitoring provides multidimensional metrics monitoring of Flink cluster as Layer: Flink Service in the OAP. In each cluster, the taskManager is represented as Instance and the job is represented as Endpoint.

Flink service Supported Metrics

Monitoring PanelUnitMetric NameDescriptionData Source
Running JobsCountmeter_flink_jobManager_running_job_numberThe number of running jobs.Flink JobManager
TaskManagersCountmeter_flink_jobManager_taskManagers_registered_numberThe number of taskManagers.Flink JobManager
JVM CPU Load%meter_flink_jobManager_jvm_cpu_loadThe number of the jobManager JVM CPU load.Flink JobManager
JVM thread countCountmeter_flink_jobManager_jvm_thread_countThe total number of the jobManager JVM live threads.Flink JobManager
JVM Memory Heap UsedMBmeter_flink_jobManager_jvm_memory_heap_usedThe amount of the jobManager JVM memory heap used.Flink JobManager
JVM Memory NonHeap UsedMBmeter_flink_jobManager_jvm_memory_NonHeap_usedThe amount of the jobManager JVM nonHeap memory used.Flink JobManager
Task Managers Slots TotalCountmeter_flink_jobManager_taskManagers_slots_totalThe number of total slots.Flink JobManager
Task Managers Slots AvailableCountmeter_flink_jobManager_taskManagers_slots_availableThe number of available slots.Flink JobManager
JVM CPU Timemsmeter_flink_jobManager_jvm_cpu_timeThe jobManager CPU time used by the JVM increase per minute.Flink JobManager
JVM Memory Heap AvailableMBmeter_flink_jobManager_jvm_memory_heap_availableThe amount of the jobManager available JVM memory Heap.Flink JobManager
JVM Memory NoHeap AvailableMBmeter_flink_jobManager_jvm_memory_nonHeap_availableThe amount of the jobManager available JVM memory noHeap.Flink JobManager
JVM Memory Metaspace UsedMBmeter_flink_jobManager_jvm_memory_metaspace_usedThe amount of the jobManager Used JVM metaspace memory.Flink JobManager
JVM Metaspace AvailableMBmeter_flink_jobManager_jvm_memory_metaspace_availableThe amount of the jobManager available JVM Metaspace Memory.Flink JobManager
JVM G1 Young Generation CountCountmeter_flink_jobManager_jvm_g1_young_generation_countThe incremental number of the jobManager JVM G1 young generation count per minute.Flink JobManager
JVM G1 Old Generation CountCountmeter_flink_jobManager_jvm_g1_old_generation_countThe incremental number of the jobManager JVM G1 old generation count per minute.Flink JobManager
JVM G1 Young Generation TimeCountmeter_flink_jobManager_jvm_g1_young_generation_timeThe incremental time of the jobManager JVM G1 young generation per minute.Flink JobManager
JVM G1 Old Generation Timemsmeter_flink_jobManager_jvm_g1_old_generation_timeThe incremental time of JVM G1 old generation increase per minute.Flink JobManager
JVM G1 Old Generation CountCountmeter_flink_jobManager_jvm_all_garbageCollector_countThe incremental number of the jobManager JVM all garbageCollector count per minute.Flink JobManager
JVM All GarbageCollector Timemsmeter_flink_jobManager_jvm_all_garbageCollector_timeThe incremental time spent performing garbage collection for the given (or all) collector for the jobManager per minute.Flink JobManager

Flink instance Supported Metrics

Monitoring PanelUnitMetric NameDescriptionData Source
JVM CPU Load%meter_flink_taskManager_jvm_cpu_loadThe number of the JVM CPU load.Flink TaskManager
JVM Thread CountCountmeter_flink_taskManager_jvm_thread_countThe total number of JVM live threads.Flink TaskManager
JVM Memory Heap UsedMBmeter_flink_taskManager_jvm_memory_heap_usedThe amount of JVM memory heap used.Flink TaskManager
JVM Memory NonHeap UsedMBmeter_flink_taskManager_jvm_memory_nonHeap_usedThe amount of JVM nonHeap memory used.Flink TaskManager
JVM CPU Timemsmeter_flink_taskManager_jvm_cpu_timeThe CPU time used by the JVM increase per minute.Flink TaskManager
JVM Memory Heap AvailableMBmeter_flink_taskManager_jvm_memory_heap_availableThe amount of available JVM memory Heap.Flink TaskManager
JVM Memory NonHeap AvailableMBmeter_flink_taskManager_jvm_memory_nonHeap_availableThe amount of available JVM memory nonHeap.Flink TaskManager
JVM Memory Metaspace UsedMBmeter_flink_taskManager_jvm_memory_metaspace_usedThe amount of Used JVM metaspace memory.Flink TaskManager
JVM Metaspace AvailableMBmeter_flink_taskManager_jvm_memory_metaspace_availableThe amount of Available JVM Metaspace Memory.Flink TaskManager
NumRecordsInCountmeter_flink_taskManager_numRecordsInThe incremental number of records this task has received per minute.Flink TaskManager
NumRecordsOutCountmeter_flink_taskManager_numRecordsOutThe incremental number of records this task has emitted per minute.Flink TaskManager
NumBytesInPerSecondBytes/smeter_flink_taskManager_numBytesInPerSecondThe number of bytes received per second.Flink TaskManager
NumBytesOutPerSecondBytes/smeter_flink_taskManager_numBytesOutPerSecondThe number of bytes this task emits per second.Flink TaskManager
Netty UsedMemoryMBmeter_flink_taskManager_netty_usedMemoryThe amount of used netty memory.Flink TaskManager
Netty AvailableMemoryMBmeter_flink_taskManager_netty_availableMemoryThe amount of available netty memory.Flink TaskManager
IsBackPressuredCountmeter_flink_taskManager_isBackPressuredWhether the task is back-pressured.Flink TaskManager
InPoolUsage%meter_flink_taskManager_inPoolUsageAn estimate of the input buffers usage. (ignores LocalInputChannels).Flink TaskManager
OutPoolUsage%meter_flink_taskManager_outPoolUsageAn estimate of the output buffers usage. The pool usage can be > 100% if overdraft buffers are being used.Flink TaskManager
SoftBackPressuredTimeMsPerSecondmsmeter_flink_taskManager_softBackPressuredTimeMsPerSecondThe time this task is softly back pressured per second.Softly back pressured task will be still responsive and capable of for example triggering unaligned checkpoints.Flink TaskManager
HardBackPressuredTimeMsPerSecondmsmeter_flink_taskManager_hardBackPressuredTimeMsPerSecondThe time this task is back pressured in a hard way per second.During hard back pressured task is completely blocked and unresponsive preventing for example unaligned checkpoints from triggering.Flink TaskManager
IdleTimeMsPerSecondmsmeter_flink_taskManager_idleTimeMsPerSecondThe time this task is idle (has no data to process) per second. Idle time excludes back pressured time, so if the task is back pressured it is not idle.Flink TaskManager
BusyTimeMsPerSecondmsmeter_flink_taskManager_busyTimeMsPerSecondThe time this task is busy (neither idle nor back pressured) per second. Can be NaN, if the value could not be calculated.Flink TaskManager

Flink Endpoint Supported Metrics

Monitoring PanelUnitMetric NameDescriptionData Source
Job RunningTimeminmeter_flink_job_runningTimeThe job running time.Flink JobManager
Job Restart NumberCountmeter_flink_job_restart_numberThe number of job restart.Flink JobManager
Job RestartingTimeminmeter_flink_job_restartingTimeThe job restarting Time.Flink JobManager
Job CancellingTimeminmeter_flink_job_cancellingTimeThe job cancelling time.Flink JobManager
Checkpoints TotalCountmeter_flink_job_checkpoints_totalThe total number of checkpoints.Flink JobManager
Checkpoints FailedCountmeter_flink_job_checkpoints_failedThe number of failed checkpoints.Flink JobManager
Checkpoints CompletedCountmeter_flink_job_checkpoints_completedThe number of completed checkpoints.Flink JobManager
Checkpoints InProgressCountmeter_flink_job_checkpoints_inProgressThe number of inProgress checkpoints.Flink JobManager
CurrentEmitEventTimeLagmsmeter_flink_job_currentEmitEventTimeLagThe latency between a data record's event time and its emission time from the source.Flink TaskManager
NumRecordsInCountmeter_flink_job_numRecordsInThe total number of records this operator/task has received.Flink TaskManager
NumRecordsOutCountmeter_flink_job_numRecordsOutThe total number of records this operator/task has emitted.Flink TaskManager
NumBytesInPerSecondBytes/smeter_flink_job_numBytesInPerSecondThe number of bytes this task received per second.Flink TaskManager
NumBytesOutPerSecondBytes/smeter_flink_job_numBytesOutPerSecondThe number of bytes this task emits per second.Flink TaskManager
LastCheckpointSizeBytesmeter_flink_job_lastCheckpointSizeThe checkPointed size of the last checkpoint (in bytes), this metric could be different from lastCheckpointFullSize if incremental checkpoint or changelog is enabled.Flink JobManager
LastCheckpointDurationmsmeter_flink_job_lastCheckpointDurationThe time it took to complete the last checkpoint.Flink JobManager

Customizations

You can customize your own metrics/expression/dashboard panel. The metrics definition and expression rules are found in otel-rules/flink/flink-jobManager.yaml, otel-rules/flink/flink-taskManager.yaml, otel-rules/flink/flink-job.yaml. The Flink dashboard panel configurations are found in ui-initialized-templates/flink.