Helix Monitoring Metrics

Helix monitoring metrics are exposed as the MBeans attributes. The MBeans are registered based on instance role.

The easiest way to see the available metrics is using jconsole and point it at a running Helix instance. This will allow browsing all metrics with JMX.

Note that if not mentioned in the attribute name, all attributes are gauge by default.

Metrics on Both Controller and Participant

MBean ZkClientMonitor

ObjectName: “HelixZkClient:type=[client-type],key=[specified-client-key],PATH=[zk-client-listening-path]”

AttributesDescription
ReadCounterZk Read counter. Which could be used to identify unusually high/low ZK traffic
WriteCounterSame as above
ReadBytesCounterSame as above
WriteBytesCounterSame as above
StateChangeEventCounterZk connection state change counter. Which could be used to identify ZkClient unstable connection
DataChangeEventCounterZk node data change counter. which could be used to identify unusual high/low ZK events occurrence or slow event processing
PendingCallbackGaugeNumber of the pending Zk callbacks.
TotalCallbackCounterNumber of total received Zk callbacks.
TotalCallbackHandledCounterNumber of total handled Zk callbacks.
ReadTotalLatencyCounterTotal read latency in ms.
WriteTotalLatencyCounterTotal write latency in ms.
WriteFailureCounterTotal write failures.
ReadFailureCounterTotal read failures.
ReadLatencyGaugeHistogram (with all statistic data) of read latency.
WriteLatencyGaugeHistogram (with all statistic data) of write latency.
ReadBytesGaugeHistogram (with all statistic data) of read bytes of single Zk access.
WriteBytesGaugeHistogram (with all statistic data) of write bytes of single Zk access.

MBean HelixCallbackMonitor

ObjectName: “HelixCallback:Type=[callback-type],Key=[cluster-name].[instance-name],Change=[callback-change-type]”

AttributesDescription
CounterZk Callback counter for each Helix callback type.
UnbatchedCounterUnbatched Zk Callback counter for each helix callback type.
LatencyCounterCallback handler latency counter in ms.
LatencyGaugeHistogram (with all statistic data) of Callback handler latency.

MBean MessageQueueMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],messageQueue=[instance-name]”

AttributesDescription
MessageQueueBacklogGet the message queue size

Metrics on Controller only

MBean ClusterStatusMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name]”

AttributesDescription
DisabledInstancesGaugeCurrent number of disabled instances
DisabledPartitionsGaugeCurrent number of disabled partitions number
DownInstanceGaugeCurrent down instances number
InstanceMessageQueueBacklogThe sum of all message queue sizes for instances in this cluster
InstancesGaugeCurrent live instances number
MaxMessageQueueSizeGaugeThe maximum message queue size across all instances including controller
RebalanceFailureGaugeNone 0 if previous rebalance failed unexpectedly. The Gauge will be set every time rebalance is done.
RebalanceFailureCounterThe number of failures during rebalance pipeline.
Enabled1 if cluster is enabled, otherwise 0
Maintenance1 if cluster is in maintenance mode, otherwise 0
Paused1 if cluster is paused, otherwise 0

MBean ClusterEventMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],eventName=ClusterEvent,phaseName=[event-handling-phase]”

AttributesDescription
TotalDurationCounterTotal event process duration for each stage.
MaxSingleDurationGaugeMax event process duration for each stage within the recent hour.
EventCounterThe count of processed event in each stage.
DurationGaugeHistogram (with all statistic data) of event process duration for each stage.

MBean InstanceMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],instanceName=[instance-name]”

AttributesDescription
OnlineThis instance is Online (1) or Offline (0)
EnabledThis instance is Enabled (1) or Disabled (0)
TotalMessageReceivedNumber of messages sent to this instance by controller
DisabledPartitionsGet the total disabled partitions number for this instance

MBean ResourceMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],resourceName=[resource-name]”

AttributesDescription
PartitionGaugeGet number of partitions of the resource in best possible ideal state for this resource
ErrorPartitionGaugeGet the number of current partitions in ERORR state for this resource
DifferenceWithIdealStateGaugeGet the number of how many replicas' current state are different from ideal state for this resource
MissingTopStatePartitionGaugeGet the number of partitions do not have top state for this resource
ExternalViewPartitionGaugeGet number of partitions in ExternalView for this resource
TotalMessageReceivedGet number of messages sent to this resource by controller
LoadRebalanceThrottledPartitionGaugeGet number of partitions that need load rebalance but were throttled.
RecoveryRebalanceThrottledPartitionGaugeGet number of partitions that need recovery rebalance but were throttled.
PendingLoadRebalancePartitionGaugeGet number of partitions that have pending load rebalance requests.
PendingRecoveryRebalancePartitionGaugeGet number of partitions that have pending recovery rebalance requests.
MissingReplicaPartitionGaugeGet number of partitions that have replica number smaller than expected.
MissingMinActiveReplicaPartitionGaugeGet number of partitions that have replica number smaller than the minimum requirement.
MaxSinglePartitionTopStateHandoffDurationGaugeGet the max duration recorded when the top state is missing in any single partition.
FailedTopStateHandoffCounterGet the number of total top state transition failure.
SucceededTopStateHandoffCounterGet the number of total top state transition done successfully.
SuccessfulTopStateHandoffDurationCounterGet the total duration of all top state transitions.
PartitionTopStateHandoffDurationGaugeHistogram (with all statistic data) of top state transition duration.

MBean PerInstanceResourceMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],instanceName=[instance-name],resourceName=[resource-name]”

AttributesDescription
PartitionGaugeGet number of partitions of the resource in best possible ideal state for this resource on specific instance

MBean JobMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],jobType=[job-type]”

AttributesDescription
SuccessfulJobCountGet number of the succeeded jobs
FailedJobCountGet number of failed jobs
AbortedJobCountGet number of the aborted jobs
ExistingJobGaugeGet number of existing jobs registered
QueuedJobGaugeGet numbers of queued jobs, which are not running jobs
RunningJobGaugeGet numbers of running jobs
MaximumJobLatencyGaugeGet maximum latency of jobs running time. It will be cleared every hour
JobLatencyCountGet total job latency counter.

MBean WorkflowMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],workflowType=[workflow-type]”

AttributesDescription
SuccessfulWorkflowCountGet number of succeeded workflows
FailedWorkflowCountGet number of failed workflows
FailedWorkflowGaugeGet number of current failed workflows
ExistingWorkflowGaugeGet number of current existing workflows
QueuedWorkflowGaugeGet number of queued but not started workflows
RunningWorkflowGaugeGet number of running workflows
WorkflowLatencyCountGet workflow latency count
MaximumWorkflowLatencyGaugeGet maximum workflow latency gauge. It will be reset in 1 hour.

Metrics on Participant only

MBean StateTransitionStatMonitor

ObjectName: “CLMParticipantReport:Cluster=[cluster-name],Resource=[resource-name],Transition=[transaction-id]”

AttributesDescription
TotalStateTransitionGaugeGet the number of total state transitions
TotalFailedTransitionGaugeGet the number of total failed state transitions
TotalSuccessTransitionGaugeGet the number of total succeeded state transitions
MeanTransitionLatencyGet the average state transition latency (from message read to finish)
MaxTransitionLatencyGet the maximum state transition latency
MinTransitionLatencyGet the minimum state transition latency
PercentileTransitionLatencyGet the percentile of state transitions latency
MeanTransitionExecuteLatencyGet the average execution latency of state transition (from task started to finish)
MaxTransitionExecuteLatencyGet the maximum execution latency of state transition
MinTransitionExecuteLatencyGet the minimum execution latency of state transition
PercentileTransitionExecuteLatencyGet the percentile of execution latency of state transitions

MBean ThreadPoolExecutorMonitor

ObjectName: “HelixThreadPoolExecutor:Type=[threadpool-type]” (threadpool-type in Message.MessageType, BatchMessageExecutor, Task)

AttributesDescription
ThreadPoolCoreSizeGaugeThread pool size is as configured. Aggregate total thread pool size for the whole cluster.
ThreadPoolMaxSizeGaugeSame as above
NumOfActiveThreadsGaugeNumber of running threads.
QueueSizeGaugeQueue size. Could be used to identify if too many HelixTask blocked in participant.

MBean MessageLatencyMonitor

ObjectName: “CLMParticipantReport:ParticipantName=[instance-name],MonitorType=MessageLatencyMonitor”

AttributesDescription
TotalMessageCountTotal message count
TotalMessageLatencyTotal message latency in ms
MessagelatencyGaugeHistogram (with all statistic data) of message processing latency.

MBean ParticipantMessageMonitor

ObjectName: “CLMParticipantReport:ParticipantName=[instance-name]”

AttributesDescription
ReceivedMessagesNumber of received messages
DiscardedMessagesNumber of discarded messages
CompletedMessagesNumber of completed messages
FailedMessagesNumber of failed messages
PendingMessagesNumber of pending messages to be processed