Overview

Metrics are statistical information exposed by Hadoop daemons, used for monitoring, performance tuning and debug. There are many metrics available by default and they are very useful for troubleshooting. This page shows the details of the available metrics.

Each section describes each context into which metrics are grouped.

The documentation of Metrics 2.0 framework is here.

jvm context

JvmMetrics

Each metrics record contains tags such as ProcessName, SessionID and Hostname as additional information along with metrics.

NameDescription
MemNonHeapUsedMCurrent non-heap memory used in MB
MemNonHeapCommittedMCurrent non-heap memory committed in MB
MemNonHeapMaxMMax non-heap memory size in MB
MemHeapUsedMCurrent heap memory used in MB
MemHeapCommittedMCurrent heap memory committed in MB
MemHeapMaxMMax heap memory size in MB
MemMaxMMax memory size in MB
ThreadsNewCurrent number of NEW threads
ThreadsRunnableCurrent number of RUNNABLE threads
ThreadsBlockedCurrent number of BLOCKED threads
ThreadsWaitingCurrent number of WAITING threads
ThreadsTimedWaitingCurrent number of TIMED_WAITING threads
ThreadsTerminatedCurrent number of TERMINATED threads
GcInfoTotal GC count and GC time in msec, grouped by the kind of GC.  ex.) GcCountPS Scavenge=6, GCTimeMillisPS Scavenge=40, GCCountPS MarkSweep=0, GCTimeMillisPS MarkSweep=0
GcCountTotal GC count
GcTimeMillisTotal GC time in msec
LogFatalTotal number of FATAL logs
LogErrorTotal number of ERROR logs
LogWarnTotal number of WARN logs
LogInfoTotal number of INFO logs
GcNumWarnThresholdExceededNumber of times that the GC warn threshold is exceeded
GcNumInfoThresholdExceededNumber of times that the GC info threshold is exceeded
GcTotalExtraSleepTimeTotal GC extra sleep time in msec

rpc context

rpc

Each metrics record contains tags such as Hostname and port (number to which server is bound) as additional information along with metrics.

NameDescription
ReceivedBytesTotal number of received bytes
SentBytesTotal number of sent bytes
RpcQueueTimeNumOpsTotal number of RPC calls
RpcQueueTimeAvgTimeAverage queue time in milliseconds
RpcLockWaitTimeNumOpsTotal number of RPC call (same as RpcQueueTimeNumOps)
RpcLockWaitTimeAvgTimeAverage time waiting for lock acquisition in milliseconds
RpcProcessingTimeNumOpsTotal number of RPC calls (same to RpcQueueTimeNumOps)
RpcProcessingAvgTimeAverage Processing time in milliseconds
RpcAuthenticationFailuresTotal number of authentication failures
RpcAuthenticationSuccessesTotal number of authentication successes
RpcAuthorizationFailuresTotal number of authorization failures
RpcAuthorizationSuccessesTotal number of authorization successes
NumOpenConnectionsCurrent number of open connections
CallQueueLengthCurrent length of the call queue
numDroppedConnectionsTotal number of dropped connections
rpcQueueTimenumsNumOpsShows total number of RPC calls (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcQueueTimenums50thPercentileLatencyShows the 50th percentile of RPC queue time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcQueueTimenums75thPercentileLatencyShows the 75th percentile of RPC queue time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcQueueTimenums90thPercentileLatencyShows the 90th percentile of RPC queue time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcQueueTimenums95thPercentileLatencyShows the 95th percentile of RPC queue time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcQueueTimenums99thPercentileLatencyShows the 99th percentile of RPC queue time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcProcessingTimenumsNumOpsShows total number of RPC calls (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcProcessingTimenums50thPercentileLatencyShows the 50th percentile of RPC processing time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcProcessingTimenums75thPercentileLatencyShows the 75th percentile of RPC processing time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcProcessingTimenums90thPercentileLatencyShows the 90th percentile of RPC processing time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcProcessingTimenums95thPercentileLatencyShows the 95th percentile of RPC processing time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcProcessingTimenums99thPercentileLatencyShows the 99th percentile of RPC processing time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcLockWaitTimenumsNumOpsShows total number of RPC calls (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcLockWaitTimenums50thPercentileLatencyShows the 50th percentile of RPC lock wait time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcLockWaitTimenums75thPercentileLatencyShows the 75th percentile of RPC lock wait time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcLockWaitTimenums90thPercentileLatencyShows the 90th percentile of RPC lock wait time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcLockWaitTimenums95thPercentileLatencyShows the 95th percentile of RPC lock wait time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.
rpcLockWaitTimenums99thPercentileLatencyShows the 99th percentile of RPC lock wait time in milliseconds (num seconds granularity) if rpc.metrics.quantile.enable is set to true. num is specified by rpc.metrics.percentiles.intervals.

RetryCache/NameNodeRetryCache

RetryCache metrics is useful to monitor NameNode fail-over. Each metrics record contains Hostname tag.

NameDescription
CacheHitTotal number of RetryCache hit
CacheClearedTotal number of RetryCache cleared
CacheUpdatedTotal number of RetryCache updated

FairCallQueue

FairCallQueue metrics will only exist if FairCallQueue is enabled. Each metric exists for each level of priority.

NameDescription
FairCallQueueSize_pPriorityCurrent number of calls in priority queue
FairCallQueueOverflowedCalls_pPriorityTotal number of overflowed calls in priority queue

rpcdetailed context

Metrics of rpcdetailed context are exposed in unified manner by RPC layer. Two metrics are exposed for each RPC based on its name. Metrics named “(RPC method name)NumOps” indicates total number of method calls, and metrics named “(RPC method name)AvgTime” shows average turn around time for method calls in milliseconds. Please note that the AvgTime metrics do not include time spent waiting to acquire locks on data structures (see RpcLockWaitTimeAvgTime).

rpcdetailed

Each metrics record contains tags such as Hostname and port (number to which server is bound) as additional information along with metrics.

The Metrics about RPCs which is not called are not included in metrics record.

NameDescription
methodnameNumOpsTotal number of the times the method is called
methodnameAvgTimeAverage turn around time of the method in milliseconds

dfs context

namenode

Each metrics record contains tags such as ProcessName, SessionId, and Hostname as additional information along with metrics.

NameDescription
CreateFileOpsTotal number of files created
FilesCreatedTotal number of files and directories created by create or mkdir operations
FilesAppendedTotal number of files appended
GetBlockLocationsTotal number of getBlockLocations operations
FilesRenamedTotal number of rename operations (NOT number of files/dirs renamed)
GetListingOpsTotal number of directory listing operations
DeleteFileOpsTotal number of delete operations
FilesDeletedTotal number of files and directories deleted by delete or rename operations
FileInfoOpsTotal number of getFileInfo and getLinkFileInfo operations
AddBlockOpsTotal number of addBlock operations succeeded
GetAdditionalDatanodeOpsTotal number of getAdditionalDatanode operations
CreateSymlinkOpsTotal number of createSymlink operations
GetLinkTargetOpsTotal number of getLinkTarget operations
FilesInGetListingOpsTotal number of files and directories listed by directory listing operations
SuccessfulReReplicationsTotal number of successful block re-replications
NumTimesReReplicationNotScheduledTotal number of times that failed to schedule a block re-replication
TimeoutReReplicationsTotal number of timed out block re-replications
AllowSnapshotOpsTotal number of allowSnapshot operations
DisallowSnapshotOpsTotal number of disallowSnapshot operations
CreateSnapshotOpsTotal number of createSnapshot operations
DeleteSnapshotOpsTotal number of deleteSnapshot operations
RenameSnapshotOpsTotal number of renameSnapshot operations
ListSnapshottableDirOpsTotal number of snapshottableDirectoryStatus operations
SnapshotDiffReportOpsTotal number of getSnapshotDiffReport operations
TransactionsNumOpsTotal number of Journal transactions
TransactionsAvgTimeAverage time of Journal transactions in milliseconds
SyncsNumOpsTotal number of Journal syncs
SyncsAvgTimeAverage time of Journal syncs in milliseconds
TransactionsBatchedInSyncTotal number of Journal transactions batched in sync
BlockReportNumOpsTotal number of processing block reports from DataNode
BlockReportAvgTimeAverage time of processing block reports in milliseconds
CacheReportNumOpsTotal number of processing cache reports from DataNode
CacheReportAvgTimeAverage time of processing cache reports in milliseconds
SafeModeTimeThe interval between FSNameSystem starts and the last time safemode leaves in milliseconds.  (sometimes not equal to the time in SafeMode, see HDFS-5156)
FsImageLoadTimeTime loading FS Image at startup in milliseconds
FsImageLoadTimeTime loading FS Image at startup in milliseconds
GetEditNumOpsTotal number of edits downloads from SecondaryNameNode
GetEditAvgTimeAverage edits download time in milliseconds
GetImageNumOpsTotal number of fsimage downloads from SecondaryNameNode
GetImageAvgTimeAverage fsimage download time in milliseconds
PutImageNumOpsTotal number of fsimage uploads to SecondaryNameNode
PutImageAvgTimeAverage fsimage upload time in milliseconds
TotalFileOpsTotal number of file operations performed
NNStartedDeprecated: Use NNStartedTimeInMillis instead
NNStartedTimeInMillisNameNode start time in milliseconds
GenerateEDEKTimeNumOpsTotal number of generating EDEK
GenerateEDEKTimeAvgTimeAverage time of generating EDEK in milliseconds
WarmUpEDEKTimeNumOpsTotal number of warming up EDEK
WarmUpEDEKTimeAvgTimeAverage time of warming up EDEK in milliseconds
ResourceCheckTimenums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of NameNode resource check latency in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.
BlockReportnums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of storage block report latency in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.

FSNamesystem

Each metrics record contains tags such as HAState and Hostname as additional information along with metrics.

NameDescription
MissingBlocksCurrent number of missing blocks
ExpiredHeartbeatsTotal number of expired heartbeats
TransactionsSinceLastCheckpointTotal number of transactions since last checkpoint
TransactionsSinceLastLogRollTotal number of transactions since last edit log roll
LastWrittenTransactionIdLast transaction ID written to the edit log
LastCheckpointTimeTime in milliseconds since epoch of last checkpoint
CapacityTotalCurrent raw capacity of DataNodes in bytes
CapacityTotalGBCurrent raw capacity of DataNodes in GB
CapacityUsedCurrent used capacity across all DataNodes in bytes
CapacityUsedGBCurrent used capacity across all DataNodes in GB
CapacityRemainingCurrent remaining capacity in bytes
CapacityRemainingGBCurrent remaining capacity in GB
CapacityUsedNonDFSCurrent space used by DataNodes for non DFS purposes in bytes
TotalLoadCurrent number of connections
SnapshottableDirectoriesCurrent number of snapshottable directories
SnapshotsCurrent number of snapshots
NumEncryptionZonesCurrent number of encryption zones
BlocksTotalCurrent number of allocated blocks in the system
FilesTotalCurrent number of files and directories
PendingReplicationBlocksCurrent number of blocks pending to be replicated
UnderReplicatedBlocksCurrent number of blocks under replicated
CorruptBlocksCurrent number of blocks with corrupt replicas.
ScheduledReplicationBlocksCurrent number of blocks scheduled for replications
PendingDeletionBlocksCurrent number of blocks pending deletion
ExcessBlocksCurrent number of excess blocks
PostponedMisreplicatedBlocks(HA-only) Current number of blocks postponed to replicate
PendingDataNodeMessageCount(HA-only) Current number of pending block-related messages for later processing in the standby NameNode
MillisSinceLastLoadedEdits(HA-only) Time in milliseconds since the last time standby NameNode load edit log. In active NameNode, set to 0
BlockCapacityCurrent number of block capacity
NumLiveDataNodesNumber of datanodes which are currently live
NumDeadDataNodesNumber of datanodes which are currently dead
NumDecomLiveDataNodesNumber of datanodes which have been decommissioned and are now live
NumDecomDeadDataNodesNumber of datanodes which have been decommissioned and are now dead
NumDecommissioningDataNodesNumber of datanodes in decommissioning state
VolumeFailuresTotalTotal number of volume failures across all Datanodes
EstimatedCapacityLostTotalAn estimate of the total capacity lost due to volume failures
StaleDataNodesCurrent number of DataNodes marked stale due to delayed heartbeat
NumStaleStoragesNumber of storages marked as content stale (after NameNode restart/failover before first block report is received)
TotalFilesDeprecated: Use FilesTotal instead
MissingReplOneBlocksCurrent number of missing blocks with replication factor 1
NumFilesUnderConstructionCurrent number of files under construction
NumActiveClientsCurrent number of active clients holding lease
HAState(HA-only) Current state of the NameNode: initializing or active or standby or stopping state
FSStateCurrent state of the file system: Safemode or Operational
LockQueueLengthNumber of threads waiting to acquire FSNameSystem lock
TotalSyncCountTotal number of sync operations performed by edit log
TotalSyncTimesTotal number of milliseconds spent by various edit logs in sync operation
NameDirSizeNameNode name directories size in bytes
NumTimedOutPendingReplicationsThe number of timed out replications. Not the number of unique blocks that timed out. Note: The metric name will be changed to NumTimedOutPendingReconstructions in Hadoop 3 release.
NumInMaintenanceLiveDataNodesNumber of live Datanodes which are in maintenance state
NumInMaintenanceDeadDataNodesNumber of dead Datanodes which are in maintenance state
NumEnteringMaintenanceDataNodesNumber of Datanodes that are entering the maintenance state
FSN(Read/Write)LockOperationNameNumOpsTotal number of acquiring lock by operations
FSN(Read/Write)LockOperationNameAvgTimeAverage time of holding the lock by operations in milliseconds

JournalNode

The server-side metrics for a journal from the JournalNode's perspective. Each metrics record contains Hostname tag as additional information along with metrics.

NameDescription
Syncs60sNumOpsNumber of sync operations (1 minute granularity)
Syncs60s50thPercentileLatencyMicrosThe 50th percentile of sync latency in microseconds (1 minute granularity)
Syncs60s75thPercentileLatencyMicrosThe 75th percentile of sync latency in microseconds (1 minute granularity)
Syncs60s90thPercentileLatencyMicrosThe 90th percentile of sync latency in microseconds (1 minute granularity)
Syncs60s95thPercentileLatencyMicrosThe 95th percentile of sync latency in microseconds (1 minute granularity)
Syncs60s99thPercentileLatencyMicrosThe 99th percentile of sync latency in microseconds (1 minute granularity)
Syncs300sNumOpsNumber of sync operations (5 minutes granularity)
Syncs300s50thPercentileLatencyMicrosThe 50th percentile of sync latency in microseconds (5 minutes granularity)
Syncs300s75thPercentileLatencyMicrosThe 75th percentile of sync latency in microseconds (5 minutes granularity)
Syncs300s90thPercentileLatencyMicrosThe 90th percentile of sync latency in microseconds (5 minutes granularity)
Syncs300s95thPercentileLatencyMicrosThe 95th percentile of sync latency in microseconds (5 minutes granularity)
Syncs300s99thPercentileLatencyMicrosThe 99th percentile of sync latency in microseconds (5 minutes granularity)
Syncs3600sNumOpsNumber of sync operations (1 hour granularity)
Syncs3600s50thPercentileLatencyMicrosThe 50th percentile of sync latency in microseconds (1 hour granularity)
Syncs3600s75thPercentileLatencyMicrosThe 75th percentile of sync latency in microseconds (1 hour granularity)
Syncs3600s90thPercentileLatencyMicrosThe 90th percentile of sync latency in microseconds (1 hour granularity)
Syncs3600s95thPercentileLatencyMicrosThe 95th percentile of sync latency in microseconds (1 hour granularity)
Syncs3600s99thPercentileLatencyMicrosThe 99th percentile of sync latency in microseconds (1 hour granularity)
BatchesWrittenTotal number of batches written since startup
TxnsWrittenTotal number of transactions written since startup
BytesWrittenTotal number of bytes written since startup
BatchesWrittenWhileLaggingTotal number of batches written where this node was lagging
LastWriterEpochCurrent writer's epoch number
CurrentLagTxnsThe number of transactions that this JournalNode is lagging
LastWrittenTxIdThe highest transaction id stored on this JournalNode
LastPromisedEpochThe last epoch number which this node has promised not to accept any lower epoch, or 0 if no promises have been made
LastJournalTimestampThe timestamp of last successfully written transaction
TxnsServedViaRpcNumber of transactions served via the RPC mechanism
BytesServedViaRpcNumber of bytes served via the RPC mechanism
RpcRequestCacheMissAmountNumMissesNumber of RPC requests which could not be served due to lack of data in the cache
RpcRequestCacheMissAmountAvgTxnsThe average number of transactions by which a request missed the cache; for example if transaction ID 10 is requested and the cache's oldest transaction is ID 15, value 5 will be added to this average
RpcEmptyResponsesNumber of RPC requests with zero edits returned

datanode

Each metrics record contains tags such as SessionId and Hostname as additional information along with metrics.

NameDescription
BytesWrittenTotal number of bytes written to DataNode
BytesReadTotal number of bytes read from DataNode
BlocksWrittenTotal number of blocks written to DataNode
BlocksReadTotal number of blocks read from DataNode
BlocksReplicatedTotal number of blocks replicated
BlocksRemovedTotal number of blocks removed
BlocksVerifiedTotal number of blocks verified
BlockVerificationFailuresTotal number of verifications failures
BlocksCachedTotal number of blocks cached
BlocksUncachedTotal number of blocks uncached
ReadsFromLocalClientTotal number of read operations from local client
ReadsFromRemoteClientTotal number of read operations from remote client
WritesFromLocalClientTotal number of write operations from local client
WritesFromRemoteClientTotal number of write operations from remote client
BlocksGetLocalPathInfoTotal number of operations to get local path names of blocks
RamDiskBlocksWriteTotal number of blocks written to memory
RamDiskBlocksWriteFallbackTotal number of blocks written to memory but not satisfied (failed-over to disk)
RamDiskBytesWriteTotal number of bytes written to memory
RamDiskBlocksReadHitsTotal number of times a block in memory was read
RamDiskBlocksEvictedTotal number of blocks evicted in memory
RamDiskBlocksEvictedWithoutReadTotal number of blocks evicted in memory without ever being read from memory
RamDiskBlocksEvictionWindowMsNumOpsNumber of blocks evicted in memory
RamDiskBlocksEvictionWindowMsAvgTimeAverage time of blocks in memory before being evicted in milliseconds
RamDiskBlocksEvictionWindowsnums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of latency between memory write and eviction in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.
RamDiskBlocksLazyPersistedTotal number of blocks written to disk by lazy writer
RamDiskBlocksDeletedBeforeLazyPersistedTotal number of blocks deleted by application before being persisted to disk
RamDiskBytesLazyPersistedTotal number of bytes written to disk by lazy writer
RamDiskBlocksLazyPersistWindowMsNumOpsNumber of blocks written to disk by lazy writer
RamDiskBlocksLazyPersistWindowMsAvgTimeAverage time of blocks written to disk by lazy writer in milliseconds
RamDiskBlocksLazyPersistWindowsnums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of latency between memory write and disk persist in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.
FsyncCountTotal number of fsync
VolumeFailuresTotal number of volume failures occurred
ReadBlockOpNumOpsTotal number of read operations
ReadBlockOpAvgTimeAverage time of read operations in milliseconds
WriteBlockOpNumOpsTotal number of write operations
WriteBlockOpAvgTimeAverage time of write operations in milliseconds
BlockChecksumOpNumOpsTotal number of blockChecksum operations
BlockChecksumOpAvgTimeAverage time of blockChecksum operations in milliseconds
CopyBlockOpNumOpsTotal number of block copy operations
CopyBlockOpAvgTimeAverage time of block copy operations in milliseconds
ReplaceBlockOpNumOpsTotal number of block replace operations
ReplaceBlockOpAvgTimeAverage time of block replace operations in milliseconds
HeartbeatsNumOpsTotal number of heartbeats
HeartbeatsAvgTimeAverage heartbeat time in milliseconds
HeartbeatsTotalNumOpsTotal number of heartbeats which is a duplicate of HeartbeatsNumOps
HeartbeatsTotalAvgTimeAverage total heartbeat time in milliseconds
LifelinesNumOpsTotal number of lifeline messages
LifelinesAvgTimeAverage lifeline message processing time in milliseconds
BlockReportsNumOpsTotal number of block report operations
BlockReportsAvgTimeAverage time of block report operations in milliseconds
IncrementalBlockReportsNumOpsTotal number of incremental block report operations
IncrementalBlockReportsAvgTimeAverage time of incremental block report operations in milliseconds
CacheReportsNumOpsTotal number of cache report operations
CacheReportsAvgTimeAverage time of cache report operations in milliseconds
PacketAckRoundTripTimeNanosNumOpsTotal number of ack round trip
PacketAckRoundTripTimeNanosAvgTimeAverage time from ack send to receive minus the downstream ack time in nanoseconds
FlushNanosNumOpsTotal number of flushes
FlushNanosAvgTimeAverage flush time in nanoseconds
FsyncNanosNumOpsTotal number of fsync
FsyncNanosAvgTimeAverage fsync time in nanoseconds
SendDataPacketBlockedOnNetworkNanosNumOpsTotal number of sending packets
SendDataPacketBlockedOnNetworkNanosAvgTimeAverage waiting time of sending packets in nanoseconds
SendDataPacketTransferNanosNumOpsTotal number of sending packets
SendDataPacketTransferNanosAvgTimeAverage transfer time of sending packets in nanoseconds
TotalWriteTimeTotal number of milliseconds spent on write operation
TotalReadTimeTotal number of milliseconds spent on read operation
RemoteBytesReadNumber of bytes read by remote clients
RemoteBytesWrittenNumber of bytes written by remote clients
BPServiceActorInfoThe information about a block pool service actor
BlocksInPendingIBRNumber of blocks in pending incremental block report (IBR)
BlocksReceivingInPendingIBRNumber of blocks at receiving status in pending incremental block report (IBR)
BlocksReceivedInPendingIBRNumber of blocks at received status in pending incremental block report (IBR)
BlocksDeletedInPendingIBRNumber of blocks at deleted status in pending incremental block report (IBR)

FsVolume

Per-volume metrics contain Datanode Volume IO related statistics. Per-volume metrics are off by default. They can be enabled by setting dfs.datanode .fileio.profiling.percentage.fraction to an integer value between 1 and 100. Setting this value to 0 would mean profiling is not enabled. But enabling per-volume metrics may have a performance impact. Each metrics record contains tags such as Hostname as additional information along with metrics.

NameDescription
TotalMetadataOperationsTotal number (monotonically increasing) of metadata operations. Metadata operations include stat, list, mkdir, delete, move, open and posix_fadvise.
MetadataOperationRateNumOpsThe number of metadata operations within an interval time of metric
MetadataOperationRateAvgTimeMean time of metadata operations in milliseconds
MetadataOperationLatencynums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of metadata operations latency in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.
TotalDataFileIosTotal number (monotonically increasing) of data file io operations
DataFileIoRateNumOpsThe number of data file io operations within an interval time of metric
DataFileIoRateAvgTimeMean time of data file io operations in milliseconds
DataFileIoLatencynums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of data file io operations latency in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.
FlushIoRateNumOpsThe number of file flush io operations within an interval time of metric
FlushIoRateAvgTimeMean time of file flush io operations in milliseconds
FlushIoLatencynums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of file flush io operations latency in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.
SyncIoRateNumOpsThe number of file sync io operations within an interval time of metric
SyncIoRateAvgTimeMean time of file sync io operations in milliseconds
SyncIoLatencynums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of file sync io operations latency in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.
ReadIoRateNumOpsThe number of file read io operations within an interval time of metric
ReadIoRateAvgTimeMean time of file read io operations in milliseconds
ReadIoLatencynums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of file read io operations latency in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.
WriteIoRateNumOpsThe number of file write io operations within an interval time of metric
WriteIoRateAvgTimeMean time of file write io operations in milliseconds
WriteIoLatencynums(50/75/90/95/99)thPercentileLatencyThe 50/75/90/95/99th percentile of file write io operations latency in milliseconds. Percentile measurement is off by default, by watching no intervals. The intervals are specified by dfs.metrics.percentiles.intervals.
TotalFileIoErrorsTotal number (monotonically increasing) of file io error operations
FileIoErrorRateNumOpsThe number of file io error operations within an interval time of metric
FileIoErrorRateAvgTimeIt measures the mean time in milliseconds from the start of an operation to hitting a failure

RouterRPCMetrics

RouterRPCMetrics shows the statistics of the Router component in Router-based federation.

NameDescription
ProcessingOpNumber of operations the Router processed internally
ProxyOpNumber of operations the Router proxied to a Namenode
ProxyOpFailureStandbyNumber of operations to fail to reach NN
ProxyOpFailureCommunicateNumber of operations to hit a standby NN
ProxyOpNotImplementedNumber of operations not implemented
RouterFailureStateStoreNumber of failed requests due to State Store unavailable
RouterFailureReadOnlyNumber of failed requests due to read only mount point
RouterFailureLockedNumber of failed requests due to locked path
RouterFailureSafemodeNumber of failed requests due to safe mode
ProcessingNumOpsNumber of operations the Router processed internally within an interval time of metric
ProcessingAvgTimeAverage time for the Router to process operations in nanoseconds
ProxyNumOpsNumber of times of that the Router to proxy operations to the Namenodes within an interval time of metric
ProxyAvgTimeAverage time for the Router to proxy operations to the Namenodes in nanoseconds

StateStoreMetrics

StateStoreMetrics shows the statistics of the State Store component in Router-based federation.

NameDescription
ReadsNumOpsNumber of GET transactions for State Store within an interval time of metric
ReadsAvgTimeAverage time of GET transactions for State Store in milliseconds
WritesNumOpsNumber of PUT transactions for State Store within an interval time of metric
WritesAvgTimeAverage time of PUT transactions for State Store in milliseconds
RemovesNumOpsNumber of REMOVE transactions for State Store within an interval time of metric
RemovesAvgTimeAverage time of REMOVE transactions for State Store in milliseconds
FailuresNumOpsNumber of failed transactions for State Store within an interval time of metric
FailuresAvgTimeAverage time of failed transactions for State Store in milliseconds
CacheBaseRecordSizeNumber of store records to cache in State Store

yarn context

ClusterMetrics

ClusterMetrics shows the metrics of the YARN cluster from the ResourceManager's perspective. Each metrics record contains Hostname tag as additional information along with metrics.

NameDescription
NumActiveNMsCurrent number of active NodeManagers
numDecommissioningNMsCurrent number of NodeManagers being decommissioned
NumDecommissionedNMsCurrent number of decommissioned NodeManagers
NumShutdownNMsCurrent number of NodeManagers shut down gracefully. Note that this does not count NodeManagers that are forcefully killed.
NumLostNMsCurrent number of lost NodeManagers for not sending heartbeats.
NumUnhealthyNMsCurrent number of unhealthy NodeManagers
NumRebootedNMsCurrent number of rebooted NodeManagers
AMLaunchDelayNumOpsTotal number of AMs launched
AMLaunchDelayAvgTimeAverage time in milliseconds RM spends to launch AM containers after the AM container is allocated
AMRegisterDelayNumOpsTotal number of AMs registered
AMRegisterDelayAvgTimeAverage time in milliseconds AM spends to register with RM after the AM container gets launched

QueueMetrics

QueueMetrics shows an application queue from the ResourceManager's perspective. Each metrics record shows the statistics of each queue, and contains tags such as queue name and Hostname as additional information along with metrics.

In running_num metrics such as running_0, you can set the property yarn.resourcemanager.metrics.runtime.buckets in yarn-site.xml to change the buckets. The default values is 60,300,1440.

NameDescription
running_0Current number of running applications whose elapsed time are less than 60 minutes
running_60Current number of running applications whose elapsed time are between 60 and 300 minutes
running_300Current number of running applications whose elapsed time are between 300 and 1440 minutes
running_1440Current number of running applications elapsed time are more than 1440 minutes
AppsSubmittedTotal number of submitted applications
AppsRunningCurrent number of running applications
AppsPendingCurrent number of applications that have not yet been assigned by any containers
AppsCompletedTotal number of completed applications
AppsKilledTotal number of killed applications
AppsFailedTotal number of failed applications
AllocatedMBCurrent allocated memory in MB
AllocatedVCoresCurrent allocated CPU in virtual cores
AllocatedContainersCurrent number of allocated containers
AggregateContainersAllocatedTotal number of allocated containers
aggregateNodeLocalContainersAllocatedTotal number of node local containers allocated
aggregateRackLocalContainersAllocatedTotal number of rack local containers allocated
aggregateOffSwitchContainersAllocatedTotal number of off switch containers allocated
AggregateContainersReleasedTotal number of released containers
AvailableMBCurrent available memory in MB
AvailableVCoresCurrent available CPU in virtual cores
PendingMBCurrent memory requests in MB that are pending to be fulfilled by the scheduler
PendingVCoresCurrent CPU requests in virtual cores that are pending to be fulfilled by the scheduler
PendingContainersCurrent number of containers that are pending to be fulfilled by the scheduler
ReservedMBCurrent reserved memory in MB
ReservedVCoresCurrent reserved CPU in virtual cores
ReservedContainersCurrent number of reserved containers
ActiveUsersCurrent number of active users
ActiveApplicationsCurrent number of active applications
AppAttemptFirstContainerAllocationDelayNumOpsTotal number of first container allocated for all attempts
AppAttemptFirstContainerAllocationDelayAvgTimeAverage time RM spends to allocate the first container for all attempts. For managed AM, the first container is AM container. So, this indicates the time duration to allocate AM container. For unmanaged AM, this is the time duration to allocate the first container asked by unmanaged AM.
FairShareMB(FairScheduler only) Current fair share of memory in MB
FairShareVCores(FairScheduler only) Current fair share of CPU in virtual cores
MinShareMB(FairScheduler only) Minimum share of memory in MB
MinShareVCores(FairScheduler only) Minimum share of CPU in virtual cores
MaxShareMB(FairScheduler only) Maximum share of memory in MB
MaxShareVCores(FairScheduler only) Maximum share of CPU in virtual cores

NodeManagerMetrics

NodeManagerMetrics shows the statistics of the containers in the node. Each metrics record contains Hostname tag as additional information along with metrics.

NameDescription
containersLaunchedTotal number of launched containers
containersCompletedTotal number of successfully completed containers
containersFailedTotal number of failed containers
containersKilledTotal number of killed containers
containersInitingCurrent number of initializing containers
containersRunningCurrent number of running containers
allocatedContainersCurrent number of allocated containers
allocatedGBCurrent allocated memory in GB
availableGBCurrent available memory in GB
allocatedVcoresCurrent used vcores
availableVcoresCurrent available vcores
containerLaunchDurationAverage time duration in milliseconds NM takes to launch a container
badLocalDirsCurrent number of bad local directories. Currently, a disk that cannot be read/written/executed by NM process or A disk being full is considered as bad.
badLogDirsCurrent number of bad log directories. Currently, a disk that cannot be read/written/executed by NM process or A disk being full is considered as bad.
goodLocalDirsDiskUtilizationPercCurrent disk utilization percentage across all good local directories
goodLogDirsDiskUtilizationPercCurrent disk utilization percentage across all good log directories

ContainerMetrics

ContainerMetrics shows the resource utilization statistics of a container. Each metrics record contains tags such as ContainerPid and Hostname as additional information along with metrics.

NameDescription
pMemLimitMBsPhysical memory limit of the container in MB
vMemLimitMBsVirtual memory limit of the container in MB
vCoreLimitCPU limit of the container in number of vcores
launchDurationMsContainer launch duration in msec
localizationDurationMsContainer localization duration in msec
StartTimeTime in msec when container starts
FinishTimeTime in msec when container finishes
ExitCodeContainer's exit code
PMemUsageMBsNumUsageTotal number of physical memory used metrics
PMemUsageMBsAvgMBsAverage physical memory used in MB
PMemUsageMBsStdevMBsStandard deviation of the physical memory used in MB
PMemUsageMBsMinMBsMinimum physical memory used in MB
PMemUsageMBsMaxMBsMaximum physical memory used in MB
PMemUsageMBsIMinMBsMinimum physical memory used in MB of current interval (the time of interval is specified by yarn.nodemanager.container-metrics.period-ms)
PMemUsageMBsIMaxMBsMaximum physical memory used in MB of current interval (the time of interval is specified by yarn.nodemanager.container-metrics.period-ms)
PMemUsageMBsINumUsageTotal number of physical memory used metrics in current interval (the time of interval is specified by yarn.nodemanager.container-metrics.period-ms)
PCpuUsagePercentNumUsageTotal number of physical CPU cores percent used metrics
PCpuUsagePercentAvgPercentsAverage physical CPU cores percent used
PCpuUsagePercentStdevPercentsStandard deviation of physical CPU cores percent used
PCpuUsagePercentMinPercentsMinimum physical CPU cores percent used
PCpuUsagePercentMaxPercentsMaximum physical CPU cores percent used
PCpuUsagePercentIMinPercentsMinimum physical CPU cores percent used in current interval (the time of interval is specified by yarn.nodemanager.container-metrics.period-ms)
PCpuUsagePercentIMaxPercentsMaximum physical CPU cores percent used in current interval (the time of interval is specified by yarn.nodemanager.container-metrics.period-ms)
PCpuUsagePercentINumUsageTotal number of physical CPU cores used metrics in current interval (the time of interval is specified by yarn.nodemanager.container-metrics.period-ms)
MilliVcoreUsageNumUsageTotal number of vcores used metrics
MilliVcoreUsageAvgMilliVcores1000 times the average vcores used
MilliVcoreUsageStdevMilliVcores1000 times the standard deviation of vcores used
MilliVcoreUsageMinMilliVcores1000 times the minimum vcores used
MilliVcoreUsageMaxMilliVcores1000 times the maximum vcores used
MilliVcoreUsageIMinMilliVcores1000 times the average vcores used in current interval (the time of interval is specified by yarn.nodemanager.container-metrics.period-ms)
MilliVcoreUsageIMaxMilliVcores1000 times the maximum vcores used in current interval (the time of interval is specified by yarn.nodemanager.container-metrics.period-ms)
MilliVcoreUsageINumUsageTotal number of vcores used metrics in current interval (the time of interval is specified by yarn.nodemanager.container-metrics.period-ms)
PMemUsageMBHistogramNumUsageTotal number of physical memory used metrics (1 second granularity)
PMemUsageMBHistogram50thPercentileMBsThe 50th percentile of physical memory used in MB (1 second granularity)
PMemUsageMBHistogram75thPercentileMBsThe 75th percentile of physical memory used in MB (1 second granularity)
PMemUsageMBHistogram90thPercentileMBsThe 90th percentile of physical memory used in MB (1 second granularity)
PMemUsageMBHistogram95thPercentileMBsThe 95th percentile of physical memory used in MB (1 second granularity)
PMemUsageMBHistogram99thPercentileMBsThe 99th percentile of physical memory used in MB (1 second granularity)
PCpuUsagePercentHistogramNumUsageTotal number of physical CPU cores used metrics (1 second granularity)
PCpuUsagePercentHistogram50thPercentilePercentsThe 50th percentile of physical CPU cores percent used (1 second granularity)
PCpuUsagePercentHistogram75thPercentilePercentsThe 75th percentile of physical CPU cores percent used (1 second granularity)
PCpuUsagePercentHistogram90thPercentilePercentsThe 90th percentile of physical CPU cores percent used (1 second granularity)
PCpuUsagePercentHistogram95thPercentilePercentsThe 95th percentile of physical CPU cores percent used (1 second granularity)
PCpuUsagePercentHistogram99thPercentilePercentsThe 99th percentile of physical CPU cores percent used (1 second granularity)

ugi context

UgiMetrics

UgiMetrics is related to user and group information. Each metrics record contains Hostname tag as additional information along with metrics.

NameDescription
LoginSuccessNumOpsTotal number of successful kerberos logins
LoginSuccessAvgTimeAverage time for successful kerberos logins in milliseconds
LoginFailureNumOpsTotal number of failed kerberos logins
LoginFailureAvgTimeAverage time for failed kerberos logins in milliseconds
getGroupsNumOpsTotal number of group resolutions
getGroupsAvgTimeAverage time for group resolution in milliseconds
getGroupsnumsNumOpsTotal number of group resolutions (num seconds granularity). num is specified by hadoop.user.group.metrics.percentiles.intervals.
getGroupsnums50thPercentileLatencyShows the 50th percentile of group resolution time in milliseconds (num seconds granularity). num is specified by hadoop.user.group.metrics.percentiles.intervals.
getGroupsnums75thPercentileLatencyShows the 75th percentile of group resolution time in milliseconds (num seconds granularity). num is specified by hadoop.user.group.metrics.percentiles.intervals.
getGroupsnums90thPercentileLatencyShows the 90th percentile of group resolution time in milliseconds (num seconds granularity). num is specified by hadoop.user.group.metrics.percentiles.intervals.
getGroupsnums95thPercentileLatencyShows the 95th percentile of group resolution time in milliseconds (num seconds granularity). num is specified by hadoop.user.group.metrics.percentiles.intervals.
getGroupsnums99thPercentileLatencyShows the 99th percentile of group resolution time in milliseconds (num seconds granularity). num is specified by hadoop.user.group.metrics.percentiles.intervals.

metricssystem context

MetricsSystem

MetricsSystem shows the statistics for metrics snapshots and publishes. Each metrics record contains Hostname tag as additional information along with metrics.

NameDescription
NumActiveSourcesCurrent number of active metrics sources
NumAllSourcesTotal number of metrics sources
NumActiveSinksCurrent number of active sinks
NumAllSinksTotal number of sinks  (BUT usually less than NumActiveSinks, see HADOOP-9946)
SnapshotNumOpsTotal number of operations to snapshot statistics from a metrics source
SnapshotAvgTimeAverage time in milliseconds to snapshot statistics from a metrics source
PublishNumOpsTotal number of operations to publish statistics to a sink
PublishAvgTimeAverage time in milliseconds to publish statistics to a sink
DroppedPubAllTotal number of dropped publishes
Sink_instanceNumOpsTotal number of sink operations for the instance
Sink_instanceAvgTimeAverage time in milliseconds of sink operations for the instance
Sink_instanceDroppedTotal number of dropped sink operations for the instance
Sink_instanceQsizeCurrent queue length of the sink

default context

StartupProgress

StartupProgress metrics shows the statistics of NameNode startup. Four metrics are exposed for each startup phase based on its name. The startup phases are LoadingFsImage, LoadingEdits, SavingCheckpoint, and SafeMode. Each metrics record contains Hostname tag as additional information along with metrics.

NameDescription
ElapsedTimeTotal elapsed time in milliseconds
PercentCompleteCurrent rate completed in NameNode startup progress  (The max value is not 100 but 1.0)
phaseCountTotal number of steps completed in the phase
phaseElapsedTimeTotal elapsed time in the phase in milliseconds
phaseTotalTotal number of steps in the phase
phasePercentCompleteCurrent rate completed in the phase  (The max value is not 100 but 1.0)