Druid generates metrics related to queries, ingestion, and coordination.
Metrics are emitted as JSON objects to a runtime log file or over HTTP (to a service such as Apache Kafka). Metric emission is disabled by default.
All Druid metrics share a common set of fields:
timestamp
- the time the metric was createdmetric
- the name of the metricservice
- the service name that emitted the metrichost
- the host name that emitted the metricvalue
- some numeric value associated with the metricMetrics may have additional dimensions beyond those listed above.
Most metric values reset each emission period. By default druid emission period is 1 minute, this can be changed by setting the property druid.monitoring.emissionPeriod
.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
query/time | Milliseconds taken to complete a query. | Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension. | < 1s |
query/bytes | number of bytes returned in query response. | Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension. | |
query/node/time | Milliseconds taken to query individual historical/realtime processes. | id, status, server. | < 1s |
query/node/bytes | number of bytes returned from querying individual historical/realtime processes. | id, status, server. | |
query/node/ttfb | Time to first byte. Milliseconds elapsed until Broker starts receiving the response from individual historical/realtime processes. | id, status, server. | < 1s |
query/node/backpressure | Milliseconds that the channel to this process has spent suspended due to backpressure. | id, status, server. | |
query/count | number of total queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/success/count | number of queries successfully processed | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/failed/count | number of failed queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/interrupted/count | number of queries interrupted due to cancellation or timeout | This metric is only available if the QueryCountStatsMonitor module is included. | |
sqlQuery/time | Milliseconds taken to complete a SQL query. | id, nativeQueryIds, dataSource, remoteAddress, success. | < 1s |
sqlQuery/bytes | number of bytes returned in SQL query response. | id, nativeQueryIds, dataSource, remoteAddress, success. |
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
query/time | Milliseconds taken to complete a query. | Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension. | < 1s |
query/segment/time | Milliseconds taken to query individual segment. Includes time to page in the segment from disk. | id, status, segment. | several hundred milliseconds |
query/wait/time | Milliseconds spent waiting for a segment to be scanned. | id, segment. | < several hundred milliseconds |
segment/scan/pending | Number of segments in queue waiting to be scanned. | Close to 0 | |
query/segmentAndCache/time | Milliseconds taken to query individual segment or hit the cache (if it is enabled on the Historical process). | id, segment. | several hundred milliseconds |
query/cpu/time | Microseconds of CPU time taken to complete a query | Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension. | Varies |
query/count | number of total queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/success/count | number of queries successfully processed | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/failed/count | number of failed queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/interrupted/count | number of queries interrupted due to cancellation or timeout | This metric is only available if the QueryCountStatsMonitor module is included. |
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
query/time | Milliseconds taken to complete a query. | Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension. | < 1s |
query/wait/time | Milliseconds spent waiting for a segment to be scanned. | id, segment. | several hundred milliseconds |
segment/scan/pending | Number of segments in queue waiting to be scanned. | Close to 0 | |
query/count | number of total queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/success/count | number of queries successfully processed | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/failed/count | number of failed queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/interrupted/count | number of queries interrupted due to cancellation or timeout | This metric is only available if the QueryCountStatsMonitor module is included. |
Metric | Description | Normal Value |
---|---|---|
jetty/numOpenConnections | Number of open jetty connections. | Not much higher than number of jetty threads. |
Metric | Description | Normal Value |
---|---|---|
query/cache/delta/* | Cache metrics since the last emission. | |
query/cache/total/* | Total cache metrics. |
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
*/numEntries | Number of cache entries. | Varies. | |
*/sizeBytes | Size in bytes of cache entries. | Varies. | |
*/hits | Number of cache hits. | Varies. | |
*/misses | Number of cache misses. | Varies. | |
*/evictions | Number of cache evictions. | Varies. | |
*/hitRate | Cache hit rate. | ~40% | |
*/averageByte | Average cache entry byte size. | Varies. | |
*/timeouts | Number of cache timeouts. | 0 | |
*/errors | Number of cache errors. | 0 | |
*/put/ok | Number of new cache entries successfully cached. | Varies, but more than zero. | |
*/put/error | Number of new cache entries that could not be cached due to errors. | Varies, but more than zero. | |
*/put/oversized | Number of potential new cache entries that were skipped due to being too large (based on druid.{broker,historical,realtime}.cache.maxEntrySize properties). | Varies. |
Memcached client metrics are reported as per the following. These metrics come directly from the client as opposed to from the cache retrieval layer.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
query/cache/memcached/total | Cache metrics unique to memcached (only if druid.cache.type=memcached ) as their actual values | Variable | N/A |
query/cache/memcached/delta | Cache metrics unique to memcached (only if druid.cache.type=memcached ) as their delta from the prior event emission | Variable | N/A |
If SQL is enabled, the Broker will emit the following metrics for SQL.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
sqlQuery/time | Milliseconds taken to complete a SQL. | id, nativeQueryIds, dataSource, remoteAddress, success. | < 1s |
sqlQuery/bytes | number of bytes returned in SQL response. | id, nativeQueryIds, dataSource, remoteAddress, success. |
These metrics are applicable for the Kafka Indexing Service.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
ingest/kafka/lag | Total lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute. | dataSource. | Greater than 0, should not be a very high number |
ingest/kafka/maxLag | Max lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute. | dataSource. | Greater than 0, should not be a very high number |
ingest/kafka/avgLag | Average lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute. | dataSource. | Greater than 0, should not be a very high number |
These metrics are only available if the RealtimeMetricsMonitor is included in the monitors list for the Realtime process. These metrics are deltas for each emission period.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
ingest/events/thrownAway | Number of events rejected because they are outside the windowPeriod. | dataSource, taskId, taskType. | 0 |
ingest/events/unparseable | Number of events rejected because the events are unparseable. | dataSource, taskId, taskType. | 0 |
ingest/events/duplicate | Number of events rejected because the events are duplicated. | dataSource, taskId, taskType. | 0 |
ingest/events/processed | Number of events successfully processed per emission period. | dataSource, taskId, taskType. | Equal to your # of events per emission period. |
ingest/rows/output | Number of Druid rows persisted. | dataSource, taskId, taskType. | Your # of events with rollup. |
ingest/persists/count | Number of times persist occurred. | dataSource, taskId, taskType. | Depends on configuration. |
ingest/persists/time | Milliseconds spent doing intermediate persist. | dataSource, taskId, taskType. | Depends on configuration. Generally a few minutes at most. |
ingest/persists/cpu | Cpu time in Nanoseconds spent on doing intermediate persist. | dataSource, taskId, taskType. | Depends on configuration. Generally a few minutes at most. |
ingest/persists/backPressure | Milliseconds spent creating persist tasks and blocking waiting for them to finish. | dataSource, taskId, taskType. | 0 or very low |
ingest/persists/failed | Number of persists that failed. | dataSource, taskId, taskType. | 0 |
ingest/handoff/failed | Number of handoffs that failed. | dataSource, taskId, taskType. | 0 |
ingest/merge/time | Milliseconds spent merging intermediate segments | dataSource, taskId, taskType. | Depends on configuration. Generally a few minutes at most. |
ingest/merge/cpu | Cpu time in Nanoseconds spent on merging intermediate segments. | dataSource, taskId, taskType. | Depends on configuration. Generally a few minutes at most. |
ingest/handoff/count | Number of handoffs that happened. | dataSource, taskId, taskType. | Varies. Generally greater than 0 once every segment granular period if cluster operating normally |
ingest/sink/count | Number of sinks not handoffed. | dataSource, taskId, taskType. | 1~3 |
ingest/events/messageGap | Time gap between the data time in event and current system time. | dataSource, taskId, taskType. | Greater than 0, depends on the time carried in event |
Note: If the JVM does not support CPU time measurement for the current thread, ingest/merge/cpu and ingest/persists/cpu will be 0.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
task/run/time | Milliseconds taken to run a task. | dataSource, taskId, taskType, taskStatus. | Varies. |
task/action/log/time | Milliseconds taken to log a task action to the audit log. | dataSource, taskId, taskType | < 1000 (subsecond) |
task/action/run/time | Milliseconds taken to execute a task action. | dataSource, taskId, taskType | Varies from subsecond to a few seconds, based on action type. |
segment/added/bytes | Size in bytes of new segments created. | dataSource, taskId, taskType, interval. | Varies. |
segment/moved/bytes | Size in bytes of segments moved/archived via the Move Task. | dataSource, taskId, taskType, interval. | Varies. |
segment/nuked/bytes | Size in bytes of segments deleted via the Kill Task. | dataSource, taskId, taskType, interval. | Varies. |
task/success/count | Number of successful tasks per emission period. This metric is only available if the TaskCountStatsMonitor module is included. | dataSource. | Varies. |
task/failed/count | Number of failed tasks per emission period. This metric is only available if the TaskCountStatsMonitor module is included. | dataSource. | Varies. |
task/running/count | Number of current running tasks. This metric is only available if the TaskCountStatsMonitor module is included. | dataSource. | Varies. |
task/pending/count | Number of current pending tasks. This metric is only available if the TaskCountStatsMonitor module is included. | dataSource. | Varies. |
task/waiting/count | Number of current waiting tasks. This metric is only available if the TaskCountStatsMonitor module is included. | dataSource. | Varies. |
These metrics are for the Druid Coordinator and are reset each time the Coordinator runs the coordination logic.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
segment/assigned/count | Number of segments assigned to be loaded in the cluster. | tier. | Varies. |
segment/moved/count | Number of segments moved in the cluster. | tier. | Varies. |
segment/dropped/count | Number of segments dropped due to being overshadowed. | tier. | Varies. |
segment/deleted/count | Number of segments dropped due to rules. | tier. | Varies. |
segment/unneeded/count | Number of segments dropped due to being marked as unused. | tier. | Varies. |
segment/cost/raw | Used in cost balancing. The raw cost of hosting segments. | tier. | Varies. |
segment/cost/normalization | Used in cost balancing. The normalization of hosting segments. | tier. | Varies. |
segment/cost/normalized | Used in cost balancing. The normalized cost of hosting segments. | tier. | Varies. |
segment/loadQueue/size | Size in bytes of segments to load. | server. | Varies. |
segment/loadQueue/failed | Number of segments that failed to load. | server. | 0 |
segment/loadQueue/count | Number of segments to load. | server. | Varies. |
segment/dropQueue/count | Number of segments to drop. | server. | Varies. |
segment/size | Total size of used segments in a data source. Emitted only for data sources to which at least one used segment belongs. | dataSource. | Varies. |
segment/count | Number of used segments belonging to a data source. Emitted only for data sources to which at least one used segment belongs. | dataSource. | < max |
segment/overShadowed/count | Number of overshadowed segments. | Varies. | |
segment/unavailable/count | Number of segments (not including replicas) left to load until segments that should be loaded in the cluster are available for queries. | dataSource. | 0 |
segment/underReplicated/count | Number of segments (including replicas) left to load until segments that should be loaded in the cluster are available for queries. | tier, dataSource. | 0 |
tier/historical/count | Number of available historical nodes in each tier. | tier. | Varies. |
tier/replication/factor | Configured maximum replication factor in each tier. | tier. | Varies. |
tier/required/capacity | Total capacity in bytes required in each tier. | tier. | Varies. |
tier/total/capacity | Total capacity in bytes available in each tier. | tier. | Varies. |
If emitBalancingStats
is set to true
in the Coordinator dynamic configuration, then log entries for class org.apache.druid.server.coordinator.duty.EmitClusterStatsAndMetrics
will have extra information on balancing decisions.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
segment/max | Maximum byte limit available for segments. | Varies. | |
segment/used | Bytes used for served segments. | dataSource, tier, priority. | < max |
segment/usedPercent | Percentage of space used by served segments. | dataSource, tier, priority. | < 100% |
segment/count | Number of served segments. | dataSource, tier, priority. | Varies. |
segment/pendingDelete | On-disk size in bytes of segments that are waiting to be cleared out | Varies. |
These metrics are only available if the JVMMonitor module is included.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
jvm/pool/committed | Committed pool. | poolKind, poolName. | close to max pool |
jvm/pool/init | Initial pool. | poolKind, poolName. | Varies. |
jvm/pool/max | Max pool. | poolKind, poolName. | Varies. |
jvm/pool/used | Pool used. | poolKind, poolName. | < max pool |
jvm/bufferpool/count | Bufferpool count. | bufferpoolName. | Varies. |
jvm/bufferpool/used | Bufferpool used. | bufferpoolName. | close to capacity |
jvm/bufferpool/capacity | Bufferpool capacity. | bufferpoolName. | Varies. |
jvm/mem/init | Initial memory. | memKind. | Varies. |
jvm/mem/max | Max memory. | memKind. | Varies. |
jvm/mem/used | Used memory. | memKind. | < max memory |
jvm/mem/committed | Committed memory. | memKind. | close to max memory |
jvm/gc/count | Garbage collection count. | gcName (cms/g1/parallel/etc.), gcGen (old/young) | Varies. |
jvm/gc/cpu | Count of CPU time in Nanoseconds spent on garbage collection. Note: jvm/gc/cpu represents the total time over multiple GC cycles; divide by jvm/gc/count to get the mean GC time per cycle | gcName, gcGen | Sum of jvm/gc/cpu should be within 10-30% of sum of jvm/cpu/total , depending on the GC algorithm used (reported by JvmCpuMonitor ) |
The following metric is only available if the EventReceiverFirehoseMonitor module is included.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
ingest/events/buffered | Number of events queued in the EventReceiverFirehose's buffer | serviceName, dataSource, taskId, taskType, bufferCapacity. | Equal to current # of events in the buffer queue. |
ingest/bytes/received | Number of bytes received by the EventReceiverFirehose. | serviceName, dataSource, taskId, taskType. | Varies. |
These metrics are only available if the SysMonitor module is included.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
sys/swap/free | Free swap. | Varies. | |
sys/swap/max | Max swap. | Varies. | |
sys/swap/pageIn | Paged in swap. | Varies. | |
sys/swap/pageOut | Paged out swap. | Varies. | |
sys/disk/write/count | Writes to disk. | fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions. | Varies. |
sys/disk/read/count | Reads from disk. | fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions. | Varies. |
sys/disk/write/size | Bytes written to disk. Can we used to determine how much paging is occurring with regards to segments. | fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions. | Varies. |
sys/disk/read/size | Bytes read from disk. Can we used to determine how much paging is occurring with regards to segments. | fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions. | Varies. |
sys/net/write/size | Bytes written to the network. | netName, netAddress, netHwaddr | Varies. |
sys/net/read/size | Bytes read from the network. | netName, netAddress, netHwaddr | Varies. |
sys/fs/used | Filesystem bytes used. | fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions. | < max |
sys/fs/max | Filesystesm bytes max. | fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions. | Varies. |
sys/mem/used | Memory used. | < max | |
sys/mem/max | Memory max. | Varies. | |
sys/storage/used | Disk space used. | fsDirName. | Varies. |
sys/cpu | CPU used. | cpuName, cpuTime. | Varies. |