What is metrics?

Along with IoTDB running, some metrics reflecting current system's status will be collected continuously, which will provide some useful information helping us resolving system problems and detecting potential system risks.

When to use metrics?

Belows are some typical application scenarios

  1. System is running slowly

    When system is running slowly, we always hope to have information about system's running status as detail as possible, such as

    • JVM:Is there FGC?How long does it cost?How much does the memory usage decreased after GC?Are there lots of threads?
    • System:Is the CPU usage too hi?Are there many disk IOs?
    • Connections:How many connections are there in the current time?
    • Interface:What is the TPS and latency of every interface?
    • ThreadPool:Are there many pending tasks?
    • Cache Hit Ratio
  2. No space left on device

    When meet a “no space left on device” error, we really want to know which kind of data file had a rapid rise in the past hours.

  3. Is the system running in abnormal status

    We could use the count of error logs、the alive status of nodes in cluster, etc, to determine whether the system is running abnormally.

Who will use metrics?

Any person cares about the system's status, including but not limited to RD, QA, SRE, DBA, can use the metrics to work more efficiently.

What metrics does IoTDB have?

For now, we have provided some metrics for several core modules of IoTDB, and more metrics will be added or updated along with the development of new features and optimization or refactoring of architecture.

Key Concept

Before step into next, we'd better stop to have a look into some key concepts about metrics.

Every metric data has two properties

  • Metric Name

    The name of this metric,for example, logback_events_total indicates the total count of log events。

  • Tag

    Each metric could have 0 or several sub classes (Tag), for the same example, the logback_events_total metric has a sub class named level, which means the total count of log events at the specific level

Data Format

IoTDB provides metrics data both in JMX and Prometheus format. For JMX, you can get these metrics via org.apache.iotdb.metrics.

Next, we will choose Prometheus format data as samples to describe each kind of metric.

IoTDB Metrics

API

MetricTagDescriptionSample
entry_seconds_countname=“interface name”The total request count of the interfaceentry_seconds_count{name=“openSession”,} 1.0
entry_seconds_sumname=“interface name”The total cost seconds of the interfaceentry_seconds_sum{name=“openSession”,} 0.024
entry_seconds_maxname=“interface name”The max latency of the interfaceentry_seconds_max{name=“openSession”,} 0.024
quantity_totalname=“pointsIn”The total points inserted into IoTDBquantity_total{name=“pointsIn”,} 1.0

File

MetricTagDescriptionSample
file_sizename=“wal/seq/unseq”The current file size of wal/seq/unseq in bytesfile_size{name=“wal”,} 67.0
file_countname=“wal/seq/unseq”The current count of wal/seq/unseq filesfile_count{name=“seq”,} 1.0

Flush

MetricTagDescriptionSample
queuename=“flush”,
status=“running/waiting”
The count of current flushing tasks in running and waiting statusqueue{name=“flush”,status=“waiting”,} 0.0
queue{name=“flush”,status=“running”,} 0.0
cost_task_seconds_countname=“flush”The total count of flushing occurs till nowcost_task_seconds_count{name=“flush”,} 1.0
cost_task_seconds_maxname=“flush”The seconds of the longest flushing task takes till nowcost_task_seconds_max{name=“flush”,} 0.363
cost_task_seconds_sumname=“flush”The total cost seconds of all flushing tasks till nowcost_task_seconds_sum{name=“flush”,} 0.363

Compaction

MetricTagDescriptionSample
queuename=“compaction_inner/compaction_cross”,
status=“running/waiting”
The count of current compaction tasks in running and waiting statusqueue{name=“compaction_inner”,status=“waiting”,} 0.0
cost_task_seconds_countname=“compaction”The total count of compaction occurs till nowcost_task_seconds_count{name=“compaction”,} 1.0
cost_task_seconds_maxname=“compaction”The seconds of the longest compaction task takes till nowcost_task_seconds_max{name=“compaction”,} 0.363
cost_task_seconds_sumname=“compaction”The total cost seconds of all compaction tasks till nowcost_task_seconds_sum{name=“compaction”,} 0.363

Memory Usage

MetricTagDescriptionSample
memname=“chunkMetaData/storageGroup/mtree”Current memory size of chunkMetaData/storageGroup/mtree data in bytesmem{name=“chunkMetaData”,} 2050.0

Cache Hit Ratio

MetricTagDescriptionSample
cache_hitname=“chunk/timeSeriesMeta/bloomFilter”Cache hit ratio of chunk/timeSeriesMeta and prevention ratio of bloom filtercache_hit{name=“chunk”,} 80

Business Data

MetricTagDescriptionSample
quantityname=“timeSeries/storageGroup/device”The current count of timeSeries/storageGroup/devices in IoTDBquantity{name=“timeSeries”,} 1.0

Cluster

MetricTagDescriptionSample
cluster_node_leader_countname=“{{ip}}”The count of dataGroupLeader on each node, which reflects the distribution of leaderscluster_node_leader_count{name=“127.0.0.1”,} 2.0
cluster_uncommitted_logname=“{{ip_datagroupHeader}}”The count of uncommitted_log on each node in data groups it belongs tocluster_uncommitted_log{name=“127.0.0.1_Data-127.0.0.1-40010-raftId-0”,} 0.0
cluster_node_statusname=“{{ip}}”The current node status, 1=online 2=offlinecluster_node_status{name=“127.0.0.1”,} 1.0
cluster_elect_totalname=“{{ip}}”,status=“fail/win”The count and result (won or failed) of elections the node participated in.cluster_elect_total{name=“127.0.0.1”,status=“win”,} 1.0

Log Events

MetricTagDescriptionSample
logback_events_total{level=“trace/debug/info/warn/error”,}The count of trace/debug/info/warn/error log events till nowlogback_events_total{level=“warn”,} 0.0

JVM

Threads

MetricTagDescriptionSample
jvm_threads_live_threadsNoneThe current count of threadsjvm_threads_live_threads 25.0
jvm_threads_daemon_threadsNoneThe current count of daemon threadsjvm_threads_daemon_threads 12.0
jvm_threads_peak_threadsNoneThe max count of threads till nowjvm_threads_peak_threads 28.0
jvm_threads_states_threadsstate=“runnable/blocked/waiting/timed-waiting/new/terminated”The count of threads in each statusjvm_threads_states_threads{state=“runnable”,} 10.0

GC

MetricTagDescriptionSample
jvm_gc_pause_seconds_countaction=“end of major GC/end of minor GC”,cause=“xxxx”The total count of YGC/FGC events and its causejvm_gc_pause_seconds_count{action=“end of major GC”,cause=“Metadata GC Threshold”,} 1.0
jvm_gc_pause_seconds_sumaction=“end of major GC/end of minor GC”,cause=“xxxx”The total cost seconds of YGC/FGC and its causejvm_gc_pause_seconds_sum{action=“end of major GC”,cause=“Metadata GC Threshold”,} 0.03
jvm_gc_pause_seconds_maxaction=“end of major GC”,cause=“Metadata GC Threshold”The max cost seconds of YGC/FGC till now and its causejvm_gc_pause_seconds_max{action=“end of major GC”,cause=“Metadata GC Threshold”,} 0.0
jvm_gc_overhead_percentNoneAn approximation of the percent of CPU time used by GC activities over the last lookback period or since monitoring began, whichever is shorter, in the range [0..1]jvm_gc_overhead_percent 0.0
jvm_gc_memory_promoted_bytes_totalNoneCount of positive increases in the size of the old generation memory pool before GC to after GCjvm_gc_memory_promoted_bytes_total 8425512.0
jvm_gc_max_data_size_bytesNoneMax size of long-lived heap memory pooljvm_gc_max_data_size_bytes 2.863661056E9
jvm_gc_live_data_size_bytesSize of long-lived heap memory pool after reclamationjvm_gc_live_data_size_bytes 8450088.0
jvm_gc_memory_allocated_bytes_totalNoneIncremented for an increase in the size of the (young) heap memory pool after one GC to before the nextjvm_gc_memory_allocated_bytes_total 4.2979144E7

Memory

MetricTagDescriptionSample
jvm_buffer_memory_used_bytesid=“direct/mapped”An estimate of the memory that the Java virtual machine is using for this buffer pooljvm_buffer_memory_used_bytes{id=“direct”,} 3.46728099E8
jvm_buffer_total_capacity_bytesid=“direct/mapped”An estimate of the total capacity of the buffers in this pooljvm_buffer_total_capacity_bytes{id=“mapped”,} 0.0
jvm_buffer_count_buffersid=“direct/mapped”An estimate of the number of buffers in the pooljvm_buffer_count_buffers{id=“direct”,} 183.0
jvm_memory_committed_bytes{area=“heap/nonheap”,id=“xxx”,}The amount of memory in bytes that is committed for the Java virtual machine to usejvm_memory_committed_bytes{area=“heap”,id=“Par Survivor Space”,} 2.44252672E8
jvm_memory_committed_bytes{area=“nonheap”,id=“Metaspace”,} 3.9051264E7
jvm_memory_max_bytes{area=“heap/nonheap”,id=“xxx”,}The maximum amount of memory in bytes that can be used for memory managementjvm_memory_max_bytes{area=“heap”,id=“Par Survivor Space”,} 2.44252672E8
jvm_memory_max_bytes{area=“nonheap”,id=“Compressed Class Space”,} 1.073741824E9
jvm_memory_used_bytes{area=“heap/nonheap”,id=“xxx”,}The amount of used memoryjvm_memory_used_bytes{area=“heap”,id=“Par Eden Space”,} 1.000128376E9
jvm_memory_used_bytes{area=“nonheap”,id=“Code Cache”,} 2.9783808E7

Classes

MetricTagDescriptionSample
jvm_classes_unloaded_classes_totalThe total number of classes unloaded since the Java virtual machine has started executionjvm_classes_unloaded_classes_total 680.0
jvm_classes_loaded_classesThe number of classes that are currently loaded in the Java virtual machinejvm_classes_loaded_classes 5975.0
jvm_compilation_time_ms_total{compiler=“HotSpot 64-Bit Tiered Compilers”,}The approximate accumulated elapsed time spent in compilationjvm_compilation_time_ms_total{compiler=“HotSpot 64-Bit Tiered Compilers”,} 107092.0

If you want add your own metrics data in IoTDB, please see the [IoTDB Metric Framework] (https://github.com/apache/iotdb/tree/master/metrics) document.

How to get these metrics?

The metrics collection switch is disabled by default,you need to enable it from conf/iotdb-metric.yml

Iotdb-metric.yml

# The default value is false,change it to true and start/restart you IoTDB server, then you will get the metrics data.
enableMetric: false            

# IoTDB provides metrics data both in JMX and Prometheus format.
metricReporterList:						
  - jmx                    
  - prometheus

# You can choose the underlying inplementation of the framework, dropwizard or micrometer, the latter is recommended.
monitorType: micrometer         

# you can set the period time of push.
# This param works only when monitorType=dropwizard
pushPeriodInSecond: 5								
########################################################
#                                                      #
# if the reporter is prometheus,                       #
# then the following must be set                       #
#                                                      #
########################################################
prometheusReporterConfig:
  prometheusExporterUrl: http://localhost
  
  # From this port, you can get metrics data by http request
  prometheusExporterPort: 9091						

Then you can get metrics data as follows

  1. Enable metrics switch in iotdb-metric.yml
  2. You can just stay other config params as default.
  3. Start/Restart your IoTDB server/cluster
  4. Open your browser or use the curl command to request http://servier_ip:9001/metrics,then you will get metrics data like follows:
# HELP file_count
# TYPE file_count gauge
file_count{name="wal",} 0.0
file_count{name="unseq",} 0.0
file_count{name="seq",} 2.0
# HELP file_size
# TYPE file_size gauge
file_size{name="wal",} 0.0
file_size{name="unseq",} 0.0
file_size{name="seq",} 560.0
# HELP queue
# TYPE queue gauge
queue{name="flush",status="waiting",} 0.0
queue{name="flush",status="running",} 0.0
# HELP quantity
# TYPE quantity gauge
quantity{name="timeSeries",} 1.0
quantity{name="storageGroup",} 1.0
quantity{name="device",} 1.0
# HELP logback_events_total Number of error level events that made it to the logs
# TYPE logback_events_total counter
logback_events_total{level="warn",} 0.0
logback_events_total{level="debug",} 2760.0
logback_events_total{level="error",} 0.0
logback_events_total{level="trace",} 0.0
logback_events_total{level="info",} 71.0
# HELP mem
# TYPE mem gauge
mem{name="storageGroup",} 0.0
mem{name="mtree",} 1328.0

Integrating with Prometheus and Grafana

As above descriptions,IoTDB provides metrics data in standard Prometheus format,so we can integrate with Prometheus and Grafana directly.

The following picture describes the relationships among IoTDB, Prometheus and Grafana

iotdb_prometheus_grafana

  1. Along with running, IoTDB will collect its metrics continuously.
  2. Prometheus scrapes metrics from IoTDB at a constant interval (can be configured).
  3. Prometheus saves these metrics to its inner TSDB.
  4. Grafana queries metrics from Prometheus at a constant interval (can be configured) and then presents them on the graph.

So, we need to do some additional works to configure and deploy Prometheus and Grafana.

For instance, you can config your Prometheus as follows to get metrics data from IoTDB:

job_name: pull-metrics
honor_labels: true
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
follow_redirects: true
static_configs:
- targets:
  - localhost:9091

The following documents may help you have a good journey with Prometheus and Grafana.

Prometheus getting_started

Prometheus scrape metrics

Grafana getting_started

Grafana query metrics from Prometheus

Here are two demo pictures of IoTDB's metrics data in Grafana.

metrics_demo_1

metrics_demo_2