Monitor Tool

1. Prometheus Integration

1.1 Prometheus Metric Mapping

The following table illustrates the mapping of IoTDB metrics to the Prometheus-compatible format. For a given metric with Metric Name = name and tags K1=V1, ..., Kn=Vn, the mapping follows this pattern, where value represents the actual measurement.

Metric TypeMapping
Countername_total{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”} value
AutoGauge, Gaugename{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”} value
Histogramname_max{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”} value
name_sum{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”} value
name_count{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”} value
name{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”, quantile=“0.5”} value
name{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”, quantile=“0.99”} value
Ratename_total{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”} value
name_total{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”, rate=“m1”} value
name_total{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”, rate=“m5”} value
name_total{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”, rate=“m15”} value
name_total{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”, rate=“mean”} value
Timername_seconds_max{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”} value
name_seconds_sum{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”} value
name_seconds_count{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”} value
name_seconds{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”, quantile=“0.5”} value
name_seconds{cluster=“clusterName”, nodeType=“nodeType”, nodeId=“nodeId”, k1=“V1”, ..., Kn=“Vn”, quantile=“0.99”} value

1.2 Configuration File

To enable Prometheus metric collection in IoTDB, modify the configuration file as follows:

  1. Taking DataNode as an example, modify the iotdb-system.properties configuration file as follows:
dn_metric_reporter_list=PROMETHEUS
dn_metric_level=CORE
dn_metric_prometheus_reporter_port=9091
  1. Start IoTDB DataNodes
  2. Use a web browser or curl to access http://server_ip:9091/metrics to retrieve metric data, such as:
...
# HELP file_count
# TYPE file_count gauge
file_count{name="wal",} 0.0
file_count{name="unseq",} 0.0
file_count{name="seq",} 2.0
...

1.3 Prometheus + Grafana Integration

IoTDB exposes monitoring data in the standard Prometheus-compatible format. Prometheus collects and stores these metrics, while Grafana is used for visualization.

Integration Workflow

The following picture describes the relationships among IoTDB, Prometheus and Grafana:

iotdb_prometheus_grafana

Iotdb-Prometheus-Grafana Workflow

  1. IoTDB continuously collects monitoring metrics.
  2. Prometheus collects metrics from IoTDB at a configurable interval.
  3. Prometheus stores the collected metrics in its internal time-series database (TSDB).
  4. Grafana queries Prometheus at a configurable interval and visualizes the metrics.

Prometheus Configuration Example

To configure Prometheus to collect IoTDB metrics, modify the prometheus.yml file as follows:

job_name: pull-metrics
honor_labels: true
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
follow_redirects: true
static_configs:
  - targets:
      - localhost:9091

For more details, refer to:

2. Apache IoTDB Dashboard

We introduce the Apache IoTDB Dashboard, designed for unified centralized operations and management, which enables monitoring multiple clusters through a single panel.

Apache IoTDB Dashboard

Apache IoTDB Dashboard

You can access the Dashboard's Json file in TimechoDB.

2.1 Cluster Overview

Including but not limited to:

  • Total number of CPU cores, memory capacity, and disk space in the cluster.
  • Number of ConfigNodes and DataNodes in the cluster.
  • Cluster uptime.
  • Cluster write throughput.
  • Current CPU, memory, and disk utilization across all nodes.
  • Detailed information for individual nodes.

2.2 Data Writing

Including but not limited to:

  • Average write latency, median latency, and the 99% percentile latency.
  • Number and size of WAL files.
  • WAL flush SyncBuffer latency per node.

2.3 Data Querying

Including but not limited to:

  • Time series metadata query load time per node.
  • Time series data read duration per node.
  • Time series metadata modification duration per node.
  • Chunk metadata list loading time per node.
  • Chunk metadata modification duration per node.
  • Chunk metadata-based filtering duration per node.
  • Average time required to construct a Chunk Reader.

2.4 Storage Engine

Including but not limited to:

  • File count and size by type.
  • Number and size of TsFiles at different processing stages.
  • Task count and execution duration for various operations.

2.5 System Monitoring

Including but not limited to:

  • System memory, swap memory, and process memory usage.
  • Disk space, file count, and file size statistics.
  • JVM garbage collection (GC) time percentage, GC events by type, GC data volume, and heap memory utilization across generations.
  • Network throughput and packet transmission rate.