| <!-- |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| --> |
| # Monitor Tool |
| |
| ## 1. **Prometheus** **Integration** |
| |
| ### 1.1 **Prometheus Metric Mapping** |
| |
| The following table illustrates the mapping of IoTDB metrics to the Prometheus-compatible format. For a given metric with `Metric Name = name` and tags `K1=V1, ..., Kn=Vn`, the mapping follows this pattern, where `value` represents the actual measurement. |
| |
| | **Metric Type** | **Mapping** | |
| | ---------------- | ------------------------------------------------------------ | |
| | Counter | name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn"} value | |
| | AutoGauge, Gauge | name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn"} value | |
| | Histogram | name_max{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn"} value <br> name_sum{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn"} value <br> name_count{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn"} value <br> name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn", quantile="0.5"} value <br> name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn", quantile="0.99"} value | |
| | Rate | name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn", rate="m1"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn", rate="m5"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn", rate="m15"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn", rate="mean"} value | |
| | Timer | name_seconds_max{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn"} value <br> name_seconds_sum{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn"} value <br> name_seconds_count{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn"} value <br> name_seconds{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn", quantile="0.5"} value <br> name_seconds{cluster="clusterName", nodeType="nodeType", nodeId="nodeId", k1="V1", ..., Kn="Vn", quantile="0.99"} value | |
| |
| ### 1.2 **Configuration File** |
| |
| To enable Prometheus metric collection in IoTDB, modify the configuration file as follows: |
| |
| 1. Taking DataNode as an example, modify the iotdb-system.properties configuration file as follows: |
| |
| ```Properties |
| dn_metric_reporter_list=PROMETHEUS |
| dn_metric_level=CORE |
| dn_metric_prometheus_reporter_port=9091 |
| ``` |
| |
| 1. Start IoTDB DataNodes |
| 2. Use a web browser or `curl` to access `http://server_ip:9091/metrics` to retrieve metric data, such as: |
| |
| ```Plain |
| ... |
| # HELP file_count |
| # TYPE file_count gauge |
| file_count{name="wal",} 0.0 |
| file_count{name="unseq",} 0.0 |
| file_count{name="seq",} 2.0 |
| ... |
| ``` |
| |
| ### 1.3 **Prometheus + Grafana** **Integration** |
| |
| IoTDB exposes monitoring data in the standard Prometheus-compatible format. Prometheus collects and stores these metrics, while Grafana is used for visualization. |
| |
| **Integration Workflow** |
| |
| The following picture describes the relationships among IoTDB, Prometheus and Grafana: |
| |
|  |
| |
| Iotdb-Prometheus-Grafana Workflow |
| |
| 1. IoTDB continuously collects monitoring metrics. |
| 2. Prometheus collects metrics from IoTDB at a configurable interval. |
| 3. Prometheus stores the collected metrics in its internal time-series database (TSDB). |
| 4. Grafana queries Prometheus at a configurable interval and visualizes the metrics. |
| |
| **Prometheus Configuration Example** |
| |
| To configure Prometheus to collect IoTDB metrics, modify the `prometheus.yml` file as follows: |
| |
| ```YAML |
| job_name: pull-metrics |
| honor_labels: true |
| honor_timestamps: true |
| scrape_interval: 15s |
| scrape_timeout: 10s |
| metrics_path: /metrics |
| scheme: http |
| follow_redirects: true |
| static_configs: |
| - targets: |
| - localhost:9091 |
| ``` |
| |
| For more details, refer to: |
| |
| - Prometheus Documentation: |
| - [Prometheus getting_started](https://prometheus.io/docs/prometheus/latest/getting_started/) |
| - [Prometheus scrape metrics](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) |
| - Grafana Documentation: |
| - [Grafana getting_started](https://grafana.com/docs/grafana/latest/getting-started/getting-started/) |
| - [Grafana query metrics from Prometheus](https://prometheus.io/docs/visualization/grafana/#grafana-support-for-prometheus) |
| |
| ## 2. **Apache IoTDB Dashboard** |
| |
| We introduce the Apache IoTDB Dashboard, designed for unified centralized operations and management, which enables monitoring multiple clusters through a single panel. |
| |
|  |
| |
|  |
| |
| You can access the Dashboard's Json file in TimechoDB. |
| |
| ### 2.1 **Cluster Overview** |
| |
| Including but not limited to: |
| |
| - Total number of CPU cores, memory capacity, and disk space in the cluster. |
| - Number of ConfigNodes and DataNodes in the cluster. |
| - Cluster uptime. |
| - Cluster write throughput. |
| - Current CPU, memory, and disk utilization across all nodes. |
| - Detailed information for individual nodes. |
| |
|  |
| |
| ### 2.2 **Data Writing** |
| |
| Including but not limited to: |
| |
| - Average write latency, median latency, and the 99% percentile latency. |
| - Number and size of WAL files. |
| - WAL flush SyncBuffer latency per node. |
| |
|  |
| |
| ### 2.3 **Data Querying** |
| |
| Including but not limited to: |
| |
| - Time series metadata query load time per node. |
| - Time series data read duration per node. |
| - Time series metadata modification duration per node. |
| - Chunk metadata list loading time per node. |
| - Chunk metadata modification duration per node. |
| - Chunk metadata-based filtering duration per node. |
| - Average time required to construct a Chunk Reader. |
| |
|  |
| |
| ### 2.4 **Storage Engine** |
| |
| Including but not limited to: |
| |
| - File count and size by type. |
| - Number and size of TsFiles at different processing stages. |
| - Task count and execution duration for various operations. |
| |
|  |
| |
| ### 2.5 **System Monitoring** |
| |
| Including but not limited to: |
| |
| - System memory, swap memory, and process memory usage. |
| - Disk space, file count, and file size statistics. |
| - JVM garbage collection (GC) time percentage, GC events by type, GC data volume, and heap memory utilization across generations. |
| - Network throughput and packet transmission rate. |
| |
|  |
| |
|  |
| |
|  |