blob: dbd777aefbeca77176effc5d24771ab2dd5d3786 [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Monitor Tool
The deployment of monitoring tools can be referenced in the document [Monitoring Panel Deployment](../Deployment-and-Maintenance/Monitoring-panel-deployment.md) chapter.
## 1. Prometheus Mapping Relationship for Monitoring Metrics
> For a monitoring metric with Metric Name as `name`, Tags as `K1=V1, ..., Kn=Vn`, the following mapping applies, where `value` is the specific value.
| Monitoring Metric Type | Mapping Relationship |
| ---------------------- | ------------------------------------------------------------ |
| Counter | `name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value` |
| AutoGauge, Gauge | `name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value` |
| Histogram | `name_max{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value` <br> `name_sum{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value` <br> `name_count{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value` <br> `name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.5"} value` <br> `name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.99"} value` |
| Rate | `name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value` <br> `name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="m1"} value` <br> `name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="m5"} value` <br> `name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="m15"} value` <br> `name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="mean"} value` |
| Timer | `name_seconds_max{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value` <br> `name_seconds_sum{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value` <br> `name_seconds_count{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value` <br> `name_seconds{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.5"} value` <br> `name_seconds{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.99"} value` |
## 2. Modifying Configuration Files
1) Taking DataNode as an example, modify the `iotdb-system.properties` configuration file as follows:
```properties
dn_metric_reporter_list=PROMETHEUS
dn_metric_level=CORE
dn_metric_prometheus_reporter_port=9091
```
2) Start the IoTDB DataNode.
3) Open a browser or use `curl` to access `http://server_ip:9091/metrics`, and you will get metric data as follows:
```
...
# HELP file_count
# TYPE file_count gauge
file_count{name="wal",} 0.0
file_count{name="unseq",} 0.0
file_count{name="seq",} 2.0
...
```
## 3. Prometheus + Grafana
As shown above, IoTDB exposes monitoring metrics in the standard Prometheus format. You can use Prometheus to collect and store these metrics and Grafana to visualize them.
The relationship between IoTDB, Prometheus, and Grafana is illustrated below:
![iotdb_prometheus_grafana](/img/UserGuide/System-Tools/Metrics/iotdb_prometheus_grafana.png)
1. IoTDB continuously collects monitoring metrics during operation.
2. Prometheus pulls monitoring metrics from IoTDB's HTTP interface at fixed intervals (configurable).
3. Prometheus stores the pulled monitoring metrics in its TSDB.
4. Grafana queries monitoring metrics from Prometheus at fixed intervals (configurable) and visualizes them.
From the interaction flow, it is clear that additional work is required to deploy and configure Prometheus and Grafana.
For example, you can configure Prometheus as follows (some parameters can be adjusted as needed) to pull metrics from IoTDB:
```yaml
job_name: pull-metrics
honor_labels: true
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
follow_redirects: true
static_configs:
- targets:
- localhost:9091
```
For more details, refer to the following documents:
- [Prometheus Installation and Usage Documentation](https://prometheus.io/docs/prometheus/latest/getting_started/)
- [Prometheus Configuration for Pulling Metrics from HTTP Interface](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)
- [Grafana Installation and Usage Documentation](https://grafana.com/docs/grafana/latest/getting-started/getting-started/)
- [Grafana Querying Data from Prometheus and Plotting Documentation](https://prometheus.io/docs/visualization/grafana/#grafana-support-for-prometheus)
## 4. Apache IoTDB Dashboard
The `Apache IoTDB Dashboard` is a companion product of TimechoDB, supporting unified centralized operation and maintenance management. It allows monitoring multiple clusters through a single monitoring panel. You can contact the business team to obtain the Dashboard's JSON file.
![Apache IoTDB Dashboard](/img/%E7%9B%91%E6%8E%A7%20default%20cluster.png)
![Apache IoTDB Dashboard](/img/%E7%9B%91%E6%8E%A7%20cluster2.png)
### 4.1 Cluster Overview
You can monitor, but not limited to:
- Total CPU cores, total memory space, total disk space of the cluster.
- Number of ConfigNodes and DataNodes in the cluster.
- Cluster uptime.
- Cluster write speed.
- Current CPU, memory, and disk usage of each node in the cluster.
- Node-specific information.
![](/img/%E7%9B%91%E6%8E%A7%20%E6%A6%82%E8%A7%88.png)
### 4.2 Data Writing
You can monitor, but not limited to:
- Average write latency, median latency, 99th percentile latency.
- Number and size of WAL files.
- Node WAL flush SyncBuffer latency.
![](/img/%E7%9B%91%E6%8E%A7%20%E5%86%99%E5%85%A5.png)
### 4.3 Data Query
You can monitor, but not limited to:
- Node query loading time series metadata latency.
- Node query reading time series latency.
- Node query modifying time series metadata latency.
- Node query loading Chunk metadata list latency.
- Node query modifying Chunk metadata latency.
- Node query filtering by Chunk metadata latency.
- Node query constructing Chunk Reader latency average.
![](/img/%E7%9B%91%E6%8E%A7%20%E6%9F%A5%E8%AF%A2.png)
### 4.4 Storage Engine
You can monitor, but not limited to:
- Number and size of files by type.
- Number and size of TsFiles in various stages.
- Number and latency of various tasks.
![](/img/%E7%9B%91%E6%8E%A7%20%E5%AD%98%E5%82%A8%E5%BC%95%E6%93%8E.png)
### 4.5 System Monitoring
You can monitor, but not limited to:
- System memory, swap memory, process memory.
- Disk space, file count, file size.
- JVM GC time ratio, GC count by type, GC data volume, heap memory usage by generation.
- Network transmission rate, packet sending rate.
![](/img/%E7%9B%91%E6%8E%A7%20%E7%B3%BB%E7%BB%9F%20%E5%86%85%E5%AD%98%E4%B8%8E%E7%A1%AC%E7%9B%98.png)
![](/img/%E7%9B%91%E6%8E%A7%20%E7%B3%BB%E7%BB%9Fjvm.png)
![](/img/%E7%9B%91%E6%8E%A7%20%E7%B3%BB%E7%BB%9F%20%E7%BD%91%E7%BB%9C.png)