IoTDB 告警功能预计支持两种模式:
写入触发:用户写入原始数据到原始时间序列,每插入一条数据都会触发 Trigger 的判断逻辑, 若满足告警要求则发送告警到下游数据接收器, 数据接收器再转发告警到外部终端。这种模式:
持续查询:用户写入原始数据到原始时间序列, ContinousQuery 定时查询原始时间序列,将查询结果写入新的时间序列, 每一次写入触发 Trigger 的判断逻辑, 若满足告警要求则发送告警到下游数据接收器, 数据接收器再转发告警到外部终端。这种模式:
随着 Trigger 模块的引入,可以实现写入触发模式的告警。
预编译好的二进制文件可在 这里 下载。
运行方法:
./alertmanager --config.file=<your_file>
可在 Quay.io 或 Docker Hub 获得。
运行方法:
docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager
如下是一个示例,可以覆盖到大部分配置规则,详细的配置规则参见 这里。
示例:
# alertmanager.yml global: # The smarthost and SMTP sender used for mail notifications. smtp_smarthost: 'localhost:25' smtp_from: 'alertmanager@example.org' # The root route on which each incoming alert enters. route: # The root route must not have any matchers as it is the entry point for # all alerts. It needs to have a receiver configured so alerts that do not # match any of the sub-routes are sent to someone. receiver: 'team-X-mails' # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. # # To aggregate by all possible labels use '...' as the sole label name. # This effectively disables aggregation entirely, passing through all # alerts as-is. This is unlikely to be what you want, unless you have # a very low alert volume or your upstream notification system performs # its own grouping. Example: group_by: [...] group_by: ['alertname', 'cluster'] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 30s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 5m # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. repeat_interval: 3h # All the above attributes are inherited by all child routes and can # overwritten on each. # The child route trees. routes: # This routes performs a regular expression match on alert labels to # catch alerts that are related to a list of services. - match_re: service: ^(foo1|foo2|baz)$ receiver: team-X-mails # The service has a sub-route for critical alerts, any alerts # that do not match, i.e. severity != critical, fall-back to the # parent node and are sent to 'team-X-mails' routes: - match: severity: critical receiver: team-X-pager - match: service: files receiver: team-Y-mails routes: - match: severity: critical receiver: team-Y-pager # This route handles all alerts coming from a database service. If there's # no team to handle it, it defaults to the DB team. - match: service: database receiver: team-DB-pager # Also group alerts by affected database. group_by: [alertname, cluster, database] routes: - match: owner: team-X receiver: team-X-pager - match: owner: team-Y receiver: team-Y-pager # Inhibition rules allow to mute a set of alerts given that another alert is # firing. # We use this to mute any warning-level notifications if the same alert is # already critical. inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' # Apply inhibition if the alertname is the same. # CAUTION: # If all label names listed in `equal` are missing # from both the source and target alerts, # the inhibition rule will apply! equal: ['alertname'] receivers: - name: 'team-X-mails' email_configs: - to: 'team-X+alerts@example.org, team-Y+alerts@example.org' - name: 'team-X-pager' email_configs: - to: 'team-X+alerts-critical@example.org' pagerduty_configs: - routing_key: <team-X-key> - name: 'team-Y-mails' email_configs: - to: 'team-Y+alerts@example.org' - name: 'team-Y-pager' pagerduty_configs: - routing_key: <team-Y-key> - name: 'team-DB-pager' pagerduty_configs: - routing_key: <team-DB-key>
在后面的示例中,我们采用的配置如下:
# alertmanager.yml global: smtp_smarthost: '' smtp_from: '' smtp_auth_username: '' smtp_auth_password: '' smtp_require_tls: false route: group_by: ['alertname'] group_wait: 1m group_interval: 10m repeat_interval: 10h receiver: 'email' receivers: - name: 'email' email_configs: - to: '' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname']
AlertManager API 分为 v1 和 v2 两个版本,当前 AlertManager API 版本为 v2 (配置参见 api/v2/openapi.yaml)。
默认配置的前缀为 /api/v1 或 /api/v2, 发送告警的 endpoint 为 /api/v1/alerts 或 /api/v2/alerts。 如果用户指定了 --web.route-prefix, 例如 --web.route-prefix=/alertmanager/, 那么前缀将会变为 /alertmanager/api/v1 或 /alertmanager/api/v2, 发送告警的 endpoint 变为 /alertmanager/api/v1/alerts 或 /alertmanager/api/v2/alerts。
用户通过自行创建 Java 类、编写钩子中的逻辑来定义一个触发器。 具体配置流程参见 Triggers。
下面的示例创建了 org.apache.iotdb.trigger.ClusterAlertingExample 类, 其 alertManagerHandler 成员变量可发送告警至地址为 http://127.0.0.1:9093/ 的 AlertManager 实例。
当 value > 100.0 时,发送 severity 为 critical 的告警; 当 50.0 < value <= 100.0 时,发送 severity 为 warning 的告警。
package org.apache.iotdb.trigger; import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerConfiguration; import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerEvent; import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerHandler; import org.apache.iotdb.trigger.api.Trigger; import org.apache.iotdb.trigger.api.TriggerAttributes; import org.apache.iotdb.tsfile.file.metadata.enums.TSDataType; import org.apache.iotdb.tsfile.write.record.Tablet; import org.apache.iotdb.tsfile.write.schema.MeasurementSchema; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; import java.util.HashMap; import java.util.List; public class ClusterAlertingExample implements Trigger { private static final Logger LOGGER = LoggerFactory.getLogger(ClusterAlertingExample.class); private final AlertManagerHandler alertManagerHandler = new AlertManagerHandler(); private final AlertManagerConfiguration alertManagerConfiguration = new AlertManagerConfiguration("http://127.0.0.1:9093/api/v2/alerts"); private String alertname; private final HashMap<String, String> labels = new HashMap<>(); private final HashMap<String, String> annotations = new HashMap<>(); @Override public void onCreate(TriggerAttributes attributes) throws Exception { alertname = "alert_test"; labels.put("series", "root.ln.wf01.wt01.temperature"); labels.put("value", ""); labels.put("severity", ""); annotations.put("summary", "high temperature"); annotations.put("description", "{{.alertname}}: {{.series}} is {{.value}}"); alertManagerHandler.open(alertManagerConfiguration); } @Override public void onDrop() throws IOException { alertManagerHandler.close(); } @Override public boolean fire(Tablet tablet) throws Exception { List<MeasurementSchema> measurementSchemaList = tablet.getSchemas(); for (int i = 0, n = measurementSchemaList.size(); i < n; i++) { if (measurementSchemaList.get(i).getType().equals(TSDataType.DOUBLE)) { // for example, we only deal with the columns of Double type double[] values = (double[]) tablet.values[i]; for (double value : values) { if (value > 100.0) { LOGGER.info("trigger value > 100"); labels.put("value", String.valueOf(value)); labels.put("severity", "critical"); AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations); alertManagerHandler.onEvent(alertManagerEvent); } else if (value > 50.0) { LOGGER.info("trigger value > 50"); labels.put("value", String.valueOf(value)); labels.put("severity", "warning"); AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations); alertManagerHandler.onEvent(alertManagerEvent); } } } } return true; } }
如下的 sql 语句在 root.ln.wf01.wt01.temperature 时间序列上注册了名为 root-ln-wf01-wt01-alert、 运行逻辑由 org.apache.iotdb.trigger.ClusterAlertingExample 类定义的触发器。
CREATE STATELESS TRIGGER `root-ln-wf01-wt01-alert` AFTER INSERT ON root.ln.wf01.wt01.temperature AS "org.apache.iotdb.trigger.ClusterAlertingExample" USING URI 'http://jar/ClusterAlertingExample.jar'
当我们完成 AlertManager 的部署和启动、Trigger 的创建, 可以通过向时间序列写入数据来测试告警功能。
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (1, 0); INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (2, 30); INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (3, 60); INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (4, 90); INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (5, 120);
执行完上述写入语句后,可以收到告警邮件。由于我们的 AlertManager 配置中设定 severity 为 critical 的告警 会抑制 severity 为 warning 的告警,我们收到的告警邮件中只包含写入 (5, 120) 后触发的告警。