Alarm core is driven by a collection of rules, which are defined in config/alarm-settings.yml
. There are three parts in alarm rule definition.
Defines the relation between scope and entity name.
There are two types of rules: individual rules and composite rules. A composite rule is a combination of individual rules.
An alarm rule is made up of the following elements:
_rule
.core/default/searchableAlarmTags
, or through system environment variable SW_SEARCHABLE_ALARM_TAG_KEYS
. The key level
is supported by default.Label settings are required by the meter-system. They are used to store metrics from the label-system platform, such as Prometheus, Micrometer, etc. The four label settings mentioned above must implement LabeledValueHolder
.
value1, value2, value3, value4, value5
. Each value may serve as the threshold for each value of the metrics. Set the value to -
if you do not wish to trigger the alarm by one or more of the values.value1
is the threshold of P50, and -, -, value3, value4, value5
means that there is no threshold for P50 and P75 in the percentile alarm rule.>
, >=
, <
, <=
, =
. We welcome contributions of all OPs.count
, then an alarm will be sent.NOTE: Composite rules are only applicable to alarm rules targeting the same entity level, such as service-level alarm rules (service_percent_rule && service_resp_time_percentile_rule
). Do not compose alarm rules of different entity levels, such as an alarm rule of the service metrics with another rule of the endpoint metrics.
A composite rule is made up of the following elements:
_rule
.&&
, ||
, and ()
.rules: # Rule unique name, must be ended with `_rule`. endpoint_percent_rule: # Metrics value need to be long, double or int metrics-name: endpoint_percent threshold: 75 op: < # The length of time to evaluate the metrics period: 10 # How many times after the metrics match the condition, will trigger alarm count: 3 # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period. silence-period: 10 # Specify if the rule can send notification or just as an condition of composite rule only-as-condition: false tags: level: WARNING service_percent_rule: metrics-name: service_percent # [Optional] Default, match all services in this metrics include-names: - service_a - service_b exclude-names: - service_c # Single value metrics threshold. threshold: 85 op: < period: 10 count: 4 only-as-condition: false service_resp_time_percentile_rule: # Metrics value need to be long, double or int metrics-name: service_percentile op: ">" # Multiple value metrics threshold. Thresholds for P50, P75, P90, P95, P99. threshold: 1000,1000,1000,1000,1000 period: 10 count: 3 silence-period: 5 message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000 only-as-condition: false meter_service_status_code_rule: metrics-name: meter_status_code exclude-labels: - "200" op: ">" threshold: 10 period: 10 count: 3 silence-period: 5 message: The request number of entity {name} non-200 status is more than expected. only-as-condition: false composite-rules: comp_rule: # Must satisfied percent rule and resp time rule expression: service_percent_rule && service_resp_time_percentile_rule message: Service {name} successful rate is less than 80% and P50 of response time is over 1000ms tags: level: CRITICAL
For convenience's sake, we have provided a default alarm-setting.yml
in our release. It includes the following rules:
The metrics names are defined in the official OAL scripts and MAL scripts, the Event names can also serve as the metrics names, all possible event names can be also found in the Event doc.
Currently, metrics from the Service, Service Instance, Endpoint, Service Relation, Service Instance Relation, Endpoint Relation scopes could be used in Alarm, and the Database access scope is same as Service.
Submit an issue or a pull request if you want to support any other scopes in alarm.
The Webhook requires the peer to be a web container. The alarm message will be sent through HTTP post by application/json
content type. The JSON format is based on List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage>
with the following key information:
org.apache.skywalking.oap.server.core.source.DefaultScopeDefine
.alarm-settings.yml
.alarm-settings.yml
.See the following example:
[{ "scopeId": 1, "scope": "SERVICE", "name": "serviceA", "id0": "12", "id1": "", "ruleName": "service_resp_time_rule", "alarmMessage": "alarmMessage xxxx", "startTime": 1560524171000, "tags": [{ "key": "level", "value": "WARNING" }] }, { "scopeId": 1, "scope": "SERVICE", "name": "serviceB", "id0": "23", "id1": "", "ruleName": "service_resp_time_rule", "alarmMessage": "alarmMessage yyy", "startTime": 1560524171000, "tags": [{ "key": "level", "value": "CRITICAL" }] }]
The alarm message will be sent through remote gRPC method by Protobuf
content type. The message contains key information which are defined in oap-server/server-alarm-plugin/src/main/proto/alarm-hook.proto
.
Part of the protocol looks like this:
message AlarmMessage { int64 scopeId = 1; string scope = 2; string name = 3; string id0 = 4; string id1 = 5; string ruleName = 6; string alarmMessage = 7; int64 startTime = 8; AlarmTags tags = 9; } message AlarmTags { // String key, String value pair. repeated KeyStringValuePair data = 1; } message KeyStringValuePair { string key = 1; string value = 2; }
Follow the Getting Started with Incoming Webhooks guide and create new Webhooks.
The alarm message will be sent through HTTP post by application/json
content type if you have configured Slack Incoming Webhooks as follows:
slackHooks: textTemplate: |- { "type": "section", "text": { "type": "mrkdwn", "text": ":alarm_clock: *Apache Skywalking Alarm* \n **%s**." } } webhooks: - https://hooks.slack.com/services/x/y/z
Note that only the WeChat Company Edition (WeCom) supports WebHooks. To use the WeChat WebHook, follow the Wechat Webhooks guide. The alarm message will be sent through HTTP post by application/json
content type after you have set up Wechat Webhooks as follows:
wechatHooks: textTemplate: |- { "msgtype": "text", "text": { "content": "Apache SkyWalking Alarm: \n %s." } } webhooks: - https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=dummy_key
Follow the Dingtalk Webhooks guide and create new Webhooks. For security purposes, you can config an optional secret for an individual webhook URL. The alarm message will be sent through HTTP post by application/json
content type if you have configured Dingtalk Webhooks as follows:
dingtalkHooks: textTemplate: |- { "msgtype": "text", "text": { "content": "Apache SkyWalking Alarm: \n %s." } } webhooks: - url: https://oapi.dingtalk.com/robot/send?access_token=dummy_token secret: dummysecret
Follow the Feishu Webhooks guide and create new Webhooks. For security purposes, you can config an optional secret for an individual webhook URL. If you would like to direct a text to a user, you can config ats
which is the feishu's user_id and separated by “,” . The alarm message will be sent through HTTP post by application/json
content type if you have configured Feishu Webhooks as follows:
feishuHooks: textTemplate: |- { "msg_type": "text", "content": { "text": "Apache SkyWalking Alarm: \n %s." }, "ats":"feishu_user_id_1,feishu_user_id_2" } webhooks: - url: https://open.feishu.cn/open-apis/bot/v2/hook/dummy_token secret: dummysecret
Follow the WeLink Webhooks guide and create new Webhooks. The alarm message will be sent through HTTP post by application/json
content type if you have configured WeLink Webhooks as follows:
welinkHooks: textTemplate: "Apache SkyWalking Alarm: \n %s." webhooks: # you may find your own client_id and client_secret in your app, below are dummy, need to change. - client_id: "dummy_client_id" client_secret: dummy_secret_key access_token_url: https://open.welink.huaweicloud.com/api/auth/v2/tickets message_url: https://open.welink.huaweicloud.com/api/welinkim/v1/im-service/chat/group-chat # if you send to multi group at a time, separate group_ids with commas, e.g. "123xx","456xx" group_ids: "dummy_group_id" # make a name you like for the robot, it will display in group robot_name: robot
Since 6.5.0, the alarm settings can be updated dynamically at runtime by Dynamic Configuration, which will override the settings in alarm-settings.yml
.
In order to determine whether an alarm rule is triggered or not, SkyWalking needs to cache the metrics of a time window for each alarm rule. If any attribute (metrics-name
, op
, threshold
, period
, count
, etc.) of a rule is changed, the sliding window will be destroyed and re-created, causing the alarm of this specific rule to restart again.