Alarm core is driven by a collection of rules, which are defined in config/alarm-settings.yml
. There are three parts in alarm rule definition.
Define the relation between scope and entity name.
Alarm rule is constituted by following keys
_rule
.The settings of labels is required by meter-system which intends to store metrics from label-system platform, just like Prometheus, Micrometer, etc. The function supports the above four settings should implement LabeledValueHolder
.
value1, value2, value3, value4, value5
. Each value could the threshold for each value of the metrics. Set the value to -
if don't want to trigger alarm by this or some of the values.value1
is threshold of P50, and -, -, value3, value4, value5
means, there is no threshold for P50 and P75 in percentile alarm rule.>
, >=
, <
, <=
, =
. Welcome to contribute all OPs.rules: # Rule unique name, must be ended with `_rule`. endpoint_percent_rule: # Metrics value need to be long, double or int metrics-name: endpoint_percent threshold: 75 op: < # The length of time to evaluate the metrics period: 10 # How many times after the metrics match the condition, will trigger alarm count: 3 # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period. silence-period: 10 service_percent_rule: metrics-name: service_percent # [Optional] Default, match all services in this metrics include-names: - service_a - service_b exclude-names: - service_c # Single value metrics threshold. threshold: 85 op: < period: 10 count: 4 service_resp_time_percentile_rule: # Metrics value need to be long, double or int metrics-name: service_percentile op: ">" # Multiple value metrics threshold. Thresholds for P50, P75, P90, P95, P99. threshold: 1000,1000,1000,1000,1000 period: 10 count: 3 silence-period: 5 message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000 meter_service_status_code_rule: metrics-name: meter_status_code exclude-labels: - "200" op: ">" threshold: 10 period: 10 count: 3 silence-period: 5 message: The request number of entity {name} non-200 status is more than expected.
We provided a default alarm-setting.yml
in our distribution only for convenience, which including following rules
The metrics names are defined in official OAL scripts, right now metrics from Service, Service Instance, Endpoint, Service Relation, Service Instance Relation, Endpoint Relation scopes could be used in Alarm, and the Database access same with Service scope.
Submit issue or pull request if you want to support any other scope in alarm.
Webhook requires the peer is a web container. The alarm message will send through HTTP post by application/json
content type. The JSON format is based on List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage>
with following key information.
alarm-settings.yml
.Example as following
[{ "scopeId": 1, "scope": "SERVICE", "name": "serviceA", "id0": "12", "id1": "", "ruleName": "service_resp_time_rule", "alarmMessage": "alarmMessage xxxx", "startTime": 1560524171000 }, { "scopeId": 1, "scope": "SERVICE", "name": "serviceB", "id0": "23", "id1": "", "ruleName": "service_resp_time_rule", "alarmMessage": "alarmMessage yyy", "startTime": 1560524171000 }]
The alarm message will send through remote gRPC method by Protobuf
content type. The message format with following key information which are defined in oap-server/server-alarm-plugin/src/main/proto/alarm-hook.proto
.
Part of protocol looks as following:
message AlarmMessage { int64 scopeId = 1; string scope = 2; string name = 3; string id0 = 4; string id1 = 5; string ruleName = 6; string alarmMessage = 7; int64 startTime = 8; }
To do this you need to follow the Getting Started with Incoming Webhooks guide and create new Webhooks.
The alarm message will send through HTTP post by application/json
content type if you configured Slack Incoming Webhooks as following:
slackHooks: textTemplate: |- { "type": "section", "text": { "type": "mrkdwn", "text": ":alarm_clock: *Apache Skywalking Alarm* \n **%s**." } } webhooks: - https://hooks.slack.com/services/x/y/z
Note, only WeCom(WeChat Company Edition) supports webhook. To use the WeChat webhook you need to follow the Wechat Webhooks guide. The alarm message would send through HTTP post by application/json
content type after you set up Wechat Webhooks as following:
wechatHooks: textTemplate: |- { "msgtype": "text", "text": { "content": "Apache SkyWalking Alarm: \n %s." } } webhooks: - https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=dummy_key
Since 6.5.0, the alarm settings can be updated dynamically at runtime by Dynamic Configuration, which will override the settings in alarm-settings.yml
.
In order to determine that whether an alarm rule is triggered or not, SkyWalking needs to cache the metrics of a time window for each alarm rule, if any attribute (metrics-name
, op
, threshold
, period
, count
, etc.) of a rule is changed, the sliding window will be destroyed and re-created, causing the alarm of this specific rule to restart again.