| <!-- |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| --> |
| |
| # Anomaly Detection |
| |
| ## IQR |
| |
| ### Usage |
| |
| This function is used to detect anomalies based on IQR. Points distributing beyond 1.5 times IQR are selected. |
| |
| **Name:** IQR |
| |
| **Input Series:** Only support a single input series. The type is INT32 / INT64 / FLOAT / DOUBLE. |
| |
| + `method`: When set to "batch", anomaly test is conducted after importing all data points; when set to "stream", it is required to provide upper and lower quantiles. The default method is "batch". |
| + `q1`: The lower quantile when method is set to "stream". |
| + `q3`: The upper quantile when method is set to "stream". |
| |
| **Output Series:** Output a single series. The type is DOUBLE. |
| |
| **Note:** $IQR=Q_3-Q_1$ |
| |
| ### Examples |
| |
| #### Batch computing |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+------------+ |
| | Time|root.test.s1| |
| +-----------------------------+------------+ |
| |1970-01-01T08:00:00.100+08:00| 0.0| |
| |1970-01-01T08:00:00.200+08:00| 0.0| |
| |1970-01-01T08:00:00.300+08:00| 1.0| |
| |1970-01-01T08:00:00.400+08:00| -1.0| |
| |1970-01-01T08:00:00.500+08:00| 0.0| |
| |1970-01-01T08:00:00.600+08:00| 0.0| |
| |1970-01-01T08:00:00.700+08:00| -2.0| |
| |1970-01-01T08:00:00.800+08:00| 2.0| |
| |1970-01-01T08:00:00.900+08:00| 0.0| |
| |1970-01-01T08:00:01.000+08:00| 0.0| |
| |1970-01-01T08:00:01.100+08:00| 1.0| |
| |1970-01-01T08:00:01.200+08:00| -1.0| |
| |1970-01-01T08:00:01.300+08:00| -1.0| |
| |1970-01-01T08:00:01.400+08:00| 1.0| |
| |1970-01-01T08:00:01.500+08:00| 0.0| |
| |1970-01-01T08:00:01.600+08:00| 0.0| |
| |1970-01-01T08:00:01.700+08:00| 10.0| |
| |1970-01-01T08:00:01.800+08:00| 2.0| |
| |1970-01-01T08:00:01.900+08:00| -2.0| |
| |1970-01-01T08:00:02.000+08:00| 0.0| |
| +-----------------------------+------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select iqr(s1) from root.test |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+-----------------+ |
| | Time|iqr(root.test.s1)| |
| +-----------------------------+-----------------+ |
| |1970-01-01T08:00:01.700+08:00| 10.0| |
| +-----------------------------+-----------------+ |
| ``` |
| |
| ## KSigma |
| |
| ### Usage |
| |
| This function is used to detect anomalies based on the Dynamic K-Sigma Algorithm. |
| Within a sliding window, the input value with a deviation of more than k times the standard deviation from the average will be output as anomaly. |
| |
| **Name:** KSIGMA |
| |
| **Input Series:** Only support a single input series. The type is INT32 / INT64 / FLOAT / DOUBLE. |
| |
| + `k`: How many times to multiply on standard deviation to define anomaly, the default value is 3. |
| + `window`: The window size of Dynamic K-Sigma Algorithm, the default value is 10000. |
| |
| **Output Series:** Output a single series. The type is same as input series. |
| |
| **Note:** Only when is larger than 0, the anomaly detection will be performed. Otherwise, nothing will be output. |
| |
| ### Examples |
| |
| #### Assigning k |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+ |
| | Time|root.test.d1.s1| |
| +-----------------------------+---------------+ |
| |2020-01-01T00:00:02.000+08:00| 0.0| |
| |2020-01-01T00:00:03.000+08:00| 50.0| |
| |2020-01-01T00:00:04.000+08:00| 100.0| |
| |2020-01-01T00:00:06.000+08:00| 150.0| |
| |2020-01-01T00:00:08.000+08:00| 200.0| |
| |2020-01-01T00:00:10.000+08:00| 200.0| |
| |2020-01-01T00:00:14.000+08:00| 200.0| |
| |2020-01-01T00:00:15.000+08:00| 200.0| |
| |2020-01-01T00:00:16.000+08:00| 200.0| |
| |2020-01-01T00:00:18.000+08:00| 200.0| |
| |2020-01-01T00:00:20.000+08:00| 150.0| |
| |2020-01-01T00:00:22.000+08:00| 100.0| |
| |2020-01-01T00:00:26.000+08:00| 50.0| |
| |2020-01-01T00:00:28.000+08:00| 0.0| |
| |2020-01-01T00:00:30.000+08:00| NaN| |
| +-----------------------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select ksigma(s1,"k"="1.0") from root.test.d1 where time <= 2020-01-01 00:00:30 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+---------------------------------+ |
| |Time |ksigma(root.test.d1.s1,"k"="3.0")| |
| +-----------------------------+---------------------------------+ |
| |2020-01-01T00:00:02.000+08:00| 0.0| |
| |2020-01-01T00:00:03.000+08:00| 50.0| |
| |2020-01-01T00:00:26.000+08:00| 50.0| |
| |2020-01-01T00:00:28.000+08:00| 0.0| |
| +-----------------------------+---------------------------------+ |
| ``` |
| |
| ## LOF |
| |
| ### Usage |
| |
| This function is used to detect density anomaly of time series. According to k-th distance calculation parameter and local outlier factor (lof) threshold, the function judges if a set of input values is an density anomaly, and a bool mark of anomaly values will be output. |
| |
| **Name:** LOF |
| |
| **Input Series:** Multiple input series. The type is INT32 / INT64 / FLOAT / DOUBLE. |
| |
| + `method`:assign a detection method. The default value is "default", when input data has multiple dimensions. The alternative is "series", when a input series will be transformed to high dimension. |
| + `k`:use the k-th distance to calculate lof. Default value is 3. |
| + `window`: size of window to split origin data points. Default value is 10000. |
| + `windowsize`:dimension that will be transformed into when method is "series". The default value is 5. |
| |
| **Output Series:** Output a single series. The type is DOUBLE. |
| |
| **Note:** Incomplete rows will be ignored. They are neither calculated nor marked as anomaly. |
| |
| ### Examples |
| |
| #### Using default parameters |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+---------------+ |
| | Time|root.test.d1.s1|root.test.d1.s2| |
| +-----------------------------+---------------+---------------+ |
| |1970-01-01T08:00:00.100+08:00| 0.0| 0.0| |
| |1970-01-01T08:00:00.200+08:00| 0.0| 1.0| |
| |1970-01-01T08:00:00.300+08:00| 1.0| 1.0| |
| |1970-01-01T08:00:00.400+08:00| 1.0| 0.0| |
| |1970-01-01T08:00:00.500+08:00| 0.0| -1.0| |
| |1970-01-01T08:00:00.600+08:00| -1.0| -1.0| |
| |1970-01-01T08:00:00.700+08:00| -1.0| 0.0| |
| |1970-01-01T08:00:00.800+08:00| 2.0| 2.0| |
| |1970-01-01T08:00:00.900+08:00| 0.0| null| |
| +-----------------------------+---------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select lof(s1,s2) from root.test.d1 where time<1000 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+-------------------------------------+ |
| | Time|lof(root.test.d1.s1, root.test.d1.s2)| |
| +-----------------------------+-------------------------------------+ |
| |1970-01-01T08:00:00.100+08:00| 3.8274824267668244| |
| |1970-01-01T08:00:00.200+08:00| 3.0117631741126156| |
| |1970-01-01T08:00:00.300+08:00| 2.838155437762879| |
| |1970-01-01T08:00:00.400+08:00| 3.0117631741126156| |
| |1970-01-01T08:00:00.500+08:00| 2.73518261244453| |
| |1970-01-01T08:00:00.600+08:00| 2.371440975708148| |
| |1970-01-01T08:00:00.700+08:00| 2.73518261244453| |
| |1970-01-01T08:00:00.800+08:00| 1.7561416374270742| |
| +-----------------------------+-------------------------------------+ |
| ``` |
| |
| #### Diagnosing 1d timeseries |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+ |
| | Time|root.test.d1.s1| |
| +-----------------------------+---------------+ |
| |1970-01-01T08:00:00.100+08:00| 1.0| |
| |1970-01-01T08:00:00.200+08:00| 2.0| |
| |1970-01-01T08:00:00.300+08:00| 3.0| |
| |1970-01-01T08:00:00.400+08:00| 4.0| |
| |1970-01-01T08:00:00.500+08:00| 5.0| |
| |1970-01-01T08:00:00.600+08:00| 6.0| |
| |1970-01-01T08:00:00.700+08:00| 7.0| |
| |1970-01-01T08:00:00.800+08:00| 8.0| |
| |1970-01-01T08:00:00.900+08:00| 9.0| |
| |1970-01-01T08:00:01.000+08:00| 10.0| |
| |1970-01-01T08:00:01.100+08:00| 11.0| |
| |1970-01-01T08:00:01.200+08:00| 12.0| |
| |1970-01-01T08:00:01.300+08:00| 13.0| |
| |1970-01-01T08:00:01.400+08:00| 14.0| |
| |1970-01-01T08:00:01.500+08:00| 15.0| |
| |1970-01-01T08:00:01.600+08:00| 16.0| |
| |1970-01-01T08:00:01.700+08:00| 17.0| |
| |1970-01-01T08:00:01.800+08:00| 18.0| |
| |1970-01-01T08:00:01.900+08:00| 19.0| |
| |1970-01-01T08:00:02.000+08:00| 20.0| |
| +-----------------------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select lof(s1, "method"="series") from root.test.d1 where time<1000 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+--------------------+ |
| | Time|lof(root.test.d1.s1)| |
| +-----------------------------+--------------------+ |
| |1970-01-01T08:00:00.100+08:00| 3.77777777777778| |
| |1970-01-01T08:00:00.200+08:00| 4.32727272727273| |
| |1970-01-01T08:00:00.300+08:00| 4.85714285714286| |
| |1970-01-01T08:00:00.400+08:00| 5.40909090909091| |
| |1970-01-01T08:00:00.500+08:00| 5.94999999999999| |
| |1970-01-01T08:00:00.600+08:00| 6.43243243243243| |
| |1970-01-01T08:00:00.700+08:00| 6.79999999999999| |
| |1970-01-01T08:00:00.800+08:00| 7.0| |
| |1970-01-01T08:00:00.900+08:00| 7.0| |
| |1970-01-01T08:00:01.000+08:00| 6.79999999999999| |
| |1970-01-01T08:00:01.100+08:00| 6.43243243243243| |
| |1970-01-01T08:00:01.200+08:00| 5.94999999999999| |
| |1970-01-01T08:00:01.300+08:00| 5.40909090909091| |
| |1970-01-01T08:00:01.400+08:00| 4.85714285714286| |
| |1970-01-01T08:00:01.500+08:00| 4.32727272727273| |
| |1970-01-01T08:00:01.600+08:00| 3.77777777777778| |
| +-----------------------------+--------------------+ |
| ``` |
| |
| ## MissDetect |
| |
| ### Usage |
| |
| This function is used to detect missing anomalies. |
| In some datasets, missing values are filled by linear interpolation. |
| Thus, there are several long perfect linear segments. |
| By discovering these perfect linear segments, |
| missing anomalies are detected. |
| |
| **Name:** MISSDETECT |
| |
| **Input Series:** Only support a single input series. The data type is INT32 / INT64 / FLOAT / DOUBLE. |
| |
| **Parameter:** |
| |
| `error`: The minimum length of the detected missing anomalies, which is an integer greater than or equal to 10. By default, it is 10. |
| |
| **Output Series:** Output a single series. The type is BOOLEAN. Each data point which is miss anomaly will be labeled as true. |
| |
| ### Examples |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+ |
| | Time|root.test.d2.s2| |
| +-----------------------------+---------------+ |
| |2021-07-01T12:00:00.000+08:00| 0.0| |
| |2021-07-01T12:00:01.000+08:00| 1.0| |
| |2021-07-01T12:00:02.000+08:00| 0.0| |
| |2021-07-01T12:00:03.000+08:00| 1.0| |
| |2021-07-01T12:00:04.000+08:00| 0.0| |
| |2021-07-01T12:00:05.000+08:00| 0.0| |
| |2021-07-01T12:00:06.000+08:00| 0.0| |
| |2021-07-01T12:00:07.000+08:00| 0.0| |
| |2021-07-01T12:00:08.000+08:00| 0.0| |
| |2021-07-01T12:00:09.000+08:00| 0.0| |
| |2021-07-01T12:00:10.000+08:00| 0.0| |
| |2021-07-01T12:00:11.000+08:00| 0.0| |
| |2021-07-01T12:00:12.000+08:00| 0.0| |
| |2021-07-01T12:00:13.000+08:00| 0.0| |
| |2021-07-01T12:00:14.000+08:00| 0.0| |
| |2021-07-01T12:00:15.000+08:00| 0.0| |
| |2021-07-01T12:00:16.000+08:00| 1.0| |
| |2021-07-01T12:00:17.000+08:00| 0.0| |
| |2021-07-01T12:00:18.000+08:00| 1.0| |
| |2021-07-01T12:00:19.000+08:00| 0.0| |
| |2021-07-01T12:00:20.000+08:00| 1.0| |
| +-----------------------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select missdetect(s2,'minlen'='10') from root.test.d2 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+------------------------------------------+ |
| | Time|missdetect(root.test.d2.s2, "minlen"="10")| |
| +-----------------------------+------------------------------------------+ |
| |2021-07-01T12:00:00.000+08:00| false| |
| |2021-07-01T12:00:01.000+08:00| false| |
| |2021-07-01T12:00:02.000+08:00| false| |
| |2021-07-01T12:00:03.000+08:00| false| |
| |2021-07-01T12:00:04.000+08:00| true| |
| |2021-07-01T12:00:05.000+08:00| true| |
| |2021-07-01T12:00:06.000+08:00| true| |
| |2021-07-01T12:00:07.000+08:00| true| |
| |2021-07-01T12:00:08.000+08:00| true| |
| |2021-07-01T12:00:09.000+08:00| true| |
| |2021-07-01T12:00:10.000+08:00| true| |
| |2021-07-01T12:00:11.000+08:00| true| |
| |2021-07-01T12:00:12.000+08:00| true| |
| |2021-07-01T12:00:13.000+08:00| true| |
| |2021-07-01T12:00:14.000+08:00| true| |
| |2021-07-01T12:00:15.000+08:00| true| |
| |2021-07-01T12:00:16.000+08:00| false| |
| |2021-07-01T12:00:17.000+08:00| false| |
| |2021-07-01T12:00:18.000+08:00| false| |
| |2021-07-01T12:00:19.000+08:00| false| |
| |2021-07-01T12:00:20.000+08:00| false| |
| +-----------------------------+------------------------------------------+ |
| ``` |
| |
| ## Range |
| |
| ### Usage |
| |
| This function is used to detect range anomaly of time series. According to upper bound and lower bound parameters, the function judges if a input value is beyond range, aka range anomaly, and a new time series of anomaly will be output. |
| |
| **Name:** RANGE |
| |
| **Input Series:** Only support a single input series. The type is INT32 / INT64 / FLOAT / DOUBLE. |
| |
| + `lower_bound`:lower bound of range anomaly detection. |
| + `upper_bound`:upper bound of range anomaly detection. |
| |
| **Output Series:** Output a single series. The type is the same as the input. |
| |
| **Note:** Only when `upper_bound` is larger than `lower_bound`, the anomaly detection will be performed. Otherwise, nothing will be output. |
| |
| |
| |
| ### Examples |
| |
| #### Assigning Lower and Upper Bound |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+ |
| | Time|root.test.d1.s1| |
| +-----------------------------+---------------+ |
| |2020-01-01T00:00:02.000+08:00| 100.0| |
| |2020-01-01T00:00:03.000+08:00| 101.0| |
| |2020-01-01T00:00:04.000+08:00| 102.0| |
| |2020-01-01T00:00:06.000+08:00| 104.0| |
| |2020-01-01T00:00:08.000+08:00| 126.0| |
| |2020-01-01T00:00:10.000+08:00| 108.0| |
| |2020-01-01T00:00:14.000+08:00| 112.0| |
| |2020-01-01T00:00:15.000+08:00| 113.0| |
| |2020-01-01T00:00:16.000+08:00| 114.0| |
| |2020-01-01T00:00:18.000+08:00| 116.0| |
| |2020-01-01T00:00:20.000+08:00| 118.0| |
| |2020-01-01T00:00:22.000+08:00| 120.0| |
| |2020-01-01T00:00:26.000+08:00| 124.0| |
| |2020-01-01T00:00:28.000+08:00| 126.0| |
| |2020-01-01T00:00:30.000+08:00| NaN| |
| +-----------------------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select range(s1,"lower_bound"="101.0","upper_bound"="125.0") from root.test.d1 where time <= 2020-01-01 00:00:30 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+------------------------------------------------------------------+ |
| |Time |range(root.test.d1.s1,"lower_bound"="101.0","upper_bound"="125.0")| |
| +-----------------------------+------------------------------------------------------------------+ |
| |2020-01-01T00:00:02.000+08:00| 100.0| |
| |2020-01-01T00:00:28.000+08:00| 126.0| |
| +-----------------------------+------------------------------------------------------------------+ |
| ``` |
| |
| ## TwoSidedFilter |
| |
| ### Usage |
| |
| The function is used to filter anomalies of a numeric time series based on two-sided window detection. |
| |
| **Name:** TWOSIDEDFILTER |
| |
| **Input Series:** Only support a single input series. The data type is INT32 / INT64 / FLOAT / DOUBLE |
| |
| **Output Series:** Output a single series. The type is the same as the input. It is the input without anomalies. |
| |
| **Parameter:** |
| |
| - `len`: The size of the window, which is a positive integer. By default, it's 5. When `len`=3, the algorithm detects forward window and backward window with length 3 and calculates the outlierness of the current point. |
| |
| - `threshold`: The threshold of outlierness, which is a floating number in (0,1). By default, it's 0.3. The strict standard of detecting anomalies is in proportion to the threshold. |
| |
| ### Examples |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+------------+ |
| | Time|root.test.s0| |
| +-----------------------------+------------+ |
| |1970-01-01T08:00:00.000+08:00| 2002.0| |
| |1970-01-01T08:00:01.000+08:00| 1946.0| |
| |1970-01-01T08:00:02.000+08:00| 1958.0| |
| |1970-01-01T08:00:03.000+08:00| 2012.0| |
| |1970-01-01T08:00:04.000+08:00| 2051.0| |
| |1970-01-01T08:00:05.000+08:00| 1898.0| |
| |1970-01-01T08:00:06.000+08:00| 2014.0| |
| |1970-01-01T08:00:07.000+08:00| 2052.0| |
| |1970-01-01T08:00:08.000+08:00| 1935.0| |
| |1970-01-01T08:00:09.000+08:00| 1901.0| |
| |1970-01-01T08:00:10.000+08:00| 1972.0| |
| |1970-01-01T08:00:11.000+08:00| 1969.0| |
| |1970-01-01T08:00:12.000+08:00| 1984.0| |
| |1970-01-01T08:00:13.000+08:00| 2018.0| |
| |1970-01-01T08:00:37.000+08:00| 1484.0| |
| |1970-01-01T08:00:38.000+08:00| 1055.0| |
| |1970-01-01T08:00:39.000+08:00| 1050.0| |
| |1970-01-01T08:01:05.000+08:00| 1023.0| |
| |1970-01-01T08:01:06.000+08:00| 1056.0| |
| |1970-01-01T08:01:07.000+08:00| 978.0| |
| |1970-01-01T08:01:08.000+08:00| 1050.0| |
| |1970-01-01T08:01:09.000+08:00| 1123.0| |
| |1970-01-01T08:01:10.000+08:00| 1150.0| |
| |1970-01-01T08:01:11.000+08:00| 1034.0| |
| |1970-01-01T08:01:12.000+08:00| 950.0| |
| |1970-01-01T08:01:13.000+08:00| 1059.0| |
| +-----------------------------+------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select TwoSidedFilter(s0, 'len'='5', 'threshold'='0.3') from root.test |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+------------+ |
| | Time|root.test.s0| |
| +-----------------------------+------------+ |
| |1970-01-01T08:00:00.000+08:00| 2002.0| |
| |1970-01-01T08:00:01.000+08:00| 1946.0| |
| |1970-01-01T08:00:02.000+08:00| 1958.0| |
| |1970-01-01T08:00:03.000+08:00| 2012.0| |
| |1970-01-01T08:00:04.000+08:00| 2051.0| |
| |1970-01-01T08:00:05.000+08:00| 1898.0| |
| |1970-01-01T08:00:06.000+08:00| 2014.0| |
| |1970-01-01T08:00:07.000+08:00| 2052.0| |
| |1970-01-01T08:00:08.000+08:00| 1935.0| |
| |1970-01-01T08:00:09.000+08:00| 1901.0| |
| |1970-01-01T08:00:10.000+08:00| 1972.0| |
| |1970-01-01T08:00:11.000+08:00| 1969.0| |
| |1970-01-01T08:00:12.000+08:00| 1984.0| |
| |1970-01-01T08:00:13.000+08:00| 2018.0| |
| |1970-01-01T08:01:05.000+08:00| 1023.0| |
| |1970-01-01T08:01:06.000+08:00| 1056.0| |
| |1970-01-01T08:01:07.000+08:00| 978.0| |
| |1970-01-01T08:01:08.000+08:00| 1050.0| |
| |1970-01-01T08:01:09.000+08:00| 1123.0| |
| |1970-01-01T08:01:10.000+08:00| 1150.0| |
| |1970-01-01T08:01:11.000+08:00| 1034.0| |
| |1970-01-01T08:01:12.000+08:00| 950.0| |
| |1970-01-01T08:01:13.000+08:00| 1059.0| |
| +-----------------------------+------------+ |
| ``` |