blob: 2d7d1228a087e8c9e92da3dcc61143727146e240 [file] [log] [blame] [view]
<!--
​ Licensed to the Apache Software Foundation (ASF) under one
​ or more contributor license agreements. See the NOTICE file
​ distributed with this work for additional information
​ regarding copyright ownership. The ASF licenses this file
​ to you under the Apache License, Version 2.0 (the
​ "License"); you may not use this file except in compliance
​ with the License. You may obtain a copy of the License at
​ http://www.apache.org/licenses/LICENSE-2.0
​ Unless required by applicable law or agreed to in writing,
​ software distributed under the License is distributed on an
​ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
​ KIND, either express or implied. See the License for the
​ specific language governing permissions and limitations
​ under the License.
-->
# Data Matching
## Cov
### Usage
This function is used to calculate the population covariance.
**Name:** COV
**Input Series:** Only support two input series. The types are both INT32 / INT64 / FLOAT / DOUBLE.
**Output Series:** Output a single series. The type is DOUBLE. There is only one data point in the series, whose timestamp is 0 and value is the population covariance.
**Note:**
+ If a row contains missing points, null points or `NaN`, it will be ignored;
+ If all rows are ignored, `NaN` will be output.
### Examples
Input series:
```
+-----------------------------+---------------+---------------+
| Time|root.test.d2.s1|root.test.d2.s2|
+-----------------------------+---------------+---------------+
|2020-01-01T00:00:02.000+08:00| 100.0| 101.0|
|2020-01-01T00:00:03.000+08:00| 101.0| null|
|2020-01-01T00:00:04.000+08:00| 102.0| 101.0|
|2020-01-01T00:00:06.000+08:00| 104.0| 102.0|
|2020-01-01T00:00:08.000+08:00| 126.0| 102.0|
|2020-01-01T00:00:10.000+08:00| 108.0| 103.0|
|2020-01-01T00:00:12.000+08:00| null| 103.0|
|2020-01-01T00:00:14.000+08:00| 112.0| 104.0|
|2020-01-01T00:00:15.000+08:00| 113.0| null|
|2020-01-01T00:00:16.000+08:00| 114.0| 104.0|
|2020-01-01T00:00:18.000+08:00| 116.0| 105.0|
|2020-01-01T00:00:20.000+08:00| 118.0| 105.0|
|2020-01-01T00:00:22.000+08:00| 100.0| 106.0|
|2020-01-01T00:00:26.000+08:00| 124.0| 108.0|
|2020-01-01T00:00:28.000+08:00| 126.0| 108.0|
|2020-01-01T00:00:30.000+08:00| NaN| 108.0|
+-----------------------------+---------------+---------------+
```
SQL for query:
```sql
select cov(s1,s2) from root.test.d2
```
Output series:
```
+-----------------------------+-------------------------------------+
| Time|cov(root.test.d2.s1, root.test.d2.s2)|
+-----------------------------+-------------------------------------+
|1970-01-01T08:00:00.000+08:00| 12.291666666666666|
+-----------------------------+-------------------------------------+
```
## DTW
### Usage
This function is used to calculate the DTW distance between two input series.
**Name:** DTW
**Input Series:** Only support two input series. The types are both INT32 / INT64 / FLOAT / DOUBLE.
**Output Series:** Output a single series. The type is DOUBLE. There is only one data point in the series, whose timestamp is 0 and value is the DTW distance.
**Note:**
+ If a row contains missing points, null points or `NaN`, it will be ignored;
+ If all rows are ignored, `0` will be output.
### Examples
Input series:
```
+-----------------------------+---------------+---------------+
| Time|root.test.d2.s1|root.test.d2.s2|
+-----------------------------+---------------+---------------+
|1970-01-01T08:00:00.001+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.002+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.003+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.004+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.005+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.006+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.007+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.008+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.009+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.010+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.011+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.012+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.013+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.014+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.015+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.016+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.017+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.018+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.019+08:00| 1.0| 2.0|
|1970-01-01T08:00:00.020+08:00| 1.0| 2.0|
+-----------------------------+---------------+---------------+
```
SQL for query:
```sql
select dtw(s1,s2) from root.test.d2
```
Output series:
```
+-----------------------------+-------------------------------------+
| Time|dtw(root.test.d2.s1, root.test.d2.s2)|
+-----------------------------+-------------------------------------+
|1970-01-01T08:00:00.000+08:00| 20.0|
+-----------------------------+-------------------------------------+
```
## Pearson
### Usage
This function is used to calculate the Pearson Correlation Coefficient.
**Name:** PEARSON
**Input Series:** Only support two input series. The types are both INT32 / INT64 / FLOAT / DOUBLE.
**Output Series:** Output a single series. The type is DOUBLE. There is only one data point in the series, whose timestamp is 0 and value is the Pearson Correlation Coefficient.
**Note:**
+ If a row contains missing points, null points or `NaN`, it will be ignored;
+ If all rows are ignored, `NaN` will be output.
### Examples
Input series:
```
+-----------------------------+---------------+---------------+
| Time|root.test.d2.s1|root.test.d2.s2|
+-----------------------------+---------------+---------------+
|2020-01-01T00:00:02.000+08:00| 100.0| 101.0|
|2020-01-01T00:00:03.000+08:00| 101.0| null|
|2020-01-01T00:00:04.000+08:00| 102.0| 101.0|
|2020-01-01T00:00:06.000+08:00| 104.0| 102.0|
|2020-01-01T00:00:08.000+08:00| 126.0| 102.0|
|2020-01-01T00:00:10.000+08:00| 108.0| 103.0|
|2020-01-01T00:00:12.000+08:00| null| 103.0|
|2020-01-01T00:00:14.000+08:00| 112.0| 104.0|
|2020-01-01T00:00:15.000+08:00| 113.0| null|
|2020-01-01T00:00:16.000+08:00| 114.0| 104.0|
|2020-01-01T00:00:18.000+08:00| 116.0| 105.0|
|2020-01-01T00:00:20.000+08:00| 118.0| 105.0|
|2020-01-01T00:00:22.000+08:00| 100.0| 106.0|
|2020-01-01T00:00:26.000+08:00| 124.0| 108.0|
|2020-01-01T00:00:28.000+08:00| 126.0| 108.0|
|2020-01-01T00:00:30.000+08:00| NaN| 108.0|
+-----------------------------+---------------+---------------+
```
SQL for query:
```sql
select pearson(s1,s2) from root.test.d2
```
Output series:
```
+-----------------------------+-----------------------------------------+
| Time|pearson(root.test.d2.s1, root.test.d2.s2)|
+-----------------------------+-----------------------------------------+
|1970-01-01T08:00:00.000+08:00| 0.5630881927754872|
+-----------------------------+-----------------------------------------+
```
## PtnSym
### Usage
This function is used to find all symmetric subseries in the input whose degree of symmetry is less than the threshold.
The degree of symmetry is calculated by DTW.
The smaller the degree, the more symmetrical the series is.
**Name:** PATTERNSYMMETRIC
**Input Series:** Only support a single input series. The type is INT32 / INT64 / FLOAT / DOUBLE
**Parameter:**
+ `window`: The length of the symmetric subseries. It's a positive integer and the default value is 10.
+ `threshold`: The threshold of the degree of symmetry. It's non-negative. Only the subseries whose degree of symmetry is below it will be output. By default, all subseries will be output.
**Output Series:** Output a single series. The type is DOUBLE. Each data point in the output series corresponds to a symmetric subseries. The output timestamp is the starting timestamp of the subseries and the output value is the degree of symmetry.
### Example
Input series:
```
+-----------------------------+---------------+
| Time|root.test.d1.s4|
+-----------------------------+---------------+
|2021-01-01T12:00:00.000+08:00| 1.0|
|2021-01-01T12:00:01.000+08:00| 2.0|
|2021-01-01T12:00:02.000+08:00| 3.0|
|2021-01-01T12:00:03.000+08:00| 2.0|
|2021-01-01T12:00:04.000+08:00| 1.0|
|2021-01-01T12:00:05.000+08:00| 1.0|
|2021-01-01T12:00:06.000+08:00| 1.0|
|2021-01-01T12:00:07.000+08:00| 1.0|
|2021-01-01T12:00:08.000+08:00| 2.0|
|2021-01-01T12:00:09.000+08:00| 3.0|
|2021-01-01T12:00:10.000+08:00| 2.0|
|2021-01-01T12:00:11.000+08:00| 1.0|
+-----------------------------+---------------+
```
SQL for query:
```sql
select ptnsym(s4, 'window'='5', 'threshold'='0') from root.test.d1
```
Output series:
```
+-----------------------------+------------------------------------------------------+
| Time|ptnsym(root.test.d1.s4, "window"="5", "threshold"="0")|
+-----------------------------+------------------------------------------------------+
|2021-01-01T12:00:00.000+08:00| 0.0|
|2021-01-01T12:00:07.000+08:00| 0.0|
+-----------------------------+------------------------------------------------------+
```
## XCorr
### Usage
This function is used to calculate the cross correlation function of given two time series.
For discrete time series, cross correlation is given by
$$CR(n) = \frac{1}{N} \sum_{m=1}^N S_1[m]S_2[m+n]$$
which represent the similarities between two series with different index shifts.
**Name:** XCORR
**Input Series:** Only support two input numeric series. The type is INT32 / INT64 / FLOAT / DOUBLE.
**Output Series:** Output a single series with DOUBLE as datatype.
There are $2N-1$ data points in the series, the center of which represents the cross correlation
calculated with pre-aligned series(that is $CR(0)$ in the formula above),
and the previous(or post) values represent those with shifting the latter series forward(or backward otherwise)
until the two series are no longer overlapped(not included).
In short, the values of output series are given by(index starts from 1)
$$OS[i] = CR(-N+i) = \frac{1}{N} \sum_{m=1}^{i} S_1[m]S_2[N-i+m],\ if\ i <= N$$
$$OS[i] = CR(i-N) = \frac{1}{N} \sum_{m=1}^{2N-i} S_1[i-N+m]S_2[m],\ if\ i > N$$
**Note:**
+ `null` and `NaN` values in the input series will be ignored and treated as 0.
### Examples
Input series:
```
+-----------------------------+---------------+---------------+
| Time|root.test.d1.s1|root.test.d1.s2|
+-----------------------------+---------------+---------------+
|2020-01-01T00:00:01.000+08:00| null| 6|
|2020-01-01T00:00:02.000+08:00| 2| 7|
|2020-01-01T00:00:03.000+08:00| 3| NaN|
|2020-01-01T00:00:04.000+08:00| 4| 9|
|2020-01-01T00:00:05.000+08:00| 5| 10|
+-----------------------------+---------------+---------------+
```
SQL for query:
```sql
select xcorr(s1, s2) from root.test.d1 where time <= 2020-01-01 00:00:05
```
Output series:
```
+-----------------------------+---------------------------------------+
| Time|xcorr(root.test.d1.s1, root.test.d1.s2)|
+-----------------------------+---------------------------------------+
|1970-01-01T08:00:00.001+08:00| 0.0|
|1970-01-01T08:00:00.002+08:00| 4.0|
|1970-01-01T08:00:00.003+08:00| 9.6|
|1970-01-01T08:00:00.004+08:00| 13.4|
|1970-01-01T08:00:00.005+08:00| 20.0|
|1970-01-01T08:00:00.006+08:00| 15.6|
|1970-01-01T08:00:00.007+08:00| 9.2|
|1970-01-01T08:00:00.008+08:00| 11.8|
|1970-01-01T08:00:00.009+08:00| 6.0|
+-----------------------------+---------------------------------------+
```