| <!-- |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| --> |
| |
| # Data Matching |
| |
| ## Cov |
| |
| ### Usage |
| |
| This function is used to calculate the population covariance. |
| |
| **Name:** COV |
| |
| **Input Series:** Only support two input series. The types are both INT32 / INT64 / FLOAT / DOUBLE. |
| |
| **Output Series:** Output a single series. The type is DOUBLE. There is only one data point in the series, whose timestamp is 0 and value is the population covariance. |
| |
| **Note:** |
| |
| + If a row contains missing points, null points or `NaN`, it will be ignored; |
| + If all rows are ignored, `NaN` will be output. |
| |
| |
| ### Examples |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+---------------+ |
| | Time|root.test.d2.s1|root.test.d2.s2| |
| +-----------------------------+---------------+---------------+ |
| |2020-01-01T00:00:02.000+08:00| 100.0| 101.0| |
| |2020-01-01T00:00:03.000+08:00| 101.0| null| |
| |2020-01-01T00:00:04.000+08:00| 102.0| 101.0| |
| |2020-01-01T00:00:06.000+08:00| 104.0| 102.0| |
| |2020-01-01T00:00:08.000+08:00| 126.0| 102.0| |
| |2020-01-01T00:00:10.000+08:00| 108.0| 103.0| |
| |2020-01-01T00:00:12.000+08:00| null| 103.0| |
| |2020-01-01T00:00:14.000+08:00| 112.0| 104.0| |
| |2020-01-01T00:00:15.000+08:00| 113.0| null| |
| |2020-01-01T00:00:16.000+08:00| 114.0| 104.0| |
| |2020-01-01T00:00:18.000+08:00| 116.0| 105.0| |
| |2020-01-01T00:00:20.000+08:00| 118.0| 105.0| |
| |2020-01-01T00:00:22.000+08:00| 100.0| 106.0| |
| |2020-01-01T00:00:26.000+08:00| 124.0| 108.0| |
| |2020-01-01T00:00:28.000+08:00| 126.0| 108.0| |
| |2020-01-01T00:00:30.000+08:00| NaN| 108.0| |
| +-----------------------------+---------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select cov(s1,s2) from root.test.d2 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+-------------------------------------+ |
| | Time|cov(root.test.d2.s1, root.test.d2.s2)| |
| +-----------------------------+-------------------------------------+ |
| |1970-01-01T08:00:00.000+08:00| 12.291666666666666| |
| +-----------------------------+-------------------------------------+ |
| ``` |
| |
| ## DTW |
| |
| ### Usage |
| |
| This function is used to calculate the DTW distance between two input series. |
| |
| **Name:** DTW |
| |
| **Input Series:** Only support two input series. The types are both INT32 / INT64 / FLOAT / DOUBLE. |
| |
| **Output Series:** Output a single series. The type is DOUBLE. There is only one data point in the series, whose timestamp is 0 and value is the DTW distance. |
| |
| **Note:** |
| |
| + If a row contains missing points, null points or `NaN`, it will be ignored; |
| + If all rows are ignored, `0` will be output. |
| |
| |
| ### Examples |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+---------------+ |
| | Time|root.test.d2.s1|root.test.d2.s2| |
| +-----------------------------+---------------+---------------+ |
| |1970-01-01T08:00:00.001+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.002+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.003+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.004+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.005+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.006+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.007+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.008+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.009+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.010+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.011+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.012+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.013+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.014+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.015+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.016+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.017+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.018+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.019+08:00| 1.0| 2.0| |
| |1970-01-01T08:00:00.020+08:00| 1.0| 2.0| |
| +-----------------------------+---------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select dtw(s1,s2) from root.test.d2 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+-------------------------------------+ |
| | Time|dtw(root.test.d2.s1, root.test.d2.s2)| |
| +-----------------------------+-------------------------------------+ |
| |1970-01-01T08:00:00.000+08:00| 20.0| |
| +-----------------------------+-------------------------------------+ |
| ``` |
| |
| ## Pearson |
| |
| ### Usage |
| |
| This function is used to calculate the Pearson Correlation Coefficient. |
| |
| **Name:** PEARSON |
| |
| **Input Series:** Only support two input series. The types are both INT32 / INT64 / FLOAT / DOUBLE. |
| |
| **Output Series:** Output a single series. The type is DOUBLE. There is only one data point in the series, whose timestamp is 0 and value is the Pearson Correlation Coefficient. |
| |
| **Note:** |
| |
| + If a row contains missing points, null points or `NaN`, it will be ignored; |
| + If all rows are ignored, `NaN` will be output. |
| |
| |
| ### Examples |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+---------------+ |
| | Time|root.test.d2.s1|root.test.d2.s2| |
| +-----------------------------+---------------+---------------+ |
| |2020-01-01T00:00:02.000+08:00| 100.0| 101.0| |
| |2020-01-01T00:00:03.000+08:00| 101.0| null| |
| |2020-01-01T00:00:04.000+08:00| 102.0| 101.0| |
| |2020-01-01T00:00:06.000+08:00| 104.0| 102.0| |
| |2020-01-01T00:00:08.000+08:00| 126.0| 102.0| |
| |2020-01-01T00:00:10.000+08:00| 108.0| 103.0| |
| |2020-01-01T00:00:12.000+08:00| null| 103.0| |
| |2020-01-01T00:00:14.000+08:00| 112.0| 104.0| |
| |2020-01-01T00:00:15.000+08:00| 113.0| null| |
| |2020-01-01T00:00:16.000+08:00| 114.0| 104.0| |
| |2020-01-01T00:00:18.000+08:00| 116.0| 105.0| |
| |2020-01-01T00:00:20.000+08:00| 118.0| 105.0| |
| |2020-01-01T00:00:22.000+08:00| 100.0| 106.0| |
| |2020-01-01T00:00:26.000+08:00| 124.0| 108.0| |
| |2020-01-01T00:00:28.000+08:00| 126.0| 108.0| |
| |2020-01-01T00:00:30.000+08:00| NaN| 108.0| |
| +-----------------------------+---------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select pearson(s1,s2) from root.test.d2 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+-----------------------------------------+ |
| | Time|pearson(root.test.d2.s1, root.test.d2.s2)| |
| +-----------------------------+-----------------------------------------+ |
| |1970-01-01T08:00:00.000+08:00| 0.5630881927754872| |
| +-----------------------------+-----------------------------------------+ |
| ``` |
| |
| ## PtnSym |
| |
| ### Usage |
| |
| This function is used to find all symmetric subseries in the input whose degree of symmetry is less than the threshold. |
| The degree of symmetry is calculated by DTW. |
| The smaller the degree, the more symmetrical the series is. |
| |
| **Name:** PATTERNSYMMETRIC |
| |
| **Input Series:** Only support a single input series. The type is INT32 / INT64 / FLOAT / DOUBLE |
| |
| **Parameter:** |
| |
| + `window`: The length of the symmetric subseries. It's a positive integer and the default value is 10. |
| + `threshold`: The threshold of the degree of symmetry. It's non-negative. Only the subseries whose degree of symmetry is below it will be output. By default, all subseries will be output. |
| |
| |
| **Output Series:** Output a single series. The type is DOUBLE. Each data point in the output series corresponds to a symmetric subseries. The output timestamp is the starting timestamp of the subseries and the output value is the degree of symmetry. |
| |
| ### Example |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+ |
| | Time|root.test.d1.s4| |
| +-----------------------------+---------------+ |
| |2021-01-01T12:00:00.000+08:00| 1.0| |
| |2021-01-01T12:00:01.000+08:00| 2.0| |
| |2021-01-01T12:00:02.000+08:00| 3.0| |
| |2021-01-01T12:00:03.000+08:00| 2.0| |
| |2021-01-01T12:00:04.000+08:00| 1.0| |
| |2021-01-01T12:00:05.000+08:00| 1.0| |
| |2021-01-01T12:00:06.000+08:00| 1.0| |
| |2021-01-01T12:00:07.000+08:00| 1.0| |
| |2021-01-01T12:00:08.000+08:00| 2.0| |
| |2021-01-01T12:00:09.000+08:00| 3.0| |
| |2021-01-01T12:00:10.000+08:00| 2.0| |
| |2021-01-01T12:00:11.000+08:00| 1.0| |
| +-----------------------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select ptnsym(s4, 'window'='5', 'threshold'='0') from root.test.d1 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+------------------------------------------------------+ |
| | Time|ptnsym(root.test.d1.s4, "window"="5", "threshold"="0")| |
| +-----------------------------+------------------------------------------------------+ |
| |2021-01-01T12:00:00.000+08:00| 0.0| |
| |2021-01-01T12:00:07.000+08:00| 0.0| |
| +-----------------------------+------------------------------------------------------+ |
| ``` |
| |
| ## XCorr |
| |
| ### Usage |
| |
| This function is used to calculate the cross correlation function of given two time series. |
| For discrete time series, cross correlation is given by |
| $$CR(n) = \frac{1}{N} \sum_{m=1}^N S_1[m]S_2[m+n]$$ |
| which represent the similarities between two series with different index shifts. |
| |
| **Name:** XCORR |
| |
| **Input Series:** Only support two input numeric series. The type is INT32 / INT64 / FLOAT / DOUBLE. |
| |
| **Output Series:** Output a single series with DOUBLE as datatype. |
| There are $2N-1$ data points in the series, the center of which represents the cross correlation |
| calculated with pre-aligned series(that is $CR(0)$ in the formula above), |
| and the previous(or post) values represent those with shifting the latter series forward(or backward otherwise) |
| until the two series are no longer overlapped(not included). |
| In short, the values of output series are given by(index starts from 1) |
| $$OS[i] = CR(-N+i) = \frac{1}{N} \sum_{m=1}^{i} S_1[m]S_2[N-i+m],\ if\ i <= N$$ |
| $$OS[i] = CR(i-N) = \frac{1}{N} \sum_{m=1}^{2N-i} S_1[i-N+m]S_2[m],\ if\ i > N$$ |
| |
| **Note:** |
| |
| + `null` and `NaN` values in the input series will be ignored and treated as 0. |
| |
| ### Examples |
| |
| Input series: |
| |
| ``` |
| +-----------------------------+---------------+---------------+ |
| | Time|root.test.d1.s1|root.test.d1.s2| |
| +-----------------------------+---------------+---------------+ |
| |2020-01-01T00:00:01.000+08:00| null| 6| |
| |2020-01-01T00:00:02.000+08:00| 2| 7| |
| |2020-01-01T00:00:03.000+08:00| 3| NaN| |
| |2020-01-01T00:00:04.000+08:00| 4| 9| |
| |2020-01-01T00:00:05.000+08:00| 5| 10| |
| +-----------------------------+---------------+---------------+ |
| ``` |
| |
| SQL for query: |
| |
| ```sql |
| select xcorr(s1, s2) from root.test.d1 where time <= 2020-01-01 00:00:05 |
| ``` |
| |
| Output series: |
| |
| ``` |
| +-----------------------------+---------------------------------------+ |
| | Time|xcorr(root.test.d1.s1, root.test.d1.s2)| |
| +-----------------------------+---------------------------------------+ |
| |1970-01-01T08:00:00.001+08:00| 0.0| |
| |1970-01-01T08:00:00.002+08:00| 4.0| |
| |1970-01-01T08:00:00.003+08:00| 9.6| |
| |1970-01-01T08:00:00.004+08:00| 13.4| |
| |1970-01-01T08:00:00.005+08:00| 20.0| |
| |1970-01-01T08:00:00.006+08:00| 15.6| |
| |1970-01-01T08:00:00.007+08:00| 9.2| |
| |1970-01-01T08:00:00.008+08:00| 11.8| |
| |1970-01-01T08:00:00.009+08:00| 6.0| |
| +-----------------------------+---------------------------------------+ |
| ``` |