Apache Otava implements a nonparametric change point detection algorithm designed to identify statistically significant distribution changes in time-ordered data. The method is primarily based on the E-Divisive family of algorithms for multivariate change point detection, with some practical adaptations.
At a high level, the algorithm:
The current implementation prioritizes:
A representative example of algorithm application:
Figure 1. An example of running compute_change_points function on Tigerbeetle dataset. The parameters for the compute_change_points function are displayed in the bottom right corner. Here the algorithm detected 5 change points with a statistical test showing that behavior of the time series changes at those points. In other words, the data have different distributions to the left and to the right of each change point.
With that being said, the difference could be merely a change in the mean, but the algorithm isn't restricted to that and could also detect other kinds of changes, such as a change in variance, skewness, and other shifts in the overall shapes between the distributions even if the mean stays constant. Once potential change points are identified, a statistical test is performed to validate which of the points are in fact change points. That is to say, they are statistically significant. The user gets to choose the threshold (also called p-value threshold or significance threshold) used for the statistical test. The threshold adjusts the balance between finding fewer change points and finding false positives. For example, a p-value threshold of 0.01 roughly means that there will be at most 1% false positives. The false negatives are harder to estimate and depend on both the chosen threshold and the actual data distribution.
The main idea is to use a divergence measure between distributions to identify potential points in time series at which the characteristics of the time series change. Namely, having a time series $Z = \{Z_0, \cdots, Z_{T-1}\}$ (which may be multidimensional, i.e. from $\mathbb{R}^d$ with $d\geq1$) we are testing non-empty subsequences $X_\tau = \{ Z_0, Z_1, \cdots, Z_{\tau-1} \}$ and $Y_\tau(\kappa)=\{ Z_\tau, Z_{\tau+1}, \cdots, Z_{\kappa-1} \}$ for all possible $1 \leq \tau < \kappa \leq T$ to find such $\hat{\tau}, \hat{\kappa}$ that maximize the probability that $X_\tau$ and $Y_\tau(\kappa)$ come from different distributions. If the probability for the best found $\hat{\tau}, \hat{\kappa}$ is above a certain threshold, then the candidate $\hat{\tau}$ is a change point. The process is repeated recursively to the left and to the right of $\hat{\tau}$ until no candidate corresponds to a high enough probability. This process yields a series of change points $1 \leq \hat{\tau}_1 < \cdots < \hat{\tau}_k \leq T - 1$. See an example in Figure 2.
Figure 2. On the left subfigure, there is a time-series $Z=\{ Z_0, \cdots Z_{21}\}$ in red color, with two subseries $X_6 = \{ Z_0, \cdots, Z_5 \}$ in green and $Y_6(12) = \{Z_6, \cdots, Z_{11}\}$ in blue. Among all possible consecutive subseries, the subseries $X_6$ and $Y_6(12)$ have the highest probability to come from different distributions, i.e., being the most distinct. These subseries are defined by the pair $(\hat{\tau}_1, \hat{\kappa}_1) = (6, 12)$. If the difference between these subsequences is big enough (above the significance threshold), then $\hat{\tau}_1=6$ is a change point, and the algorithm will continue to the next step with the next best pair being $(\hat{\tau}_2, \hat{\kappa}_2) = (12, 22)$. Note that the $\kappa$ defines the end of the subseries $Y_\tau(\kappa)$, which might be before the end of series $Z$, i.e., $\kappa \leq T$. The reason can be seen on the right subfigures. Specifically, it shows that if we restrict the $Y$ subseries so that it does not go all the way to the end of series $Z$, the algorithm might detect the changes in time series $Z$ better. Consider a loosely defined distance between the distributions (cyan) in the top right subfigure $X_6$ vs $Y_6(12)$ and in the bottom right subfigure $X_6$ vs $Y_6(22)$. The tail $\{ Z_{12}, \cdots Z_{21} \}$ in $Y_6(22)$ muddies the signal, compared to the truncated series $Y_6(12)$.
There are a couple of additional nuances and notes that we want to mention:
Figure 3. Divergence measure between distributions of $X_\tau$ and $T_\tau(\kappa)$ for all valid values of $(\tau, \kappa)$. The biggest number corresponds to the most promising pair of values (marked by a red cross).
The original work was presented in “A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data” by Matteson and James. The authors provided extensive theoretical reasoning on using the following empirical divergence measure:
$\hat{𝒬}(X_\tau,Y_\tau(\kappa);\alpha)=\dfrac{\tau(\kappa - \tau)}{\kappa}\hat{ℰ}(X_\tau,Y_\tau(\kappa);\alpha),$
$\hat{ℰ}(X_\tau,Y_\tau(\kappa);\alpha)=\dfrac{2}{\tau(\kappa - \tau)}\sum\limits_{i=0}^{\tau-1}\sum\limits_{j=\tau}^{\kappa-1} |Z_i - Z_j|^\alpha - {\displaystyle \binom{\tau}{2}^{-1}} \sum\limits_{0\leq i < j \leq\tau-1}|Z_i - Z_j|^\alpha - {\displaystyle \binom{\kappa - \tau}{2}^{-1}} \sum\limits_{\tau\leq i < j \leq\kappa-1}|Z_i - Z_j|^\alpha,$
where $\alpha \in (0, 2)$, usually we take $\alpha=1$; $|\cdot|$ is Euclidean distance; and the coefficient in front of the second and third terms in $\hat{ℰ}$ are reciprocals of binomial coefficients. The idea behind each term in $\hat{ℰ}$ is actually quite simple. The first term of $\hat{ℰ}$ is the average of all possible pairwise distances between subsequences $X_\tau$ and $Y_\tau(\kappa)$. The second and third terms are the average of all possible pairwise distances of points within subsequences $X_\tau$ and $Y_\tau(\kappa)$, respectively. The intuition behind this divergence measure is to compare “how large” is the distance between the distributions, compared to the distances within each distribution on average. And the coefficient in front of $\hat{ℰ}$ in $\hat{𝒬}$ is a normalization coefficient that is required for some theoretical results to work.
The most “dissimilar” subsequences are given by
$(\hat{\tau}, \hat{\kappa}) = \text{arg}\max\limits_{(\tau, \kappa)}\hat{𝒬}(X_\tau,Y_\tau(\kappa);\alpha).$
After the subsequences are found, one needs to find the probability that $X_{\hat{\tau}}$ and $Y_{\hat{\tau}}(\hat{\kappa})$ come from different distributions. Generally speaking, the time sub-series $X$ and $Y$ could come from any distribution(s), and the authors proposed the use of a non-parametric permutation test to test for significant difference between them. If the candidates are shown to be significant, the process is to be run using hierarchical segmentation, i.e., recursively. For more details read the linked paper.
While the original paper was theoretically sound, there were a few practical issues with the methodology. They were outlined clearly in Hunter: Using Change Point Detection to Hunt for Performance Regressions by Fleming et al.. Here is the short outline, with more details in the linked paper:
The authors proposed a few innovations to resolve the issues. Namely,
The current implementation in Apache Otava is conceptually the one from the Hunter paper with a newly rewritten implementation of the algorithm.
Starting with a zero-indexed time series $Z = \{Z_0, Z_1, \cdots, Z_{T-1} \}$, we are interested in computing $\hat{𝒬}(X_\tau,Y_\tau(\kappa);\alpha)$. Moreover, because we are going to recursively split the time series, the series in the arguments of the function $𝒬$ do not have to start from the beginning of the series. For simplicity, let us fix $\alpha=1$ and define the following function instead:
$Q(s, \tau, \kappa) = \left.𝒬(\{Z_s, Z_{s+1}, \dots, Z_{\tau-1}\}, \{Z_\tau, Z_{\tau + 1}, \cdots, Z_{\kappa-1}\}; \alpha)\right|_{\alpha=1},$
which is equivalent to comparing the time series slices Z[s:tau] and Z[tau:kappa] in Python notation. Next, let us define a symmetric distance matrix $D$, with $D_{ij} = |Z_i - Z_j|$ and the following auxiliary functions:
$V(s, \tau, \kappa) = \sum\limits_{i=s}^{\tau-1} D_{i\tau},$
$H(s, \tau, \kappa) = \sum\limits_{j=\tau}^{\kappa-1} D_{\tau j}.$
Note that $V(s, \tau, \kappa) = V(s, \tau, \cdot)$ and $H(s, \tau, \kappa) = H(\cdot, \tau, \kappa)$. Finally, we can use function $V$ and $H$ to define recurrence equations for $Q$. We start by rewriting the function $Q$ as a sum of three functions:
$Q(s, \tau, \kappa) = A(s, \tau, \kappa) - B(s, \tau, \kappa) - C(s, \tau, \kappa),$
Here, functions $A$, $B$, $C$ correspond to the average pairwise distances between and within the subsequences. See Original Work section for more details.
$A(s, \tau, \kappa) = A(s, \tau - 1, \kappa) - \dfrac{2}{\kappa - s} V(s, \tau - 1, \kappa) + \dfrac{2}{\kappa - s} H(s, \tau - 1, \kappa),$
$B(s, \tau, \kappa) = \dfrac{2(\kappa - \tau)}{(\kappa - s)(\tau - s - 1)} \left(\dfrac{(\kappa - s)(\tau - s - 2)}{2 (\kappa - \tau + 1)} B(s, \tau - 1, \kappa) + V(s, \tau - 1, \kappa) \right),$
$C(s, \tau, \kappa) = \dfrac{2(\tau - s)}{(\kappa - s) (\kappa - \tau - 1)} \left( \dfrac{(\kappa - s)(\kappa - \tau - 2)}{2 (\tau + 1 - s)} C(s, \tau + 1, \kappa) - H(s, \tau, \kappa) \right).$
Note that $A(s, s, \kappa)=0$, and $B(s, s, \kappa)=0$, and $C(s, \kappa, \kappa)=0$ because they correspond to empty subsequences. Moreover, $B(s, s + 1, \kappa) = 0$ and $C(s, \kappa - 1, \kappa)=0$ because each of them corresponds to an average pairwise distance within a subsequence consisting of a single point, i.e., each of them is the sum with zero terms. Using that, we can compute values for $A$, $B$, and $C$ iteratively. Note that functions $A$ and $B$ are computed as $\tau$ increases, and function $C$ is computed as $\tau$ decreases.
These formulas are effectively used in Apache Otava after some NumPy vectorizations. For details see change_point_divisive/calculator.py.