solr/solr-ref-guide/src/time-series.adoc - lucene-solr - Git at Google

 = Time Series
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 This section of the user guide provides an overview of some of the time series capabilities available
 in Streaming Expressions and Math Expressions.

 == Time Series Aggregation

 The `timeseries` function performs fast, distributed time
 series aggregation leveraging Solr's built-in faceting and date math capabilities.

 The example below performs a monthly time series aggregation over a collection of daily stock price data.
 In this example the average monthly closing price is calculated for the stock
 ticker *AMZN* between a specific date range.

 [source,text]
 ----
 timeseries(stocks,
            q=ticker_s:amzn,
            field="date_dt",
            start="2010-01-01T00:00:00Z",
            end="2017-11-01T00:00:00Z",
            gap="+1MONTH",
            format="YYYY-MM",
            avg(close_d))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "date_dt": "2010-01",
         "avg(close_d)": 127.42315789473685
       },
       {
         "date_dt": "2010-02",
         "avg(close_d)": 118.02105263157895
       },
       {
         "date_dt": "2010-03",
         "avg(close_d)": 130.89739130434782
       },
       {
         "date_dt": "2010-04",
         "avg(close_d)": 141.07
       },
       {
         "date_dt": "2010-05",
         "avg(close_d)": 127.606
       },
       {
         "date_dt": "2010-06",
         "avg(close_d)": 121.66681818181816
       },
       {
         "date_dt": "2010-07",
         "avg(close_d)": 117.5190476190476
       }
 ]}}
 ----

 Using Zeppelin-Solr this time series can be visualized using a line chart.

 image::images/math-expressions/timeseries1.png[]


 == Vectorizing the Time Series

 Before a time series can be smoothed or modeled the data will need to be vectorized.
 The `col` function can be used
 to copy a column of data from a list of tuples into an array.

 The expression below demonstrates the vectorization of the `date_dt` and `avg(close_d)` fields.
 The `zplot` function is then used to plot the months on the x-axis and the average closing prices on the y-axis.

 image::images/math-expressions/timeseries2.png[]


 == Smoothing

 Time series smoothing is often used to remove the noise from a time series and help spot the underlying trend.
 The math expressions library has three *sliding window* approaches
 for time series smoothing.
 These approaches use a summary value from a sliding window of the data to calculate a new set of smoothed data points.

 The three *sliding window* functions are lagging indicators, which means
 they don't start to move in the direction of the trend until the trend effects
 the summary value of the sliding window.
 Because of this lagging quality these smoothing functions are often used to confirm the direction of the trend.

 === Moving Average

 The `movingAvg` function computes a simple moving average over a sliding window of data.
 The example below generates a time series, vectorizes the `avg(close_d)` field and computes the
 moving average with a window size of 5.

 The moving average function returns an array that is of shorter length
 then the original vector. This is because results are generated only when a full window of data
 is available for computing the average. With a window size of five the moving average will
 begin generating results at the 5th value. The prior values are not included in the result.

 The `zplot` function is then used to plot the months on the x-axis, and the average close and moving
 average on the y-axis. Notice that the `ltrim` function is used to trim the first 4 values from
 the x-axis and the average closing prices. This is done to line up the three arrays so they start
 from the 5th value.

 image::images/math-expressions/movingavg.png[]

 === Exponential Moving Average

 The `expMovingAvg` function uses a different formula for computing the moving average that
 responds faster to changes in the underlying data. This means that it is
 less of a lagging indicator than the simple moving average.

 Below is an example that computes a moving average and exponential moving average and plots them
 along with the original y values. Notice how the exponential moving average is more sensitive
 to changes in the y values.

 image::images/math-expressions/expmoving.png[]


 === Moving Median

 The `movingMedian` function uses the median of the sliding window rather than the average.
 In many cases the moving median will be more *robust* to outliers than moving averages.

 Below is an example computing the moving median:

 image::images/math-expressions/movingMedian.png[]


 == Differencing

 Differencing can be used to make
 a time series stationary by removing the trend or seasonality from the series.

 === First Difference

 The technique used in differencing is to use the difference between values rather than the
 original values. The *first difference* takes the difference between a value and the value
 that came directly before it. The first difference is often used to remove the trend
 from a time series.

 The examples below uses the first difference to make two time series stationary so they can be compared
 without the trend.

 In this example we'll be comparing the average monthly closing price for two stocks: Amazon and Google.
 The image below plots both time series before differencing is applied.

 image::images/math-expressions/timecompare.png[]

 In the next example the `diff` function is applied to both time series inside the `zplot` function.
 The `diff` can be applied inside the `zplot` function or like any other function inside of the `let`
 function.

 Notice that both time series now have the trend removed and the monthly movements of the stock price
 can be studied without being influenced by the trend.

 image::images/math-expressions/diff1.png[]

 In the next example the `zoom` function of the time series visualization is used to zoom into a specific
 range of months. This allows for closer inspection of the data. With closer inspection of the data there appears
 to be some correlation between the monthly movements of the two stocks.

 image::images/math-expressions/diffzoom.png[]

 In the final example the differenced time series are correlated with the `corr` function.

 image::images/math-expressions/diffcorr.png[]


 === Lagged Differences

 The `diff` function has an optional second parameter to specify a lag in the difference.
 If a lag is specified the difference is taken between a value and the value at a specified
 lag in the past. Lagged differences are often used to remove seasonality from a time series.

 The simple example below demonstrates how lagged differencing works.
 Notice that the array in the example follows a simple repeated pattern. This type of pattern
 is often displayed with seasonality.

 image::images/math-expressions/season.png[]

 In this example we remove this pattern using
 the `diff` function with a lag of 4. This will subtract the value lagging four indexes
 behind the current index. Notice that the result set size is the original array size minus the lag.
 This is because the `diff` function only returns results for values where the lag of 4
 is possible to compute.

 image::images/math-expressions/seasondiff.png[]


 == Anomaly Detection

 The `movingMAD` (moving mean absolute deviation) function can be used to surface anomalies
 in a time series by measuring dispersion (deviation from the mean) within a sliding window.

 The `movingMAD` function operates in a similar manner as a moving average, except it
 measures the mean absolute deviation within the window rather than the average. By
 looking for unusually high or low dispersion we can find anomalies in the time
 series.

 For this example we'll be working with daily stock prices for Amazon over a two year
 period. The daily stock data will provide a larger data set to study.

 In the example below the `search` expression is used to return the daily closing price
 for the ticker *AMZN* over a two year period.

 image::images/math-expressions/anomaly.png[]

 The next step is to apply the `movingMAD` function to the data to calculate
 the moving mean absolute deviation over a 10 day window. The example below shows the function being
 applied and visualized.

 image::images/math-expressions/mad.png[]

 Once the moving MAD has been calculated we can visualize the distribution of dispersion
 with the `empiricalDistribution` function. The example below plots the empirical
 distribution with 10 bins, creating a 10 bin histogram of the dispersion of the
 time series.

 This visualization shows that most of the mean absolute deviations fall between 0 and
 9.2 with the mean of the final bin at 11.94.

 image::images/math-expressions/maddist.png[]

 The final step is to detect outliers in the series using the `outliers` function.
 The `outliers` function uses a probability distribution to find outliers in a numeric vector.
 The `outliers` function takes four parameters:

 * Probability distribution
 * Numeric vector
 * Low probability threshold
 * High probability threshold
 * List of results that the numeric vector was selected from

 The `outliers` function iterates the numeric vector and uses the probability
 distribution to calculate the cumulative probability of each value. If the cumulative
 probability is below the low probability threshold or above the high threshold it considers
 the value an outlier. When the `outliers` function encounters an outlier it returns
 the corresponding result from the list of results provided by the fifth parameter.
 It also includes the cumulative probability and the value of the outlier.

 The example below shows the `outliers` function applied to the Amazon stock
 price data set. The empirical distribution of the moving mean absolute deviation is
 the first parameter. The vector containing the moving mean absolute
 deviations is the second parameter. `-1` is the low and `.99` is the high probability
 thresholds. `-1` means that low outliers will not be considered. The final parameter
 is the original result set containing the `close_d` and `date_dt` fields.

 The output of the `outliers` function contains the results where an outlier was detected.
 In this case 5 results above the .99 probability threshold were detected.


 image::images/math-expressions/outliers.png[]


 == Modeling

 Math expressions support in Solr includes a number of functions that can be used to model a time series.
 These functions include linear regression, polynomial and harmonic curve fitting, loess regression, and KNN regression.

 Each of these functions can model a time series and be used for
 interpolation (predicting values within the dataset) and several
 can be used for extrapolation (predicting values beyond the data set).

 The various regression functions are covered in detail in the Linear Regression, Curve
 Fitting and Machine Learning sections of the user guide.

 The example below uses the `polyfit` function (polynomial regression) to
 fit a non-linear model to a time series. The data set being used is the
 monthly average closing price for Amazon over an eight year period.

 In this example the `polyfit` function returns a fitted model for the *y*
 axis, which is the average monthly closing prices, using a 4 degree polynomial.
 The degree of the polynomial determines the number of curves in the
 model. The fitted model is set to the variable `y1`. The fitted model
 is then directly plotted with `zplot` along with the original `y`
 values.

 The visualization shows the smooth line fit through the average closing
 price data.

 image::images/math-expressions/timemodel.png[]


 == Forecasting

 The `polyfit` function can also be used to extrapolate a time series to forecast
 future stock prices. The example below demonstrates a 10 month forecast.

 In the example the `polyfit` function fits a model to the y-axis and the model
 is set to the variable *`m`*.
 Then to create a forecast 10 zeros are appended
 to the y-axis to create new vector called `y10`.
 Then a new x-axis is created using
 the `natural` function which returns a sequence of whole numbers 0 to the length of `y10`.
 The new x-axis is stored in the variable `x10`.

 The `predict` function uses the fitted model to predict values for the new x-axis stored in
 variable `x10`.

 The `zplot` function is then used to plot the `x10` vector on the x-axis and the `y10` vector and extrapolated
 model on the y-axis. Notice that the `y10` vector drops to zero where the observed data
 ends, but the forecast continues along the fitted curve
 of the model.

 image::images/math-expressions/forecast.png[]
	= Time Series
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	This section of the user guide provides an overview of some of the time series capabilities available
	in Streaming Expressions and Math Expressions.

	== Time Series Aggregation

	The `timeseries` function performs fast, distributed time
	series aggregation leveraging Solr's built-in faceting and date math capabilities.

	The example below performs a monthly time series aggregation over a collection of daily stock price data.
	In this example the average monthly closing price is calculated for the stock
	ticker AMZN between a specific date range.

	[source,text]
	----
	timeseries(stocks,
	q=ticker_s:amzn,
	field="date_dt",
	start="2010-01-01T00:00:00Z",
	end="2017-11-01T00:00:00Z",
	gap="+1MONTH",
	format="YYYY-MM",
	avg(close_d))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"date_dt": "2010-01",
	"avg(close_d)": 127.42315789473685
	},
	{
	"date_dt": "2010-02",
	"avg(close_d)": 118.02105263157895
	},
	{
	"date_dt": "2010-03",
	"avg(close_d)": 130.89739130434782
	},
	{
	"date_dt": "2010-04",
	"avg(close_d)": 141.07
	},
	{
	"date_dt": "2010-05",
	"avg(close_d)": 127.606
	},
	{
	"date_dt": "2010-06",
	"avg(close_d)": 121.66681818181816
	},
	{
	"date_dt": "2010-07",
	"avg(close_d)": 117.5190476190476
	}
	]}}
	----

	Using Zeppelin-Solr this time series can be visualized using a line chart.

	image::images/math-expressions/timeseries1.png[]


	== Vectorizing the Time Series

	Before a time series can be smoothed or modeled the data will need to be vectorized.
	The `col` function can be used
	to copy a column of data from a list of tuples into an array.

	The expression below demonstrates the vectorization of the `date_dt` and `avg(close_d)` fields.
	The `zplot` function is then used to plot the months on the x-axis and the average closing prices on the y-axis.

	image::images/math-expressions/timeseries2.png[]


	== Smoothing

	Time series smoothing is often used to remove the noise from a time series and help spot the underlying trend.
	The math expressions library has three sliding window approaches
	for time series smoothing.
	These approaches use a summary value from a sliding window of the data to calculate a new set of smoothed data points.

	The three sliding window functions are lagging indicators, which means
	they don't start to move in the direction of the trend until the trend effects
	the summary value of the sliding window.
	Because of this lagging quality these smoothing functions are often used to confirm the direction of the trend.

	=== Moving Average

	The `movingAvg` function computes a simple moving average over a sliding window of data.
	The example below generates a time series, vectorizes the `avg(close_d)` field and computes the
	moving average with a window size of 5.

	The moving average function returns an array that is of shorter length
	then the original vector. This is because results are generated only when a full window of data
	is available for computing the average. With a window size of five the moving average will
	begin generating results at the 5th value. The prior values are not included in the result.

	The `zplot` function is then used to plot the months on the x-axis, and the average close and moving
	average on the y-axis. Notice that the `ltrim` function is used to trim the first 4 values from
	the x-axis and the average closing prices. This is done to line up the three arrays so they start
	from the 5th value.

	image::images/math-expressions/movingavg.png[]

	=== Exponential Moving Average

	The `expMovingAvg` function uses a different formula for computing the moving average that
	responds faster to changes in the underlying data. This means that it is
	less of a lagging indicator than the simple moving average.

	Below is an example that computes a moving average and exponential moving average and plots them
	along with the original y values. Notice how the exponential moving average is more sensitive
	to changes in the y values.

	image::images/math-expressions/expmoving.png[]


	=== Moving Median

	The `movingMedian` function uses the median of the sliding window rather than the average.
	In many cases the moving median will be more robust to outliers than moving averages.

	Below is an example computing the moving median:

	image::images/math-expressions/movingMedian.png[]


	== Differencing

	Differencing can be used to make
	a time series stationary by removing the trend or seasonality from the series.

	=== First Difference

	The technique used in differencing is to use the difference between values rather than the
	original values. The first difference takes the difference between a value and the value
	that came directly before it. The first difference is often used to remove the trend
	from a time series.

	The examples below uses the first difference to make two time series stationary so they can be compared
	without the trend.

	In this example we'll be comparing the average monthly closing price for two stocks: Amazon and Google.
	The image below plots both time series before differencing is applied.

	image::images/math-expressions/timecompare.png[]

	In the next example the `diff` function is applied to both time series inside the `zplot` function.
	The `diff` can be applied inside the `zplot` function or like any other function inside of the `let`
	function.

	Notice that both time series now have the trend removed and the monthly movements of the stock price
	can be studied without being influenced by the trend.

	image::images/math-expressions/diff1.png[]

	In the next example the `zoom` function of the time series visualization is used to zoom into a specific
	range of months. This allows for closer inspection of the data. With closer inspection of the data there appears
	to be some correlation between the monthly movements of the two stocks.

	image::images/math-expressions/diffzoom.png[]

	In the final example the differenced time series are correlated with the `corr` function.

	image::images/math-expressions/diffcorr.png[]



	=== Lagged Differences

	The `diff` function has an optional second parameter to specify a lag in the difference.
	If a lag is specified the difference is taken between a value and the value at a specified
	lag in the past. Lagged differences are often used to remove seasonality from a time series.

	The simple example below demonstrates how lagged differencing works.
	Notice that the array in the example follows a simple repeated pattern. This type of pattern
	is often displayed with seasonality.

	image::images/math-expressions/season.png[]

	In this example we remove this pattern using
	the `diff` function with a lag of 4. This will subtract the value lagging four indexes
	behind the current index. Notice that the result set size is the original array size minus the lag.
	This is because the `diff` function only returns results for values where the lag of 4
	is possible to compute.

	image::images/math-expressions/seasondiff.png[]


	== Anomaly Detection

	The `movingMAD` (moving mean absolute deviation) function can be used to surface anomalies
	in a time series by measuring dispersion (deviation from the mean) within a sliding window.

	The `movingMAD` function operates in a similar manner as a moving average, except it
	measures the mean absolute deviation within the window rather than the average. By
	looking for unusually high or low dispersion we can find anomalies in the time
	series.

	For this example we'll be working with daily stock prices for Amazon over a two year
	period. The daily stock data will provide a larger data set to study.

	In the example below the `search` expression is used to return the daily closing price
	for the ticker AMZN over a two year period.

	image::images/math-expressions/anomaly.png[]

	The next step is to apply the `movingMAD` function to the data to calculate
	the moving mean absolute deviation over a 10 day window. The example below shows the function being
	applied and visualized.

	image::images/math-expressions/mad.png[]

	Once the moving MAD has been calculated we can visualize the distribution of dispersion
	with the `empiricalDistribution` function. The example below plots the empirical
	distribution with 10 bins, creating a 10 bin histogram of the dispersion of the
	time series.

	This visualization shows that most of the mean absolute deviations fall between 0 and
	9.2 with the mean of the final bin at 11.94.

	image::images/math-expressions/maddist.png[]

	The final step is to detect outliers in the series using the `outliers` function.
	The `outliers` function uses a probability distribution to find outliers in a numeric vector.
	The `outliers` function takes four parameters:

	* Probability distribution
	* Numeric vector
	* Low probability threshold
	* High probability threshold
	* List of results that the numeric vector was selected from

	The `outliers` function iterates the numeric vector and uses the probability
	distribution to calculate the cumulative probability of each value. If the cumulative
	probability is below the low probability threshold or above the high threshold it considers
	the value an outlier. When the `outliers` function encounters an outlier it returns
	the corresponding result from the list of results provided by the fifth parameter.
	It also includes the cumulative probability and the value of the outlier.

	The example below shows the `outliers` function applied to the Amazon stock
	price data set. The empirical distribution of the moving mean absolute deviation is
	the first parameter. The vector containing the moving mean absolute
	deviations is the second parameter. `-1` is the low and `.99` is the high probability
	thresholds. `-1` means that low outliers will not be considered. The final parameter
	is the original result set containing the `close_d` and `date_dt` fields.

	The output of the `outliers` function contains the results where an outlier was detected.
	In this case 5 results above the .99 probability threshold were detected.


	image::images/math-expressions/outliers.png[]


	== Modeling

	Math expressions support in Solr includes a number of functions that can be used to model a time series.
	These functions include linear regression, polynomial and harmonic curve fitting, loess regression, and KNN regression.

	Each of these functions can model a time series and be used for
	interpolation (predicting values within the dataset) and several
	can be used for extrapolation (predicting values beyond the data set).

	The various regression functions are covered in detail in the Linear Regression, Curve
	Fitting and Machine Learning sections of the user guide.

	The example below uses the `polyfit` function (polynomial regression) to
	fit a non-linear model to a time series. The data set being used is the
	monthly average closing price for Amazon over an eight year period.

	In this example the `polyfit` function returns a fitted model for the y
	axis, which is the average monthly closing prices, using a 4 degree polynomial.
	The degree of the polynomial determines the number of curves in the
	model. The fitted model is set to the variable `y1`. The fitted model
	is then directly plotted with `zplot` along with the original `y`
	values.

	The visualization shows the smooth line fit through the average closing
	price data.

	image::images/math-expressions/timemodel.png[]


	== Forecasting

	The `polyfit` function can also be used to extrapolate a time series to forecast
	future stock prices. The example below demonstrates a 10 month forecast.

	In the example the `polyfit` function fits a model to the y-axis and the model
	is set to the variable `m`.
	Then to create a forecast 10 zeros are appended
	to the y-axis to create new vector called `y10`.
	Then a new x-axis is created using
	the `natural` function which returns a sequence of whole numbers 0 to the length of `y10`.
	The new x-axis is stored in the variable `x10`.

	The `predict` function uses the fitted model to predict values for the new x-axis stored in
	variable `x10`.

	The `zplot` function is then used to plot the `x10` vector on the x-axis and the `y10` vector and extrapolated
	model on the y-axis. Notice that the `y10` vector drops to zero where the observed data
	ends, but the forecast continues along the fitted curve
	of the model.

	image::images/math-expressions/forecast.png[]