solr/solr-ref-guide/src/numerical-analysis.adoc - lucene-solr - Git at Google

 = Interpolation, Derivatives and Integrals
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 This section explores the interrelated math expressions for interpolation and numerical calculus.

 == Interpolation

 Interpolation is used to construct new data points between a set of known control of points.
 The ability to predict new data points allows for sampling along the curve defined by the
 control points.

 The interpolation functions described below all return an _interpolation function_
 that can be passed to other functions which make use of the sampling capability.

 If returned directly the interpolation function returns an array containing predictions for each of the
 control points. This is useful in the case of `loess` interpolation which first smooths the control points
 and then interpolates the smoothed points. All other interpolation functions simply return the original
 control points because interpolation predicts a curve that passes through the original control points.

 There are different algorithms for interpolation that will result in different predictions
 along the curve. The math expressions library currently supports the following
 interpolation functions:

 * `lerp`: Linear interpolation predicts points that pass through each control point and
   form straight lines between control points.
 * `spline`: Spline interpolation predicts points that pass through each control point
 and form a smooth curve between control points.
 * `akima`: Akima spline interpolation is similar to spline interpolation but is stable to outliers.
 * `loess`: Loess interpolation first performs a non-linear local regression to smooth the original
 control points. Then a spline is used to interpolate the smoothed control points.

 === Sampling Along the Curve

 One way to better understand interpolation is to visualize what it means to sample along a curve. The example
 below zooms in on a specific region of a curve by sampling the curve between a specific x-axis range.

 image::images/math-expressions/interpolate1.png[]

 The visualization above first creates two arrays with x and y-axis points. Notice that the x-axis ranges from
  0 to 9. Then the `akima`, `spline` and `lerp`
 functions are applied to the vectors to create three interpolation functions.

 Then 500 hundred random samples are drawn from a uniform distribution between 0 and 3. These are
 the new zoomed in x-axis points, between 0 and 3. Notice that we are sampling a specific
 area of the curve.

 Then the `predict` function is used to predict y-axis points for
 the sampled x-axis, for all three interpolation functions. Finally all three prediction vectors
 are plotted with the sampled x-axis points.

 The red line is the `lerp` interpolation, the blue line is the `akima` and the purple line is
 the `spline` interpolation. You can see they each produce different curves in between the control
 points.


 === Smoothing Interpolation

 The `loess` function is a smoothing interpolator which means it doesn't derive
 a function that passes through the original control points. Instead the `loess` function
 returns a function that smooths the original control points.

 A technique known as local regression is used to compute the smoothed curve. The size of the
 neighborhood of the local regression can be adjusted
 to control how close the new curve conforms to the original control points.

 The `loess` function is passed x- and y-axes and fits a smooth curve to the data.
 If only a single array is provided it is treated as the y-axis and a sequence is generated
 for the x-axis.

 The example below shows the `loess` function being used to model a monthly
 time series. In the example the `timeseries` function is used to generate
 a monthly time series of average closing prices for the stock ticker
 *AMZN*. The `date_dt` and `avg(close_d)` fields from the time series
 are then vectorized and stored in variables `x` and `y`. The `loess`
 function is then applied to the *y* vector containing the average closing
 prices. The `bandwidth` named parameter specifies the percentage
 of the data set used to compute the local regression. The `loess` function
 returns the fitted model of smoothed data points.

 The `zplot` function is then used to plot the `x`, `y` and `y1`
 variables.

 image::images/math-expressions/loess.png[]


 == Derivatives

 The derivative of a function measures the rate of change of the `y` value in respects to the
 rate of change of the `x` value.

 The `derivative` function can compute the derivative for any of the
 interpolation functions described above. Each interpolation function
 will produce different derivatives that match the characteristics
 of the function.

 === The First Derivative (Velocity)

 A simple example shows how the `derivative` function is used to calculate
 the rate of change or *velocity*.

 In the example two vectors are created, one representing hours and
 one representing miles traveled. The `lerp` function is then used to
 create a linear interpolation of the `hours` and `miles` vectors.
 The `derivative` function is then applied to the
 linear interpolation. `zplot` is then used to plot the *`hours`*
 on the x-axis and `miles` on the y-axis, and the `derivative` as `mph`,
 at each x-axis point.


 image::images/math-expressions/derivative.png[]

 Notice that the *miles_traveled* line has a slope of 10 until the
 5th hour where
 it changes to a slope of 50. The *mph* line, which is
  the derivative, visualizes the *velocity* of the
  *miles_traveled* line.

 Also notice that the derivative is calculated along
 straight lines showing immediate change from one point to the next. This
 is because linear interpolation (`lerp`) is used as the interpolation
 function. If the `spline` or `akima` functions had been used it would have produced
 a derivative with rounded curves.


 === The Second Derivative (Acceleration)

 While the first derivative represents velocity, the second derivative
 represents `acceleration`. The second the derivative is the derivative
 of the first derivative.

 The example below builds on the first example and adds the second derivative.
 Notice that the second derivative `d2` is taken by applying the
 derivative function to a linear interpolation of the first derivative.

 The second derivative is plotted as *acceleration* on the chart.

 image::images/math-expressions/derivatives.png[]

 Notice that the acceleration line is 0 until the *mph* line increases from 10 to 50. At this
 point the *acceleration* line moves to 40. As the *mph* line stays at 50, the acceleration
 line drops to 0.

 === Price Velocity

 The example below shows how to plot the `derivative` for a time series generated
 by the `timeseries` function. In the example a monthly time series is
 generated for the average closing price for the stock ticker `amzn`.
 The `avg(close)` column is vectorized and interpolated using linear
 interpolation (`lerp`).  The `zplot` function is then used to plot the derivative
 of the time series.

 image::images/math-expressions/derivative2.png[]

 Notice that the derivative plot clearly shows the rates of change in the stock price over time.


 == Integrals

 An integral is a measure of the volume underneath a curve.
 The `integral` function computes the cumulative integrals for a curve or the integral for a specific
 range of an interpolated curve. Like the `derivative` function the `integral` function operates
 over interpolation functions.

 === Single Integral

 If the `integral` function is passed a *start* and *end* range it will compute the volume under the
 curve within that specific range.

 In the example below the `integral` function computes an
 integral for the entire range of the curve, 0 through 10. Notice  that the `integral` function is passed
 the interpolated curve and the start and end range, and returns the integral for the range.

 [source,text]
 ----
 let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
     y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0),
     curve=loess(x, y, bandwidth=.3),
     integral=integral(curve,  0, 10))
 ----

 When this expression is sent to the `/stream` handler it
 responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "integral": 45.300912584519914
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 === Cumulative Integral Plot

 If the `integral` function is passed a single interpolated curve it returns a vector of the cumulative
 integrals for the curve. The cumulative integrals vector contains a cumulative integral calculation
 for each x-axis point. The cumulative integral is calculated by taking the
 integral of the range between each x-axis point and the *first* x-axis point. In the example above this would
 mean calculating a vector of integrals as such:

 [source,text]
 ----
 let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
     y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0),
     curve=loess(x, y, bandwidth=.3),
     integrals=array(0, integral(curve, 0, 1), integral(curve, 0, 2), integral(curve, 0, 3), ...)
 ----

 The plot of cumulative integrals visualizes how much cumulative volume of the curve is under each point
 x-axis point.

 The example below shows the cumulative integral plot for a time series generated by
 the `timeseries` function. In the example a monthly time series is
 generated for the average closing price for the stock ticker `amzn`.
 The `avg(close)` column is vectorized and interpolated using a `spline`.

 The `zplot` function is then used to plot the cumulative integral
 of the time series.

 image::images/math-expressions/integral.png[]

 The plot above visualizes the volume under the curve as the *AMZN* stock
 price changes over time.  Because this plot is cumulative, the volume under
 a stock price time series which stays the same over time, will
 have a positive *linear* slope. A stock that has rising prices will have a *concave* shape and
 a stock with falling prices will have a *convex* shape.

 In this particular example the integral plot becomes more *concave* over time
 showing accelerating increases in stock price.

 == Bicubic Spline

 The `bicubicSpline` function can be used to interpolate and predict values
 anywhere within a grid of data.

 A simple example will make this more clear:

 [source,text]
 ----
 let(years=array(1998, 2000, 2002, 2004, 2006),
     floors=array(1, 5, 9, 13, 17, 19),
     prices = matrix(array(300000, 320000, 330000, 350000, 360000, 370000),
                     array(320000, 330000, 340000, 350000, 365000, 380000),
                     array(400000, 410000, 415000, 425000, 430000, 440000),
                     array(410000, 420000, 425000, 435000, 445000, 450000),
                     array(420000, 430000, 435000, 445000, 450000, 470000)),
     bspline=bicubicSpline(years, floors, prices),
     prediction=predict(bspline, 2003, 8))
 ----

 In this example a bicubic spline is used to interpolate a matrix of real estate data.
 Each row of the matrix represent specific `years`. Each column of the matrix
 represents `floors` of the building. The grid of numbers is the average selling price of
 an apartment for each year and floor. For example in 2002 the average selling price for
 the 9th floor was `415000` (row 3, column 3).

 The `bicubicSpline` function is then used to
 interpolate the grid, and the `predict` function is used to predict a value for year 2003, floor 8.
 Notice that the matrix does not include a data point for year 2003, floor 8. The `bicubicSpline`
 function creates that data point based on the surrounding data in the matrix:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "prediction": 418279.5009328358
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----
	= Interpolation, Derivatives and Integrals
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	This section explores the interrelated math expressions for interpolation and numerical calculus.

	== Interpolation

	Interpolation is used to construct new data points between a set of known control of points.
	The ability to predict new data points allows for sampling along the curve defined by the
	control points.

	The interpolation functions described below all return an _interpolation function_
	that can be passed to other functions which make use of the sampling capability.

	If returned directly the interpolation function returns an array containing predictions for each of the
	control points. This is useful in the case of `loess` interpolation which first smooths the control points
	and then interpolates the smoothed points. All other interpolation functions simply return the original
	control points because interpolation predicts a curve that passes through the original control points.

	There are different algorithms for interpolation that will result in different predictions
	along the curve. The math expressions library currently supports the following
	interpolation functions:

	* `lerp`: Linear interpolation predicts points that pass through each control point and
	form straight lines between control points.
	* `spline`: Spline interpolation predicts points that pass through each control point
	and form a smooth curve between control points.
	* `akima`: Akima spline interpolation is similar to spline interpolation but is stable to outliers.
	* `loess`: Loess interpolation first performs a non-linear local regression to smooth the original
	control points. Then a spline is used to interpolate the smoothed control points.

	=== Sampling Along the Curve

	One way to better understand interpolation is to visualize what it means to sample along a curve. The example
	below zooms in on a specific region of a curve by sampling the curve between a specific x-axis range.

	image::images/math-expressions/interpolate1.png[]

	The visualization above first creates two arrays with x and y-axis points. Notice that the x-axis ranges from
	0 to 9. Then the `akima`, `spline` and `lerp`
	functions are applied to the vectors to create three interpolation functions.

	Then 500 hundred random samples are drawn from a uniform distribution between 0 and 3. These are
	the new zoomed in x-axis points, between 0 and 3. Notice that we are sampling a specific
	area of the curve.

	Then the `predict` function is used to predict y-axis points for
	the sampled x-axis, for all three interpolation functions. Finally all three prediction vectors
	are plotted with the sampled x-axis points.

	The red line is the `lerp` interpolation, the blue line is the `akima` and the purple line is
	the `spline` interpolation. You can see they each produce different curves in between the control
	points.


	=== Smoothing Interpolation

	The `loess` function is a smoothing interpolator which means it doesn't derive
	a function that passes through the original control points. Instead the `loess` function
	returns a function that smooths the original control points.

	A technique known as local regression is used to compute the smoothed curve. The size of the
	neighborhood of the local regression can be adjusted
	to control how close the new curve conforms to the original control points.

	The `loess` function is passed x- and y-axes and fits a smooth curve to the data.
	If only a single array is provided it is treated as the y-axis and a sequence is generated
	for the x-axis.

	The example below shows the `loess` function being used to model a monthly
	time series. In the example the `timeseries` function is used to generate
	a monthly time series of average closing prices for the stock ticker
	AMZN. The `date_dt` and `avg(close_d)` fields from the time series
	are then vectorized and stored in variables `x` and `y`. The `loess`
	function is then applied to the y vector containing the average closing
	prices. The `bandwidth` named parameter specifies the percentage
	of the data set used to compute the local regression. The `loess` function
	returns the fitted model of smoothed data points.

	The `zplot` function is then used to plot the `x`, `y` and `y1`
	variables.

	image::images/math-expressions/loess.png[]


	== Derivatives

	The derivative of a function measures the rate of change of the `y` value in respects to the
	rate of change of the `x` value.

	The `derivative` function can compute the derivative for any of the
	interpolation functions described above. Each interpolation function
	will produce different derivatives that match the characteristics
	of the function.

	=== The First Derivative (Velocity)

	A simple example shows how the `derivative` function is used to calculate
	the rate of change or velocity.

	In the example two vectors are created, one representing hours and
	one representing miles traveled. The `lerp` function is then used to
	create a linear interpolation of the `hours` and `miles` vectors.
	The `derivative` function is then applied to the
	linear interpolation. `zplot` is then used to plot the `hours`
	on the x-axis and `miles` on the y-axis, and the `derivative` as `mph`,
	at each x-axis point.


	image::images/math-expressions/derivative.png[]

	Notice that the miles_traveled line has a slope of 10 until the
	5th hour where
	it changes to a slope of 50. The mph line, which is
	the derivative, visualizes the velocity of the
	miles_traveled line.

	Also notice that the derivative is calculated along
	straight lines showing immediate change from one point to the next. This
	is because linear interpolation (`lerp`) is used as the interpolation
	function. If the `spline` or `akima` functions had been used it would have produced
	a derivative with rounded curves.


	=== The Second Derivative (Acceleration)

	While the first derivative represents velocity, the second derivative
	represents `acceleration`. The second the derivative is the derivative
	of the first derivative.

	The example below builds on the first example and adds the second derivative.
	Notice that the second derivative `d2` is taken by applying the
	derivative function to a linear interpolation of the first derivative.

	The second derivative is plotted as acceleration on the chart.

	image::images/math-expressions/derivatives.png[]

	Notice that the acceleration line is 0 until the mph line increases from 10 to 50. At this
	point the acceleration line moves to 40. As the mph line stays at 50, the acceleration
	line drops to 0.

	=== Price Velocity

	The example below shows how to plot the `derivative` for a time series generated
	by the `timeseries` function. In the example a monthly time series is
	generated for the average closing price for the stock ticker `amzn`.
	The `avg(close)` column is vectorized and interpolated using linear
	interpolation (`lerp`). The `zplot` function is then used to plot the derivative
	of the time series.

	image::images/math-expressions/derivative2.png[]

	Notice that the derivative plot clearly shows the rates of change in the stock price over time.


	== Integrals

	An integral is a measure of the volume underneath a curve.
	The `integral` function computes the cumulative integrals for a curve or the integral for a specific
	range of an interpolated curve. Like the `derivative` function the `integral` function operates
	over interpolation functions.

	=== Single Integral

	If the `integral` function is passed a start and end range it will compute the volume under the
	curve within that specific range.

	In the example below the `integral` function computes an
	integral for the entire range of the curve, 0 through 10. Notice that the `integral` function is passed
	the interpolated curve and the start and end range, and returns the integral for the range.

	[source,text]
	----
	let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
	y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0),
	curve=loess(x, y, bandwidth=.3),
	integral=integral(curve, 0, 10))
	----

	When this expression is sent to the `/stream` handler it
	responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"integral": 45.300912584519914
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----

	=== Cumulative Integral Plot

	If the `integral` function is passed a single interpolated curve it returns a vector of the cumulative
	integrals for the curve. The cumulative integrals vector contains a cumulative integral calculation
	for each x-axis point. The cumulative integral is calculated by taking the
	integral of the range between each x-axis point and the first x-axis point. In the example above this would
	mean calculating a vector of integrals as such:

	[source,text]
	----
	let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
	y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0),
	curve=loess(x, y, bandwidth=.3),
	integrals=array(0, integral(curve, 0, 1), integral(curve, 0, 2), integral(curve, 0, 3), ...)
	----

	The plot of cumulative integrals visualizes how much cumulative volume of the curve is under each point
	x-axis point.

	The example below shows the cumulative integral plot for a time series generated by
	the `timeseries` function. In the example a monthly time series is
	generated for the average closing price for the stock ticker `amzn`.
	The `avg(close)` column is vectorized and interpolated using a `spline`.

	The `zplot` function is then used to plot the cumulative integral
	of the time series.

	image::images/math-expressions/integral.png[]

	The plot above visualizes the volume under the curve as the AMZN stock
	price changes over time. Because this plot is cumulative, the volume under
	a stock price time series which stays the same over time, will
	have a positive linear slope. A stock that has rising prices will have a concave shape and
	a stock with falling prices will have a convex shape.

	In this particular example the integral plot becomes more concave over time
	showing accelerating increases in stock price.

	== Bicubic Spline

	The `bicubicSpline` function can be used to interpolate and predict values
	anywhere within a grid of data.

	A simple example will make this more clear:

	[source,text]
	----
	let(years=array(1998, 2000, 2002, 2004, 2006),
	floors=array(1, 5, 9, 13, 17, 19),
	prices = matrix(array(300000, 320000, 330000, 350000, 360000, 370000),
	array(320000, 330000, 340000, 350000, 365000, 380000),
	array(400000, 410000, 415000, 425000, 430000, 440000),
	array(410000, 420000, 425000, 435000, 445000, 450000),
	array(420000, 430000, 435000, 445000, 450000, 470000)),
	bspline=bicubicSpline(years, floors, prices),
	prediction=predict(bspline, 2003, 8))
	----

	In this example a bicubic spline is used to interpolate a matrix of real estate data.
	Each row of the matrix represent specific `years`. Each column of the matrix
	represents `floors` of the building. The grid of numbers is the average selling price of
	an apartment for each year and floor. For example in 2002 the average selling price for
	the 9th floor was `415000` (row 3, column 3).

	The `bicubicSpline` function is then used to
	interpolate the grid, and the `predict` function is used to predict a value for year 2003, floor 8.
	Notice that the matrix does not include a data point for year 2003, floor 8. The `bicubicSpline`
	function creates that data point based on the surrounding data in the matrix:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"prediction": 418279.5009328358
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----