solr/solr-ref-guide/src/simulations.adoc - lucene-solr - Git at Google

 = Monte Carlo Simulations
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 Monte Carlo simulations are commonly used to model the behavior of
 stochastic (random) systems. This section of the user guide covers
 the basics of performing Monte Carlo simulations with Math Expressions.

 <<Random Time Series, Random Time Series>> -
 <<Autocorrelation, Autocorrelation>> -
 <<Visualizing the Distribution, Visualizing & Fitting the Distribution>> -
 <<Monte Carlo, Monte Carlo>> -
 <<Random Walk, Random Walk>> -
 <<Multivariate Normal Distribution, Multivariate Normal Distribution>>

 == Random Time Series

 The daily movement of stock prices is often described as a "random walk".
 But what does that really mean, and how is this different than a random time series?
 The examples below will use Monte Carlo simulations to explore both "random walks"
 and random time series.

 A useful first step in understanding the difference is to visualize
 daily stock returns, calculated as closing price minus opening price, as a time series.

 The example below uses the `search` function to return 1000 days of daily stock
 returns for the ticker *cvx* (Chevron). The *change_d* field, which is the
 change in price for the day, is then plotted as a time series.

 image::images/math-expressions/randomwalk1.png[]

 Notice that the time series of daily price changes moves randomly above and
 below zero. Some days the stock is up, some days its down, but there
 does not seem to be a noticeable pattern or any dependency between steps. This is a hint
 that this is a *random time series*.

 === Autocorrelation

 Autocorrelation measures the degree to which a signal is correlated with itself.
  Autocorrelation can be used to determine
 if a vector contains a signal or if there is dependency between values in a time series. If there is no
 signal and no dependency between values in the time series then the time series is random.

 Its useful to plot the autocorrelation of the *change_d* vector to confirm that it is indeed random.

 In the example below the search results are set to a variable and then the *change_d* field
 is vectorized and stored in variable *b*. Then the
  `conv` (convolution) function is used to autocorrelate
 the *change_d* vector.
 Notice that the `conv` function is simply "convolving" the *change_d* vector
 with a reversed copy of itself.
 This is the technique for performing autocorrelation using convolution.
 The <<dsp.adoc#dsp,Signal Processing>> section
 of the user guide covers both convolution and autocorrelation in detail.
 In this section we'll just discuss the plot.

 The plot shows the intensity of correlation that is calculated as the *change_d* vector is slid across
 itself by the `conv` function.
 Notice in the plot there is long period of low intensity correlation that appears
 to be random. Then in the center a peak of high intensity correlation where the vectors
 are directly lined up.
 This is followed by another long period of low intensity correlation.

 This is the autocorrelation plot of pure noise. The daily stock changes appear
 to be a random time series.

 image::images/math-expressions/randomwalk2.png[]

 === Visualizing the Distribution

 The random daily changes in stock prices cannot be predicted, but they can be modeled with a probability distribution.
 To model the time series we'll start by visualizing the distribution of the *change_d* vector. In the example
 below the *change_d* vector is plotted using the `empiricalDistribution` function to create an 11 bin
 histogram of the data. Notice that the distribution appears to be normally distributed. Daily stock price
 changes do tend to be normally distributed although *cvx* was chosen specifically
 for this example because of this characteristic.

 image::images/math-expressions/randomwalk3.png[]


 === Fitting the Distribution

 The `ks` Test can be used to determine if the distribution of a vector of data fits a
 reference distribution.
 In the example below the `ks` test is performed with a *normal distribution* with the *mean*
 and *standard deviation* of the *change_d* vector as the reference distribution. The `ks` test is
 checking the reference distribution against the *change_d* vector itself to see if it
 fits a normal distribution.

 Notice in the example below the `ks` test reports a p-value of .16278. A p-value of .05 or less is typically
 used to invalidate the null hypothesis of the test which is that the vector could have been
 drawn from the reference distribution.

 image::images/math-expressions/randomwalk4.png[]


 The `ks` test, which tends to be fairly sensitive, has confirmed the visualization which appeared to be normal. Because of this the
 normal distribution with the *mean* and *standard deviation* of the *change_d* vector will be used to represent the daily stock returns
 for Chevron in the Monte Carlo simulations below.

 === Monte Carlo

 Now that we have fit a distribution to the daily stock return data we can use the
 `monteCarlo` function to run a simulation using the distribution.

 The `monteCarlo` function runs a specified number of times. On each run it sets
 a series of variables and runs one final function which returns a single numeric value. The
 monteCarlo function collects the results of each run in a vector and returns it.
 The final function typically has one or more variables that are drawn from probability
 distributions on each run. The `sample` function is used to draw the samples.

 The simulation's result array can then be treated as an empirical distribution to understand
 the probabilities of the simulation results.

 The example below uses the `monteCarlo` function to simulate a distribution for the total return
 of 100 days of stock returns.

 In the example a `normalDistribution` is created from the *mean* and *standard deviation*
 of the *change_d* vector. The `monteCarlo` function then draws 100 samples from the
 normal distribution to represent 100 days of stock returns and sets
 the vector of samples to the variable *d*.

 The `add` function then calculates the total return
 from the 100 day sample. The output of the `add` function is collected by the
 `monteCarlo` function. This is repeated
 50000 times, with each run drawing a different set of samples from
 the normal distribution.

 The result of the simulation is set to variable *s*, which contains
 the total returns from the 50000 runs.

 The `empiricalDistribution` function is then used to visualize the output of the simulation
 as a 50 bin histogram. The distribution visualizes the probability of the different total
 returns from 100 days of stock returns for ticker *cvx*.

 image::images/math-expressions/randomwalk5.png[]

 The `probability` and `cumulativeProbability` functions can then used to
 learn more about the `empiricalDistribution`.
 For example the `probability` function can be used to
 calculate the probability of a non-negative return from 100 days of stock returns.

 The example below uses the `probability` function to return the probability of a
 return between the range of 0 and 40 from the `empiricalDistribution`
 of the simulation.

 image::images/math-expressions/randomwalk5.1.png[]

 === Random Walk

 The `monteCarlo` function can also be used to model a random walk of
 daily stock prices from the `normalDistribution` of daily stock returns.
 A random walk is a time series where each step is calculated by adding a random sample to the previous
 step. This creates a time series where each value is dependant on the previous value,
 which simulates the autocorrelation of stock prices.

 In the example below the random walk is achieved by adding a random sample to the
 variable *v* on each Monte Carlo iteration. The variable `v` is maintained between
 iterations so each iteration uses the previous value of `v`. The `double` function
 is the final function run each iteration, which simply returns the value of `v` as a
 double. The example iterates 1000 times to create a random walk with 1000 steps.

 image::images/math-expressions/randomwalk6.png[]

 Notice the autocorrelation in the daily stock prices caused by the dependency
 between steps produces a very different plot then the
 random daily change in stock price.

 == Multivariate Normal Distribution

 The `multiVariateNormalDistribution` function can be used to model and simulate
 two or more normally distributed variables. It also incorporates the
 *correlation* between variables into the model which allows for the study of
 how correlation effects the possible outcomes.

 In the examples below a simulation of the total daily returns of two
 stocks is explored. The *all* ticker (*Allstate*) is used along with the
 *cvx* ticker (*Chevron*) from the previous examples.

 === Correlation and Covariance

 The multivariate simulations show the effect of correlation on possible
 outcomes. Before getting started with actual simulations its useful
 to first understand the correlation and covariance between
 the Allstate and Chevron stock returns.

 The example below runs two searches to retrieve the daily stock returns
 for all Allstate and Chevron. The *change_d* vectors from both returns
 are read into variables (*all* and *cvx*) and Pearson's correlation is
 calculated for the two vectors with the `corr` function.

 image::images/math-expressions/corrsim1.png[]

 Covariance is an unscaled measure of correlation. Covariance is the measure
 used by the multivariate simulations so its useful to also compute the
 covariance for the two stock returns. The example below computes
 the covariance.

 image::images/math-expressions/corrsim2.png[]

 === Covariance Matrix

 A covariance matrix is actually whats needed by the
 `multiVariateNormalDistribution` as it contains both the variance of the
 two stock return vectors and the covariance between the two
 vectors. The `cov` function will compute the covariance matrix for the
 the columns of a matrix.

 The example below demonstrates how
 to compute the covariance matrix by adding the `all` and `cvx` vectors
 as rows to a matrix. The matrix is then transposed with the `transpose`
 function so that the `all` vector
 is the first column and the `cvx` vector is the second column.

 The `cov` function then computes the covariance matrix for the
 columns of the matrix and returns the result.

 image::images/math-expressions/corrsim3.png[]

 The covariance matrix is a square matrix which contains the
 variance of each vector and the covariance between the
 vectors as follows:

 [source,text]
 ----
           all                 cvx
 all [0.12294442137237226, 0.13106056985285258],
 cvx [0.13106056985285258, 0.7409729840230235]
 ----

 === Multivariate Simulation

 The example below demonstrates a Monte Carlo simulation with two stock tickers using the
 `multiVariateNormalDistribution`.

 In the example, result sets with the *change_d* field for both stock tickers, *all* (Allstate) and *cvx*
 (Chevron),
 are retrieved and read into vectors.

 A matrix is then created from the two vectors and is transposed so
 the matrix contains two columns, one with the *all* vector and one with the *cvx* vector.

 Then the `multiVariateNormalDistribution` is created with two parameters. The first parameter
 is an array of *mean* values. In this case the means for the *all* vector and the *cvx* vector. The
 second parameter is the covariance matrix which was created from the 2 column matrix of the two vectors.

 The `monteCarlo` function then performs the simulation by drawing 100 samples from the `multiVariateNormalDistribution` on
 each iteration. Each sample set is a matrix with 100 rows and 2 columns containing stock return samples
 from the *all* and *cvx* distributions. The distributions of the columns will match the normal
 distributions used to create the `multiVariateNormalDistribution`. The covariance of the sample columns
 will match the covariance matrix.

 On each iteration the `grandSum` function is used to sum all the values of the sample matrix to get the total
 stock returns for both stocks.

 The output of the simulation is a vector which can be treated as an empirical distribution in exactly the
 same manner as the single stock ticker simulation. In this example it is plotted as a 50 bin histogram which
 visualizes the probability of the different total returns from 100 days of stock returns
 for the tickers *all* and *cvx*


 image::images/math-expressions/mnorm.png[]

 === The Effect of Correlation

 The covariance matrix can be changed to study the effect on the simulation. The example
 below demonstrates this by providing a hard coded covariance matrix with a higher covariance
 value for the two vectors. This results is a simulated outcome distribution with a higher standard deviation
 or larger spread from the mean. This measures the degree that higher correlation produces higher volatility
 in the random walk.

 image::images/math-expressions/mnorm2.png[]
	= Monte Carlo Simulations
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	Monte Carlo simulations are commonly used to model the behavior of
	stochastic (random) systems. This section of the user guide covers
	the basics of performing Monte Carlo simulations with Math Expressions.

	<<Random Time Series, Random Time Series>> -
	<<Autocorrelation, Autocorrelation>> -
	<<Visualizing the Distribution, Visualizing & Fitting the Distribution>> -
	<<Monte Carlo, Monte Carlo>> -
	<<Random Walk, Random Walk>> -
	<<Multivariate Normal Distribution, Multivariate Normal Distribution>>

	== Random Time Series

	The daily movement of stock prices is often described as a "random walk".
	But what does that really mean, and how is this different than a random time series?
	The examples below will use Monte Carlo simulations to explore both "random walks"
	and random time series.

	A useful first step in understanding the difference is to visualize
	daily stock returns, calculated as closing price minus opening price, as a time series.

	The example below uses the `search` function to return 1000 days of daily stock
	returns for the ticker cvx (Chevron). The change_d field, which is the
	change in price for the day, is then plotted as a time series.

	image::images/math-expressions/randomwalk1.png[]

	Notice that the time series of daily price changes moves randomly above and
	below zero. Some days the stock is up, some days its down, but there
	does not seem to be a noticeable pattern or any dependency between steps. This is a hint
	that this is a random time series.

	=== Autocorrelation

	Autocorrelation measures the degree to which a signal is correlated with itself.
	Autocorrelation can be used to determine
	if a vector contains a signal or if there is dependency between values in a time series. If there is no
	signal and no dependency between values in the time series then the time series is random.

	Its useful to plot the autocorrelation of the change_d vector to confirm that it is indeed random.

	In the example below the search results are set to a variable and then the change_d field
	is vectorized and stored in variable b. Then the
	`conv` (convolution) function is used to autocorrelate
	the change_d vector.
	Notice that the `conv` function is simply "convolving" the change_d vector
	with a reversed copy of itself.
	This is the technique for performing autocorrelation using convolution.
	The <<dsp.adoc#dsp,Signal Processing>> section
	of the user guide covers both convolution and autocorrelation in detail.
	In this section we'll just discuss the plot.

	The plot shows the intensity of correlation that is calculated as the change_d vector is slid across
	itself by the `conv` function.
	Notice in the plot there is long period of low intensity correlation that appears
	to be random. Then in the center a peak of high intensity correlation where the vectors
	are directly lined up.
	This is followed by another long period of low intensity correlation.

	This is the autocorrelation plot of pure noise. The daily stock changes appear
	to be a random time series.

	image::images/math-expressions/randomwalk2.png[]

	=== Visualizing the Distribution

	The random daily changes in stock prices cannot be predicted, but they can be modeled with a probability distribution.
	To model the time series we'll start by visualizing the distribution of the change_d vector. In the example
	below the change_d vector is plotted using the `empiricalDistribution` function to create an 11 bin
	histogram of the data. Notice that the distribution appears to be normally distributed. Daily stock price
	changes do tend to be normally distributed although cvx was chosen specifically
	for this example because of this characteristic.

	image::images/math-expressions/randomwalk3.png[]


	=== Fitting the Distribution

	The `ks` Test can be used to determine if the distribution of a vector of data fits a
	reference distribution.
	In the example below the `ks` test is performed with a normal distribution with the mean
	and standard deviation of the change_d vector as the reference distribution. The `ks` test is
	checking the reference distribution against the change_d vector itself to see if it
	fits a normal distribution.

	Notice in the example below the `ks` test reports a p-value of .16278. A p-value of .05 or less is typically
	used to invalidate the null hypothesis of the test which is that the vector could have been
	drawn from the reference distribution.

	image::images/math-expressions/randomwalk4.png[]


	The `ks` test, which tends to be fairly sensitive, has confirmed the visualization which appeared to be normal. Because of this the
	normal distribution with the mean and standard deviation of the change_d vector will be used to represent the daily stock returns
	for Chevron in the Monte Carlo simulations below.

	=== Monte Carlo

	Now that we have fit a distribution to the daily stock return data we can use the
	`monteCarlo` function to run a simulation using the distribution.

	The `monteCarlo` function runs a specified number of times. On each run it sets
	a series of variables and runs one final function which returns a single numeric value. The
	monteCarlo function collects the results of each run in a vector and returns it.
	The final function typically has one or more variables that are drawn from probability
	distributions on each run. The `sample` function is used to draw the samples.

	The simulation's result array can then be treated as an empirical distribution to understand
	the probabilities of the simulation results.

	The example below uses the `monteCarlo` function to simulate a distribution for the total return
	of 100 days of stock returns.

	In the example a `normalDistribution` is created from the mean and standard deviation
	of the change_d vector. The `monteCarlo` function then draws 100 samples from the
	normal distribution to represent 100 days of stock returns and sets
	the vector of samples to the variable d.

	The `add` function then calculates the total return
	from the 100 day sample. The output of the `add` function is collected by the
	`monteCarlo` function. This is repeated
	50000 times, with each run drawing a different set of samples from
	the normal distribution.

	The result of the simulation is set to variable s, which contains
	the total returns from the 50000 runs.

	The `empiricalDistribution` function is then used to visualize the output of the simulation
	as a 50 bin histogram. The distribution visualizes the probability of the different total
	returns from 100 days of stock returns for ticker cvx.

	image::images/math-expressions/randomwalk5.png[]

	The `probability` and `cumulativeProbability` functions can then used to
	learn more about the `empiricalDistribution`.
	For example the `probability` function can be used to
	calculate the probability of a non-negative return from 100 days of stock returns.

	The example below uses the `probability` function to return the probability of a
	return between the range of 0 and 40 from the `empiricalDistribution`
	of the simulation.

	image::images/math-expressions/randomwalk5.1.png[]

	=== Random Walk

	The `monteCarlo` function can also be used to model a random walk of
	daily stock prices from the `normalDistribution` of daily stock returns.
	A random walk is a time series where each step is calculated by adding a random sample to the previous
	step. This creates a time series where each value is dependant on the previous value,
	which simulates the autocorrelation of stock prices.

	In the example below the random walk is achieved by adding a random sample to the
	variable v on each Monte Carlo iteration. The variable `v` is maintained between
	iterations so each iteration uses the previous value of `v`. The `double` function
	is the final function run each iteration, which simply returns the value of `v` as a
	double. The example iterates 1000 times to create a random walk with 1000 steps.

	image::images/math-expressions/randomwalk6.png[]

	Notice the autocorrelation in the daily stock prices caused by the dependency
	between steps produces a very different plot then the
	random daily change in stock price.

	== Multivariate Normal Distribution

	The `multiVariateNormalDistribution` function can be used to model and simulate
	two or more normally distributed variables. It also incorporates the
	correlation between variables into the model which allows for the study of
	how correlation effects the possible outcomes.

	In the examples below a simulation of the total daily returns of two
	stocks is explored. The all ticker (Allstate) is used along with the
	cvx ticker (Chevron) from the previous examples.

	=== Correlation and Covariance

	The multivariate simulations show the effect of correlation on possible
	outcomes. Before getting started with actual simulations its useful
	to first understand the correlation and covariance between
	the Allstate and Chevron stock returns.

	The example below runs two searches to retrieve the daily stock returns
	for all Allstate and Chevron. The change_d vectors from both returns
	are read into variables (all and cvx) and Pearson's correlation is
	calculated for the two vectors with the `corr` function.

	image::images/math-expressions/corrsim1.png[]

	Covariance is an unscaled measure of correlation. Covariance is the measure
	used by the multivariate simulations so its useful to also compute the
	covariance for the two stock returns. The example below computes
	the covariance.

	image::images/math-expressions/corrsim2.png[]

	=== Covariance Matrix

	A covariance matrix is actually whats needed by the
	`multiVariateNormalDistribution` as it contains both the variance of the
	two stock return vectors and the covariance between the two
	vectors. The `cov` function will compute the covariance matrix for the
	the columns of a matrix.

	The example below demonstrates how
	to compute the covariance matrix by adding the `all` and `cvx` vectors
	as rows to a matrix. The matrix is then transposed with the `transpose`
	function so that the `all` vector
	is the first column and the `cvx` vector is the second column.

	The `cov` function then computes the covariance matrix for the
	columns of the matrix and returns the result.

	image::images/math-expressions/corrsim3.png[]

	The covariance matrix is a square matrix which contains the
	variance of each vector and the covariance between the
	vectors as follows:

	[source,text]
	----
	all cvx
	all [0.12294442137237226, 0.13106056985285258],
	cvx [0.13106056985285258, 0.7409729840230235]
	----

	=== Multivariate Simulation

	The example below demonstrates a Monte Carlo simulation with two stock tickers using the
	`multiVariateNormalDistribution`.

	In the example, result sets with the change_d field for both stock tickers, all (Allstate) and cvx
	(Chevron),
	are retrieved and read into vectors.

	A matrix is then created from the two vectors and is transposed so
	the matrix contains two columns, one with the all vector and one with the cvx vector.

	Then the `multiVariateNormalDistribution` is created with two parameters. The first parameter
	is an array of mean values. In this case the means for the all vector and the cvx vector. The
	second parameter is the covariance matrix which was created from the 2 column matrix of the two vectors.

	The `monteCarlo` function then performs the simulation by drawing 100 samples from the `multiVariateNormalDistribution` on
	each iteration. Each sample set is a matrix with 100 rows and 2 columns containing stock return samples
	from the all and cvx distributions. The distributions of the columns will match the normal
	distributions used to create the `multiVariateNormalDistribution`. The covariance of the sample columns
	will match the covariance matrix.

	On each iteration the `grandSum` function is used to sum all the values of the sample matrix to get the total
	stock returns for both stocks.

	The output of the simulation is a vector which can be treated as an empirical distribution in exactly the
	same manner as the single stock ticker simulation. In this example it is plotted as a 50 bin histogram which
	visualizes the probability of the different total returns from 100 days of stock returns
	for the tickers all and cvx


	image::images/math-expressions/mnorm.png[]

	=== The Effect of Correlation

	The covariance matrix can be changed to study the effect on the simulation. The example
	below demonstrates this by providing a hard coded covariance matrix with a higher covariance
	value for the two vectors. This results is a simulated outcome distribution with a higher standard deviation
	or larger spread from the mean. This measures the degree that higher correlation produces higher volatility
	in the random walk.

	image::images/math-expressions/mnorm2.png[]