solr/solr-ref-guide/src/simulations.adoc - lucene-solr - Git at Google

 = Monte Carlo Simulations
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.


 Monte Carlo simulations are commonly used to model the behavior of
 stochastic systems. This section describes
 how to perform both uncorrelated and correlated Monte Carlo simulations
 using the sampling capabilities of the probability distribution framework.

 == Uncorrelated Simulations

 Uncorrelated Monte Carlo simulations model stochastic systems with the assumption
 that the underlying random variables move independently of each other.
 A simple example of a Monte Carlo simulation using two independently changing random variables
 is described below.

 In this example a Monte Carlo simulation is used to determine the probability that a simple hinge assembly will
 fall within a required length specification.

 The hinge has two components A and B. The combined length of the two components must be less then 5 centimeters
 to fall within specification.

 A random sampling of lengths for component A has shown that its length conforms to a
 normal distribution with a mean of 2.2 centimeters and a standard deviation of .0195
 centimeters.

 A random sampling of lengths for component B has shown that its length conforms
 to a normal distribution with a mean of 2.71 centimeters and a standard deviation of .0198 centimeters.

 [source,text]
 ----
 let(componentA=normalDistribution(2.2, .0195),  <1>
     componentB=normalDistribution(2.71, .0198),  <2>
     simresults=monteCarlo(sampleA=sample(componentA),  <3>
                           sampleB=sample(componentB),
                           add(sampleA, sampleB),  <4>
                           100000),  <5>
     simmodel=empiricalDistribution(simresults),  <6>
     prob=cumulativeProbability(simmodel,  5))  <7>
 ----

 The Monte Carlo simulation below performs the following steps:

 <1> A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of `componentA`.
 <2> A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of `componentB`.
 <3> The `monteCarlo` function samples from the `componentA` and `componentB` distributions and sets the values to variables `sampleA` and `sampleB`.
 <4> It then calls the `add(sampleA, sampleB)`* function to find the combined lengths of the samples.
 <5> The `monteCarlo` function runs a set number of times, 100000, and collects the results in an array. Each
   time the function is called new samples are drawn from the `componentA`
   and `componentB` distributions. On each run, the `add` function adds the two samples to calculate the combined length.
   The result of each run is collected in an array and assigned to the `simresults` variable.
 <6> An `empiricalDistribution` function is then created from the `simresults` array to model the distribution of the
   simulation results.
 <7> Finally, the `cumulativeProbability` function is called on the `simmodel` to determine the cumulative probability
   that the combined length of the components is 5 or less.

 Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will
 be 5 or less:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "prob": 0.9994371944629039
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 660
       }
     ]
   }
 }
 ----

 == Correlated Simulations

 The simulation above assumes that the lengths of `componentA` and `componentB` vary independently.
 What would happen to the probability model if there was a correlation between the lengths of
 `componentA` and `componentB`?

 In the example below a database containing assembled pairs of components is used to determine
 if there is a correlation between the lengths of the components, and how the correlation effects the model.

 Before performing a simulation of the effects of correlation on the probability model its
 useful to understand what the correlation is between the lengths of `componentA` and `componentB`.

 [source,text]
 ----
 let(a=random(collection5, q="*:*", rows="5000", fl="componentA_d, componentB_d"), <1>
     b=col(a, componentA_d)), <2>
     c=col(a, componentB_d)),
     d=corr(b, c))  <3>
 ----

 <1> In the example, 5000 random samples are selected from a collection of assembled hinges.
 Each sample contains lengths of the components in the fields `componentA_d` and `componentB_d`.
 <2> Both fields are then vectorized. The *componentA_d* vector is stored in
 variable *`b`* and the *componentB_d* variable is stored in variable *`c`*.
 <3> Then the correlation of the two vectors is calculated using the `corr` function.

 Note from the result that the outcome from `corr` is 0.9996931313216989.
 This means that `componentA_d` and *`componentB_d` are almost perfectly correlated.

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "d": 0.9996931313216989
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 309
       }
     ]
   }
 }
 ----

 === Correlation Effects on the Probability Model

 The example below explores how to use a multivariate normal distribution function
 to model how correlation effects the probability of hinge defects.

 In this example 5000 random samples are selected from a collection
 containing length data for assembled hinges. Each sample contains
 the fields `componentA_d` and `componentB_d`.

 Both fields are then vectorized. The `componentA_d` vector is stored in
 variable *`b`* and the `componentB_d` variable is stored in variable *`c`*.

 An array is created that contains the means of the two vectorized fields.

 Then both vectors are added to a matrix which is transposed. This creates
 an observation matrix where each row contains one observation of
 `componentA_d` and `componentB_d`. A covariance matrix is then created from the columns of
 the observation matrix with the
 `cov` function. The covariance matrix describes the covariance between `componentA_d` and `componentB_d`.

 The `multivariateNormalDistribution` function is then called with the
 array of means for the two fields and the covariance matrix. The model
 for the multivariate normal distribution is stored in variable *`g`*.

 The `monteCarlo` function then calls the function `add(sample(g))` 50000 times
 and collections the results in a vector. Each time the function is called a single sample
 is drawn from the multivariate normal distribution. Each sample is a vector containing
 one `componentA` and `componentB` pair. The `add` function adds the values in the vector to
 calculate the length of the pair. Over the long term the samples drawn from the
 multivariate normal distribution will conform to the covariance matrix used to construct it.

 Just as in the non-correlated example an empirical distribution is used to model probabilities
 of the simulation vector and the `cumulativeProbability` function is used to compute the cumulative
 probability that the combined component length will be 5 centimeters or less.

 Notice that the probability of a hinge meeting specification has dropped to 0.9889517439980468.
 This is because the strong correlation
 between the lengths of components means that their lengths rise together causing more hinges to
 fall out of the 5 centimeter specification.

 [source,text]
 ----
 let(a=random(hinges, q="*:*", rows="5000", fl="componentA_d, componentB_d"),
     b=col(a, componentA_d),
     c=col(a, componentB_d),
     cor=corr(b,c),
     d=array(mean(b), mean(c)),
     e=transpose(matrix(b, c)),
     f=cov(e),
     g=multiVariateNormalDistribution(d, f),
     h=monteCarlo(add(sample(g)), 50000),
     i=empiricalDistribution(h),
     j=cumulativeProbability(i, 5))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "j": 0.9889517439980468
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 599
       }
     ]
   }
 }
 ----
	= Monte Carlo Simulations
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.


	Monte Carlo simulations are commonly used to model the behavior of
	stochastic systems. This section describes
	how to perform both uncorrelated and correlated Monte Carlo simulations
	using the sampling capabilities of the probability distribution framework.

	== Uncorrelated Simulations

	Uncorrelated Monte Carlo simulations model stochastic systems with the assumption
	that the underlying random variables move independently of each other.
	A simple example of a Monte Carlo simulation using two independently changing random variables
	is described below.

	In this example a Monte Carlo simulation is used to determine the probability that a simple hinge assembly will
	fall within a required length specification.

	The hinge has two components A and B. The combined length of the two components must be less then 5 centimeters
	to fall within specification.

	A random sampling of lengths for component A has shown that its length conforms to a
	normal distribution with a mean of 2.2 centimeters and a standard deviation of .0195
	centimeters.

	A random sampling of lengths for component B has shown that its length conforms
	to a normal distribution with a mean of 2.71 centimeters and a standard deviation of .0198 centimeters.

	[source,text]
	----
	let(componentA=normalDistribution(2.2, .0195), <1>
	componentB=normalDistribution(2.71, .0198), <2>
	simresults=monteCarlo(sampleA=sample(componentA), <3>
	sampleB=sample(componentB),
	add(sampleA, sampleB), <4>
	100000), <5>
	simmodel=empiricalDistribution(simresults), <6>
	prob=cumulativeProbability(simmodel, 5)) <7>
	----

	The Monte Carlo simulation below performs the following steps:

	<1> A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of `componentA`.
	<2> A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of `componentB`.
	<3> The `monteCarlo` function samples from the `componentA` and `componentB` distributions and sets the values to variables `sampleA` and `sampleB`.
	<4> It then calls the `add(sampleA, sampleB)`* function to find the combined lengths of the samples.
	<5> The `monteCarlo` function runs a set number of times, 100000, and collects the results in an array. Each
	time the function is called new samples are drawn from the `componentA`
	and `componentB` distributions. On each run, the `add` function adds the two samples to calculate the combined length.
	The result of each run is collected in an array and assigned to the `simresults` variable.
	<6> An `empiricalDistribution` function is then created from the `simresults` array to model the distribution of the
	simulation results.
	<7> Finally, the `cumulativeProbability` function is called on the `simmodel` to determine the cumulative probability
	that the combined length of the components is 5 or less.

	Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will
	be 5 or less:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"prob": 0.9994371944629039
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 660
	}
	]
	}
	}
	----

	== Correlated Simulations

	The simulation above assumes that the lengths of `componentA` and `componentB` vary independently.
	What would happen to the probability model if there was a correlation between the lengths of
	`componentA` and `componentB`?

	In the example below a database containing assembled pairs of components is used to determine
	if there is a correlation between the lengths of the components, and how the correlation effects the model.

	Before performing a simulation of the effects of correlation on the probability model its
	useful to understand what the correlation is between the lengths of `componentA` and `componentB`.

	[source,text]
	----
	let(a=random(collection5, q=":", rows="5000", fl="componentA_d, componentB_d"), <1>
	b=col(a, componentA_d)), <2>
	c=col(a, componentB_d)),
	d=corr(b, c)) <3>
	----

	<1> In the example, 5000 random samples are selected from a collection of assembled hinges.
	Each sample contains lengths of the components in the fields `componentA_d` and `componentB_d`.
	<2> Both fields are then vectorized. The componentA_d vector is stored in
	variable `b` and the componentB_d variable is stored in variable `c`.
	<3> Then the correlation of the two vectors is calculated using the `corr` function.

	Note from the result that the outcome from `corr` is 0.9996931313216989.
	This means that `componentA_d` and *`componentB_d` are almost perfectly correlated.

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"d": 0.9996931313216989
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 309
	}
	]
	}
	}
	----

	=== Correlation Effects on the Probability Model

	The example below explores how to use a multivariate normal distribution function
	to model how correlation effects the probability of hinge defects.

	In this example 5000 random samples are selected from a collection
	containing length data for assembled hinges. Each sample contains
	the fields `componentA_d` and `componentB_d`.

	Both fields are then vectorized. The `componentA_d` vector is stored in
	variable `b` and the `componentB_d` variable is stored in variable `c`.

	An array is created that contains the means of the two vectorized fields.

	Then both vectors are added to a matrix which is transposed. This creates
	an observation matrix where each row contains one observation of
	`componentA_d` and `componentB_d`. A covariance matrix is then created from the columns of
	the observation matrix with the
	`cov` function. The covariance matrix describes the covariance between `componentA_d` and `componentB_d`.

	The `multivariateNormalDistribution` function is then called with the
	array of means for the two fields and the covariance matrix. The model
	for the multivariate normal distribution is stored in variable `g`.

	The `monteCarlo` function then calls the function `add(sample(g))` 50000 times
	and collections the results in a vector. Each time the function is called a single sample
	is drawn from the multivariate normal distribution. Each sample is a vector containing
	one `componentA` and `componentB` pair. The `add` function adds the values in the vector to
	calculate the length of the pair. Over the long term the samples drawn from the
	multivariate normal distribution will conform to the covariance matrix used to construct it.

	Just as in the non-correlated example an empirical distribution is used to model probabilities
	of the simulation vector and the `cumulativeProbability` function is used to compute the cumulative
	probability that the combined component length will be 5 centimeters or less.

	Notice that the probability of a hinge meeting specification has dropped to 0.9889517439980468.
	This is because the strong correlation
	between the lengths of components means that their lengths rise together causing more hinges to
	fall out of the 5 centimeter specification.

	[source,text]
	----
	let(a=random(hinges, q=":", rows="5000", fl="componentA_d, componentB_d"),
	b=col(a, componentA_d),
	c=col(a, componentB_d),
	cor=corr(b,c),
	d=array(mean(b), mean(c)),
	e=transpose(matrix(b, c)),
	f=cov(e),
	g=multiVariateNormalDistribution(d, f),
	h=monteCarlo(add(sample(g)), 50000),
	i=empiricalDistribution(h),
	j=cumulativeProbability(i, 5))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"j": 0.9889517439980468
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 599
	}
	]
	}
	}
	----