solr/solr-ref-guide/src/probability-distributions.adoc - lucene-solr - Git at Google

 = Probability Distributions
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 This section of the user guide covers the probability distribution
 framework included in the math expressions library.

 == Visualization

 Probability distributions can be visualized with Zeppelin-Solr using the
 `zplot` function with the `dist` parameter, which visualizes the
 probability density function (PDF) of the distribution.

 Example visualizations are shown with each distribution below.

 === Continuous Distributions

 Continuous probability distributions work with continuous numbers (floating points). Below
 are the supported continuous probability distributions.

 ==== empiricalDistribution

 The `empiricalDistribution` function creates a continuous probability
 distribution from actual data.

 Empirical distributions can be used to conveniently visualize the probability density function of a random sample from a SolrCloud collection.
 The example below shows the `zplot` function visualizing the probability
 density of a random sample with a 32 bin histogram.

 image::images/math-expressions/empirical.png[]

 ==== normalDistribution

 The visualization below shows a normal distribution with a `mean` of 0 and `standard deviation` of 1.

 image::images/math-expressions/dist.png[]


 ==== logNormalDistribution

 The visualization below shows a log normal distribution with a `shape` of .25 and `scale`
 of 0.

 image::images/math-expressions/lognormal.png[]

 ==== gammaDistribution

 The visualization below shows a gamma distribution with a `shape` of 7.5 and `scale`
 of 1.

 image::images/math-expressions/gamma.png[]

 ==== betaDistribution

 The visualization below shows a beta distribution with a `shape1` of 2 and `shape2`
 of 2.

 image::images/math-expressions/beta.png[]

 ==== uniformDistribution

 The visualization below shows a uniform distribution between 0 and 10.

 image::images/math-expressions/uniformr.png[]

 ==== weibullDistribution

 The visualization below shows a Weibull distribution with a `shape` of 5 and `scale`
 of 1.

 image::images/math-expressions/weibull.png[]

 ==== triangularDistribution

 The visualization below shows a triangular distribution with a low of 5 a mode of 10
 and a high value of 20.

 image::images/math-expressions/triangular.png[]

 ==== constantDistribution

 The visualization below shows a constant distribution of 10.5.

 image::images/math-expressions/constant.png[]


 === Discrete Distributions

 Discrete probability distributions work with discrete numbers (integers). Below are the
 supported discrete probability distributions.

 ==== enumeratedDistribution

 The `enumeratedDistribution` function creates a discrete
 distribution function
 from an enumerated list of values and probabilities or
 from a data set of discrete values

 The visualization below shows an enumerated distribution created from a list of
 discrete values and probabilities.

 image::images/math-expressions/enum1.png[]

 The visualization below shows an enumerated distribution generated from a search
 result that has been transformed into a vector of discrete values.

 image::images/math-expressions/enum2.png[]

 ==== poissonDistribution

 The visualization below shows a Poisson distribution with a `mean` of 15.

 image::images/math-expressions/poisson.png[]


 ==== binomialDistribution

 The visualization below shows a binomial distribution with a 100 trials and .15
 probability of success.

 image::images/math-expressions/binomial.png[]


 ==== uniformIntegerDistribution

 The visualization below shows a uniform integer distribution between 0 and 10.

 image::images/math-expressions/uniform.png[]


 ==== geometricDistribution

 The visualization below shows a geometric distribution probability of success of
 .25.

 image::images/math-expressions/geometric.png[]


 ==== zipFDistribution

 The visualization below shows a ZipF distribution with a size of 50 and exponent of 1.

 image::images/math-expressions/zipf.png[]


 === Cumulative Probability

 The `cumulativeProbability` function can be used with all
 probability distributions to calculate the
 cumulative probability of encountering a specific
 random variable within a specific distribution.

 Below is example of calculating the cumulative probability
 of a random variable within a normal distribution.

 [source,text]
 ----
 let(a=normalDistribution(10, 5),
     b=cumulativeProbability(a, 12))
 ----

 In this example a normal distribution function is created
 with a mean of 10 and a standard deviation of 5. Then
 the cumulative probability of the value 12 is calculated for this
 specific distribution.

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": 0.6554217416103242
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 === Probability

 All probability distributions can calculate the probability
 between a range of values.

 In the following example an empirical distribution is created
 from a sample of file sizes drawn from the logs collection.
 Then the probability of a file size between the range of 40000
 and 41000 is calculated to be 19%.

 [source,text]
 ----
 let(a=random(logs, q="*:*", fl="filesize_d", rows="50000"),
     b=col(a, filesize_d),
     c=empiricalDistribution(b, 100),
     d=probability(c, 40000, 41000))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "d": 0.19006540560734791
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 550
       }
     ]
   }
 }
 ----

 === Discrete Probability

 The `probability` function can be used with any discrete
 distribution function to compute the probability of a
 discrete value.

 Below is an example which calculates the probability
 of a discrete value within a Poisson distribution.

 In the example a Poisson distribution function is created
 with a mean of `100`. Then the
 probability of encountering a sample of the discrete value 101 is calculated for this
 specific distribution.

 [source,text]
 ----
 let(a=poissonDistribution(100),
     b=probability(a, 101))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": 0.039466333474403106
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----


 === Sampling

 All probability distributions support sampling. The `sample`
 function returns one or more random samples from a probability distribution.

 Below is an example drawing a single sample from a normal distribution.

 [source,text]
 ----
 let(a=normalDistribution(10, 5),
     b=sample(a))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": 11.24578055004963
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 The sample function can also return a vector of samples. Vectors of samples
 can be visualized as scatter plots to gain an intuitive understanding
 of the underlying distribution.

 The first example shows the scatter plot of a normal distribution with
 a mean of 0 and a standard deviation of 5.

 image::images/math-expressions/sample-scatter.png[]

 The next example shows a scatter plot of the same distribution with
 an ascending sort applied to the sample vector.

 image::images/math-expressions/sample-scatter1.png[]

 The next example shows two different distributions overlaid
 in the same scatter plot.

 image::images/math-expressions/sample-overlay.png[]


 === Multivariate Normal Distribution

 The multivariate normal distribution is a generalization of the
 univariate normal distribution to higher dimensions.

 The multivariate normal distribution models two or more random
 variables that are normally distributed. The relationship between the variables is defined by a covariance matrix.

 ==== Sampling

 The `sample` function can be used to draw samples
 from a multivariate normal distribution in much the same
 way as a univariate normal distribution.

 The difference is that each sample will be an array containing a sample
 drawn from each of the underlying normal distributions.
 If multiple samples are drawn, the `sample` function returns a matrix with a
 sample in each row. Over the long term the columns of the sample
 matrix will conform to the covariance matrix used to parametrize the
 multivariate normal distribution.

 The example below demonstrates how to initialize and draw samples
 from a multivariate normal distribution.

 In this example 5000 random samples are selected from a collection of log records. Each sample contains
 the fields `filesize_d` and `response_d`. The values of both fields conform to a normal distribution.

 Both fields are then vectorized. The `filesize_d` vector is stored in
 variable `b` and the `response_d` variable is stored in variable `c`.

 An array is created that contains the means of the two vectorized fields.

 Then both vectors are added to a matrix which is transposed. This creates
 an observation matrix where each row contains one observation of
 `filesize_d` and `response_d`. A covariance matrix is then created from the columns of
 the observation matrix with the `cov` function. The covariance matrix describes the covariance between
 `filesize_d` and `response_d`.

 The `multivariateNormalDistribution` function is then called with the
 array of means for the two fields and the covariance matrix. The model for the
 multivariate normal distribution is assigned to variable `g`.

 Finally five samples are drawn from the multivariate normal distribution.

 [source,text]
 ----
 let(a=random(logs, q="*:*", rows="5000", fl="filesize_d, response_d"),
     b=col(a, filesize_d),
     c=col(a, response_d),
     d=array(mean(b), mean(c)),
     e=transpose(matrix(b, c)),
     f=cov(e),
     g=multiVariateNormalDistribution(d, f),
     h=sample(g, 5))
 ----

 The samples are returned as a matrix, with each row representing one sample. There are two
 columns in the matrix. The first column contains samples for `filesize_d` and the second
 column contains samples for `response_d`. Over the long term the covariance between
 the columns will conform to the covariance matrix used to instantiate the
 multivariate normal distribution.

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "h": [
           [
             41974.85669321393,
             779.4097049705296
           ],
           [
             42869.19876441414,
             834.2599296790783
           ],
           [
             38556.30444839889,
             720.3683470060988
           ],
           [
             37689.31290928216,
             686.5549428100018
           ],
           [
             40564.74398214547,
             769.9328090774
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 162
       }
     ]
   }
 }
 ----
	= Probability Distributions
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	This section of the user guide covers the probability distribution
	framework included in the math expressions library.

	== Visualization

	Probability distributions can be visualized with Zeppelin-Solr using the
	`zplot` function with the `dist` parameter, which visualizes the
	probability density function (PDF) of the distribution.

	Example visualizations are shown with each distribution below.

	=== Continuous Distributions

	Continuous probability distributions work with continuous numbers (floating points). Below
	are the supported continuous probability distributions.

	==== empiricalDistribution

	The `empiricalDistribution` function creates a continuous probability
	distribution from actual data.

	Empirical distributions can be used to conveniently visualize the probability density function of a random sample from a SolrCloud collection.
	The example below shows the `zplot` function visualizing the probability
	density of a random sample with a 32 bin histogram.

	image::images/math-expressions/empirical.png[]

	==== normalDistribution

	The visualization below shows a normal distribution with a `mean` of 0 and `standard deviation` of 1.

	image::images/math-expressions/dist.png[]


	==== logNormalDistribution

	The visualization below shows a log normal distribution with a `shape` of .25 and `scale`
	of 0.

	image::images/math-expressions/lognormal.png[]

	==== gammaDistribution

	The visualization below shows a gamma distribution with a `shape` of 7.5 and `scale`
	of 1.

	image::images/math-expressions/gamma.png[]

	==== betaDistribution

	The visualization below shows a beta distribution with a `shape1` of 2 and `shape2`
	of 2.

	image::images/math-expressions/beta.png[]

	==== uniformDistribution

	The visualization below shows a uniform distribution between 0 and 10.

	image::images/math-expressions/uniformr.png[]

	==== weibullDistribution

	The visualization below shows a Weibull distribution with a `shape` of 5 and `scale`
	of 1.

	image::images/math-expressions/weibull.png[]

	==== triangularDistribution

	The visualization below shows a triangular distribution with a low of 5 a mode of 10
	and a high value of 20.

	image::images/math-expressions/triangular.png[]

	==== constantDistribution

	The visualization below shows a constant distribution of 10.5.

	image::images/math-expressions/constant.png[]



	=== Discrete Distributions

	Discrete probability distributions work with discrete numbers (integers). Below are the
	supported discrete probability distributions.

	==== enumeratedDistribution

	The `enumeratedDistribution` function creates a discrete
	distribution function
	from an enumerated list of values and probabilities or
	from a data set of discrete values

	The visualization below shows an enumerated distribution created from a list of
	discrete values and probabilities.

	image::images/math-expressions/enum1.png[]

	The visualization below shows an enumerated distribution generated from a search
	result that has been transformed into a vector of discrete values.

	image::images/math-expressions/enum2.png[]

	==== poissonDistribution

	The visualization below shows a Poisson distribution with a `mean` of 15.

	image::images/math-expressions/poisson.png[]


	==== binomialDistribution

	The visualization below shows a binomial distribution with a 100 trials and .15
	probability of success.

	image::images/math-expressions/binomial.png[]


	==== uniformIntegerDistribution

	The visualization below shows a uniform integer distribution between 0 and 10.

	image::images/math-expressions/uniform.png[]


	==== geometricDistribution

	The visualization below shows a geometric distribution probability of success of
	.25.

	image::images/math-expressions/geometric.png[]


	==== zipFDistribution

	The visualization below shows a ZipF distribution with a size of 50 and exponent of 1.

	image::images/math-expressions/zipf.png[]



	=== Cumulative Probability

	The `cumulativeProbability` function can be used with all
	probability distributions to calculate the
	cumulative probability of encountering a specific
	random variable within a specific distribution.

	Below is example of calculating the cumulative probability
	of a random variable within a normal distribution.

	[source,text]
	----
	let(a=normalDistribution(10, 5),
	b=cumulativeProbability(a, 12))
	----

	In this example a normal distribution function is created
	with a mean of 10 and a standard deviation of 5. Then
	the cumulative probability of the value 12 is calculated for this
	specific distribution.

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": 0.6554217416103242
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----

	=== Probability

	All probability distributions can calculate the probability
	between a range of values.

	In the following example an empirical distribution is created
	from a sample of file sizes drawn from the logs collection.
	Then the probability of a file size between the range of 40000
	and 41000 is calculated to be 19%.

	[source,text]
	----
	let(a=random(logs, q=":", fl="filesize_d", rows="50000"),
	b=col(a, filesize_d),
	c=empiricalDistribution(b, 100),
	d=probability(c, 40000, 41000))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"d": 0.19006540560734791
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 550
	}
	]
	}
	}
	----

	=== Discrete Probability

	The `probability` function can be used with any discrete
	distribution function to compute the probability of a
	discrete value.

	Below is an example which calculates the probability
	of a discrete value within a Poisson distribution.

	In the example a Poisson distribution function is created
	with a mean of `100`. Then the
	probability of encountering a sample of the discrete value 101 is calculated for this
	specific distribution.

	[source,text]
	----
	let(a=poissonDistribution(100),
	b=probability(a, 101))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": 0.039466333474403106
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----


	=== Sampling

	All probability distributions support sampling. The `sample`
	function returns one or more random samples from a probability distribution.

	Below is an example drawing a single sample from a normal distribution.

	[source,text]
	----
	let(a=normalDistribution(10, 5),
	b=sample(a))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": 11.24578055004963
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----

	The sample function can also return a vector of samples. Vectors of samples
	can be visualized as scatter plots to gain an intuitive understanding
	of the underlying distribution.

	The first example shows the scatter plot of a normal distribution with
	a mean of 0 and a standard deviation of 5.

	image::images/math-expressions/sample-scatter.png[]

	The next example shows a scatter plot of the same distribution with
	an ascending sort applied to the sample vector.

	image::images/math-expressions/sample-scatter1.png[]

	The next example shows two different distributions overlaid
	in the same scatter plot.

	image::images/math-expressions/sample-overlay.png[]





	=== Multivariate Normal Distribution

	The multivariate normal distribution is a generalization of the
	univariate normal distribution to higher dimensions.

	The multivariate normal distribution models two or more random
	variables that are normally distributed. The relationship between the variables is defined by a covariance matrix.

	==== Sampling

	The `sample` function can be used to draw samples
	from a multivariate normal distribution in much the same
	way as a univariate normal distribution.

	The difference is that each sample will be an array containing a sample
	drawn from each of the underlying normal distributions.
	If multiple samples are drawn, the `sample` function returns a matrix with a
	sample in each row. Over the long term the columns of the sample
	matrix will conform to the covariance matrix used to parametrize the
	multivariate normal distribution.

	The example below demonstrates how to initialize and draw samples
	from a multivariate normal distribution.

	In this example 5000 random samples are selected from a collection of log records. Each sample contains
	the fields `filesize_d` and `response_d`. The values of both fields conform to a normal distribution.

	Both fields are then vectorized. The `filesize_d` vector is stored in
	variable `b` and the `response_d` variable is stored in variable `c`.

	An array is created that contains the means of the two vectorized fields.

	Then both vectors are added to a matrix which is transposed. This creates
	an observation matrix where each row contains one observation of
	`filesize_d` and `response_d`. A covariance matrix is then created from the columns of
	the observation matrix with the `cov` function. The covariance matrix describes the covariance between
	`filesize_d` and `response_d`.

	The `multivariateNormalDistribution` function is then called with the
	array of means for the two fields and the covariance matrix. The model for the
	multivariate normal distribution is assigned to variable `g`.

	Finally five samples are drawn from the multivariate normal distribution.

	[source,text]
	----
	let(a=random(logs, q=":", rows="5000", fl="filesize_d, response_d"),
	b=col(a, filesize_d),
	c=col(a, response_d),
	d=array(mean(b), mean(c)),
	e=transpose(matrix(b, c)),
	f=cov(e),
	g=multiVariateNormalDistribution(d, f),
	h=sample(g, 5))
	----

	The samples are returned as a matrix, with each row representing one sample. There are two
	columns in the matrix. The first column contains samples for `filesize_d` and the second
	column contains samples for `response_d`. Over the long term the covariance between
	the columns will conform to the covariance matrix used to instantiate the
	multivariate normal distribution.

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"h": [
	[
	41974.85669321393,
	779.4097049705296
	],
	[
	42869.19876441414,
	834.2599296790783
	],
	[
	38556.30444839889,
	720.3683470060988
	],
	[
	37689.31290928216,
	686.5549428100018
	],
	[
	40564.74398214547,
	769.9328090774
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 162
	}
	]
	}
	}
	----