solr/solr-ref-guide/src/statistics.adoc - lucene-solr - Git at Google

 = Statistics
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.


 This section of the user guide covers the core statistical functions
 available in math expressions.


 <<descriptive-statistics, Descriptive Statistics>> - <<Histograms and Frequency Tables, Histograms>> -
 <<Frequency Tables, Frequency Tables>> - <<Percentiles, Percentiles>> - <<Quantile Plots, Quantile Plots>> -
 <<Correlation and Covariance, Correlation and Covariance>> - <<Statistical Inference Tests, Inference Tests>> -
 <<Transformations, Transformations>> - <<Z-scores, Z-scores>>


 == Descriptive Statistics

 The `describe` function returns descriptive statistics for a
 numeric array. The `describe` function returns a single *tuple* with name/value
 pairs containing the descriptive statistics.

 Below is a simple example that selects a random sample of documents from the *logs* collection,
 vectorizes the *response_d* field in the result set and uses the `describe` function to
 return descriptive statistics about the vector.

 [source,text]
 ----
 let(a=random(logs, q="*:*", fl="response_d", rows="50000"),
     b=col(a, response_d),
     c=describe(b))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "sumsq": 36674200601.78738,
         "max": 1068.854686837548,
         "var": 1957.9752647562789,
         "geometricMean": 854.1445499569674,
         "sum": 42764648.83319176,
         "kurtosis": 0.013189848821424377,
         "N": 50000,
         "min": 656.023249311864,
         "mean": 855.2929766638425,
         "popVar": 1957.936105250984,
         "skewness": 0.0014560741802307174,
         "stdev": 44.24901428005237
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 430
       }
     ]
   }
 }
 ----

 Notice that the random sample contains 50,000 records and the response
 time is only 430 milliseconds. Samples of this size can be used to
 reliably estimate the statistics for very large underlying
 data sets with sub-second performance.


 The `describe` function can also be visualized in a table with Zeppelin-Solr:

 image::images/math-expressions/describe.png[]


 == Histograms and Frequency Tables

 Histograms and frequency tables are tools for visualizing the distribution
 of a random variable.

 The `hist` function creates a histogram designed for usage with continuous data. The
 `freqTable` function creates a frequency table for use with discrete data.

 === histograms

 In the example below a histogram is used to visualize a random sample of
 response times from the logs collection. The example retrieves the
 random sample with the `random` function and creates a vector from the *response_d* field
 in the result set. Then the `hist` function is applied to the vector
 to return a histogram with 22 bins. The `hist` function returns a
 list of tuples with summary statistics for each bin.

 [source,text]
 ----
 let(a=random(logs, q="*:*", fl="response_d", rows="50000"),
     b=col(a, response_d),
     c=hist(b,  22))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,text]
 ----
 {
   "result-set": {
     "docs": [
       {
         "prob": 0.00004896007228311655,
         "min": 675.573084576817,
         "max": 688.3309631697003,
         "mean": 683.805542728906,
         "var": 50.9974629924082,
         "cumProb": 0.000030022417162809913,
         "sum": 2051.416628186718,
         "stdev": 7.141250800273591,
         "N": 3
       },
       {
         "prob": 0.00029607514624062624,
         "min": 696.2875238591652,
         "max": 707.9706315779541,
         "mean": 702.1110569558929,
         "var": 14.136444379466969,
         "cumProb": 0.00022705264963879807,
         "sum": 11233.776911294284,
         "stdev": 3.759846323916307,
         "N": 16
       },
       {
         "prob": 0.0011491235433157194,
         "min": 709.1574910598678,
         "max": 724.9027194369135,
         "mean": 717.8554290699951,
         "var": 20.6935845290122,
         "cumProb": 0.0009858515418689757,
         "sum": 41635.61488605971,
         "stdev": 4.549020172412098,
         "N": 58
       },
       ...
       ]}}
 ----

 With Zeppelin-Solr the histogram can be first visualized as a table:

 image::images/math-expressions/histtable.png[]

 Then the histogram can be visualized with an area chart by plotting the *mean* of
 the bins on the *x-axis* and the *prob* (probability) on the *y-axis*:

 image::images/math-expressions/hist.png[]

 The cumulative probability can be plotted by switching the *y-axis* to the *cumProb* column:

 image::images/math-expressions/cumProb.png[]

 === Frequency Tables

 The `freqTable` function returns a frequency distribution for a discrete data set.
 The `freqTable` function doesn't create bins like the histogram. Instead it counts
 the occurrence of each discrete data value and returns a list of tuples with the
 frequency statistics for each value.

 Below is an example of a frequency table built from a result set
 of rounded *differences* in daily opening stock prises for the stock ticker *amzn*.

 This example is interesting because it shows a multi-step process to arrive
 at the result. The first step is to *search* for records in the the *stocks*
 collection with a ticker of *amzn*. Notice that the result set is sorted by
 date ascending and it returns the *open_d* field which is the opening price for
 the day.

 The *open_d* field is then vectorized and set to variable *b*, which now contains
 a vector of opening prices ordered by date ascending.

 The `diff` function is then used to calculate the *first difference* for the
 vector of opening prices. The first difference simply subtracts the previous value
 from each value in the array. This will provide an array of price differences
 for each day which will show daily change in opening price.

 Then the `round` function is used to round the price differences to the nearest
 integer to create a vector of discrete values. The `round` function in this
 example is effectively *binning* continuous data at integer boundaries.

 Finally the `freqTable` function is run on the discrete values to calculate
 the frequency table.

 [source,text]
 ----
 let(a=search(stocks,
              q="ticker_s:amzn",
              fl="open_d, date_dt",
              sort="date_dt asc",
              rows=25000),
     b=col(a, open_d),
     c=diff(b),
     d=round(c),
     e=freqTable(d))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,text]
 ----
  {
    "result-set": {
      "docs": [
        {
          "pct": 0.00019409937888198756,
          "count": 1,
          "cumFreq": 1,
          "cumPct": 0.00019409937888198756,
          "value": -57
        },
        {
          "pct": 0.00019409937888198756,
          "count": 1,
          "cumFreq": 2,
          "cumPct": 0.00038819875776397513,
          "value": -51
        },
        {
          "pct": 0.00019409937888198756,
          "count": 1,
          "cumFreq": 3,
          "cumPct": 0.0005822981366459627,
          "value": -49
        },
        ...
        ]}}
 ----

 With Zeppelin-Solr the frequency table can be first visualized as a table:

 image::images/math-expressions/freqTable.png[]

 The frequency table can then be plotted by switching to a scatter chart and selecting
 the *value* column for the *x-axis* and the *count* column for the *y-axis*

 image::images/math-expressions/freqTable1.png[]

 Notice that the visualization nicely displays the frequency of daily change in stock prices
 rounded to integers. The most frequently occurring value is 0 with 1494 occurrences followed by
  -1 and 1 with around 700 occurrences.


 == Percentiles

 The `percentile` function returns the estimated value for a specific percentile in
 a sample set. The example below returns a random sample containing the *response_d* field
 from the logs collection. The *response_d* field is vectorized and the 20th percentile
 is calculated for the vector:

 [source,text]
 ----
 let(a=random(logs, q="*:*", rows="15000", fl="response_d"),
     b=col(a, response_d),
     c=percentile(b, 20))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
  {
    "result-set": {
      "docs": [
        {
          "c": 818.073554
        },
        {
          "EOF": true,
          "RESPONSE_TIME": 286
        }
      ]
    }
  }
 ----

 The `percentile` function can also compute an array of percentile values.
 The example below is computing the 20th, 40th, 60th and 80th percentiles for a random sample
 of the *response_d* field:

 [source,text]
 ----
 let(a=random(logs, q="*:*", rows="15000", fl="response_d"),
     b=col(a, response_d),
     c=percentile(b, array(20,40,60,80)))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "c": [
           818.0835543394625,
           843.5590348165282,
           866.1789509894824,
           892.5033386599067
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 291
       }
     ]
   }
 }
 ----

 === Quantile Plots

 A quantile plot, or QQ Plot, plots the percentiles from two distributions on the
 the same scatter plot for comparision.

 The example below uses the sampling capability
 described in the <<probability-distributions.adoc#probability-distributions,Probability>> section of the user guide. The same
 technique can be used with random samples drawn with the `random` function on empirical data
 stored in Solr Cloud collections. But its very useful to be able to
 use the probability distribution framework to sample from different distributions
 to learn how to read QQ plots.

 In the example 50000 samples from two normal distributions are drawn. Both distributions
 have a mean of 500 but have different standard deviations. A sequence is then created
 with 98 integers starting from 1 with a stride 1. This sequence will be used
 to specify the
 percentiles to calculate and also serve as the *x-axis* in the plot.
 Then the percentile function is used to calculate the percentiles for
 both distributions.

 Finally `zplot` is used to plot the percentiles sequence on the *x-axis* and the calculated
 percentile values for both distributions on the *y axis*. A scatter plot is used
 to visualize the QQ plot.

 image::images/math-expressions/quantile-plot.png[]

 Notice there are two scatter plots that intersect at 500 which is the mean
 of both distributions. But the red scatter plot, which has the
 higher standard deviation,
 has a steeper slope. The higher standard deviation creates a steeper slope
 because the percentile values are are dispersed farther from the mean.


 == Correlation and Covariance

 Correlation and Covariance measure how random variables fluctuate
 together.

 === Correlation and Correlation Matrices

 Correlation is a measure of the linear correlation between two vectors. Correlation is scaled between
 -1 and 1.

 Three correlation types are supported:

 * *pearsons* (default)
 * *kendalls*
 * *spearmans*

 The type of correlation is specified by adding the *type* named parameter in the
 function call.

 In the example below a random sample containing two fields, *filesize_d* and *response_d*, is drawn from
 the logs collection using the `random` function. The fields are vectorized into the
 variables *x* and *y* and then *Spearman's* correlation for
 the two vectors is calculated using the `corr` function.

 image::images/math-expressions/correlation.png[]

 ==== Correlation Matrices

 Correlation matrices are powerful tools for visualizing the correlation between two or more
 vectors.

 The `corr` function builds a correlation matrix
 if a matrix is passed as the parameter. The correlation matrix is computed by correlating the *columns*
 of the matrix.

 The example below demonstrates the power of correlation matrices combined with 2 dimensional faceting.

 In this example the `facet2D` function is used to generate a two dimensional facet aggregation
 over the fields *complaint_type_s* and *zip_s* from the *nyc311* complaints database.
 The *top 20* complaint types and the *top 25* zip codes for each complaint type are aggregated.
 The result is a stream of tuples each containing the fields *complaint_type_s*, *zip_s* and
 the count for the pair.

 The `pivot` function is then used to pivot the fields into a *matrix* with the *zip_s*
 field as the *rows* and the *complaint_type_s* field as the *columns*. The `count(*)` field populates
 the values in the cells of the matrix.

 The `corr` function is then used correlate the *columns* of the matrix. This produces a correlation matrix
 that shows how complaint types are correlated based on the zip codes they appear in. Another way to look at this
 is it shows how the different complaint types tend to co-occur across zip codes.

 Finally the `zplot` function is used to plot the correlation matrix as a heat map.

 image::images/math-expressions/corrmatrix.png[]

 Notice in the example the correlation matrix is square with complaint types shown on both
 the *x* and *y* axises. The color of the cells in the heat map shows the
 intensity of the correlation between the complaint types.

 The heat map is interactive, so mousing over one of the cells pops up the values
 for the cell.

 image::images/math-expressions/corrmatrix2.png[]

 Notice that HEAT/HOT WATER and UNSANITARY CONDITION complaints have a correlation of 8 (rounded to the nearest
 tenth).

 === Covariance and Covariance Matrices

 Covariance is an unscaled measure of correlation.

 The `cov` function calculates the covariance of two vectors of data.

 In the example below a random sample containing two fields, *filesize_d* and *response_d*, is drawn from
 the logs collection using the `random` function. The fields are vectorized into the
 variables *x* and *y* and then the covariance for
 the two vectors is calculated using the `cov` function.

 image::images/math-expressions/covariance.png[]

 If a matrix is passed to the `cov` function it will automatically compute a covariance
 matrix for the *columns* of the matrix.

 Notice in the example below that the *x* and *y* vectors are added to a matrix.
 The matrix is then transposed to turn the rows into columns,
 and the covariance matrix is computed for the columns of the matrix.

 [source,text]
 ----
 let(a=random(logs, q="*:*", fl="filesize_d, response_d", rows=50000),
     x=col(a, filesize_d),
     y=col(a, response_d),
     m=transpose(matrix(x, y)),
     covariance=cov(m))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
  {
    "result-set": {
      "docs": [
        {
          "covariance": [
            [
              4018404.072532102,
              80243.3948172242
            ],
            [
              80243.3948172242,
              1948.3216661122592
            ]
          ]
        },
        {
          "EOF": true,
          "RESPONSE_TIME": 534
        }
      ]
    }
  }
 ----

 The covariance matrix contains both the variance for the two vectors and the covariance between the vectors
 in the following format:


 [source,text]
 ----
          x                 y
  x [4018404.072532102, 80243.3948172242],
  y [80243.3948172242,  1948.3216661122592]
 ----

 The covariance matrix is always square. So a covariance matrix created from 3 vectors will produce a 3 x 3 matrix.


 == Statistical Inference Tests

 Statistical inference tests test a hypothesis on *random samples* and return p-values which
 can be used to infer the reliability of the test for the entire population.

 The following statistical inference tests are available:

 * `anova`: One-Way-Anova tests if there is a statistically significant difference in the
 means of two or more random samples.

 * `ttest`: The T-test tests if there is a statistically significant difference in the means of two
 random samples.

 * `pairedTtest`: The paired t-test tests if there is a statistically significant difference
 in the means of two random samples with paired data.

 * `gTestDataSet`: The G-test tests if two samples of binned discrete data were drawn
 from the same population.

 * `chiSquareDataset`: The Chi-Squared test tests if two samples of binned discrete data were
 drawn from the same population.

 * `mannWhitney`: The Mann-Whitney test is a non-parametric test that tests if two
 samples of continuous were pulled
 from the same population. The Mann-Whitney test is often used instead of the T-test when the
 underlying assumptions of the T-test are not
 met.

 * `ks`: The Kolmogorov-Smirnov test tests if two samples of continuous data were drawn from
 the same distribution.

 Below is a simple example of a T-test performed on two random samples.
 The returned p-value of .93 means we can accept the null hypothesis
 that the two samples do not have statistically significantly differences in the means.

 [source,text]
 ----
 let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
     b=random(collection1, q="*:*", rows="1500", fl="price_f"),
     c=col(a, price_f),
     d=col(b, price_f),
     e=ttest(c, d))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "e": {
           "p-value": 0.9350135639249795,
           "t-statistic": 0.081545541074817
         }
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 48
       }
     ]
   }
 }
 ----

 == Transformations

 In statistical analysis its often useful to transform data sets before performing
 statistical calculations. The statistical function library includes the following
 commonly used transformations:

 * `rank`: Returns a numeric array with the rank-transformed value of each element of the original
 array.

 * `log`: Returns a numeric array with the natural log of each element of the original array.

 * `log10`: Returns a numeric array with the base 10 log of each element of the original array.

 * `sqrt`: Returns a numeric array with the square root of each element of the original array.

 * `cbrt`: Returns a numeric array with the cube root of each element of the original array.

 * `recip`: Returns a numeric array with the reciprocal of each element of the original array.

 Below is an example of a ttest performed on log transformed data sets:

 [source,text]
 ----
 let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
     b=random(collection1, q="*:*", rows="1500", fl="price_f"),
     c=log(col(a, price_f)),
     d=log(col(b, price_f)),
     e=ttest(c, d))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "e": {
           "p-value": 0.9655110070265056,
           "t-statistic": -0.04324265449471238
         }
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 58
       }
     ]
   }
 }
 ----

 == Back Transformations

 Vectors that have been transformed with the `log`, `log10`, `sqrt` and `cbrt` functions
 can be back transformed using the `pow` function.

 The example below shows how to back transform data that has been transformed by the
 `sqrt` function.


 [source,text]
 ----
 let(echo="b,c",
     a=array(100, 200, 300),
     b=sqrt(a),
     c=pow(b, 2))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           10,
           14.142135623730951,
           17.320508075688775
         ],
         "c": [
           100,
           200.00000000000003,
           300.00000000000006
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 The example below shows how to back transform data that has been transformed by the
 `log10` function.


 [source,text]
 ----
 let(echo="b,c",
     a=array(100, 200, 300),
     b=log10(a),
     c=pow(10, b))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           2,
           2.3010299956639813,
           2.4771212547196626
         ],
         "c": [
           100,
           200.00000000000003,
           300.0000000000001
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 Vectors that have been transformed with the `recip` function can be back-transformed by taking the reciprocal
 of the reciprocal.

 The example below shows an example of the back-transformation of the `recip` function.

 [source,text]
 ----
 let(echo="b,c",
     a=array(100, 200, 300),
     b=recip(a),
     c=recip(b))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           0.01,
           0.005,
           0.0033333333333333335
         ],
         "c": [
           100,
           200,
           300
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 == Z-scores

 The `zscores` function converts a numeric array to an array of z-scores. The z-score
 is the number of standard deviations a number is from the mean.

 The example below computes the z-scores for the values in an array.


 [source,text]
 ----
 let(a=array(1,2,3),
     b=zscores(a))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           -1,
           0,
           1
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 27
       }
     ]
   }
 }
 ----
	= Statistics
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.


	This section of the user guide covers the core statistical functions
	available in math expressions.


	<<descriptive-statistics, Descriptive Statistics>> - <<Histograms and Frequency Tables, Histograms>> -
	<<Frequency Tables, Frequency Tables>> - <<Percentiles, Percentiles>> - <<Quantile Plots, Quantile Plots>> -
	<<Correlation and Covariance, Correlation and Covariance>> - <<Statistical Inference Tests, Inference Tests>> -
	<<Transformations, Transformations>> - <<Z-scores, Z-scores>>


	== Descriptive Statistics

	The `describe` function returns descriptive statistics for a
	numeric array. The `describe` function returns a single tuple with name/value
	pairs containing the descriptive statistics.

	Below is a simple example that selects a random sample of documents from the logs collection,
	vectorizes the response_d field in the result set and uses the `describe` function to
	return descriptive statistics about the vector.

	[source,text]
	----
	let(a=random(logs, q=":", fl="response_d", rows="50000"),
	b=col(a, response_d),
	c=describe(b))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"sumsq": 36674200601.78738,
	"max": 1068.854686837548,
	"var": 1957.9752647562789,
	"geometricMean": 854.1445499569674,
	"sum": 42764648.83319176,
	"kurtosis": 0.013189848821424377,
	"N": 50000,
	"min": 656.023249311864,
	"mean": 855.2929766638425,
	"popVar": 1957.936105250984,
	"skewness": 0.0014560741802307174,
	"stdev": 44.24901428005237
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 430
	}
	]
	}
	}
	----

	Notice that the random sample contains 50,000 records and the response
	time is only 430 milliseconds. Samples of this size can be used to
	reliably estimate the statistics for very large underlying
	data sets with sub-second performance.


	The `describe` function can also be visualized in a table with Zeppelin-Solr:

	image::images/math-expressions/describe.png[]


	== Histograms and Frequency Tables

	Histograms and frequency tables are tools for visualizing the distribution
	of a random variable.

	The `hist` function creates a histogram designed for usage with continuous data. The
	`freqTable` function creates a frequency table for use with discrete data.

	=== histograms

	In the example below a histogram is used to visualize a random sample of
	response times from the logs collection. The example retrieves the
	random sample with the `random` function and creates a vector from the response_d field
	in the result set. Then the `hist` function is applied to the vector
	to return a histogram with 22 bins. The `hist` function returns a
	list of tuples with summary statistics for each bin.

	[source,text]
	----
	let(a=random(logs, q=":", fl="response_d", rows="50000"),
	b=col(a, response_d),
	c=hist(b, 22))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,text]
	----
	{
	"result-set": {
	"docs": [
	{
	"prob": 0.00004896007228311655,
	"min": 675.573084576817,
	"max": 688.3309631697003,
	"mean": 683.805542728906,
	"var": 50.9974629924082,
	"cumProb": 0.000030022417162809913,
	"sum": 2051.416628186718,
	"stdev": 7.141250800273591,
	"N": 3
	},
	{
	"prob": 0.00029607514624062624,
	"min": 696.2875238591652,
	"max": 707.9706315779541,
	"mean": 702.1110569558929,
	"var": 14.136444379466969,
	"cumProb": 0.00022705264963879807,
	"sum": 11233.776911294284,
	"stdev": 3.759846323916307,
	"N": 16
	},
	{
	"prob": 0.0011491235433157194,
	"min": 709.1574910598678,
	"max": 724.9027194369135,
	"mean": 717.8554290699951,
	"var": 20.6935845290122,
	"cumProb": 0.0009858515418689757,
	"sum": 41635.61488605971,
	"stdev": 4.549020172412098,
	"N": 58
	},
	...
	]}}
	----

	With Zeppelin-Solr the histogram can be first visualized as a table:

	image::images/math-expressions/histtable.png[]

	Then the histogram can be visualized with an area chart by plotting the mean of
	the bins on the x-axis and the prob (probability) on the y-axis:

	image::images/math-expressions/hist.png[]

	The cumulative probability can be plotted by switching the y-axis to the cumProb column:

	image::images/math-expressions/cumProb.png[]

	=== Frequency Tables

	The `freqTable` function returns a frequency distribution for a discrete data set.
	The `freqTable` function doesn't create bins like the histogram. Instead it counts
	the occurrence of each discrete data value and returns a list of tuples with the
	frequency statistics for each value.

	Below is an example of a frequency table built from a result set
	of rounded differences in daily opening stock prises for the stock ticker amzn.

	This example is interesting because it shows a multi-step process to arrive
	at the result. The first step is to search for records in the the stocks
	collection with a ticker of amzn. Notice that the result set is sorted by
	date ascending and it returns the open_d field which is the opening price for
	the day.

	The open_d field is then vectorized and set to variable b, which now contains
	a vector of opening prices ordered by date ascending.

	The `diff` function is then used to calculate the first difference for the
	vector of opening prices. The first difference simply subtracts the previous value
	from each value in the array. This will provide an array of price differences
	for each day which will show daily change in opening price.

	Then the `round` function is used to round the price differences to the nearest
	integer to create a vector of discrete values. The `round` function in this
	example is effectively binning continuous data at integer boundaries.

	Finally the `freqTable` function is run on the discrete values to calculate
	the frequency table.

	[source,text]
	----
	let(a=search(stocks,
	q="ticker_s:amzn",
	fl="open_d, date_dt",
	sort="date_dt asc",
	rows=25000),
	b=col(a, open_d),
	c=diff(b),
	d=round(c),
	e=freqTable(d))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,text]
	----
	{
	"result-set": {
	"docs": [
	{
	"pct": 0.00019409937888198756,
	"count": 1,
	"cumFreq": 1,
	"cumPct": 0.00019409937888198756,
	"value": -57
	},
	{
	"pct": 0.00019409937888198756,
	"count": 1,
	"cumFreq": 2,
	"cumPct": 0.00038819875776397513,
	"value": -51
	},
	{
	"pct": 0.00019409937888198756,
	"count": 1,
	"cumFreq": 3,
	"cumPct": 0.0005822981366459627,
	"value": -49
	},
	...
	]}}
	----

	With Zeppelin-Solr the frequency table can be first visualized as a table:

	image::images/math-expressions/freqTable.png[]

	The frequency table can then be plotted by switching to a scatter chart and selecting
	the value column for the x-axis and the count column for the y-axis

	image::images/math-expressions/freqTable1.png[]

	Notice that the visualization nicely displays the frequency of daily change in stock prices
	rounded to integers. The most frequently occurring value is 0 with 1494 occurrences followed by
	-1 and 1 with around 700 occurrences.


	== Percentiles

	The `percentile` function returns the estimated value for a specific percentile in
	a sample set. The example below returns a random sample containing the response_d field
	from the logs collection. The response_d field is vectorized and the 20th percentile
	is calculated for the vector:

	[source,text]
	----
	let(a=random(logs, q=":", rows="15000", fl="response_d"),
	b=col(a, response_d),
	c=percentile(b, 20))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"c": 818.073554
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 286
	}
	]
	}
	}
	----

	The `percentile` function can also compute an array of percentile values.
	The example below is computing the 20th, 40th, 60th and 80th percentiles for a random sample
	of the response_d field:

	[source,text]
	----
	let(a=random(logs, q=":", rows="15000", fl="response_d"),
	b=col(a, response_d),
	c=percentile(b, array(20,40,60,80)))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"c": [
	818.0835543394625,
	843.5590348165282,
	866.1789509894824,
	892.5033386599067
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 291
	}
	]
	}
	}
	----

	=== Quantile Plots

	A quantile plot, or QQ Plot, plots the percentiles from two distributions on the
	the same scatter plot for comparision.

	The example below uses the sampling capability
	described in the <<probability-distributions.adoc#probability-distributions,Probability>> section of the user guide. The same
	technique can be used with random samples drawn with the `random` function on empirical data
	stored in Solr Cloud collections. But its very useful to be able to
	use the probability distribution framework to sample from different distributions
	to learn how to read QQ plots.

	In the example 50000 samples from two normal distributions are drawn. Both distributions
	have a mean of 500 but have different standard deviations. A sequence is then created
	with 98 integers starting from 1 with a stride 1. This sequence will be used
	to specify the
	percentiles to calculate and also serve as the x-axis in the plot.
	Then the percentile function is used to calculate the percentiles for
	both distributions.

	Finally `zplot` is used to plot the percentiles sequence on the x-axis and the calculated
	percentile values for both distributions on the y axis. A scatter plot is used
	to visualize the QQ plot.

	image::images/math-expressions/quantile-plot.png[]

	Notice there are two scatter plots that intersect at 500 which is the mean
	of both distributions. But the red scatter plot, which has the
	higher standard deviation,
	has a steeper slope. The higher standard deviation creates a steeper slope
	because the percentile values are are dispersed farther from the mean.


	== Correlation and Covariance

	Correlation and Covariance measure how random variables fluctuate
	together.

	=== Correlation and Correlation Matrices

	Correlation is a measure of the linear correlation between two vectors. Correlation is scaled between
	-1 and 1.

	Three correlation types are supported:

	* pearsons (default)
	* kendalls
	* spearmans

	The type of correlation is specified by adding the type named parameter in the
	function call.

	In the example below a random sample containing two fields, filesize_d and response_d, is drawn from
	the logs collection using the `random` function. The fields are vectorized into the
	variables x and y and then Spearman's correlation for
	the two vectors is calculated using the `corr` function.

	image::images/math-expressions/correlation.png[]

	==== Correlation Matrices

	Correlation matrices are powerful tools for visualizing the correlation between two or more
	vectors.

	The `corr` function builds a correlation matrix
	if a matrix is passed as the parameter. The correlation matrix is computed by correlating the columns
	of the matrix.

	The example below demonstrates the power of correlation matrices combined with 2 dimensional faceting.

	In this example the `facet2D` function is used to generate a two dimensional facet aggregation
	over the fields complaint_type_s and zip_s from the nyc311 complaints database.
	The top 20 complaint types and the top 25 zip codes for each complaint type are aggregated.
	The result is a stream of tuples each containing the fields complaint_type_s, zip_s and
	the count for the pair.

	The `pivot` function is then used to pivot the fields into a matrix with the zip_s
	field as the rows and the complaint_type_s field as the columns. The `count(*)` field populates
	the values in the cells of the matrix.

	The `corr` function is then used correlate the columns of the matrix. This produces a correlation matrix
	that shows how complaint types are correlated based on the zip codes they appear in. Another way to look at this
	is it shows how the different complaint types tend to co-occur across zip codes.

	Finally the `zplot` function is used to plot the correlation matrix as a heat map.

	image::images/math-expressions/corrmatrix.png[]

	Notice in the example the correlation matrix is square with complaint types shown on both
	the x and y axises. The color of the cells in the heat map shows the
	intensity of the correlation between the complaint types.

	The heat map is interactive, so mousing over one of the cells pops up the values
	for the cell.

	image::images/math-expressions/corrmatrix2.png[]

	Notice that HEAT/HOT WATER and UNSANITARY CONDITION complaints have a correlation of 8 (rounded to the nearest
	tenth).

	=== Covariance and Covariance Matrices

	Covariance is an unscaled measure of correlation.

	The `cov` function calculates the covariance of two vectors of data.

	In the example below a random sample containing two fields, filesize_d and response_d, is drawn from
	the logs collection using the `random` function. The fields are vectorized into the
	variables x and y and then the covariance for
	the two vectors is calculated using the `cov` function.

	image::images/math-expressions/covariance.png[]

	If a matrix is passed to the `cov` function it will automatically compute a covariance
	matrix for the columns of the matrix.

	Notice in the example below that the x and y vectors are added to a matrix.
	The matrix is then transposed to turn the rows into columns,
	and the covariance matrix is computed for the columns of the matrix.

	[source,text]
	----
	let(a=random(logs, q=":", fl="filesize_d, response_d", rows=50000),
	x=col(a, filesize_d),
	y=col(a, response_d),
	m=transpose(matrix(x, y)),
	covariance=cov(m))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"covariance": [
	[
	4018404.072532102,
	80243.3948172242
	],
	[
	80243.3948172242,
	1948.3216661122592
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 534
	}
	]
	}
	}
	----

	The covariance matrix contains both the variance for the two vectors and the covariance between the vectors
	in the following format:


	[source,text]
	----
	x y
	x [4018404.072532102, 80243.3948172242],
	y [80243.3948172242, 1948.3216661122592]
	----

	The covariance matrix is always square. So a covariance matrix created from 3 vectors will produce a 3 x 3 matrix.



	== Statistical Inference Tests

	Statistical inference tests test a hypothesis on random samples and return p-values which
	can be used to infer the reliability of the test for the entire population.

	The following statistical inference tests are available:

	* `anova`: One-Way-Anova tests if there is a statistically significant difference in the
	means of two or more random samples.

	* `ttest`: The T-test tests if there is a statistically significant difference in the means of two
	random samples.

	* `pairedTtest`: The paired t-test tests if there is a statistically significant difference
	in the means of two random samples with paired data.

	* `gTestDataSet`: The G-test tests if two samples of binned discrete data were drawn
	from the same population.

	* `chiSquareDataset`: The Chi-Squared test tests if two samples of binned discrete data were
	drawn from the same population.

	* `mannWhitney`: The Mann-Whitney test is a non-parametric test that tests if two
	samples of continuous were pulled
	from the same population. The Mann-Whitney test is often used instead of the T-test when the
	underlying assumptions of the T-test are not
	met.

	* `ks`: The Kolmogorov-Smirnov test tests if two samples of continuous data were drawn from
	the same distribution.

	Below is a simple example of a T-test performed on two random samples.
	The returned p-value of .93 means we can accept the null hypothesis
	that the two samples do not have statistically significantly differences in the means.

	[source,text]
	----
	let(a=random(collection1, q=":", rows="1500", fl="price_f"),
	b=random(collection1, q=":", rows="1500", fl="price_f"),
	c=col(a, price_f),
	d=col(b, price_f),
	e=ttest(c, d))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"e": {
	"p-value": 0.9350135639249795,
	"t-statistic": 0.081545541074817
	}
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 48
	}
	]
	}
	}
	----

	== Transformations

	In statistical analysis its often useful to transform data sets before performing
	statistical calculations. The statistical function library includes the following
	commonly used transformations:

	* `rank`: Returns a numeric array with the rank-transformed value of each element of the original
	array.

	* `log`: Returns a numeric array with the natural log of each element of the original array.

	* `log10`: Returns a numeric array with the base 10 log of each element of the original array.

	* `sqrt`: Returns a numeric array with the square root of each element of the original array.

	* `cbrt`: Returns a numeric array with the cube root of each element of the original array.

	* `recip`: Returns a numeric array with the reciprocal of each element of the original array.

	Below is an example of a ttest performed on log transformed data sets:

	[source,text]
	----
	let(a=random(collection1, q=":", rows="1500", fl="price_f"),
	b=random(collection1, q=":", rows="1500", fl="price_f"),
	c=log(col(a, price_f)),
	d=log(col(b, price_f)),
	e=ttest(c, d))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"e": {
	"p-value": 0.9655110070265056,
	"t-statistic": -0.04324265449471238
	}
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 58
	}
	]
	}
	}
	----

	== Back Transformations

	Vectors that have been transformed with the `log`, `log10`, `sqrt` and `cbrt` functions
	can be back transformed using the `pow` function.

	The example below shows how to back transform data that has been transformed by the
	`sqrt` function.


	[source,text]
	----
	let(echo="b,c",
	a=array(100, 200, 300),
	b=sqrt(a),
	c=pow(b, 2))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": [
	10,
	14.142135623730951,
	17.320508075688775
	],
	"c": [
	100,
	200.00000000000003,
	300.00000000000006
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----

	The example below shows how to back transform data that has been transformed by the
	`log10` function.


	[source,text]
	----
	let(echo="b,c",
	a=array(100, 200, 300),
	b=log10(a),
	c=pow(10, b))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": [
	2,
	2.3010299956639813,
	2.4771212547196626
	],
	"c": [
	100,
	200.00000000000003,
	300.0000000000001
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----

	Vectors that have been transformed with the `recip` function can be back-transformed by taking the reciprocal
	of the reciprocal.

	The example below shows an example of the back-transformation of the `recip` function.

	[source,text]
	----
	let(echo="b,c",
	a=array(100, 200, 300),
	b=recip(a),
	c=recip(b))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": [
	0.01,
	0.005,
	0.0033333333333333335
	],
	"c": [
	100,
	200,
	300
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----

	== Z-scores

	The `zscores` function converts a numeric array to an array of z-scores. The z-score
	is the number of standard deviations a number is from the mean.

	The example below computes the z-scores for the values in an array.


	[source,text]
	----
	let(a=array(1,2,3),
	b=zscores(a))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": [
	-1,
	0,
	1
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 27
	}
	]
	}
	}
	----