solr/solr-ref-guide/src/statistics.adoc - lucene-solr - Git at Google

 = Statistics
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 This section of the user guide covers the core statistical functions
 available in math expressions.

 == Descriptive Statistics

 The `describe` function returns descriptive statistics for a
 numeric array. The `describe` function returns a single *tuple* with name/value
 pairs containing the descriptive statistics.

 Below is a simple example that selects a random sample of documents from the *logs* collection,
 vectorizes the `response_d` field in the result set and uses the `describe` function to
 return descriptive statistics about the vector.

 [source,text]
 ----
 let(a=random(logs, q="*:*", fl="response_d", rows="50000"),
     b=col(a, response_d),
     c=describe(b))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "sumsq": 36674200601.78738,
         "max": 1068.854686837548,
         "var": 1957.9752647562789,
         "geometricMean": 854.1445499569674,
         "sum": 42764648.83319176,
         "kurtosis": 0.013189848821424377,
         "N": 50000,
         "min": 656.023249311864,
         "mean": 855.2929766638425,
         "popVar": 1957.936105250984,
         "skewness": 0.0014560741802307174,
         "stdev": 44.24901428005237
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 430
       }
     ]
   }
 }
 ----

 Notice that the random sample contains 50,000 records and the response
 time is only 430 milliseconds. Samples of this size can be used to
 reliably estimate the statistics for very large underlying
 data sets with sub-second performance.


 The `describe` function can also be visualized in a table with Zeppelin-Solr:

 image::images/math-expressions/describe.png[]


 == Histograms and Frequency Tables

 Histograms and frequency tables are tools for visualizing the distribution
 of a random variable.

 The `hist` function creates a histogram designed for usage with continuous data. The
 `freqTable` function creates a frequency table for use with discrete data.

 === histograms

 In the example below a histogram is used to visualize a random sample of
 response times from the logs collection. The example retrieves the
 random sample with the `random` function and creates a vector from the `response_d` field
 in the result set. Then the `hist` function is applied to the vector
 to return a histogram with 22 bins. The `hist` function returns a
 list of tuples with summary statistics for each bin.

 [source,text]
 ----
 let(a=random(logs, q="*:*", fl="response_d", rows="50000"),
     b=col(a, response_d),
     c=hist(b,  22))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,text]
 ----
 {
   "result-set": {
     "docs": [
       {
         "prob": 0.00004896007228311655,
         "min": 675.573084576817,
         "max": 688.3309631697003,
         "mean": 683.805542728906,
         "var": 50.9974629924082,
         "cumProb": 0.000030022417162809913,
         "sum": 2051.416628186718,
         "stdev": 7.141250800273591,
         "N": 3
       },
       {
         "prob": 0.00029607514624062624,
         "min": 696.2875238591652,
         "max": 707.9706315779541,
         "mean": 702.1110569558929,
         "var": 14.136444379466969,
         "cumProb": 0.00022705264963879807,
         "sum": 11233.776911294284,
         "stdev": 3.759846323916307,
         "N": 16
       },
       {
         "prob": 0.0011491235433157194,
         "min": 709.1574910598678,
         "max": 724.9027194369135,
         "mean": 717.8554290699951,
         "var": 20.6935845290122,
         "cumProb": 0.0009858515418689757,
         "sum": 41635.61488605971,
         "stdev": 4.549020172412098,
         "N": 58
       },
       ...
       ]}}
 ----

 With Zeppelin-Solr the histogram can be first visualized as a table:

 image::images/math-expressions/histtable.png[]

 Then the histogram can be visualized with an area chart by plotting the *mean* of
 the bins on the *x-axis* and the *prob* (probability) on the *y-axis*:

 image::images/math-expressions/hist.png[]

 The cumulative probability can be plotted by switching the *y-axis* to the *cumProb* column:

 image::images/math-expressions/cumProb.png[]

 === Custom Histograms

 Custom histograms can be defined and visualized by combining the output from multiple
 `stats` functions into a single histogram. Instead of automatically binning a numeric
 field the custom histogram allows for comparison of bins based on queries.

 NOTE: The `stats` function is first discussed in the *Searching, Sampling and Aggregation* section of the
 user guide.

 A simple example will illustrate how to define and visualize a custom histogram.

 In this example, three `stats` functions are wrapped in a `plist` function. The
 `plist` (parallel list) function  executes each of its internal functions in parallel
 and concatenates the results into a single stream. `plist` also maintains the order
 of the outputs from each of the sub-functions. In this example each `stats` function
 computes the count of documents that match a specific query. In this case they count the
 number of documents that contain the terms copper, gold and silver. The list of tuples
 with the counts is then stored in variable *a*.

 Then an `array` of labels is created and set to variable *l*.

 Finally the `zplot` function is used to plot the labels vector and the `count(*)` column.
 Notice the `col` function is used inside of the `zplot` function to extract the
 counts from the `stats` results.

 image::images/math-expressions/custom-hist.png[]


 === Frequency Tables

 The `freqTable` function returns a frequency distribution for a discrete data set.
 The `freqTable` function doesn't create bins like the histogram. Instead it counts
 the occurrence of each discrete data value and returns a list of tuples with the
 frequency statistics for each value.

 Below is an example of a frequency table built from a result set
 of rounded *differences* in daily opening stock prices for the stock ticker *amzn*.

 This example is interesting because it shows a multi-step process to arrive
 at the result. The first step is to *search* for records in the *stocks*
 collection with a ticker of *amzn*. Notice that the result set is sorted by
 date ascending and it returns the `open_d` field which is the opening price for
 the day.

 The `open_d` field is then vectorized and set to variable *b*, which now contains
 a vector of opening prices ordered by date ascending.

 The `diff` function is then used to calculate the *first difference* for the
 vector of opening prices. The first difference simply subtracts the previous value
 from each value in the array. This will provide an array of price differences
 for each day which will show daily change in opening price.

 Then the `round` function is used to round the price differences to the nearest
 integer to create a vector of discrete values. The `round` function in this
 example is effectively *binning* continuous data at integer boundaries.

 Finally the `freqTable` function is run on the discrete values to calculate
 the frequency table.

 [source,text]
 ----
 let(a=search(stocks,
              q="ticker_s:amzn",
              fl="open_d, date_dt",
              sort="date_dt asc",
              rows=25000),
     b=col(a, open_d),
     c=diff(b),
     d=round(c),
     e=freqTable(d))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,text]
 ----
  {
    "result-set": {
      "docs": [
        {
          "pct": 0.00019409937888198756,
          "count": 1,
          "cumFreq": 1,
          "cumPct": 0.00019409937888198756,
          "value": -57
        },
        {
          "pct": 0.00019409937888198756,
          "count": 1,
          "cumFreq": 2,
          "cumPct": 0.00038819875776397513,
          "value": -51
        },
        {
          "pct": 0.00019409937888198756,
          "count": 1,
          "cumFreq": 3,
          "cumPct": 0.0005822981366459627,
          "value": -49
        },
        ...
        ]}}
 ----

 With Zeppelin-Solr the frequency table can be first visualized as a table:

 image::images/math-expressions/freqTable.png[]

 The frequency table can then be plotted by switching to a scatter chart and selecting
 the *value* column for the *x-axis* and the *count* column for the *y-axis*

 image::images/math-expressions/freqTable1.png[]

 Notice that the visualization nicely displays the frequency of daily change in stock prices
 rounded to integers. The most frequently occurring value is 0 with 1494 occurrences followed by
  -1 and 1 with around 700 occurrences.


 == Percentiles

 The `percentile` function returns the estimated value for a specific percentile in
 a sample set. The example below returns a random sample containing the `response_d` field
 from the logs collection. The `response_d` field is vectorized and the 20th percentile
 is calculated for the vector:

 [source,text]
 ----
 let(a=random(logs, q="*:*", rows="15000", fl="response_d"),
     b=col(a, response_d),
     c=percentile(b, 20))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
  {
    "result-set": {
      "docs": [
        {
          "c": 818.073554
        },
        {
          "EOF": true,
          "RESPONSE_TIME": 286
        }
      ]
    }
  }
 ----

 The `percentile` function can also compute an array of percentile values.
 The example below is computing the 20th, 40th, 60th and 80th percentiles for a random sample
 of the `response_d` field:

 [source,text]
 ----
 let(a=random(logs, q="*:*", rows="15000", fl="response_d"),
     b=col(a, response_d),
     c=percentile(b, array(20,40,60,80)))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "c": [
           818.0835543394625,
           843.5590348165282,
           866.1789509894824,
           892.5033386599067
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 291
       }
     ]
   }
 }
 ----

 === Quantile Plots

 Quantile plots or QQ Plots are powerful tools for visually comparing two or more distributions.

 A quantile plot, plots the percentiles from two or more distributions in the same visualization. This allows
 for visual comparison of the distributions at each percentile. A simple example will help illustrate the power
 of quantile plots.

 In this example the distribution of daily stock price changes for two stock tickers, *goog* and
 *amzn*, are visualized with a quantile plot.

 The example first creates an array of values representing the percentiles that will be calculated and sets this array
 to variable *p*. Then random samples of the `change_d` field are drawn for the tickers *amzn* and *goog*. The `change_d` field
 represents the change in stock price for one day. Then the `change_d` field is vectorized for both samples and placed
 in the variables *amzn* and *goog*. The `percentile` function is then used to calculate the percentiles for both vectors. Notice that
 the variable *p* is used to specify the list of percentiles that are calculated.

 Finally `zplot` is used to plot the percentiles sequence on the *x-axis* and the calculated
 percentile values for both distributions on the *y-axis*. And a line plot is used
 to visualize the QQ plot.

 image::images/math-expressions/quantile-plot.png[]

 This quantile plot provides a clear picture of the distributions of daily price changes for *amzn*
 and *googl*. In the plot the *x-axis* is the percentiles and the *y-axis* is the percentile value calculated.

 Notice that the *goog* percentile value starts lower and ends higher than the *amzn* plot and that there is a
 steeper slope. This shows the greater variability in the *goog* price change distribution. The plot gives a clear picture
 of the difference
 in the distributions across the full range of percentiles.


 == Correlation and Covariance

 Correlation and Covariance measure how random variables fluctuate
 together.

 === Correlation and Correlation Matrices

 Correlation is a measure of the linear correlation between two vectors. Correlation is scaled between
 -1 and 1.

 Three correlation types are supported:

 * *pearsons* (default)
 * *kendalls*
 * *spearmans*

 The type of correlation is specified by adding the *type* named parameter in the
 function call.

 In the example below a random sample containing two fields, `filesize_d` and `response_d`, is drawn from
 the logs collection using the `random` function. The fields are vectorized into the
 variables *x* and *y* and then *Spearman's* correlation for
 the two vectors is calculated using the `corr` function.

 image::images/math-expressions/correlation.png[]

 ==== Correlation Matrices

 Correlation matrices are powerful tools for visualizing the correlation between two or more
 vectors.

 The `corr` function builds a correlation matrix
 if a matrix is passed as the parameter. The correlation matrix is computed by correlating the *columns*
 of the matrix.

 The example below demonstrates the power of correlation matrices combined with 2 dimensional faceting.

 In this example the `facet2D` function is used to generate a two dimensional facet aggregation
 over the fields `complaint_type_s` and `zip_s` from the *nyc311* complaints database.
 The *top 20* complaint types and the *top 25* zip codes for each complaint type are aggregated.
 The result is a stream of tuples each containing the fields `complaint_type_s`, `zip_s` and
 the count for the pair.

 The `pivot` function is then used to pivot the fields into a *matrix* with the `zip_s`
 field as the *rows* and the `complaint_type_s` field as the *columns*. The `count(*)` field populates
 the values in the cells of the matrix.

 The `corr` function is then used correlate the *columns* of the matrix. This produces a correlation matrix
 that shows how complaint types are correlated based on the zip codes they appear in. Another way to look at this
 is it shows how the different complaint types tend to co-occur across zip codes.

 Finally the `zplot` function is used to plot the correlation matrix as a heat map.

 image::images/math-expressions/corrmatrix.png[]

 Notice in the example the correlation matrix is square with complaint types shown on both
 the *x* and y-axises. The color of the cells in the heat map shows the
 intensity of the correlation between the complaint types.

 The heat map is interactive, so mousing over one of the cells pops up the values
 for the cell.

 image::images/math-expressions/corrmatrix2.png[]

 Notice that HEAT/HOT WATER and UNSANITARY CONDITION complaints have a correlation of 8 (rounded to the nearest
 tenth).

 === Covariance and Covariance Matrices

 Covariance is an unscaled measure of correlation.

 The `cov` function calculates the covariance of two vectors of data.

 In the example below a random sample containing two fields, `filesize_d` and `response_d`, is drawn from
 the logs collection using the `random` function. The fields are vectorized into the
 variables *x* and *y* and then the covariance for
 the two vectors is calculated using the `cov` function.

 image::images/math-expressions/covariance.png[]

 If a matrix is passed to the `cov` function it will automatically compute a covariance
 matrix for the *columns* of the matrix.

 Notice in the example below that the *x* and *y* vectors are added to a matrix.
 The matrix is then transposed to turn the rows into columns,
 and the covariance matrix is computed for the columns of the matrix.

 [source,text]
 ----
 let(a=random(logs, q="*:*", fl="filesize_d, response_d", rows=50000),
     x=col(a, filesize_d),
     y=col(a, response_d),
     m=transpose(matrix(x, y)),
     covariance=cov(m))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
  {
    "result-set": {
      "docs": [
        {
          "covariance": [
            [
              4018404.072532102,
              80243.3948172242
            ],
            [
              80243.3948172242,
              1948.3216661122592
            ]
          ]
        },
        {
          "EOF": true,
          "RESPONSE_TIME": 534
        }
      ]
    }
  }
 ----

 The covariance matrix contains both the variance for the two vectors and the covariance between the vectors
 in the following format:


 [source,text]
 ----
          x                 y
  x [4018404.072532102, 80243.3948172242],
  y [80243.3948172242,  1948.3216661122592]
 ----

 The covariance matrix is always square. So a covariance matrix created from 3 vectors will produce a 3 x 3 matrix.


 == Statistical Inference Tests

 Statistical inference tests test a hypothesis on *random samples* and return p-values which
 can be used to infer the reliability of the test for the entire population.

 The following statistical inference tests are available:

 * `anova`: One-Way-Anova tests if there is a statistically significant difference in the
 means of two or more random samples.

 * `ttest`: The T-test tests if there is a statistically significant difference in the means of two
 random samples.

 * `pairedTtest`: The paired t-test tests if there is a statistically significant difference
 in the means of two random samples with paired data.

 * `gTestDataSet`: The G-test tests if two samples of binned discrete data were drawn
 from the same population.

 * `chiSquareDataset`: The Chi-Squared test tests if two samples of binned discrete data were
 drawn from the same population.

 * `mannWhitney`: The Mann-Whitney test is a non-parametric test that tests if two
 samples of continuous data were pulled
 from the same population. The Mann-Whitney test is often used instead of the T-test when the
 underlying assumptions of the T-test are not
 met.

 * `ks`: The Kolmogorov-Smirnov test tests if two samples of continuous data were drawn from
 the same distribution.

 Below is a simple example of a T-test performed on two random samples.
 The returned p-value of .93 means we can accept the null hypothesis
 that the two samples do not have statistically significantly differences in the means.

 [source,text]
 ----
 let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
     b=random(collection1, q="*:*", rows="1500", fl="price_f"),
     c=col(a, price_f),
     d=col(b, price_f),
     e=ttest(c, d))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "e": {
           "p-value": 0.9350135639249795,
           "t-statistic": 0.081545541074817
         }
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 48
       }
     ]
   }
 }
 ----

 == Transformations

 In statistical analysis its often useful to transform data sets before performing
 statistical calculations. The statistical function library includes the following
 commonly used transformations:

 * `rank`: Returns a numeric array with the rank-transformed value of each element of the original
 array.

 * `log`: Returns a numeric array with the natural log of each element of the original array.

 * `log10`: Returns a numeric array with the base 10 log of each element of the original array.

 * `sqrt`: Returns a numeric array with the square root of each element of the original array.

 * `cbrt`: Returns a numeric array with the cube root of each element of the original array.

 * `recip`: Returns a numeric array with the reciprocal of each element of the original array.

 Below is an example of a ttest performed on log transformed data sets:

 [source,text]
 ----
 let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
     b=random(collection1, q="*:*", rows="1500", fl="price_f"),
     c=log(col(a, price_f)),
     d=log(col(b, price_f)),
     e=ttest(c, d))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "e": {
           "p-value": 0.9655110070265056,
           "t-statistic": -0.04324265449471238
         }
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 58
       }
     ]
   }
 }
 ----

 == Back Transformations

 Vectors that have been transformed with the `log`, `log10`, `sqrt` and `cbrt` functions
 can be back transformed using the `pow` function.

 The example below shows how to back transform data that has been transformed by the
 `sqrt` function.


 [source,text]
 ----
 let(echo="b,c",
     a=array(100, 200, 300),
     b=sqrt(a),
     c=pow(b, 2))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           10,
           14.142135623730951,
           17.320508075688775
         ],
         "c": [
           100,
           200.00000000000003,
           300.00000000000006
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 The example below shows how to back transform data that has been transformed by the
 `log10` function.


 [source,text]
 ----
 let(echo="b,c",
     a=array(100, 200, 300),
     b=log10(a),
     c=pow(10, b))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           2,
           2.3010299956639813,
           2.4771212547196626
         ],
         "c": [
           100,
           200.00000000000003,
           300.0000000000001
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 Vectors that have been transformed with the `recip` function can be back-transformed by taking the reciprocal
 of the reciprocal.

 The example below shows an example of the back-transformation of the `recip` function.

 [source,text]
 ----
 let(echo="b,c",
     a=array(100, 200, 300),
     b=recip(a),
     c=recip(b))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           0.01,
           0.005,
           0.0033333333333333335
         ],
         "c": [
           100,
           200,
           300
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 == Z-scores

 The `zscores` function converts a numeric array to an array of z-scores. The z-score
 is the number of standard deviations a number is from the mean.

 The example below computes the z-scores for the values in an array.


 [source,text]
 ----
 let(a=array(1,2,3),
     b=zscores(a))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           -1,
           0,
           1
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 27
       }
     ]
   }
 }
 ----