| = The Stats Component |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| The Stats component returns simple statistics for numeric, string, and date fields within the document set. |
| |
| The sample queries in this section assume you are running the "```techproducts```" example included with Solr: |
| |
| [source,bash] |
| ---- |
| bin/solr -e techproducts |
| ---- |
| |
| == Stats Component Parameters |
| |
| The Stats Component accepts the following parameters: |
| |
| `stats`:: |
| If `true`, then invokes the Stats component. |
| |
| `stats.field`:: |
| Specifies a field for which statistics should be generated. This parameter may be invoked multiple times in a query in order to request statistics on multiple fields. |
| + |
| <<local-parameters-in-queries.adoc#,Local Parameters>> may be used to indicate which subset of the supported statistics should be computed, and/or that statistics should be computed over the results of an arbitrary numeric function (or query) instead of a simple field name. See the examples below. |
| |
| |
| === Stats Component Example |
| |
| The query below demonstrates computing stats against two different fields numeric fields, as well as stats over the results of a `termfreq()` function call using the `text` field: |
| |
| [source,text] |
| ---- |
| http://localhost:8983/solr/techproducts/select?q=*:*&wt=xml&stats=true&stats.field={!func}termfreq('text','memory')&stats.field=price&stats.field=popularity&rows=0&indent=true |
| ---- |
| |
| [source,xml] |
| ---- |
| <lst name="stats"> |
| <lst name="stats_fields"> |
| <lst name="termfreq(text,memory)"> |
| <double name="min">0.0</double> |
| <double name="max">3.0</double> |
| <long name="count">32</long> |
| <long name="missing">0</long> |
| <double name="sum">10.0</double> |
| <double name="sumOfSquares">22.0</double> |
| <double name="mean">0.3125</double> |
| <double name="stddev">0.7803018439949604</double> |
| <lst name="facets"/> |
| </lst> |
| <lst name="price"> |
| <double name="min">0.0</double> |
| <double name="max">2199.0</double> |
| <long name="count">16</long> |
| <long name="missing">16</long> |
| <double name="sum">5251.270030975342</double> |
| <double name="sumOfSquares">6038619.175900028</double> |
| <double name="mean">328.20437693595886</double> |
| <double name="stddev">536.3536996709846</double> |
| <lst name="facets"/> |
| </lst> |
| <lst name="popularity"> |
| <double name="min">0.0</double> |
| <double name="max">10.0</double> |
| <long name="count">15</long> |
| <long name="missing">17</long> |
| <double name="sum">85.0</double> |
| <double name="sumOfSquares">603.0</double> |
| <double name="mean">5.666666666666667</double> |
| <double name="stddev">2.943920288775949</double> |
| <lst name="facets"/> |
| </lst> |
| </lst> |
| </lst> |
| ---- |
| |
| == Statistics Supported |
| |
| The table below explains the statistics supported by the Stats component. Not all statistics are supported for all field types, and not all statistics are computed by default (see <<Local Parameters with the Stats Component>> below for details) |
| |
| `min`:: |
| The minimum value of the field/function in all documents in the set. This statistic is computed for all field types and is computed by default. |
| |
| `max`:: |
| The maximum value of the field/function in all documents in the set. This statistic is computed for all field types and is computed by default. |
| |
| `sum`:: |
| The sum of all values of the field/function in all documents in the set. This statistic is computed for numeric and date field types and is computed by default. |
| |
| `count`:: |
| The number of values found in all documents in the set for this field/function. This statistic is computed for all field types and is computed by default. |
| |
| `missing`:: |
| The number of documents in the set which do not have a value for this field/function. This statistic is computed for all field types and is computed by default. |
| |
| `sumOfSquares`:: |
| Sum of all values squared (a by product of computing stddev). This statistic is computed for numeric and date field types and is computed by default. |
| |
| `mean`:: |
| The average `(v1 + v2 .... + vN)/N`. This statistic is computed for numeric and date field types and is computed by default. |
| |
| `stddev`:: |
| Standard deviation, measuring how widely spread the values in the data set are. This statistic is computed for numeric and date field types and is computed by default. |
| |
| `percentiles`:: |
| A list of percentile values based on cut-off points specified by the parameter value, such as `1,99,99.9`. These values are an approximation, using the https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[t-digest algorithm]. This statistic is computed for numeric field types and is not computed by default. |
| |
| `distinctValues`:: |
| The set of all distinct values for the field/function in all of the documents in the set. This calculation can be very expensive for fields that do not have a tiny cardinality. This statistic is computed for all field types but is not computed by default. |
| |
| `countDistinct`:: |
| The exact number of distinct values in the field/function in all of the documents in the set. This calculation can be very expensive for fields that do not have a tiny cardinality. This statistic is computed for all field types but is not computed by default. |
| |
| `cardinality`:: |
| A statistical approximation (currently using the https://en.wikipedia.org/wiki/HyperLogLog[HyperLogLog] algorithm) of the number of distinct values in the field/function in all of the documents in the set. This calculation is much more efficient then using the `countDistinct` option, but may not be 100% accurate. |
| + |
| Input for this option can be floating point number between `0.0` and `1.0` indicating how aggressively the algorithm should try to be accurate: `0.0` means use as little memory as possible; `1.0` means use as much memory as needed to be as accurate as possible. `true` is supported as an alias for `0.3`. |
| + |
| This statistic is computed for all field types but is not computed by default. |
| |
| == Local Parameters with the Stats Component |
| |
| Similar to the <<faceting.adoc#,Facet Component>>, the `stats.field` parameter supports local parameters for: |
| |
| * Tagging & Excluding Filters: `stats.field={!ex=filterA}price` |
| * Changing the Output Key: `stats.field={!key=my_price_stats}price` |
| * Tagging stats for <<The Stats Component and Faceting,use with `facet.pivot`>>: `stats.field={!tag=my_pivot_stats}price` |
| |
| Local parameters can also be used to specify individual statistics by name, overriding the set of statistics computed by default, e.g., `stats.field={!min=true max=true percentiles='99,99.9,99.99'}price`. |
| |
| [IMPORTANT] |
| ==== |
| If any supported statistics are specified via local parameters, then the entire set of default statistics is overridden and only the requested statistics are computed. |
| ==== |
| |
| Additional "Expert" local params are supported in some cases for affecting the behavior of some statistics: |
| |
| * `percentiles` |
| ** `tdigestCompression` - a positive numeric value defaulting to `100.0` controlling the compression factor of the T-Digest. Larger values means more accuracy, but also uses more memory. |
| * `cardinality` |
| ** `hllPreHashed` - a boolean option indicating that the statistics are being computed over a "long" field that has already been hashed at index time – allowing the HLL computation to skip this step. |
| ** `hllLog2m` - an integer value specifying an explicit "log2m" value to use, overriding the heuristic value determined by the cardinality local param and the field type – see the https://github.com/aggregateknowledge/java-hll/[java-hll] documentation for more details |
| ** `hllRegwidth` - an integer value specifying an explicit "regwidth" value to use, overriding the heuristic value determined by the cardinality local param and the field type – see the https://github.com/aggregateknowledge/java-hll/[java-hll] documentation for more details |
| |
| === Examples with Local Parameters |
| |
| Here we compute some statistics for the price field. The min, max, mean, 90th, and 99th percentile price values are computed against all products that are in stock (`q=*:*` and `fq=inStock:true`), and independently all of the default statistics are computed against all products regardless of whether they are in stock or not (by excluding that filter). |
| |
| [source,text] |
| http://localhost:8983/solr/techproducts/select?q=*:*&fq={!tag=stock_check}inStock:true&stats=true&stats.field={!ex=stock_check+key=instock_prices+min=true+max=true+mean=true+percentiles='90,99'}price&stats.field={!key=all_prices}price&rows=0&indent=true&wt=xml |
| |
| [source,xml] |
| ---- |
| <lst name="stats"> |
| <lst name="stats_fields"> |
| <lst name="instock_prices"> |
| <double name="min">0.0</double> |
| <double name="max">2199.0</double> |
| <double name="mean">328.20437693595886</double> |
| <lst name="percentiles"> |
| <double name="90.0">564.9700012207031</double> |
| <double name="99.0">1966.6484985351556</double> |
| </lst> |
| </lst> |
| <lst name="all_prices"> |
| <double name="min">0.0</double> |
| <double name="max">2199.0</double> |
| <long name="count">12</long> |
| <long name="missing">5</long> |
| <double name="sum">4089.880027770996</double> |
| <double name="sumOfSquares">5385249.921747174</double> |
| <double name="mean">340.823335647583</double> |
| <double name="stddev">602.3683083752779</double> |
| </lst> |
| </lst> |
| </lst> |
| ---- |
| |
| == The Stats Component and Faceting |
| |
| Sets of `stats.field` parameters can be referenced by `tag` when using Pivot Faceting to compute multiple statistics at every level (i.e., field) in the tree of pivot constraints. |
| |
| For more information and a detailed example, please see <<faceting.adoc#combining-stats-component-with-pivots,Combining Stats Component With Pivots>>. |