blob: 0db83a0908b7c3b020563c9d0a4c2db5d962bd29 [file] [log] [blame]
= The Stats Component
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
The Stats component returns simple statistics for numeric, string, and date fields within the document set.
The sample queries in this section assume you are running the "```techproducts```" example included with Solr:
[source,bash]
----
bin/solr -e techproducts
----
== Stats Component Parameters
The Stats Component accepts the following parameters:
`stats`::
If `true`, then invokes the Stats component.
`stats.field`::
Specifies a field for which statistics should be generated. This parameter may be invoked multiple times in a query in order to request statistics on multiple fields.
+
<<local-parameters-in-queries.adoc#,Local Parameters>> may be used to indicate which subset of the supported statistics should be computed, and/or that statistics should be computed over the results of an arbitrary numeric function (or query) instead of a simple field name. See the examples below.
=== Stats Component Example
The query below demonstrates computing stats against two different fields numeric fields, as well as stats over the results of a `termfreq()` function call using the `text` field:
[source,text]
----
http://localhost:8983/solr/techproducts/select?q=*:*&wt=xml&stats=true&stats.field={!func}termfreq('text','memory')&stats.field=price&stats.field=popularity&rows=0&indent=true
----
[source,xml]
----
<lst name="stats">
<lst name="stats_fields">
<lst name="termfreq(text,memory)">
<double name="min">0.0</double>
<double name="max">3.0</double>
<long name="count">32</long>
<long name="missing">0</long>
<double name="sum">10.0</double>
<double name="sumOfSquares">22.0</double>
<double name="mean">0.3125</double>
<double name="stddev">0.7803018439949604</double>
<lst name="facets"/>
</lst>
<lst name="price">
<double name="min">0.0</double>
<double name="max">2199.0</double>
<long name="count">16</long>
<long name="missing">16</long>
<double name="sum">5251.270030975342</double>
<double name="sumOfSquares">6038619.175900028</double>
<double name="mean">328.20437693595886</double>
<double name="stddev">536.3536996709846</double>
<lst name="facets"/>
</lst>
<lst name="popularity">
<double name="min">0.0</double>
<double name="max">10.0</double>
<long name="count">15</long>
<long name="missing">17</long>
<double name="sum">85.0</double>
<double name="sumOfSquares">603.0</double>
<double name="mean">5.666666666666667</double>
<double name="stddev">2.943920288775949</double>
<lst name="facets"/>
</lst>
</lst>
</lst>
----
== Statistics Supported
The table below explains the statistics supported by the Stats component. Not all statistics are supported for all field types, and not all statistics are computed by default (see <<Local Parameters with the Stats Component>> below for details)
`min`::
The minimum value of the field/function in all documents in the set. This statistic is computed for all field types and is computed by default.
`max`::
The maximum value of the field/function in all documents in the set. This statistic is computed for all field types and is computed by default.
`sum`::
The sum of all values of the field/function in all documents in the set. This statistic is computed for numeric and date field types and is computed by default.
`count`::
The number of values found in all documents in the set for this field/function. This statistic is computed for all field types and is computed by default.
`missing`::
The number of documents in the set which do not have a value for this field/function. This statistic is computed for all field types and is computed by default.
`sumOfSquares`::
Sum of all values squared (a by product of computing stddev). This statistic is computed for numeric and date field types and is computed by default.
`mean`::
The average `(v1 + v2 .... + vN)/N`. This statistic is computed for numeric and date field types and is computed by default.
`stddev`::
Standard deviation, measuring how widely spread the values in the data set are. This statistic is computed for numeric and date field types and is computed by default.
`percentiles`::
A list of percentile values based on cut-off points specified by the parameter value, such as `1,99,99.9`. These values are an approximation, using the https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[t-digest algorithm]. This statistic is computed for numeric field types and is not computed by default.
`distinctValues`::
The set of all distinct values for the field/function in all of the documents in the set. This calculation can be very expensive for fields that do not have a tiny cardinality. This statistic is computed for all field types but is not computed by default.
`countDistinct`::
The exact number of distinct values in the field/function in all of the documents in the set. This calculation can be very expensive for fields that do not have a tiny cardinality. This statistic is computed for all field types but is not computed by default.
`cardinality`::
A statistical approximation (currently using the https://en.wikipedia.org/wiki/HyperLogLog[HyperLogLog] algorithm) of the number of distinct values in the field/function in all of the documents in the set. This calculation is much more efficient then using the `countDistinct` option, but may not be 100% accurate.
+
Input for this option can be floating point number between `0.0` and `1.0` indicating how aggressively the algorithm should try to be accurate: `0.0` means use as little memory as possible; `1.0` means use as much memory as needed to be as accurate as possible. `true` is supported as an alias for `0.3`.
+
This statistic is computed for all field types but is not computed by default.
== Local Parameters with the Stats Component
Similar to the <<faceting.adoc#,Facet Component>>, the `stats.field` parameter supports local parameters for:
* Tagging & Excluding Filters: `stats.field={!ex=filterA}price`
* Changing the Output Key: `stats.field={!key=my_price_stats}price`
* Tagging stats for <<The Stats Component and Faceting,use with `facet.pivot`>>: `stats.field={!tag=my_pivot_stats}price`
Local parameters can also be used to specify individual statistics by name, overriding the set of statistics computed by default, e.g., `stats.field={!min=true max=true percentiles='99,99.9,99.99'}price`.
[IMPORTANT]
====
If any supported statistics are specified via local parameters, then the entire set of default statistics is overridden and only the requested statistics are computed.
====
Additional "Expert" local params are supported in some cases for affecting the behavior of some statistics:
* `percentiles`
** `tdigestCompression` - a positive numeric value defaulting to `100.0` controlling the compression factor of the T-Digest. Larger values means more accuracy, but also uses more memory.
* `cardinality`
** `hllPreHashed` - a boolean option indicating that the statistics are being computed over a "long" field that has already been hashed at index time allowing the HLL computation to skip this step.
** `hllLog2m` - an integer value specifying an explicit "log2m" value to use, overriding the heuristic value determined by the cardinality local param and the field type see the https://github.com/aggregateknowledge/java-hll/[java-hll] documentation for more details
** `hllRegwidth` - an integer value specifying an explicit "regwidth" value to use, overriding the heuristic value determined by the cardinality local param and the field type see the https://github.com/aggregateknowledge/java-hll/[java-hll] documentation for more details
=== Examples with Local Parameters
Here we compute some statistics for the price field. The min, max, mean, 90th, and 99th percentile price values are computed against all products that are in stock (`q=*:*` and `fq=inStock:true`), and independently all of the default statistics are computed against all products regardless of whether they are in stock or not (by excluding that filter).
[source,text]
http://localhost:8983/solr/techproducts/select?q=*:*&fq={!tag=stock_check}inStock:true&stats=true&stats.field={!ex=stock_check+key=instock_prices+min=true+max=true+mean=true+percentiles='90,99'}price&stats.field={!key=all_prices}price&rows=0&indent=true&wt=xml
[source,xml]
----
<lst name="stats">
<lst name="stats_fields">
<lst name="instock_prices">
<double name="min">0.0</double>
<double name="max">2199.0</double>
<double name="mean">328.20437693595886</double>
<lst name="percentiles">
<double name="90.0">564.9700012207031</double>
<double name="99.0">1966.6484985351556</double>
</lst>
</lst>
<lst name="all_prices">
<double name="min">0.0</double>
<double name="max">2199.0</double>
<long name="count">12</long>
<long name="missing">5</long>
<double name="sum">4089.880027770996</double>
<double name="sumOfSquares">5385249.921747174</double>
<double name="mean">340.823335647583</double>
<double name="stddev">602.3683083752779</double>
</lst>
</lst>
</lst>
----
== The Stats Component and Faceting
Sets of `stats.field` parameters can be referenced by `tag` when using Pivot Faceting to compute multiple statistics at every level (i.e., field) in the tree of pivot constraints.
For more information and a detailed example, please see <<faceting.adoc#combining-stats-component-with-pivots,Combining Stats Component With Pivots>>.