blob: 87e448754318bbcf1e2dce2140ddfa6976d2087f [file] [log] [blame]
= JSON Facet API
:page-tocclass: right
:solr-root-path: ../../
:example-source-dir: {solr-root-path}solrj/src/test/org/apache/solr/client/ref_guide_examples/
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
== Facet & Analytics Module
The new Facet & Analytics Module exposed via the JSON Facet API is a rewrite of Solr's previous faceting capabilities, with the following goals:
* First class native JSON API to control faceting and analytics
** The structured nature of nested sub-facets are more naturally expressed in JSON rather than the flat namespace provided by normal query parameters.
* First class integrated analytics support
* Nest any facet type under any other facet type (such as range facet, field facet, query facet)
* Ability to sort facet buckets by any calculated metric
* Easier programmatic construction of complex nested facet commands
* Support a more canonical response format that is easier for clients to parse
* Support a cleaner way to implement distributed faceting
* Support better integration with other search features
* Full integration with the JSON Request API
== Faceted Search
Faceted search is about aggregating data and calculating metrics about that data.
There are two main types of facets:
* Facets that partition or categorize data (the domain) into multiple *buckets*
* Facets that calculate data for a given bucket (normally a metric, statistic or analytic function)
=== Metrics Example
By default, the *domain* for facets starts with all documents that match the base query and any filters. Here's an example that requests various metrics about the root domain:
[source,bash]
----
curl http://localhost:8983/solr/techproducts/query -d '
q=memory&
fq=inStock:true&
json.facet={
"avg_price" : "avg(price)",
"num_suppliers" : "unique(manu_exact)",
"median_weight" : "percentile(weight,50)"
}'
----
The response to the facet request above will start with documents matching the root domain (docs containing "memory" with inStock:true) then calculate and return the requested metrics:
[...]
[source,java]
----
"facets" : {
"count" : 4,
"avg_price" : 109.9950008392334,
"num_suppliers" : 3,
"median_weight" : 352.0
}
----
=== Bucketing Facet Example
Here's an example of a bucketing facet, that partitions documents into bucket based on the `cat` field (short for category), and returns the top 3 buckets:
[.dynamic-tabs]
--
[example.tab-pane#curljsonsimpletermsfacet]
====
[.tab-label]*curl*
[source,bash]
----
curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&
json.facet={
categories : {
type : terms,
field : cat, // bucket documents based on the "cat" field
limit : 3 // retrieve the top 3 buckets ranked by the number of docs in each bucket
}
}'
----
====
[example.tab-pane#solrjjsonsimpletermsfacet]
====
[.tab-label]*SolrJ*
[source,java,indent=0]
----
include::{example-source-dir}JsonRequestApiTest.java[tag=solrj-json-simple-terms-facet]
----
====
--
The response below shows us that 32 documents match the default root domain. and 12 documents have `cat:electronics`, 4 documents have `cat:currency`, etc.
[source,java]
----
[...]
"facets":{
"count":32,
"categories":{
"buckets":[{
"val":"electronics",
"count":12},
{
"val":"currency",
"count":4},
{
"val":"memory",
"count":3},
]
}
}
----
=== Making a Facet Request
In this guide, we will often just present the **facet command block**:
[source,java]
----
{
x: "average(mul(price,popularity))"
}
----
To execute a facet command block such as this, you'll need to use the `json.facet` parameter, and provide at least a base query such as `q=\*:*`
[source,bash]
----
curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&json.facet=
{
x: "avg(mul(price,popularity))"
}
'
----
Another option is to use the JSON Request API to provide the entire request in JSON:
[.dynamic-tabs]
--
[example.tab-pane#curljsontermsfacet2]
====
[.tab-label]*curl*
[source,bash]
----
curl http://localhost:8983/solr/techproducts/query -d '
{
query: "*:*", // this is the base query
filter: [ "inStock:true" ], // a list of filters
facet: {
x: "avg(mul(price,popularity))" // and our funky metric of average of price * popularity
}
}
'
----
====
[example.tab-pane#solrjjsontermsfacet2]
====
[.tab-label]*SolrJ*
[source,java,indent=0]
----
include::{example-source-dir}JsonRequestApiTest.java[tag=solrj-json-terms-facet2]
----
====
--
=== JSON Extensions
The *Noggit* JSON parser that is used by Solr accepts a number of JSON extensions such as,
* bare words can be left unquoted
* single line comments using either `//` or `#`
* Multi-line comments using C style /* comments in here */
* Single quoted strings
* Allow backslash escaping of any character
* Allow trailing commas and extra commas. Example: [9,4,3,]
* Handle nbsp (non-break space, \u00a0) as whitespace.
== Terms Facet
The terms facet (or field facet) buckets the domain based on the unique terms / values of a field.
[source,bash]
----
curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&
json.facet={
categories:{
terms: {
field : cat, // bucket documents based on the "cat" field
limit : 5 // retrieve the top 5 buckets ranked by the number of docs in each bucket
}
}
}'
----
// TODO: This table has cells that won't work with PDF: https://github.com/ctargett/refguide-asciidoc-poc/issues/13
[width="100%",cols="20%,90%",options="header",]
|===
|Parameter |Description
|field |The field name to facet over.
|offset |Used for paging, this skips the first N buckets. Defaults to 0.
|limit |Limits the number of buckets returned. Defaults to 10.
|sort |Specifies how to sort the buckets produced.
count specifies document count, index sorts by the index (natural) order of the bucket value. One can also sort by any <<json-facet-api.adoc#aggregation-functions,facet function / statistic>> that occurs in the bucket. The default is “count desc”. This parameter may also be specified in JSON like `sort:{count:desc}`. The sort order may either be “asc” or “desc”
|overrequest a|
Number of buckets beyond the `limit` to internally request from shards during a distributed search.
Larger values can increase the accuracy of the final "Top Terms" returned when the individual shards have very diff top terms.
The default of `-1` causes a hueristic to be applied based on the other options specified.
|refine |If `true`, turns on distributed facet refining. This uses a second phase to retrieve any buckets needed for the final result from shards that did not include those buckets in their initial internal results, so that every shard contributes to every returned bucket in this facet and any sub-facets. This makes counts & stats for returned buckets exact.
|overrefine a|
Number of buckets beyond the `limit` to consider internally during a distributed search when determining which buckets to refine.
Larger values can increase the accuracy of the final "Top Terms" returned when the individual shards have very diff top terms, and the current `sort` option can result in refinement pushing terms lower down the sorted list (ex: `sort:"count asc"`)
The default of `-1` causes a hueristic to be applied based on other options specified.
|mincount |Only return buckets with a count of at least this number. Defaults to 1.
|missing |A boolean that specifies if a special missing bucket should be returned that is defined by documents without a value in the field. Defaults to false.
|numBuckets |A boolean. If true, adds numBuckets to the response, an integer representing the number of buckets for the facet (as opposed to the number of buckets returned). Defaults to false.
|allBuckets |A boolean. If true, adds an allBuckets bucket to the response, representing the union of all of the buckets. For multi-valued fields, this is different than a bucket for all of the documents in the domain since a single document can belong to multiple buckets. Defaults to false.
|prefix |Only produce buckets for terms starting with the specified prefix.
|facet |Aggregations, metrics or nested facets that will be calculated for every returned bucket
|method a|
This parameter indicates the facet algorithm to use:
* "dv" DocValues, collect into ordinal array
* "uif" UnInvertedField, collect into ordinal array
* "dvhash" DocValues, collect into hash - improves efficiency over high cardinality fields
* "enum" TermsEnum then intersect DocSet (stream-able)
* "stream" Presently equivalent to "enum"
* "smart" Pick the best method for the field type (this is the default)
|prelim_sort |An optional parameter for specifying an approximation of the final `sort` to use during initial collection of top buckets when the <<json-facet-api.adoc#sorting-facets-by-nested-functions,`sort` param is very costly>>.
|===
== Query Facet
The query facet produces a single bucket of documents that match the domain as well as the specified query.
An example of the simplest form of the query facet is `"query":"query string"`.
[source,java]
----
{
high_popularity: {query: "popularity:[8 TO 10]" }
}
----
An expanded form allows for more parameters and a facet command block to specify sub-facets (either nested facets or metrics):
[source,java]
----
{
high_popularity : {
type: query,
q : "popularity:[8 TO 10]",
facet : { average_price : "avg(price)" }
}
}
----
Example response:
[source,java]
----
"high_popularity" : {
"count" : 36,
"average_price" : 36.75
}
----
== Range Facet
The range facet produces multiple buckets over a date field or numeric field.
Example:
[source,java]
----
{
prices : {
type: range,
field : price,
start : 0,
end : 100,
gap : 20
}
}
----
[source,java]
----
"prices":{
"buckets":[
{
"val":0.0, // the bucket value represents the start of each range. This bucket covers 0-20
"count":5},
{
"val":20.0,
"count":3},
{
"val":40.0,
"count":2},
{
"val":60.0,
"count":1},
{
"val":80.0,
"count":1}
]
}
----
=== Range Facet Parameters
To ease migration, the range facet parameter names and semantics largely mirror facet.range query-parameter style faceting. For example "start" here corresponds to "facet.range.start" in a facet.range command.
// TODO: This table has cells that won't work with PDF: https://github.com/ctargett/refguide-asciidoc-poc/issues/13
[width="100%",cols="10%,90%",options="header",]
|===
|Parameter |Description
|field |The numeric field or date field to produce range buckets from.
|start |Lower bound of the ranges.
|end |Upper bound of the ranges.
|gap |Size of each range bucket produced.
|hardend |A boolean, which if true means that the last bucket will end at end even if it is less than gap wide. If false, the last bucket will be gap wide, which may extend past end”.
|other a|
This parameter indicates that in addition to the counts for each range constraint between `start` and `end`, counts should also be computed for
* "before" all records with field values lower then lower bound of the first range
* "after" all records with field values greater then the upper bound of the last range
* "between" all records with field values between the start and end bounds of all ranges
* "none" compute none of this information
* "all" shortcut for before, between, and after
|include a|
By default, the ranges used to compute range faceting between `start` and `end` are inclusive of their lower bounds and exclusive of the upper bounds. The before range is exclusive and the after range is inclusive. This default, equivalent to "lower" below, will not result in double counting at the boundaries. The `include` parameter may be any combination of the following options:
* "lower" all gap based ranges include their lower bound
* "upper" all gap based ranges include their upper bound
* "edge" the first and last gap ranges include their edge bounds (i.e., lower for the first one, upper for the last one) even if the corresponding upper/lower option is not specified
* "outer" the before and after ranges will be inclusive of their bounds, even if the first or last ranges already include those boundaries.
* "all" shorthand for lower, upper, edge, outer
|facet |Aggregations, metrics, or nested facets that will be calculated for every returned bucket
|===
== Heatmap Facet
The `heatmap` facet generates a 2D grid of facet counts for documents having spatial data in each grid cell.
This feature is primarily documented in the <<spatial-search.adoc#heatmap-faceting,spatial>> section of the reference guide.
The key parameters are `type` to specify `heatmap` and `field` to indicate a spatial RPT field.
The rest of the parameter names use the same names and semantics mirroring
facet.heatmap query-parameter style faceting, albeit without the "facet.heatmap." prefix.
For example `geom` here corresponds to `facet.heatmap.geom` in a facet.heatmap command.
NOTE: Unlike other facets that partition the domain into buckets, `heatmap` facets do not currently support <<Nested Facets>>.
Here's an example query:
[source,java]
----
{
hm : {
type : heatmap,
field : points_srpt,
geom : "[-49.492,-180 TO 64.701,73.125]",
distErrPct : 0.5
}
}
----
And the facet response will look like:
[source,json]
----
{
"facets":{
"count":145725,
"hm":{
"gridLevel":1,
"columns":6,
"rows":4,
"minX":-180.0,
"maxX":90.0,
"minY":-90.0,
"maxY":90.0,
"counts_ints2D":[[68,1270,459,5359,39456,1713],[123,10472,13620,7777,18376,6239],[88,6,3898,989,1314,255],[0,0,30,1,0,1]]
}}}
----
== Aggregation Functions
Unlike all the facets discussed so far, Aggregation functions (also called *facet functions*, *analytic functions*, or *metrics*) do not partition data into buckets. Instead, they calculate something over all the documents in the domain.
[width="100%",cols="10%,30%,60%",options="header",]
|===
|Aggregation |Example |Description
|sum |`sum(sales)` |summation of numeric values
|avg |`avg(popularity)` |average of numeric values
|min |`min(salary)` |minimum value
|max |`max(mul(price,popularity))` |maximum value
|unique |`unique(author)` |number of unique values of the given field. Beyond 100 values it yields not exact estimate
|uniqueBlock |`uniqueBlock(\_root_)` |same as above with smaller footprint strictly for <<json-facet-api.adoc#block-join-counts,counting the number of Block Join blocks>>. The given field must be unique across blocks, and only singlevalued string fields are supported, docValues are recommended.
|hll |`hll(author)` |distributed cardinality estimate via hyper-log-log algorithm
|percentile |`percentile(salary,50,75,99,99.9)` |Percentile estimates via t-digest algorithm. When sorting by this metric, the first percentile listed is used as the sort value.
|sumsq |`sumsq(rent)` |sum of squares of field or function
|variance |`variance(rent)` |variance of numeric field or function
|stddev |`stddev(rent)` |standard deviation of field or function
|relatedness |`relatedness('popularity:[100 TO *]','inStock:true')`|A function for computing a relatedness score of the documents in the domain to a Foreground set, relative to a Background set (both defined as queries). This is primarily for use when building <<Semantic Knowledge Graphs>>.
|===
Numeric aggregation functions such as `avg` can be on any numeric field, or on a <<function-queries.adoc#function-queries,nested function>> of multiple numeric fields such as `avg(div(popularity,price))`.
The most common way of requesting an aggregation function is as a simple containing the expression you wish to compute:
[source,javascript]
----
{
"average_roi": "avg(div(popularity,price))"
}
----
An expanded form allows for <<local-parameters-in-queries.adoc#local-parameters-in-queries,Local Parameters>> to be specified. These may be used explicitly by some specialized aggregations such as `<<json-facet-api.adoc#relatedness-options,relatedness()>>`, but can also be used as parameter references to make aggregation expressions more readable, with out needing to use (global) request parameters:
[source,javascript]
----
{
"average_roi" : {
"type": "func",
"func": "avg(div($numer,$denom))",
"numer": "mul(popularity,rating)",
"denom": "mul(price,size)"
}
}
----
== Nested Facets
Nested facets, or **sub-facets**, allow one to nest facet commands under any facet command that partitions the domain into buckets (i.e., `terms`, `range`, `query`). These sub-facets are then evaluated against the domains defined by the set of all documents in each bucket of their parent.
The syntax is identical to top-level facets - just add a `facet` command to the facet command block of the parent facet. Technically, every facet command is actually a sub-facet since we start off with a single facet bucket with a domain defined by the main query and filters.
=== Nested Facet Example
Let's start off with a simple non-nested terms facet on the genre field:
[source,java]
----
top_genres:{
type: terms
field: genre,
limit: 5
}
----
Now if we wanted to add a nested facet to find the top 2 authors for each genre bucket:
[source,java]
----
top_genres:{
type: terms,
field: genre,
limit: 5,
facet:{
top_authors:{
type: terms, // nested terms facet on author will be calculated for each parent bucket (genre)
field: author,
limit: 2
}
}
}
----
And the response will look something like:
[source,java]
----
"facets":{
"top_genres":{
"buckets":[
{
"val":"Fantasy",
"count":5432,
"top_authors":{ // these are the top authors in the "Fantasy" genre
"buckets":[{
"val":"Mercedes Lackey",
"count":121},
{
"val":"Piers Anthony",
"count":98}
]
}
},
{
"val":"Mystery",
"count":4322,
"top_authors":{ // these are the top authors in the "Mystery" genre
"buckets":[{
"val":"James Patterson",
"count":146},
{
"val":"Patricia Cornwell",
"count":132}
]
}
}
----
By default "top authors" is defined by simple document count descending, but we could use our aggregation functions to sort by more interesting metrics.
=== Sorting Facets By Nested Functions
The default sort for a field or terms facet is by bucket count descending. We can optionally `sort` ascending or descending by any facet function that appears in each bucket.
[source,java]
----
{
categories:{
type : terms, // terms facet creates a bucket for each indexed term in the field
field : cat,
sort : "x desc", // can also use sort:{x:desc}
facet : {
x : "avg(price)", // x = average price for each facet bucket
y : "max(popularity)" // y = max popularity value in each facet bucket
}
}
}
----
In some situations the desired `sort` may be an aggregation function that is very costly to compute for every bucket. A `prelim_sort` option can be used to specify an approximation of the `sort`, for initially ranking the buckets to determine the top candidates (based on the `limit` and `overrequest`). Only after the top candidate buckets have been refined, will the actual `sort` be used.
[source,java]
----
{
categories:{
type : terms,
field : cat,
refine: true,
limit: 10,
overrequest: 100,
prelim_sort: "sales_rank desc",
sort : "prod_quality desc",
facet : {
prod_quality : "avg(div(prod(rating,sales_rank),prod(num_returns,price)))"
sales_rank : "sum(sales_rank)"
}
}
}
----
== Changing the Domain
As discussed above, facets compute buckets or statistics based on a "domain" which is typically implicit:
* By default, facets use the set of all documents matching the main query as their domain.
* Nested "sub-facets" are computed for every bucket of their parent facet, using a domain containing all documents in that bucket.
But users can also override the "domain" of a facet that partitions data, using an explicit `domain` attribute whose value is a JSON Object that can support various options for restricting, expanding, or completely changing the original domain before the buckets are computed for the associated facet.
[NOTE]
====
`domain` changes can only be specified on individual facets that do data partitioning -- not statistical/metric facets, or groups of facets.
A `\*:*` query facet with a `domain` change can be used to group multiple sub-facets of any type, for the purpose of applying a common domain change.
====
=== Adding Domain Filters
The simplest example of a domain change is to specify an additional filter which will be applied to the existing domain. This can be done via the `filter` keyword in the `domain` block of the facet.
Example:
[source,json]
----
{
categories: {
type: terms,
field: cat,
domain: {filter: "popularity:[5 TO 10]" }
}
}
----
The value of `filter` can be a single query to treat as a filter, or a JSON list of filter queries. Each query can be:
* a string containing a query in Solr query syntax
* a reference to a request parameter containing Solr query syntax, of the form: `{param : <request_param_name>}`
When a `filter` option is combined with other `domain` changing options, the filtering is applied _after_ the other domain changes take place.
=== Filter Exclusions
Domains can also exclude the top-level query or filters via the `excludeTags` keywords in the `domain` block of the facet, expanding the existing domain.
Example:
[source,bash]
----
&q={!tag=top}"running shorts"
&fq={!tag=COLOR}color:Blue
&json={
filter:"{!tag=BRAND}brand:Bosco"
facet:{
sizes:{type:terms, field:size},
colors:{type:terms, field:color, domain:{excludeTags:COLOR} },
brands:{type:terms, field:brand, domain:{excludeTags:"BRAND,top"} }
}
}
----
The value of `excludeTags` can be a single string tag, array of string tags or comma-separated tags in the single string.
When an `excludeTags` option is combined with other `domain` changing options, it expands the domain _before_ any other domain changes take place.
See also the section on <<faceting.adoc#tagging-and-excluding-filters,multi-select faceting>>.
=== Arbitrary Domain Query
A `query` domain can be specified when you wish to compute a facet against an arbitrary set of documents, regardless of the original domain. The most common use case would be to compute a top level facet against a specific subset of the collection, regardless of the main query. But it can also be useful on nested facets when building <<Semantic Knowledge Graphs>>.
Example:
[source,json]
----
{
"categories": {
"type": "terms",
"field": "cat",
"domain": {"query": "*:*" }
}
}
----
The value of `query` can be a single query, or a JSON list of queries. Each query can be:
* a string containing a query in Solr query syntax
* a reference to a request parameter containing Solr query syntax, of the form: `{param: <request_param_name>}`
NOTE: While a `query` domain can be combined with an additional domain `filter`, It is not possible to also use `excludeTags`, because the tags would be meaningless: The `query` domain already completely ignores the top-level query and all previous filters.
=== Block Join Domain Changes
When a collection contains <<uploading-data-with-index-handlers.adoc#nested-child-documents, Block Join child documents>>, the `blockChildren` or `blockParent` domain options can be used transform an existing domain containing one type of document, into a domain containing the documents with the specified relationship (child or parent of) to the documents from the original domain.
Both of these options work similar to the corresponding <<other-parsers.adoc#block-join-query-parsers,Block Join Query Parsers>> by taking in a single String query that exclusively matches all parent documents in the collection. If `blockParent` is used, then the resulting domain will contain all parent documents of the children from the original domain. If If `blockChildren` is used, then the resulting domain will contain all child documents of the parents from the original domain.
Example:
[source,json,subs="verbatim,callouts"]]
----
{
"colors": { // <1>
"type": "terms",
"field": "sku_color", // <2>
"facet" : {
"brands" : {
"type": "terms",
"field": "product_brand", // <3>
"domain": {
"blockParent": "doc_type:product"
}
}}}}
----
<1> This example assumes we parent documents corresponding to Products, with child documents corresponding to individual SKUs with unique colors, and that our original query was against SKU documents.
<2> The `colors` facet will be computed against all of the original SKU documents matching our search.
<3> For each bucket in the `colors` facet, the set of all matching SKU documents will be transformed into the set of corresponding parent Product documents. The resulting `brands` sub-facet will count how many Product documents (that have SKUs with the associated color) exist for each Brand.
=== Join Query Domain Changes
A `join` domain change option can be used to specify arbitrary `from` and `to` fields to use in transforming from the existing domain to a related set of documents.
This works very similar to the <<other-parsers.adoc#join-query-parser,Join Query Parser>>, and has the same limitations when dealing with multi-shard collections.
Example:
[source,json]
----
{
"colors": {
"type": "terms",
"field": "sku_color",
"facet": {
"brands": {
"type": "terms",
"field": "product_brand",
"domain" : {
"join" : {
"from": "product_id_of_this_sku",
"to": "id"
},
"filter": "doc_type:product"
}
}
}
}
}
----
=== Graph Traversal Domain Changes
A `graph` domain change option works similarly to the `join` domain option, but can do traversal multiple hops `from` the existing domain `to` other documents.
This works very similar to the <<other-parsers.adoc#graph-query-parser,Graph Query Parser>>, supporting all of it's optional parameters, and has the same limitations when dealing with multi-shard collections.
Example:
[source,json]
----
{
"related_brands": {
"type": "terms",
"field": "brand",
"domain": {
"graph": {
"from": "related_product_ids",
"to": "id",
"maxDepth": 3
}
}
}
}
----
== Block Join Counts
When a collection contains <<uploading-data-with-index-handlers.adoc#nested-child-documents, Block Join child documents>>, the `blockChildren` and `blockParent` domain changes mentioned above can be useful when searching for parent documents and you want to compute stats against all of the affected children documents (or vice versa). But in the situation where the _count_ of all the blocks that exist in the current domain is sufficient, a more efficient option is the `uniqueBlock()` aggregate function.
=== Block Join Counts Example
Suppose we have products with multiple SKUs, and we want to count products for each color.
[source,json]
----
{
"id": "1", "type": "product", "name": "Solr T-Shirt",
"_childDocuments_": [
{ "id": "11", "type": "SKU", "color": "Red", "size": "L" },
{ "id": "12", "type": "SKU", "color": "Blue", "size": "L" },
{ "id": "13", "type": "SKU", "color": "Red", "size": "M" }
]
},
{
"id": "2", "type": "product", "name": "Solr T-Shirt",
"_childDocuments_": [
{ "id": "21", "type": "SKU", "color": "Blue", "size": "S" }
]
}
----
When searching against a set of SKU documents, we can ask for a facet on color, with a nested statistic counting all the "blocks" -- aka: products:
[source,java]
----
color: {
type: terms,
field: color,
limit: -1,
facet: {
productsCount: "uniqueBlock(_root_)"
}
}
----
and get:
[source,java]
----
color:{
buckets:[
{ val:Blue, count:2, productsCount:2 },
{ val:Red, count:2, productsCount:1 }
]
}
----
Please notice that `\_root_` is an internal field added by Lucene to each child document to reference on parent one.
Aggregation `uniqueBlock(\_root_)` is functionally equivalent to `unique(\_root_)`, but is optimized for nested documents block structure.
It's recommended to define `limit: -1` for `uniqueBlock` calculation, like in above example,
since default value of `limit` parameter is `10`, while `uniqueBlock` is supposed to be much faster with `-1`.
== Semantic Knowledge Graphs
The `relatedness(...)` aggregation function allows for sets of documents to be scored relative to Foreground and Background sets of documents, for the purposes of finding ad-hoc relationships that make up a "Semantic Knowledge Graph":
[quote, Grainger et al., 'https://arxiv.org/abs/1609.00464[The Semantic Knowledge Graph]']
____
At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes.
____
The `relatedness(...)` function is used to "score" these relationships, relative to "Foreground" and "Background" sets of documents, specified in the function params as queries.
Unlike most aggregation functions, the `relatedness(...)` function is aware of whether and how it's used in <<nested-facets,Nested Facets>>. It evaluates the query defining the current bucket _independently_ from it's parent/ancestor buckets, and intersects those documents with a "Foreground Set" defined by the foreground query _combined with the ancestor buckets_. The result is then compared to a similar intersection done against the "Background Set" (defined exclusively by background query) to see if there is a positive, or negative, correlation between the current bucket and the Foreground Set, relative to the Background Set.
NOTE: While it's very common to define the Background Set as `\*:*`, or some other super-set of the Foreground Query, it is not strictly required. The `relatedness(...)` function can be used to compare the statistical relatedness of sets of documents to orthogonal foreground/background queries.
[[relatedness-options]]
=== relatedness() Options
When using the extended `type:func` syntax for specifying a `relatedness()` aggregation, an opional `min_popularity` (float) option can be used to specify a lower bound on the `foreground_popularity` and `background_popularity` values, that must be met in order for the `relatedness` score to be valid -- If this `min_popularity` is not met, then the `relatedness` score will be `-Infinity`.
[source,json]
----
{ "type": "func",
"func": "relatedness($fore,$back)",
"min_popularity": 0.001,
}
----
This can be particularly useful when using a descending sorting on `relatedness()` with foreground and background queries that are disjoint, to ensure the "top buckets" are all relevant to both sets.
[TIP]
====
When sorting on `relatedness(...)` requests can be processed much more quickly by adding a `prelim_sort: "count desc"` option. Increasing the `overrequest` can help improve the accuracy of the top buckets.
====
=== Semantic Knowledge Graph Example
.Sample Documents
[source,bash,subs="verbatim,callouts"]
----
curl -sS -X POST 'http://localhost:8983/solr/gettingstarted/update?commit=true' -d '[
{"id":"01",age:15,"state":"AZ","hobbies":["soccer","painting","cycling"]},
{"id":"02",age:22,"state":"AZ","hobbies":["swimming","darts","cycling"]},
{"id":"03",age:27,"state":"AZ","hobbies":["swimming","frisbee","painting"]},
{"id":"04",age:33,"state":"AZ","hobbies":["darts"]},
{"id":"05",age:42,"state":"AZ","hobbies":["swimming","golf","painting"]},
{"id":"06",age:54,"state":"AZ","hobbies":["swimming","golf"]},
{"id":"07",age:67,"state":"AZ","hobbies":["golf","painting"]},
{"id":"08",age:71,"state":"AZ","hobbies":["painting"]},
{"id":"09",age:14,"state":"CO","hobbies":["soccer","frisbee","skiing","swimming","skating"]},
{"id":"10",age:23,"state":"CO","hobbies":["skiing","darts","cycling","swimming"]},
{"id":"11",age:26,"state":"CO","hobbies":["skiing","golf"]},
{"id":"12",age:35,"state":"CO","hobbies":["golf","frisbee","painting","skiing"]},
{"id":"13",age:47,"state":"CO","hobbies":["skiing","darts","painting","skating"]},
{"id":"14",age:51,"state":"CO","hobbies":["skiing","golf"]},
{"id":"15",age:64,"state":"CO","hobbies":["skating","cycling"]},
{"id":"16",age:73,"state":"CO","hobbies":["painting"]},
]'
----
.Example Query
[source,bash,subs="verbatim,callouts"]
----
curl -sS -X POST http://localhost:8983/solr/gettingstarted/query -d 'rows=0&q=*:*
&back=*:* # <1>
&fore=age:[35 TO *] # <2>
&json.facet={
hobby : {
type : terms,
field : hobbies,
limit : 5,
sort : { r1: desc }, # <3>
facet : {
r1 : "relatedness($fore,$back)", # <4>
location : {
type : terms,
field : state,
limit : 2,
sort : { r2: desc }, # <3>
facet : {
r2 : "relatedness($fore,$back)" # <4>
}
}
}
}
}'
----
<1> Use the entire collection as our "Background Set"
<2> Use a query for "age >= 35" to define our (initial) "Foreground Set"
<3> For both the top level `hobbies` facet & the sub-facet on `state` we will be sorting on the `relatedness(...)` values
<4> In both calls to the `relatedness(...)` function, we use <<local-parameters-in-queries.adoc#parameter-dereferencing,Parameter Variables>> to refer to the previously defined `fore` and `back` queries.
.The Facet Response
[source,json,subs="verbatim,callouts"]
----
"facets":{
"count":16,
"hobby":{
"buckets":[{
"val":"golf",
"count":6, // <1>
"r1":{
"relatedness":0.01225,
"foreground_popularity":0.3125, // <2>
"background_popularity":0.375}, // <3>
"location":{
"buckets":[{
"val":"az",
"count":3,
"r2":{
"relatedness":0.00496, // <4>
"foreground_popularity":0.1875, // <6>
"background_popularity":0.5}}, // <7>
{
"val":"co",
"count":3,
"r2":{
"relatedness":-0.00496, // <5>
"foreground_popularity":0.125,
"background_popularity":0.5}}]}},
{
"val":"painting",
"count":8, // <1>
"r1":{
"relatedness":0.01097,
"foreground_popularity":0.375,
"background_popularity":0.5},
"location":{
"buckets":[{
...
----
<1> Even though `hobbies:golf` has a lower total facet `count` then `hobbies:painting`, it has a higher `relatedness` score, indicating that relative to the Background Set (the entire collection) Golf has a stronger correlation to our Foreground Set (people age 35+) then Painting.
<2> The number of documents matching `age:[35 TO *]` _and_ `hobbies:golf` is 31.25% of the total number of documents in the Background Set
<3> 37.5% of the documents in the Background Set match `hobbies:golf`
<4> The state of Arizona (AZ) has a _positive_ relatedness correlation with the _nested_ Foreground Set (people ages 35+ who play Golf) compared to the Background Set -- i.e., "People in Arizona are statistically more likely to be '35+ year old Golfers' then the country as a whole."
<5> The state of Colorado (CO) has a _negative_ correlation with the nested Foreground Set -- i.e., "People in Colorado are statistically less likely to be '35+ year old Golfers' then the country as a whole."
<6> The number documents matching `age:[35 TO *]` _and_ `hobbies:golf` _and_ `state:AZ` is 18.75% of the total number of documents in the Background Set
<7> 50% of the documents in the Background Set match `state:AZ`
== References
This documentation was originally adapted largely from the following blog pages:
http://yonik.com/json-facet-api/
http://yonik.com/solr-facet-functions/
http://yonik.com/solr-subfacets/
http://yonik.com/percentiles-for-solr-faceting/