blob: 1bbd636b3694c6309589a56c1fe2a585c25456fa [file] [log] [blame]
= Other Parsers
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
In addition to the main query parsers discussed earlier, there are several other query parsers that can be used instead of or in conjunction with the main parsers for specific purposes.
This section details the other parsers, and gives examples for how they might be used.
Many of these parsers are expressed the same way as <<local-parameters-in-queries.adoc#,Local Parameters in Queries>>.
== Block Join Query Parsers
There are two query parsers that support block joins. These parsers allow indexing and searching for relational content that has been <<indexing-nested-documents.adoc#, indexed as Nested Documents>>.
The example usage of the query parsers below assumes the following documents have been indexed:
[source,xml]
----
<add>
<doc>
<field name="id">1</field>
<field name="content_type">parent</field>
<field name="title">Solr has block join support</field>
<doc>
<field name="id">2</field>
<field name="content_type">child</field>
<field name="comments">SolrCloud supports it too!</field>
</doc>
</doc>
<doc>
<field name="id">3</field>
<field name="content_type">parent</field>
<field name="title">New Lucene and Solr release</field>
<doc>
<field name="id">4</field>
<field name="content_type">child</field>
<field name="comments">Lots of new features</field>
</doc>
</doc>
</add>
----
=== Block Join Children Query Parser
This parser wraps a query that matches some parent documents and returns the children of those documents.
The syntax for this parser is: `q={!child of=<blockMask>}<someParents>`.
* The inner subordinate query string (`someParents`) must be a query that will match some parent documents
* The `of` parameter must be a query string to use as a <<#block-mask,Block Mask>> -- typically a query that matches the set of all possible parent documents
The resulting query will match all documents which do _not_ match the `<blockMask>` query and are children (or descendents) of the documents matched by `<someParents>`.
Using the example documents above, we can construct a query such as `q={!child of="content_type:parent"}title:lucene`. We only get one document in response:
[source,xml]
----
<result name="response" numFound="1" start="0">
<doc>
<str name="id">4</str>
<arr name="content_type"><str>child</str></arr>
<str name="comments">Lots of new features</str>
</doc>
</result>
----
[CAUTION]
====
The query for `someParents` *MUST* match a strict subset of the documents matched by the <<#block-mask,Block Mask>> or your query may result in an Error:
[literal]
Parent query must not match any docs besides parent filter. Combine them as must (+) and must-not (-) clauses to find a problem doc.
You can search for `q=+(someParents) -(blockMask)` to find a cause if you encounter this type of error.
====
==== Filtering and Tagging
`{!child}` also supports `filters` and `excludeTags` local parameters like the following:
[source,text]
?q={!child of=<blockMask> filters=$parentfq excludeTags=certain}<someParents>
&parentfq=BRAND:Foo
&parentfq=NAME:Bar
&parentfq={!tag=certain}CATEGORY:Baz
This is equivalent to:
[source,text]
q={!child of=<blockMask>}+<someParents> +BRAND:Foo +NAME:Bar
Notice "$" syntax in `filters` for referencing queries; comma-separated tags `excludeTags` allows to exclude certain queries by tagging. Overall the idea is similar to <<faceting.adoc#tagging-and-excluding-filters, excluding fq in facets>>. Note, that filtering is applied to the subordinate clause (`<someParents>`), and the intersection result is joined to the children.
==== All Children Syntax
When subordinate clause (`<someParents>`) is omitted, it's parsed as a _segmented_ and _cached_ filter for children documents. More precisely, `q={!child of=<blockMask>}` is equivalent to `q=\*:* -<blockMask>`.
=== Block Join Parent Query Parser
This parser takes a query that matches child documents and returns their parents.
The syntax for this parser is similar to the `child` parser: `q={!parent which=<blockMask>}<someChildren>`.
* The inner subordinate query string (`someChildren`) must be a query that will match some child documents
* The `which` parameter must be a query string to use as a <<#block-mask,Block Mask>> -- typically a query that matches the set of all possible parent documents
The resulting query will match all documents which _do_ match the `<blockMask>` query and are parents (or ancestors) of the documents matched by `<someChildren>`.
Again using the example documents above, we can construct a query such as `q={!parent which="content_type:parent"}comments:SolrCloud`. We get this document in response:
[source,xml]
----
<result name="response" numFound="1" start="0">
<doc>
<str name="id">1</str>
<arr name="content_type"><str>parent</str></arr>
<arr name="title"><str>Solr has block join support</str></arr>
</doc>
</result>
----
[CAUTION]
====
The query for `someChildren` *MUST NOT* match any documents matched by the <<#block-mask,Block Mask>> or your query may result in an Error:
[literal]
Child query must not match same docs with parent filter. Combine them as must clauses (+) to find a problem doc.
You can search for `q=+(blockMask) +(someChildren)` to find a cause.
====
==== Filtering and Tagging
The `{!parent}` query supports `filters` and `excludeTags` local parameters like the following:
[source,text]
?q={!parent which=<blockMask> filters=$childfq excludeTags=certain}<someChildren>
&childfq=COLOR:Red
&childfq=SIZE:XL
&childfq={!tag=certain}PRINT:Hatched
This is equivalent to:
[source,text]
q={!parent which=<blockMask>}+<someChildren> +COLOR:Red +SIZE:XL
Notice the "$" syntax in `filters` for referencing queries. Comma-separated tags in `excludeTags` allow excluding certain queries by tagging. Overall the idea is similar to <<faceting.adoc#tagging-and-excluding-filters, excluding fq in facets>>. Note that filtering is applied to the subordinate clause (`<someChildren>`) first, and the intersection result is joined to the parents.
==== Scoring with the Block Join Parent Query Parser
You can optionally use the `score` local parameter to return scores of the subordinate query. The values to use for this parameter define the type of aggregation, which are `avg` (average), `max` (maximum), `min` (minimum), `total (sum)`. Implicit default is `none` which returns `0.0`.
==== All Parents Syntax
When subordinate clause (`<someChildren>`) is omitted, it's parsed as a _segmented_ and _cached_ filter for all parent documents, or more precisely `q={!parent which=<blockMask>}` is equivalent to `q=<blockMask>`.
[#block-mask]
=== Block Masks: The `of` and `which` local params
The purpose of the "Block Mask" query specified as either an `of` or `which` parameter (depending on the parser used) is to identy the set of all documents in the index which should be treated as "parents" _(or their ancestors)_ and which documents should be treated as "children". This is important because in the "on disk" index, the relationships are flattened into "blocks" of documents, so the `of` / `which` params are needed to serve as a "mask" against the flat document blocks to identify the boundaries of every hierarchical relationship.
In the example queries above, we were able to use a very simple Block Mask of `doc_type:parent` because our data is very simple: every document is either a `parent` or a `child` So this query string easily distinguishes _all_ of our documents.
A common mistake is to try and use a `which` parameter that is more restrictive then the set of all parent documents, in order to filter the parents that are matched, as in this bad example:
----
// BAD! DO NOT USE!
q={!parent which="title:join"}comments:support
----
This type of query will frequently not work the way you might expect. Since the `which` parameter only identifies _some_ of the "parent" documents, the resulting query can match "parent" documents it should not, because it will mistakenly identify all documents which do _not_ match the `which="title:join"` Block Mask as children of the next "parent" document in the index (that does match this Mask).
A similar problematic situation can arise when mixing parent/child documents with "simple" documents that have no children _and do not match the query used to identify 'parent' documents_. For example, if we add the following document to our existing parent/child example documents:
[source,xml]
----
<add>
<doc>
<field name="id">0</field>
<field name="content_type">plain</field>
<field name="title">Lucene and Solr are cool</field>
</doc>
</add>
----
...then our simple `doc_type:parent` Block Mask would no longer be adequate. We would instead need to use `\*:* -doc_type:child` or `doc_type:(simple parent)` to prevent our "simple" document from mistakenly being treated as a "child" of an adjacent "parent" document.
The <<searching-nested-documents#searching-nested-documents,Searching Nested Documents>> section contains more detailed examples of specifing Block Mask queries with non trivial hierarchicies of documents.
== Boolean Query Parser
The `BoolQParser` creates a Lucene `BooleanQuery` which is a boolean combination of other queries. Sub-queries along with their typed occurrences indicate how documents will be matched and scored.
*Parameters*
`must`::
A list of queries that *must* appear in matching documents and contribute to the score.
`must_not`::
A list of queries that *must not* appear in matching documents.
`should`::
A list of queries *should* appear in matching documents. For a BooleanQuery with no `must` queries, one or more `should` queries must match a document for the BooleanQuery to match.
`filter`::
A list of queries that *must* appear in matching documents. However, unlike `must`, the score of filter queries is ignored. Also, these queries are cached in filter cache. To avoid caching add either `cache=false` as local parameter, or `"cache":"false"` property to underneath Query DLS Object.
`excludeTags`::
Comma separated list of tags for excluding queries from parameters above. See explanation below.
*Examples*
[source,text]
----
{!bool must=foo must=bar}
----
[source,text]
----
{!bool filter=foo should=bar}
----
Parameters might also be multivalue references. The former example above is equivalent to:
[source,text]
----
q={!bool must=$ref}&ref=foo&ref=bar
----
Referred queries might be excluded via tags. Overall the idea is similar to <<faceting.adoc#tagging-and-excluding-filters, excluding fq in facets>>.
[source,text]
----
q={!bool must=$ref excludeTags=t2}&ref={!tag=t1}foo&ref={!tag=t2}bar
----
Since the later query is excluded via `t2`, the resulting query is equivalent to:
[source,text]
----
q={!bool must=foo}
----
== Boost Query Parser
`BoostQParser` extends the `QParserPlugin` and creates a boosted query from the input value. The main value is any query to be "wrapped" and "boosted" -- only documents which match that query will match the final query produced by this parser.
Parameter `b` is a <<function-queries.adoc#available-functions,function>> to be evaluated against each document that matches the original query, and the result of the function will be multiplied into into the final score for that document.
=== Boost Query Parser Examples
Creates a query `name:foo` which is boosted (scores are multiplied) by the function query `log(popularity)`:
[source,text]
----
q={!boost b=log(popularity)}name:foo
----
Creates a query `name:foo` which has it's scores multiplied by the _inverse_ of the numeric `price` field -- effectively "demoting" documents which have a high `price` by lowering their final score:
[source,text]
----
// NOTE: we "add 1" to the denominator to prevent divide by zero
q={!boost b=div(1,add(1,price))}name:foo
----
The `<<function-queries.adoc#query-function,query(...)>>` function is particularly useful for situations where you want to multiply (or divide) the score for each document matching your main query by the score that document would have from another query.
This example uses <<local-parameters-in-queries.adoc#parameter-dereferencing,local parameter variables>> to create a query for `name:foo` which is boosted by the scores from the independently specified query `category:electronics`:
[source,text]
----
q={!boost b=query($my_boost)}name:foo
my_boost=category:electronics
----
[[other-collapsing]]
== Collapsing Query Parser
The `CollapsingQParser` is really a _post filter_ that provides more performant field collapsing than Solr's standard approach when the number of distinct groups in the result set is high.
This parser collapses the result set to a single document per group before it forwards the result set to the rest of the search components. So all downstream components (faceting, highlighting, etc.) will work with the collapsed result set.
Details about using the `CollapsingQParser` can be found in the section <<collapse-and-expand-results.adoc#,Collapse and Expand Results>>.
== Complex Phrase Query Parser
The `ComplexPhraseQParser` provides support for wildcards, ORs, etc., inside phrase queries using Lucene's {lucene-javadocs}/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html[`ComplexPhraseQueryParser`].
Under the covers, this query parser makes use of the Span group of queries, e.g., spanNear, spanOr, etc., and is subject to the same limitations as that family or parsers.
*Parameters*
`inOrder`::
Set to true to force phrase queries to match terms in the order specified. The default is `true`.
`df`::
The default search field.
*Examples*
[source,text]
----
{!complexphrase inOrder=true}name:"Jo* Smith"
----
[source,text]
----
{!complexphrase inOrder=false}name:"(john jon jonathan~) peters*"
----
A mix of ordered and unordered complex phrase queries:
[source,text]
----
+_query_:"{!complexphrase inOrder=true}manu:\"a* c*\"" +_query_:"{!complexphrase inOrder=false df=name}\"bla* pla*\""
----
=== Complex Phrase Parser Limitations
Performance is sensitive to the number of unique terms that are associated with a pattern. For instance, searching for "a*" will form a large OR clause (technically a SpanOr with many terms) for all of the terms in your index for the indicated field that start with the single letter 'a'. It may be prudent to restrict wildcards to at least two or preferably three letters as a prefix. Allowing very short prefixes may result in to many low-quality documents being returned.
Notice that it also supports leading wildcards "*a" as well with consequent performance implications. Applying <<filter-descriptions.adoc#reversed-wildcard-filter,ReversedWildcardFilterFactory>> in index-time analysis is usually a good idea.
==== MaxBooleanClauses with Complex Phrase Parser
You may need to increase MaxBooleanClauses in `solrconfig.xml` as a result of the term expansion above:
[source,xml]
----
<maxBooleanClauses>4096</maxBooleanClauses>
----
This property is described in more detail in the section <<query-settings-in-solrconfig.adoc#query-sizing-and-warming,Query Sizing and Warming>>.
==== Stopwords with Complex Phrase Parser
It is recommended not to use stopword elimination with this query parser.
Lets say we add the terms *the*, *up*, and *to* to `stopwords.txt` for your collection, and index a document containing the text _"Stores up to 15,000 songs, 25,00 photos, or 150 yours of video"_ in a field named "features".
While the query below does not use this parser:
[source,text]
----
q=features:"Stores up to 15,000"
----
the document is returned. The next query that _does_ use the Complex Phrase Query Parser, as in this query:
[source,text]
----
q=features:"sto* up to 15*"&defType=complexphrase
----
does _not_ return that document because SpanNearQuery has no good way to handle stopwords in a way analogous to PhraseQuery. If you must remove stopwords for your use case, use a custom filter factory or perhaps a customized synonyms filter that reduces given stopwords to some impossible token.
==== Escaping with Complex Phrase Parser
Special care has to be given when escaping: clauses between double quotes (usually whole query) is parsed twice, these parts have to be escaped as twice, e.g., `"foo\\: bar\\^"`.
== Field Query Parser
The `FieldQParser` extends the `QParserPlugin` and creates a field query from the input value, applying text analysis and constructing a phrase query if appropriate. The parameter `f` is the field to be queried.
Example:
[source,text]
----
{!field f=myfield}Foo Bar
----
This example creates a phrase query with "foo" followed by "bar" (assuming the analyzer for `myfield` is a text field with an analyzer that splits on whitespace and lowercase terms). This is generally equivalent to the Lucene query parser expression `myfield:"Foo Bar"`.
== Filters Query Parser
The syntax is:
[literal]
q={!filters param=$fqs excludeTags=sample}field:text&
fqs=COLOR:Red&
fqs=SIZE:XL&
fqs={!tag=sample}BRAND:Foo
which is equivalent to:
[literal]
q=+field:text +COLOR:Red +SIZE:XL
`param` local parameter uses "$" syntax to refer to a few queries, where `excludeTags` may omit some of them.
== Function Query Parser
The `FunctionQParser` extends the `QParserPlugin` and creates a function query from the input value. This is only one way to use function queries in Solr; for another, more integrated, approach, see the section on <<function-queries.adoc#,Function Queries>>.
Example:
[source,text]
----
{!func}log(foo)
----
== Function Range Query Parser
The `FunctionRangeQParser` extends the `QParserPlugin` and creates a range query over a function. This is also referred to as `frange`, as seen in the examples below.
*Parameters*
`l`::
The lower bound. This parameter is optional.
`u`::
The upper bound. This parameter is optional.
`incl`::
Include the lower bound. This parameter is optional. The default is `true`.
`incu`::
Include the upper bound. This parameter is optional. The default is `true`.
*Examples*
[source,text]
----
{!frange l=1000 u=50000}myfield
----
[source,text]
----
fq={!frange l=0 u=2.2} sum(user_ranking,editor_ranking)
----
Both of these examples restrict the results by a range of values found in a declared field or a function query. In the second example, we're doing a sum calculation, and then defining only values between 0 and 2.2 should be returned to the user.
For more information about range queries over functions, see Yonik Seeley's introductory blog post https://lucidworks.com/2009/07/06/ranges-over-functions-in-solr-14/[Ranges over Functions in Solr 1.4].
== Graph Query Parser
The `graph` query parser does a breadth first, cyclic aware, graph traversal of all documents that are "reachable" from a starting set of root documents identified by a wrapped query.
The graph is built according to linkages between documents based on the terms found in `from` and `to` fields that you specify as part of the query.
Supported field types are point fields with docValues enabled, or string fields with `indexed=true` or `docValues=true`.
TIP: For string fields which are `indexed=false` and `docValues=true`, please refer to the javadocs for {lucene-javadocs}sandbox/org/apache/lucene/search/DocValuesTermsQuery.html[`DocValuesTermsQuery`]
for its performance characteristics so `indexed=true` will perform better for most use-cases.
=== Graph Query Parameters
`to`::
The field name of matching documents to inspect to identify outgoing edges for graph traversal. Defaults to `edge_ids`.
`from`::
The field name to of candidate documents to inspect to identify incoming graph edges. Defaults to `node_id`.
`traversalFilter`::
An optional query that can be supplied to limit the scope of documents that are traversed.
`maxDepth`::
Integer specifying how deep the breadth first search of the graph should go beginning with the initial query. Defaults to `-1` (unlimited).
`returnRoot`::
Boolean to indicate if the documents that matched the original query (to define the starting points for graph) should be included in the final results. Defaults to `true`.
`returnOnlyLeaf`::
Boolean that indicates if the results of the query should be filtered so that only documents with no outgoing edges are returned. Defaults to `false`.
`useAutn`:: Boolean that indicates if an Automatons should be compiled for each iteration of the breadth first search, which may be faster for some graphs. Defaults to `false`.
=== Graph Query Limitations
The `graph` parser only works in single node Solr installations, or with <<solrcloud.adoc#,SolrCloud>> collections that use exactly 1 shard.
=== Graph Query Examples
To understand how the graph parser works, consider the following Directed Cyclic Graph, containing 8 nodes (A to H) and 9 edges (1 to 9):
image::images/other-parsers/graph_qparser_example.png[image,height=100]
One way to model this graph as Solr documents, would be to create one document per node, with mutivalued fields identifying the incoming and outgoing edges for each node:
[source,bash]
----
curl -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_graph/update?commit=true' --data-binary '[
{"id":"A","foo": 7, "out_edge":["1","9"], "in_edge":["4","2"] },
{"id":"B","foo": 12, "out_edge":["3","6"], "in_edge":["1"] },
{"id":"C","foo": 10, "out_edge":["5","2"], "in_edge":["9"] },
{"id":"D","foo": 20, "out_edge":["4","7"], "in_edge":["3","5"] },
{"id":"E","foo": 17, "out_edge":[], "in_edge":["6"] },
{"id":"F","foo": 11, "out_edge":[], "in_edge":["7"] },
{"id":"G","foo": 7, "out_edge":["8"], "in_edge":[] },
{"id":"H","foo": 10, "out_edge":[], "in_edge":["8"] }
]'
----
With the model shown above, the following query demonstrates a simple traversal of all nodes reachable from node A:
[source,text]
----
http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge}id:A
----
[source,json]
----
"response":{"numFound":6,"start":0,"docs":[
{ "id":"A" },
{ "id":"B" },
{ "id":"C" },
{ "id":"D" },
{ "id":"E" },
{ "id":"F" } ]
}
----
We can also use the `traversalFilter` to limit the graph traversal to only nodes with maximum value of 15 in the `foo` field. In this case that means D, E, and F are excluded – F has a value of `foo=11`, but it is unreachable because the traversal skipped D:
[source,text]
----
http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge+traversalFilter='foo:[*+TO+15]'}id:A
----
[source,json]
----
...
"response":{"numFound":3,"start":0,"docs":[
{ "id":"A" },
{ "id":"B" },
{ "id":"C" } ]
}
----
The examples shown so far have all used a query for a single document (`"id:A"`) as the root node for the graph traversal, but any query can be used to identify multiple documents to use as root nodes. The next example demonstrates using the `maxDepth` parameter to find all nodes that are at most one edge away from an root node with a value in the `foo` field less then or equal to 10:
[source,text]
----
http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge+maxDepth=1}foo:[*+TO+10]
----
[source,json]
----
...
"response":{"numFound":6,"start":0,"docs":[
{ "id":"A" },
{ "id":"B" },
{ "id":"C" },
{ "id":"D" },
{ "id":"G" },
{ "id":"H" } ]
}
----
=== Simplified Models
The Document & Field modeling used in the above examples enumerated all of the outgoing and income edges for each node explicitly, to help demonstrate exactly how the "from" and "to" parameters work, and to give you an idea of what is possible. With multiple sets of fields like these for identifying incoming and outgoing edges, it's possible to model many independent Directed Graphs that contain some or all of the documents in your collection.
But in many cases it can also be possible to drastically simplify the model used.
For example, the same graph shown in the diagram above can be modeled by Solr Documents that represent each node and know only the ids of the nodes they link to, without knowing anything about the incoming links:
[source,bash]
----
curl -H 'Content-Type: application/json' 'http://localhost:8983/solr/alt_graph/update?commit=true' --data-binary '[
{"id":"A","foo": 7, "out_edge":["B","C"] },
{"id":"B","foo": 12, "out_edge":["E","D"] },
{"id":"C","foo": 10, "out_edge":["A","D"] },
{"id":"D","foo": 20, "out_edge":["A","F"] },
{"id":"E","foo": 17, "out_edge":[] },
{"id":"F","foo": 11, "out_edge":[] },
{"id":"G","foo": 7, "out_edge":["H"] },
{"id":"H","foo": 10, "out_edge":[] }
]'
----
With this alternative document model, all of the same queries demonstrated above can still be executed, simply by changing the "```from```" parameter to replace the "```in_edge```" field with the "```id```" field:
[source,text]
----
http://localhost:8983/solr/alt_graph/query?fl=id&q={!graph+from=id+to=out_edge+maxDepth=1}foo:[*+TO+10]
----
[source,json]
----
...
"response":{"numFound":6,"start":0,"docs":[
{ "id":"A" },
{ "id":"B" },
{ "id":"C" },
{ "id":"D" },
{ "id":"G" },
{ "id":"H" } ]
}
----
== Hash Range Query Parser
The hash range query parser will return documents that have a field that contains a value that would be hashed to a particular range. This is used by the join query when using method=crossCollection. The hash rang query parser has a per segment cache for each field that this query parser will operate on.
When specifying a min/max hash range and a field name with the hash range query parser, only documents who contain a field value that hashes into that range will be returned. If you want to query for a very large result set, you can query for various hash ranges to return a fraction of the documents with each range request. In the cross collection join case, the hash_range query parser is used to ensure that each shard only gets the set of join keys that would end up on that shard.
This query parser uses the MurmurHash3_x86_32. This is the same as the default hashing for the default composite ID router in Solr.
=== Hash Range Parameters
`f`::
The field name to operate on. This field should have docValues enabled and should be single-valued
`l`::
The lower bound of the hash range for the query
`u`::
The upper bound for the hash range for the query
=== Hash Range Example
[source,text]
----
{!hash_range f="field_name" l="0" u="12345"}
----
=== Hash Range Cache Config
The hash range query parser uses a special cache to improve the speedup of the queries. The following should be added to the `solrconfig.xml` for the various fields that you want to perform the hash range query on. Note the name of the cache should be the field name prefixed by "hash_".
[source,xml]
----
<cache name="hash_field_name"
class="solr.LRUCache"
size="128"
initialSize="0"
regenerator="solr.NoOpRegenerator"/>
----
== Join Query Parser
The Join query parser allows users to run queries that normalize relationships between documents.
Solr runs a subquery of the user's choosing (the `v` parameter), identifies all the values that matching documents have in a field of interest (the `from` parameter), and then returns documents where those values are contained in a second field of interest (the `to` parameter).
In practice, these semantics are much like "inner queries" in a SQL engine.
As an example, consider the Solr query below:
[source,text]
----
/solr/techproducts/select?q={!join from=manu_id_s to=id}title:ipod
----
This query, which returns a document for each manufacturer that makes a product with "ipod" in the title, is semantically identical to the SQL query below:
[source,text]
----
SELECT *
FROM techproducts
WHERE id IN (
SELECT manu_id_s
FROM techproducts
WHERE title='ipod'
)
----
The join operation is done on a term basis, so the `from` and `to` fields must use compatible field types.
For example: joining between a `StrField` and a `IntPointField` will not work.
Likewise joining between a `StrField` and a `TextField` that uses `LowerCaseFilterFactory` will only work for values that are already lower cased in the string field.
=== Parameters
This query parser takes the following parameters:
`from`::
The name of a field which contains values to look for in the "to" field.
Can be single or multi-valued, but must have a field type compatible with the field represented in the "to" field.
This parameter is required.
`to`::
The name of a field whose value(s) will be checked against those found in the "from" field.
Can be single or multi-valued, but must have a field type compatible with the "from" field.
This parameter is required.
`fromIndex`::
The name of the index to run the "from" query (`v` parameter) on and where "from" values are gathered.
Must be located on the same node as the core processing the request.
This parameter is optional; it defaults to the value of the processing core if not specified.
See <<Joining Across Single Shard Collections,Joining Across Single Shard Collections>> or <<Cross Collection Join,Cross Collection Join>> below for more information.
`score`::
An optional parameter that instructs Solr to return information about the "from" query scores.
The value of this parameter controls what type of aggregation information is returned.
Options include `avg` (average), `max` (maximum), `min` (minimum), `total` (total), or `none`.
+
If `method` is not specified but `score` is, then the `dvWithScore` method is used.
If `method` is specified and is not `dvWithScore`, then the `score` value is ignored.
See the `method` parameter documentation below for more details.
`method`::
An optional parameter used to determine which of several query implementations should be used by Solr.
Options are restricted to: `index`, `dvWithScore`, and `topLevelDV`.
If unspecified the default value is `index`, unless the `score` parameter is present which overrides it to `dvWithScore`.
Each implementation has its own performance characteristics, and users are encouraged to experiment to determine which implementation is most performant for their use-case.
Details and performance heuristics are given below.
+
`index` the default `method` unless the `score` parameter is specified.
Uses the terms index structures to process the request.
Performance scales with the cardinality and number of postings (term occurrences) in the "from" field.
Consider this method when the "from" field has low cardinality, when the "to" side returns a large number of documents, or when sporadic post-commit slowdowns cannot be tolerated (this is a disadvantage of other methods that `index` avoids).
+
`dvWithScore` returns an optional "score" statistic alongside result documents.
Uses docValues structures if available, but falls back to the field cache when necessary.
The first access to the field cache slows down the initial requests following a commit and takes up additional space on the JVM heap, so docValues are recommended in most situations.
Performance scales linearly with the number of values matched in the "from" field.
This method must be used if score information is required, and should also be considered when the "from" query matches few documents, regardless of the number of "to" side documents returned.
+
.dvWithScore and single value numerics
[WARNING]
====
The `dvWithScore` method doesn't support single value numeric fields. Users migrating from versions prior to 7.0 are encouraged to change field types to string and rebuild indexes during migration.
====
+
`topLevelDV` can only be used when `to` and `from` fields have docValues data, and does not currently support numeric fields.
Uses top-level docValues data structures to find results.
These data structures outperform other methods as the number of values matched in the `from` field grows high.
But they are also expensive to build and need to be lazily populated after each commit, causing a sometimes-noticeable slowdown on the first query to use them after each commit.
If you commit frequently and your use-case can tolerate a static warming query, consider adding one to `solrconfig.xml` so that this work is done as a part of the commit itself and not attached directly to user requests.
Consider this method when the "from" query matches a large number of documents and the "to" result set is small to moderate in size, but only if sporadic post-commit slowness is tolerable.
=== Joining Across Single Shard Collections
You can also specify a `fromIndex` parameter to join with a field from another core or a single shard collection. If running in SolrCloud mode, then the collection specified in the `fromIndex` parameter must have a single shard and a replica on all Solr nodes where the collection you're joining to has a replica.
Let's consider an example where you want to use a Solr join query to filter movies by directors that have won an Oscar. Specifically, imagine we have two collections with the following fields:
*movies*: id, title, director_id, ...
*movie_directors*: id, name, has_oscar, ...
To filter movies by directors that have won an Oscar using a Solr join on the *movie_directors* collection, you can send the following filter query to the *movies* collection:
[source,text]
----
fq={!join from=id fromIndex=movie_directors to=director_id}has_oscar:true
----
Notice that the query criteria of the filter (`has_oscar:true`) is based on a field in the collection specified using `fromIndex`. Keep in mind that you cannot return fields from the `fromIndex` collection using join queries, you can only use the fields for filtering results in the "to" collection (movies).
Next, let's understand how these collections need to be deployed in your cluster. Imagine the *movies* collection is deployed to a four node SolrCloud cluster and has two shards with a replication factor of two. Specifically, the *movies* collection has replicas on the following four nodes:
node 1: movies_shard1_replica1
node 2: movies_shard1_replica2
node 3: movies_shard2_replica1
node 4: movies_shard2_replica2
To use the *movie_directors* collection in Solr join queries with the *movies* collection, it needs to have a replica on each of the four nodes. In other words, *movie_directors* must have one shard and replication factor of four:
node 1: movie_directors_shard1_replica1
node 2: movie_directors_shard1_replica2
node 3: movie_directors_shard1_replica3
node 4: movie_directors_shard1_replica4
At query time, the `JoinQParser` will access the local replica of the *movie_directors* collection to perform the join. If a local replica is not available or active, then the query will fail. At this point, it should be clear that since you're limited to a single shard and the data must be replicated across all nodes where it is needed, this approach works better with smaller data sets where there is a one-to-many relationship between the from collection and the to collection. Moreover, if you add a replica to the to collection, then you also need to add a replica for the from collection.
For more information, Erick Erickson has written a blog post about join performance titled https://lucidworks.com/2012/06/20/solr-and-joins/[Solr and Joins].
=== Cross Collection Join
The Cross Collection Join Filter is a method for the join parser that will execute a query against a remote Solr collection to get back a set of join keys that will be used to as a filter query against the local Solr collection.
The crossCollection join query will create an CrossCollectionQuery object.
The CrossCollectionQuery will first query a remote Solr collection and get back a streaming expression result of the join keys.
As the join keys are streamed to the node, a bitset of the matching documents in the local index is built up.
This avoids keeping the full set of join keys in memory at any given time.
This bitset is then inserted into the filter cache upon successful execution as with the normal behavior of the Solr filter cache.
If the local index is sharded according to the join key field, the cross collection join can leverage a secondary query parser called the "hash_range" query parser.
The hash_range query parser is responsible for returning only the documents that hash to a given range of values.
This allows the CrossCollectionQuery to query the remote Solr collection and return only the join keys that would match a specific shard in the local Solr collection.
This has the benefit of making sure that network traffic doesn't increase as the number of shards increases and allows for much greater scalability.
The CrossCollection join query works with both String and Point types of fields.
The fields that are being used for the join key must be single-valued and have docValues enabled.
It's advised to shard the local collection by the join key as this allows for the optimization mentioned above to be utilized.
The cross collection join queries should not be generally used as part of the `q` parameter, but rather it is designed to be used as a filter query (`fq` parameter) to ensure proper caching.
The remote Solr collection that is being queried should have a single-valued field for the join key with docValues enabled.
The remote Solr collection does not have any specific sharding requirements.
==== Join Query Parser Definition in solrconfig.xml
The cross collection join has some configuration options that can be specified in `solrconfig.xml`.
`routerField`::
If the documents are routed to shards using the CompositeID router by the join field, then that field name should be specified in the configuration here. This will allow the parser to optimize the resulting HashRange query.
`solrUrl`::
If specified, this array of strings specifies the white listed Solr URLs that you can pass to the solrUrl query parameter. Without this configuration the solrUrl parameter cannot be used. This restriction is necessary to prevent an attacker from using Solr to explore the network.
[source,xml]
----
<queryParser name="join" class="org.apache.solr.search.JoinQParserPlugin">
<str name="routerField">product_id_s</str>
<arr name="allowSolrUrls">
<str>http://othersolr.example.com:8983/solr</str>
</arr>
</queryParser>
----
==== Cross Collection Join Query Parameters
`fromIndex`::
The name of the external Solr collection to be queried to retrieve the set of join key values (required).
`zkHost`::
The connection string to be used to connect to ZooKeeper. `zkHost` and `solrUrl` are both optional parameters, and at most one of them should be specified. If neither `zkHost` nor `solrUrl` are specified, the local ZooKeeper cluster will be used. (optional).
`solrUrl`::
The URL of the external Solr node to be queried. Must be a character for character exact match of a whitelisted url. (optional, disabled by default for security).
`from`::
The join key field name in the external collection (required).
`to`::
The join key field name in the local collection.
`v`::
The query substituted in as a local param. This is the query string that will match documents in the remote collection.
`routed`::
If `true`, the cross collection join query will use each shard's hash range to determine the set of join keys to retrieve for that shard.
This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the `to` field.
If this parameter is not specified, the cross collection join query will try to determine the correct value automatically.
`ttl`::
The length of time that a cross collection join query in the cache will be considered valid, in seconds.
Defaults to `3600` (one hour).
The cross collection join query will not be aware of changes to the remote collection, so if the remote collection is updated, cached cross collection queries may give inaccurate results.
After the `ttl` period has expired, the cross collection join query will re-execute the join against the remote collection.
Other Parameters::
Any normal Solr query parameter can also be specified/passed through as a local param.
==== Cross Collection Query Examples
[source,text]
----
http://localhost:8983/solr/localCollection/query?fl=id&q={!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField" v="*:*"}
----
== Lucene Query Parser
The `LuceneQParser` extends the `QParserPlugin` by parsing Solr's variant on the Lucene QueryParser syntax. This is effectively the same query parser that is used in Lucene. It uses the operators `q.op`, the default operator ("OR" or "AND") and `df`, the default field name.
Example:
[source,text]
----
{!lucene q.op=AND df=text}myfield:foo +bar -baz
----
For more information about the syntax for the Lucene Query Parser, see the {lucene-javadocs}/queryparser/org/apache/lucene/queryparser/classic/package-summary.html[Classic QueryParser javadocs].
== Learning To Rank Query Parser
The `LTRQParserPlugin` is a special purpose parser for reranking the top results of a simple query using a more complex ranking query which is based on a machine learnt model.
Example:
[source,text]
----
{!ltr model=myModel reRankDocs=100}
----
Details about using the `LTRQParserPlugin` can be found in the <<learning-to-rank.adoc#,Learning To Rank>> section.
== Max Score Query Parser
The `MaxScoreQParser` extends the `LuceneQParser` but returns the Max score from the clauses. It does this by wrapping all `SHOULD` clauses in a `DisjunctionMaxQuery` with `tie=1.0`. Any `MUST` or `PROHIBITED` clauses are passed through as-is. Non-boolean queries, e.g., NumericRange falls-through to the `LuceneQParser` parser behavior.
Example:
[source,text]
----
{!maxscore tie=0.01}C OR (D AND E)
----
== MinHash Query Parser
The `MinHashQParser` builds queries for fields analysed with the `MinHashFilterFactory`.
The queries measure Jaccard similarity between the query string and MinHash fields; allowing for faster, approximate matching if required.
The parser supports two modes of operation.
The first, when tokens are generated from text by normal analysis; and the second, when explicit tokens are provided.
Currently the score returned by the query reflects the number of top level elements that match and is *not* normalised between 0 and 1.
`sim`::
The required minimum similarity.
The default behaviour is to find any similarity greater than zero.
A numeric value between `0.0` and `1.0`.
`tp`::
The required true positive rate.
The default is `1.0`.
For values lower than `1.0`, an optimised and faster banded query may be used.
The banding behaviour depends on the values of `sim` and `tp` requested.
`field`::
The field in which the MinHash value is indexed.
This field is normally used to analyse the text provided to the query parser.
It is also used for the query field.
`sep`::
A separator string.
By default, this is the empty string, "".
If a non-empty separator string is provided, the query string is interpreted as a list of pre-analysed values separated by the separator string.
In this case, no other analysis of the string is performed: the tokens are used as found.
`analyzer_field`::
This parameter can be used to define how text is analysed, distinct from the query field.
It is used to analyse query text when using a pre-analysed string `field` to store MinHash values.
See the example below.
This query parser is registered with the name `min_hash`.
=== Example with Analysed Fields
Typical analysis:
[source,xml]
----
<fieldType name="text_min_hash" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
<filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
</fieldType>
...
<field name="min_hash_analysed" type="text_min_hash" multiValued="false" indexed="true" stored="false" />
----
Here, the input text is split on whitespace, the tokens normalised, the resulting token stream assembled into a stream of all the 5 word shingles which are then hashed.
The lowest hashes from each of 512 buckets are kept and produced as the output tokens.
Queries to this field would need to generate at least one shingle so would require 5 distinct tokens.
Example queries:
[source,text]
----
{!min_hash field="min_hash_analysed"}At least five or more tokens
{!min_hash field="min_hash_analysed" sim="0.5"}At least five or more tokens
{!min_hash field="min_hash_analysed" sim="0.5" tp="0.5"}At least five or more tokens
----
=== Example with Pre-Analysed Fields
Here, the MinHash is pre-computed, most likely using Lucene analysis inline as shown below.
It would be more prudent to get the analyser from the schema.
[source,java]
----
ICUTokenizerFactory factory = new ICUTokenizerFactory(Collections.EMPTY_MAP);
factory.inform(null);
Tokenizer tokenizer = factory.create();
tokenizer.setReader(new StringReader(text));
ICUFoldingFilterFactory filter = new ICUFoldingFilterFactory(Collections.EMPTY_MAP);
TokenStream ts = filter.create(tokenizer);
HashMap<String, String> args = new HashMap<>();
args.put("minShingleSize", "5");
args.put("outputUnigrams", "false");
args.put("outputUnigramsIfNoShingles", "false");
args.put("maxShingleSize", "5");
args.put("tokenSeparator", " ");
ShingleFilterFactory sff = new ShingleFilterFactory(args);
ts = sff.create(ts);
HashMap<String, String> args2 = new HashMap<>();
args2.put("bucketCount", "512");
args2.put("hashSetSize", "1");
args2.put("hashCount", "1");
MinHashFilterFactory mhff = new MinHashFilterFactory(args2);
ts = mhff.create(ts);
CharTermAttribute termAttribute = ts.getAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken())
{
char[] buff = termAttribute.buffer();
...
}
ts.end();
----
The schema will just define a multi-valued string value and an optional field to use at anlysis time - similar to above.
[source,xml]
----
<field name="min_hash_string" type="strings" multiValued="true" indexed="true" stored="true"/>
<!-- Optional -->
<field name="min_hash_analysed" type="text_min_hash" multiValued="false" indexed="true" stored="false"/>
<fieldType name="strings" class="solr.StrField" sortMissingLast="true" multiValued="true"/>
<!-- Optional -->
<fieldType name="text_min_hash" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
<filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
</fieldType>
----
Example queries:
[source,plain]
----
{!min_hash field="min_hash_string" sep=","}HASH1,HASH2,HASH3
{!min_hash field="min_hash_string" sim="0.9" analyzer_field="min_hash_analysed"}Lets hope the config and code for analysis are in sync
----
It is also possible to query analysed fields using known hashes (the reverse of the above)
[source,plain]
{!min_hash field="min_hash_analysed" analyzer_field="min_hash_string" sep=","}HASH1,HASH2,HASH3
Pre-analysed fields mean hash values can be recovered per document rather than re-hashed.
An initial query stage that returns the minhash stored field could be followed by a `min_hash` query to find similar documents.
=== Banded Queries
The default behaviour of the query parser, given the configuration above is to generate a boolean query and OR 512 constant score term queries together: one for each hash.
In this case, generating a score of 1 if one hash matches and a score of 512 if they all match.
A banded query mixes conjunctions and disjunctions.
We could have 256 bands each of two queries ANDed together, 128 with 4 hashes ANDed together etc.
With fewer bands query performance increases but we may miss some matches.
There is a trade off between speed and accuracy.
With 64 bands the score will range from 0 to 64 (the number of bands ORed together)
Given the required similarity and an acceptable true positive rate, the query parser computes the appropriate band size^[1]^.
It finds the minimum number of bands subject to
latexmath:[tp \leq 1 - (1 - sim^{rows})^{bands}]
If there are not enough hashes to fill the final band of the query it wraps to the start.
=== A Note on Similarity
Low similarities can be meaningful.
The number of 5 word hashes is large.
Even a single match may indicate some kind of similarity either in meaning, style or structure.
=== Further Reading
For a general introduction see "Mining of Massive Datasets"^[1]^.
For documents of ~1500 words expect an index size overhead of ~10%; your milage will vary.
512 hashes would be expected to represent ~2500 words well.
Using a set of MinHash values was proposed in the initial paper^[2]^ but provides a biased estimate of Jaccard similarity.
There may be cases where that bias is a good thing.
Likewise with rotation and short documents.
The implementation is derived from an unbiased method proposed in later work^[3]^.
^[1]^ Leskovec, Jure; Rajaraman, Anand & Ullman, Jeffrey D. "Mining of Massive Datasets", Cambridge University Press; 2 edition (December 29, 2014), Chapter 3, ISBN: 9781107077232.
^[2]^ Broder, Andrei Z. (1997), "On the resemblance and containment of documents", Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997 (PDF), IEEE, pp. 21–29, doi:10.1109/SEQUEN.1997.666900.
^[3]^ Shrivastava, Anshumali & Li, Ping (2014), "Improved Densification of One Permutation Hashing", 30th Conference on Uncertainty in Artificial Intelligence (UAI), Quebec City, Quebec, Canada, July 23-27, 2014, AUAI, pp. 225-234, http://www.auai.org/uai2014/proceedings/individuals/225.pdf
== More Like This Query Parser
The `MLTQParser` enables retrieving documents that are similar to a given document.
It uses Lucene's existing `MoreLikeThis` logic and also works in SolrCloud mode.
Information about how to use this query parser is with the documentation about MoreLikeThis, in the section <<morelikethis.adoc#morelikethis-query-parser,MoreLikeThis Query Parser>>.
== Nested Query Parser
The `NestedParser` extends the `QParserPlugin` and creates a nested query, with the ability for that query to redefine its type via local parameters. This is useful in specifying defaults in configuration and letting clients indirectly reference them.
Example:
[source,text]
----
{!query defType=func v=$q1}
----
If the `q1` parameter is price, then the query would be a function query on the price field. If the `q1` parameter is \{!lucene}inStock:true}} then a term query is created from the Lucene syntax string that matches documents with `inStock=true`. These parameters would be defined in `solrconfig.xml`, in the `defaults` section:
[source,xml]
----
<lst name="defaults">
<str name="q1">{!lucene}inStock:true</str>
</lst>
----
For more information about the possibilities of nested queries, see Yonik Seeley's blog post https://lucidworks.com/2009/03/31/nested-queries-in-solr/[Nested Queries in Solr].
== Payload Query Parsers
These query parsers utilize payloads encoded on terms during indexing.
The main query, for both of these parsers, is parsed straightforwardly from the field type's query analysis into a `SpanQuery`. The generated `SpanQuery` will be either a `SpanTermQuery` or an ordered, zero slop `SpanNearQuery`, depending on how many tokens are emitted. Payloads can be encoded on terms using either the `DelimitedPayloadTokenFilter` or the `NumericPayloadTokenFilter`. The payload using parsers are:
* `PayloadScoreQParser`
* `PayloadCheckQParser`
=== Payload Score Parser
`PayloadScoreQParser` incorporates each matching term's numeric (integer or float) payloads into the scores.
This parser accepts the following parameters:
`f`::
The field to use. This parameter is required.
`func`::
The payload function. The options are: `min`, `max`, `average`, or `sum`. This parameter is required.
`operator`::
A search operator. The options are `or` and `phrase`, which is the default. This defines if the search query should be an OR query or a phrase query.
`includeSpanScore`::
If `true`, multiples the computed payload factor by the score of the original query. If `false`, the default, the computed payload factor is the score.
*Examples*
[source,text]
{!payload_score f=my_field_dpf v=some_term func=max}
[source,text]
{!payload_score f=payload_field func=sum operator=or}A B C
=== Payload Check Parser
`PayloadCheckQParser` only matches when the matching terms also have the specified payloads.
This parser accepts the following parameters:
`f`::
The field to use (required).
`payloads`::
A space-separated list of payloads that must match the query terms (required)
+
Each specified payload will be encoded using the encoder determined from the field type and encoded accordingly for matching.
+
`DelimitedPayloadTokenFilter` 'identity' encoded payloads also work here, as well as float and integer encoded ones.
*Example*
[source,text]
----
{!payload_check f=words_dps payloads="VERB NOUN"}searching stuff
----
== Prefix Query Parser
`PrefixQParser` extends the `QParserPlugin` by creating a prefix query from the input value. Currently no analysis or value transformation is done to create this prefix query.
The parameter is `f`, the field. The string after the prefix declaration is treated as a wildcard query.
Example:
[source,text]
----
{!prefix f=myfield}foo
----
This would be generally equivalent to the Lucene query parser expression `myfield:foo*`.
== Raw Query Parser
`RawQParser` extends the `QParserPlugin` by creating a term query from the input value without any text analysis or transformation. This is useful in debugging, or when raw terms are returned from the terms component (this is not the default).
The only parameter is `f`, which defines the field to search.
Example:
[source,text]
----
{!raw f=myfield}Foo Bar
----
This example constructs the query: `TermQuery(Term("myfield","Foo Bar"))`.
For easy filter construction to drill down in faceting, the <<Term Query Parser,TermQParserPlugin>> is recommended.
For full analysis on all fields, including text fields, you may want to use the <<Field Query Parser,FieldQParserPlugin>>.
== Ranking Query Parser
The `RankQParserPlugin` is a faster implementation of ranking-related features of `FunctionQParser` and can work together with specialized field of {solr-javadocs}/solr-core/org/apache/solr/schema/RankField.html[`RankFields`] type.
It allows queries like:
[source,text]
----
http://localhost:8983/solr/techproducts?q=memory _query_:{!rank f='pagerank', function='log' scalingFactor='1.2'}
----
== Re-Ranking Query Parser
The `ReRankQParserPlugin` is a special purpose parser for Re-Ranking the top results of a simple query using a more complex ranking query.
Details about using the `ReRankQParserPlugin` can be found in the <<query-re-ranking.adoc#,Query Re-Ranking>> section.
== Simple Query Parser
The Simple query parser in Solr is based on Lucene's SimpleQueryParser. This query parser is designed to allow users to enter queries however they want, and it will do its best to interpret the query and return results.
This parser takes the following parameters:
q.operators::
Comma-separated list of names of parsing operators to enable. By default, all operations are enabled, and this parameter can be used to effectively disable specific operators as needed, by excluding them from the list. Passing an empty string with this parameter disables all operators.
+
// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
+
[cols="15,20,50,15",options="header"]
|===
|Name |Operator |Description |Example query
|`AND` |`+` |Specifies AND |`token1+token2`
|`OR` |`\|` |Specifies OR |`token1\|token2`
|`NOT` |`-` |Specifies NOT |`-token3`
|`PREFIX` |`*` |Specifies a prefix query |`term*`
|`PHRASE` |`"` |Creates a phrase |`"term1 term2"`
|`PRECEDENCE` |`( )` |Specifies precedence; tokens inside the parenthesis will be analyzed first. Otherwise, normal order is left to right. |`token1 + (token2 \| token3)`
|`ESCAPE` |`\` |Put it in front of operators to match them literally |`C\+\+`
|`WHITESPACE` |space or `[\r\t\n]` a|Delimits tokens on whitespace. If not enabled, whitespace splitting will not be performed prior to analysis – usually most desirable.
Not splitting whitespace is a unique feature of this parser that enables multi-word synonyms to work. However, it probably actually won't unless synonyms are configured to normalize instead of expand to all that match a given synonym. Such a configuration requires normalizing synonyms at both index time and query time. Solr's analysis screen can help here. |`term1 term2`
|`FUZZY` a|
`~`
`~_N_`
a|
At the end of terms, specifies a fuzzy query.
"N" is optional and may be either "1" or "2" (the default)
|`term~1`
|`NEAR` |`~_N_` |At the end of phrases, specifies a NEAR query |`"term1 term2"~5`
|===
q.op::
Defines the default operator to use if none is defined by the user. Allowed values are `AND` and `OR`. `OR` is used if none is specified.
qf::
A list of query fields and boosts to use when building the query.
df::
Defines the default field if none is defined in the Schema, or overrides the default field if it is already defined.
Any errors in syntax are ignored and the query parser will interpret queries as best it can. However, this can lead to odd results in some cases.
== Spatial Query Parsers
There are two spatial QParsers in Solr: `geofilt` and `bbox`. But there are other ways to query spatially: using the `frange` parser with a distance function, using the standard (lucene) query parser with the range syntax to pick the corners of a rectangle, or with RPT and BBoxField you can use the standard query parser but use a special syntax within quotes that allows you to pick the spatial predicate.
All these options are documented further in the section <<spatial-search.adoc#,Spatial Search>>.
== Surround Query Parser
The `SurroundQParser` enables the Surround query syntax, which provides proximity search functionality. There are two positional operators: `w` creates an ordered span query and `n` creates an unordered one. Both operators take a numeric value to indicate distance between two terms. The default is 1, and the maximum is 99.
Note that the query string is not analyzed in any way.
Example:
[source,text]
----
{!surround} 3w(foo, bar)
----
This example finds documents where the terms "foo" and "bar" are no more than 3 terms away from each other (i.e., no more than 2 terms between them).
This query parser will also accept boolean operators (`AND`, `OR`, and `NOT`, in either upper- or lowercase), wildcards, quoting for phrase searches, and boosting. The `w` and `n` operators can also be expressed in upper- or lowercase.
The non-unary operators (everything but `NOT`) support both infix `(a AND b AND c)` and prefix `AND(a, b, c)` notation.
== Switch Query Parser
`SwitchQParser` is a `QParserPlugin` that acts like a "switch" or "case" statement.
The primary input string is trimmed and then prefixed with `case.` for use as a key to lookup a "switch case" in the parser's local params. If a matching local param is found the resulting parameter value will then be parsed as a subquery, and returned as the parse result.
The `case` local param can be optionally be specified as a switch case to match missing (or blank) input strings. The `default` local param can optionally be specified as a default case to use if the input string does not match any other switch case local params. If default is not specified, then any input which does not match a switch case local param will result in a syntax error.
In the examples below, the result of each query is "XXX":
[source,text]
----
{!switch case.foo=XXX case.bar=zzz case.yak=qqq}foo
----
.The extra whitespace between `}` and `bar` is trimmed automatically.
[source,text]
----
{!switch case.foo=qqq case.bar=XXX case.yak=zzz} bar
----
.The result will fallback to the default.
[source,text]
----
{!switch case.foo=qqq case.bar=zzz default=XXX}asdf
----
.No input uses the value for `case` instead.
[source,text]
----
{!switch case=XXX case.bar=zzz case.yak=qqq}
----
A practical usage of this parser, is in specifying `appends` filter query (`fq`) parameters in the configuration of a SearchHandler, to provide a fixed set of filter options for clients using custom parameter names.
Using the example configuration below, clients can optionally specify the custom parameters `in_stock` and `shipping` to override the default filtering behavior, but are limited to the specific set of legal values (shipping=any|free, in_stock=yes|no|all).
[source,xml]
----
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="in_stock">yes</str>
<str name="shipping">any</str>
</lst>
<lst name="appends">
<str name="fq">{!switch case.all='*:*'
case.yes='inStock:true'
case.no='inStock:false'
v=$in_stock}</str>
<str name="fq">{!switch case.any='*:*'
case.free='shipping_cost:0.0'
v=$shipping}</str>
</lst>
</requestHandler>
----
== Term Query Parser
`TermQParser` extends the `QParserPlugin` by creating a single term query from the input value equivalent to `readableToIndexed()`. This is useful for generating filter queries from the external human readable terms returned by the faceting or terms components. The only parameter is `f`, for the field.
Example:
[source,text]
----
{!term f=weight}1.5
----
For text fields, no analysis is done since raw terms are already returned from the faceting and terms components. To apply analysis to text fields as well, see the <<Field Query Parser>>, above.
If no analysis or transformation is desired for any type of field, see the <<Raw Query Parser>>, above.
== Terms Query Parser
`TermsQParser` functions similarly to the <<Term Query Parser,Term Query Parser>> but takes in multiple values separated by commas and returns documents matching any of the specified values.
This can be useful for generating filter queries from the external human readable terms returned by the faceting or terms components, and may be more efficient in some cases than using the <<the-standard-query-parser.adoc#,Standard Query Parser>> to generate a boolean query since the default implementation `method` avoids scoring.
This query parser takes the following parameters:
`f`::
The field on which to search. This parameter is required.
`separator`::
Separator to use when parsing the input. If set to " " (a single blank space), will trim additional white space from the input terms. Defaults to a comma (`,`).
`method`::
An optional parameter used to determine which of several query implementations should be used by Solr. Options are restricted to: `termsFilter`, `booleanQuery`, `automaton`, `docValuesTermsFilterPerSegment`, `docValuesTermsFilterTopLevel` or `docValuesTermsFilter`. If unspecified, the default value is `termsFilter`. Each implementation has its own performance characteristics, and users are encouraged to experiment to determine which implementation is most performant for their use-case. Heuristics are given below.
+
`booleanQuery` creates a `BooleanQuery` representing the request. Scales well with index size, but poorly with the number of terms being searched for.
+
`termsFilter` the default `method`. Uses a `BooleanQuery` or a `TermInSetQuery` depending on the number of terms. Scales well with index size, but only moderately with the number of query terms.
+
`docValuesTermsFilter` can only be used on fields with docValues data. The `cache` parameter is false by default. Chooses between the `docValuesTermsFilterTopLevel` and `docValuesTermsFilterPerSegment` methods using the number of query terms as a rough heuristic. Users should typically use this method instead of using `docValuesTermsFilterTopLevel` or `docValuesTermsFilterPerSegment` directly, unless they've done performance testing to validate one of the methods on queries of all sizes. Depending on the implementation picked, this method may rely on expensive data structures which are lazily populated after each commit. If you commit frequently and your use-case can tolerate a static warming query, consider adding one to `solrconfig.xml` so that this work is done as a part of the commit itself and not attached directly to user requests.
+
`docValuesTermsFilterTopLevel` can only be used on fields with docValues data. The `cache` parameter is false by default. Uses top-level docValues data structures to find results. These data structures are more efficient as the number of query terms grows high (over several hundred). But they are also expensive to build and need to be populated lazily after each commit, causing a sometimes-noticeable slowdown on the first query after each commit. If you commit frequently and your use-case can tolerate a static warming query, consider adding one to `solrconfig.xml` so that this work is done as a part of the commit itself and not attached directly to user requests.
+
`docValuesTermsFilterPerSegment` can only be used on fields with docValues data. The `cache` parameter is false by default. It is more efficient than the "top-level" alternative with small to medium (~500) numbers of query terms, and doesn't suffer a slowdown on queries immediately following a commit (as `docValuesTermsFilterTopLevel` does - see above). But it is less performant on very large numbers of query terms.
+
`automaton` creates an `AutomatonQuery` representing the request with each term forming a union. Scales well with index size and moderately with the number of query terms.
*Examples*
[source,text]
----
{!terms f=tags}software,apache,solr,lucene
----
[source,text]
----
{!terms f=categoryId method=booleanQuery separator=" "}8 6 7 5309
----
== XML Query Parser
The {solr-javadocs}/solr-core/org/apache/solr/search/XmlQParserPlugin.html[XmlQParserPlugin] extends the {solr-javadocs}/solr-core/org/apache/solr/search/QParserPlugin.html[QParserPlugin] and supports the creation of queries from XML. Example:
// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
[cols="30,70",options="header"]
|===
|Parameter |Value
|defType |`xmlparser`
|q a|
[source,xml]
----
<BooleanQuery fieldName="description">
<Clause occurs="must">
<TermQuery>shirt</TermQuery>
</Clause>
<Clause occurs="mustnot">
<TermQuery>plain</TermQuery>
</Clause>
<Clause occurs="should">
<TermQuery>cotton</TermQuery>
</Clause>
<Clause occurs="must">
<BooleanQuery fieldName="size">
<Clause occurs="should">
<TermsQuery>S M L</TermsQuery>
</Clause>
</BooleanQuery>
</Clause>
</BooleanQuery>
----
|===
The XmlQParser implementation uses the {solr-javadocs}/solr-core/org/apache/solr/search/SolrCoreParser.html[SolrCoreParser] class which extends Lucene's {lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/CoreParser.html[CoreParser] class. XML elements are mapped to {lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/QueryBuilder.html[QueryBuilder] classes as follows:
// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
[width="100%",cols="30,70",options="header"]
|===
|XML element |QueryBuilder class
|<BooleanQuery> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/BooleanQueryBuilder.html[BooleanQueryBuilder]
|<BoostingTermQuery> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/BoostingTermBuilder.html[BoostingTermBuilder]
|<ConstantScoreQuery> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/ConstantScoreQueryBuilder.html[ConstantScoreQueryBuilder]
|<DisjunctionMaxQuery> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/DisjunctionMaxQueryBuilder.html[DisjunctionMaxQueryBuilder]
|<MatchAllDocsQuery> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/MatchAllDocsQueryBuilder.html[MatchAllDocsQueryBuilder]
|<RangeQuery> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/RangeQueryBuilder.html[RangeQueryBuilder]
|<SpanFirst> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/SpanFirstBuilder.html[SpanFirstBuilder]
|<SpanPositionRange> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/SpanPositionRangeBuilder.html[SpanPositionRangeBuilder]
|<SpanNear> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.html[SpanNearBuilder]
|<SpanNot> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/SpanNotBuilder.html[SpanNotBuilder]
|<SpanOr> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/SpanOrBuilder.html[SpanOrBuilder]
|<SpanOrTerms> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/SpanOrTermsBuilder.html[SpanOrTermsBuilder]
|<SpanTerm> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/SpanTermBuilder.html[SpanTermBuilder]
|<TermQuery> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/TermQueryBuilder.html[TermQueryBuilder]
|<TermsQuery> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/TermsQueryBuilder.html[TermsQueryBuilder]
|<UserQuery> |{lucene-javadocs}/queryparser/org/apache/lucene/queryparser/xml/builders/UserInputQueryBuilder.html[UserInputQueryBuilder]
|<LegacyNumericRangeQuery> |LegacyNumericRangeQuery(Builder) is deprecated
|===
=== Customizing XML Query Parser
You can configure your own custom query builders for additional XML elements. The custom builders need to extend the {solr-javadocs}/solr-core/org/apache/solr/search/SolrQueryBuilder.html[SolrQueryBuilder] or the {solr-javadocs}/solr-core/org/apache/solr/search/SolrSpanQueryBuilder.html[SolrSpanQueryBuilder] class. Example `solrconfig.xml` snippet:
[source,xml]
----
<queryParser name="xmlparser" class="XmlQParserPlugin">
<str name="MyCustomQuery">com.mycompany.solr.search.MyCustomQueryBuilder</str>
</queryParser>
----