blob: ce20bfc367bfd42d00d9b0136cb026b0e72c0051 [file] [log] [blame]
= The DisMax Query Parser
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input).
In general, the DisMax query parser's interface is more like that of Google than the interface of the 'lucene' Solr query parser. This similarity makes DisMax the appropriate query parser for many consumer applications. It accepts a simple syntax, and it rarely produces error messages.
The DisMax query parser supports an extremely simplified subset of the Lucene QueryParser syntax. As in Lucene, quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses. All other Lucene query parser special characters (except AND and OR) are escaped to simplify the user experience. The DisMax query parser takes responsibility for building a good query from the user's input using Boolean clauses containing DisMax queries across fields and boosts specified by the user. It also lets the Solr administrator provide additional boosting queries, boosting functions, and filtering queries to artificially affect the outcome of all searches. These options can all be specified as default parameters for the request handler in the `solrconfig.xml` file or overridden in the Solr query URL.
Interested in the technical concept behind the DisMax name? DisMax stands for Maximum Disjunction. Here's a definition of a Maximum Disjunction or "DisMax" query:
[quote]
____
A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.
____
Whether or not you remember this explanation, do remember that the DisMax Query Parser was primarily designed to be easy to use and to accept almost any input without returning an error.
== DisMax Query Parser Parameters
In addition to the common request parameters, highlighting parameters, and simple facet parameters, the DisMax query parser supports the parameters described below. Like the standard query parser, the DisMax query parser allows default parameter values to be specified in `solrconfig.xml`, or overridden by query-time values in the request.
The sections below explain these parameters in detail.
=== q Parameter
The `q` parameter defines the main "query" constituting the essence of the search. The parameter supports raw input strings provided by users with no special escaping. The + and - characters are treated as "mandatory" and "prohibited" modifiers for terms. Text wrapped in balanced quote characters (for example, "San Jose") is treated as a phrase. Any query containing an odd number of quote characters is evaluated as if there were no quote characters at all.
IMPORTANT: The `q` parameter does not support wildcard characters such as *.
=== q.alt Parameter
If specified, the `q.alt` parameter defines a query (which by default will be parsed using standard query parsing syntax) when the main q parameter is not specified or is blank. The `q.alt` parameter comes in handy when you need something like a query to match all documents (don't forget `&rows=0` for that one!) in order to get collection-wide faceting counts.
=== qf (Query Fields) Parameter
The `qf` parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field's importance in the query. For example, the query below:
`qf="fieldOne^2.3 fieldTwo fieldThree^0.4"`
assigns `fieldOne` a boost of 2.3, leaves `fieldTwo` with the default boost (because no boost factor is specified), and `fieldThree` a boost of 0.4. These boost factors make matches in `fieldOne` much more significant than matches in `fieldTwo`, which in turn are much more significant than matches in `fieldThree`.
=== mm (Minimum Should Match) Parameter
When processing queries, Lucene/Solr recognizes three types of clauses: mandatory, prohibited, and "optional" (also known as "should" clauses). By default, all words or phrases specified in the `q` parameter are treated as "optional" clauses unless they are preceded by a "+" or a "-". When dealing with these "optional" clauses, the `mm` parameter makes it possible to say that a certain minimum number of those clauses must match. The DisMax query parser offers great flexibility in how the minimum number can be specified.
The table below explains the various ways that mm values can be specified.
// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
[cols="30,10,60",options="header"]
|===
|Syntax |Example |Description
|Positive integer |3 |Defines the minimum number of clauses that must match, regardless of how many clauses there are in total.
|Negative integer |-2 |Sets the minimum number of matching clauses to the total number of optional clauses, minus this value.
|Percentage |75% |Sets the minimum number of matching clauses to this percentage of the total number of optional clauses. The number computed from the percentage is rounded down and used as the minimum.
|Negative percentage |-25% |Indicates that this percent of the total number of optional clauses can be missing. The number computed from the percentage is rounded down, before being subtracted from the total to determine the minimum number.
|An expression beginning with a positive integer followed by a > or < sign and another value |3<90% |Defines a conditional expression indicating that if the number of optional clauses is equal to (or less than) the integer, they are all required, but if it's greater than the integer, the specification applies. In this example: if there are 1 to 3 clauses they are all required, but for 4 or more clauses only 90% are required.
|Multiple conditional expressions involving > or < signs |2\<-25% 9\<-3 |Defines multiple conditions, each one being valid only for numbers greater than the one before it. In the example at left, if there are 1 or 2 clauses, then both are required. If there are 3-9 clauses all but 25% are required. If there are more then 9 clauses, all but three are required.
|===
When specifying `mm` values, keep in mind the following:
* When dealing with percentages, negative values can be used to get different behavior in edge cases. 75% and -25% mean the same thing when dealing with 4 clauses, but when dealing with 5 clauses 75% means 3 are required, but -25% means 4 are required.
* If the calculations based on the parameter arguments determine that no optional clauses are needed, the usual rules about Boolean queries still apply at search time. (That is, a Boolean query containing no required clauses must still match at least one optional clause).
* No matter what number the calculation arrives at, Solr will never use a value greater than the number of optional clauses, or a value less than 1. In other words, no matter how low or how high the calculated result, the minimum number of required matches will never be less than 1 or greater than the number of clauses.
* When searching across multiple fields that are configured with different query analyzers, the number of optional clauses may differ between the fields. In such a case, the value specified by mm applies to the maximum number of optional clauses. For example, if a query clause is treated as stopword for one of the fields, the number of optional clauses for that field will be smaller than for the other fields. A query with such a stopword clause would not return a match in that field if mm is set to 100% because the removed clause does not count as matched.
The default value of `mm` is 0% (all clauses optional), unless `q.op` is specified as "AND", in which case `mm` defaults to 100% (all clauses required).
=== pf (Phrase Fields) Parameter
Once the list of matching documents has been identified using the `fq` and `qf` parameters, the `pf` parameter can be used to "boost" the score of documents in cases where all of the terms in the q parameter appear in close proximity.
The format is the same as that used by the `qf` parameter: a list of fields and "boosts" to associate with each of them when making phrase queries out of the entire q parameter.
=== ps (Phrase Slop) Parameter
The `ps` parameter specifies the amount of "phrase slop" to apply to queries specified with the pf parameter. Phrase slop is the number of positions one token needs to be moved in relation to another token in order to match a phrase specified in a query.
=== qs (Query Phrase Slop) Parameter
The `qs` parameter specifies the amount of slop permitted on phrase queries explicitly included in the user's query string with the `qf` parameter. As explained above, slop refers to the number of positions one token needs to be moved in relation to another token in order to match a phrase specified in a query.
=== The tie (Tie Breaker) Parameter
The `tie` parameter specifies a float value (which should be something much less than 1) to use as tiebreaker in DisMax queries.
When a term from the user's input is tested against multiple fields, more than one field may match. If so, each field will generate a different score based on how common that word is in that field (for each document relative to all other documents). The `tie` parameter lets you control how much the final score of the query will be influenced by the scores of the lower scoring fields compared to the highest scoring field.
A value of "0.0" - the default - makes the query a pure "disjunction max query": that is, only the maximum scoring subquery contributes to the final score. A value of "1.0" makes the query a pure "disjunction sum query" where it doesn't matter what the maximum scoring sub query is, because the final score will be the sum of the subquery scores. Typically a low value, such as 0.1, is useful.
=== bq (Boost Query) Parameter
The `bq` parameter specifies an additional, optional, query clause that will be _added_ to the user's main query as optional clauses that will influence the score. For example, if you wanted to add a boost for documents that are in a particular category you could use:
[source,text]
----
q=cheese
bq=category:food^10
----
You can specify multiple `bq` parameters, which will each be added as separate clauses with separate boosts.
[source,text]
----
q=cheese
bq=category:food^10
bq=category:deli^5
----
Using the `bq` parameter in this way is functionally equivilent to combining your `q` and `bq` parameters into a single larger boolean query, where the (original) `q` parameter is "mandatory" and the other clauses are optional:
[source,text]
----
q=(+cheese category:food^10 category:deli^5)
----
The only difference between the above examples, is that using the `bq` parameter allows you to specify these extra clauses independently (i.e., as configuration defaults) from the main query.
[TIP]
[[bq-bf-shortcomings]]
.Additive Boosts vs Multiplicative Boosts
====
Generally speaking, using `bq` (or `bf`, below) is considered a poor way to "boost" documents by a secondary query because it has an "Additive" effect on the final score. The overall impact a particular `bq` parameter will have on a given document can vary a lot depending on the _absolute_ values of the scores from the original query as well as the `bq` query, which in turn depends on the complexity of the original query, and various scoring factors (TF, IDF, average field length, etc.)
"Multiplicative Boosting" is generally considered to be a more predictable method of influencing document score, because it acts as a "scaling factor" -- increasing (or decreasing) the scores of each document by a _relative_ amount.
The <<other-parsers.adoc#boost-query-parser,`{!boost}` QParser>> provides a convenient wrapper for implementing multiplicative boosting, and the <<the-extended-dismax-query-parser.adoc#extended-dismax-parameters,`{!edismax}` QParser>> offers a `boost` query parameter shortcut for using it.
====
=== bf (Boost Functions) Parameter
The `bf` parameter specifies functions (with optional <<the-standard-query-parser.adoc#boosting-a-term-with,query boost>>) that will be used to construct FunctionQueries which will be _added_ to the user's main query as optional clauses that will influence the score. Any <<function-queries.adoc#available-functions,function supported natively by Solr>> can be used, along with a boost value. For example:
[source,text]
----
q=cheese
bf=div(1,sum(1,price))^1.5
----
Specifying functions with the bf parameter is essentially just shorthand for using the `bq` parameter (<<#bq-bf-shortcomings,with the same shortcomings>>) combined with the `{!func}` parser -- with the addition of the simplified "query boost" syntax.
For example, the two `bf` parameters listed below, are completely equivalent to the two `bq` parameters below:
[source,text]
----
bf=div(sales_rank,ms(NOW,release_date))
bf=div(1,sum(1,price))^1.5
----
[source,text]
----
bq={!func}div(sales_rank,ms(NOW,release_date))
bq={!lucene}( {!func v='div(1,sum(1,price))'} )^1.5
----
== Examples of Queries Submitted to the DisMax Query Parser
All of the sample URLs in this section assume you are running Solr's "techproducts" example:
[source,bash]
----
bin/solr -e techproducts
----
Results for the word "video" using the standard query parser, and we assume "df" is pointing to a field to search:
`\http://localhost:8983/solr/techproducts/select?q=video&fl=name+score`
The "dismax" parser is configured to search across the text, features, name, sku, id, manu, and cat fields all with varying boosts designed to ensure that "better" matches appear first, specifically: documents which match on the name and cat fields get higher scores.
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video`
Note that this instance is also configured with a default field list, which can be overridden in the URL.
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&fl=*,score`
You can also override which fields are searched on and how much boost each field gets.
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features\^20.0+text^0.3`
You can boost results that have a field that matches a specific value.
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&bq=cat:electronics^5.0`
Another request handler is registered at "/instock" and has slightly different configuration options, notably: a filter for (you guessed it) `inStock:true)`.
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&fl=name,score,inStock`
`\http://localhost:8983/solr/techproducts/instock?defType=dismax&q=video&fl=name,score,inStock`
One of the other really cool features in this parser is robust support for specifying the "BooleanQuery.minimumNumberShouldMatch" you want to be used based on how many terms are in your user's query. These allows flexibility for typos and partial matches. For the dismax parser, one and two word queries require that all of the optional clauses match, but for three to five word queries one missing word is allowed.
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod`
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+gibberish`
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+apple`
Use the debugQuery option to see the parsed query, and the score explanations for each document.
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+gibberish&debugQuery=true`
`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video+card&debugQuery=true`