blob: f5bd9abc0e8d78922b58469c62cd5b5de952ecee [file] [log] [blame]
= MoreLikeThis
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
MoreLikeThis enables queries for documents similar to a document in their result list.
It does this by using terms from the original document to find similar documents in the index.
There are several ways to use MoreLikeThis.
The first, and most common, is to use it as a request handler.
In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link).
The second is to use it as a search component.
This is less desirable since it performs the MoreLikeThis analysis on every document that matches a user query. This may slow search results.
Another approach is to use it as a request handler but with externally supplied text.
This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.
Finally, the MLT query parser can be used.
This operates in much the same way as the request handler but since it is a query parser it can be used in filter queries, boost queries, etc., and results can be paginated or highlighted as needed.
== How MoreLikeThis Works
`MoreLikeThis` constructs a Lucene query based on terms in a document.
It does this by pulling terms from the list of fields provided with the request.
For best results, the fields should have stored term vectors (`termVectors=true`), which can be <<defining-fields.adoc#,configured in the schema>>.
If term vectors are not stored, MoreLikeThis can generate terms from stored fields.
The field used for the `uniqueKey` must also be stored in order for MoreLikeThis to work properly.
Terms from the original document are filtered using thresholds defined with the MoreLikeThis parameters.
Once the terms have been selected, a query is run with any other query parameters as appropriate and a new document set is returned.
== MoreLikeThis Handler and Component
The MoreLikeThis request handler and search component share several parameters, but also have some key differences in response and operation, as described below.
=== Common Handler and Component Parameters
The list below summarizes the `MoreLikeThis` parameters supported by Solr.
These parameters can be used with the MoreLikeThis search component or request handler.
`mlt.fl`::
+
[%autowidth,frame=none]
|===
s|Required |Default: none
|===
+
Specifies the fields to use for similarity.
A list of fields can be provided separated by commas.
If possible, the fields should have stored `termVectors`.
`mlt.mintf`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `2`
|===
+
Specifies the minimum frequency below which terms will be ignored in the source document.
`mlt.mindf`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `5`
|===
+
Specifies the minimum frequency below which terms will be ignored which do not occur in at least this many documents.
`mlt.maxdf`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Specifies the maximum frequency above which terms will be ignored which occur in more than this many documents.
`mlt.maxdfpct`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Specifies the maximum document frequency using a ratio relative to the number of documents in the index.
The value provided must be an integer between `0` and `100`.
For example, `mlt.maxdfpct=75` means the word will be ignored if it occurs in more than 75 percent of the documents in the index.
`mlt.minwl`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Sets the minimum word length below which words will be ignored.
`mlt.maxwl`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Sets the maximum word length above which words will be ignored.
`mlt.maxqt`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `25`
|===
+
Sets the maximum number of query terms that will be included in any generated query.
`mlt.maxntp`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `5000`
|===
+
Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support.
`mlt.boost`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `false`
|===
+
Specifies if the query will be boosted by the interesting term relevance.
Possible values are `true` or `false`.
`mlt.qf`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Query fields and their boosts using the same format used by the <<the-dismax-query-parser.adoc#,DisMax Query Parser>>.
These fields must also be specified in `mlt.fl`.
`mlt.interestingTerms`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `none`
|===
+
Adds a section in the response that shows the top terms (based on TF/IDF) used for the MoreLikeThis query.
It supports three possible values:
+
* `list` lists the terms.
* `none` lists no terms (the default).
* `details` lists the terms along with the boost value used for each term.
Unless `mlt.boost=true`, all terms will have `boost=1.0`.
+
To use this parameter with the <<MoreLikeThis Search Component,search component>>, the query cannot be distributed.
In order to get interesting terms, the query must be sent to a single shard and limited to that shard only (with the <<distributed-requests.adoc#limiting-which-shards-are-queried,`shards`>> parameter).
Multi-shard support is, however, available with the MoreLikeThis request handler.
=== MoreLikeThis Request Handler
==== Request Handler Configuration
The MoreLikeThis request handler is not configured by default and needs to be set up before using it.
You can do this by manually editing `solrconfig.xml` or with the Config API:
[.dynamic-tabs]
--
[example.tab-pane#manualconfig]
====
[.tab-label]*Manual Configuration*
[source,xml]
----
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
<str name="mlt.fl">body</str>
</requestHandler>
----
====
[example.tab-pane#configapi]
====
[.tab-label]*Config API*
[source,bash]
----
curl -X POST -H 'Content-type:application/json' -d {
"add-requesthandler": {
"name": "/mlt",
"class": "solr.MoreLikeThisHandler",
"defaults": {"mlt.fl": "body"}
}
} http://localhost:8983/solr/<collection>/config
----
====
--
Both of the above examples set the `mlt.fl` parameter to "body" for the request handler.
This means that all requests to the handler will use that value for the parameter unless specifically overridden in an individual request.
For more about request handler configuration in general, see the section <<requesthandlers-and-searchcomponents-in-solrconfig.adoc#default-components,RequestHandlers and SearchComponents in Solrconfig>>.
==== Request Handler Parameters
The MoreLikeThis request handler supports the following parameters in addition to the <<Common Handler and Component Parameters,common parameters>> above.
It supports faceting, paging, and filtering using common query parameters, but does not work well with alternate query parsers.
`mlt.match.include`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `true`
|===
+
Specifies if the response should include the matched document.
If set to `false`, the response will look like a normal select response.
`mlt.match.offset`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Specifies an offset into the main query search results to locate the document on which the MoreLikeThis query should operate.
By default, the query operates on the first result for the `q` parameter.
==== Request Handler Query and Response
Queries to the MoreLikeThis request handler use the name defined when it was configured (`/mlt` in the above example).
The following example query uses a document (`q=id:0553573403`) found in Solr's example document set (`./example/exampledocs`), and asks that the author field be used to find similar documents (`mlt.fl=author`).
[source,bash]
http://localhost:8983/solr/gettingstarted/mlt?mlt.fl=author&mlt.interestingTerms=details&mlt.match.include=true&mlt.mindf=0&mlt.mintf=0&q=id%3A0553573403
This query also requests interesting terms with their boosts (`mlt.interestingTerms=details`) and that the original document also be returned (`mlt.match.include=true`).
The minimum term frequency and minimum word document frequency are set to `0`.
The response will include a section `match`, which includes the original document.
The `response` section includes the similar documents.
Finally, the `interestingTerms` section shows the terms from the author field that were used to find the similar documents.
Because we did not also specify `mlt.boost`, the boost values shown for the interesting terms all display `1.0`.
[source,json]
----
{
"match":{"numFound":1,"start":0,"numFoundExact":true,
"docs":[
{
"id":"0553573403",
"cat":["book"],
"name":["A Game of Thrones"],
"price":[7.99],
"inStock":[true],
"author":["George R.R. Martin"],
"series_t":"A Song of Ice and Fire",
"sequence_i":1,
"genre_s":"fantasy",
"_version_":1693062911089442816}]
},
"response":{"numFound":2,"start":0,"numFoundExact":true,
"docs":[
{
"id":"0553579908",
"cat":["book"],
"name":["A Clash of Kings"],
"price":[7.99],
"inStock":[true],
"author":["George R.R. Martin"],
"series_t":"A Song of Ice and Fire",
"sequence_i":2,
"genre_s":"fantasy",
"_version_":1693062911094685696},
{
"id":"055357342X",
"cat":["book"],
"name":["A Storm of Swords"],
"price":[7.99],
"inStock":[true],
"author":["George R.R. Martin"],
"series_t":"A Song of Ice and Fire",
"sequence_i":3,
"genre_s":"fantasy",
"_version_":1693062911095734272}]
},
"interestingTerms":[
"author:r.r",1.0,
"author:george",1.0,
"author:martin",1.0]}
----
If we had not requested `mlt.match.include=true`, the response would not have included the `match` section.
==== Streaming External Content to MoreLikeThis
An external document (one not in the index) can be passed to the MoreLikeThis request handler to be used for recommended documents.
This is accomplished with the use of <<content-streams.adoc#,Content Streams>>.
The body of a document can be passed directly to the request handler with the `stream.body` parameter.
Alternatively, if remote streams are enabled, a URL or file could be passed.
[source,bash]
----
http://localhost:8983/solr/mlt?stream.body=electronics%20memory&mlt.fl=manu,cat&mlt.interestingTerms=list&mlt.mintf=0
----
This query would pass the terms "electronics memory" to the request handler instead of using a document already in the index.
The response in this case would look similar to the response above that used a document already in the index.
=== MoreLikeThis Search Component
Using MoreLikeThis as a search component returns similar documents for each document in the response set for another query.
It's important to note this could incur a cost to search performance so should only be used when the use case warrants it.
==== Search Component Configuration
The MoreLikeThis search component is a default search component that works with all search handlers (see also <<requesthandlers-and-searchcomponents-in-solrconfig.adoc#default-components,Default Components>>).
Since it is configured already, it doesn't need any additional configuration unless you'd like to set parameters for a particular collection that override the MoreLikeThis defaults.
To do this, you could configure it like this:
[source,xml]
----
<searchComponent name="mlt" class="solr.MoreLikeThisComponent">
<str name="mlt">true</str>
<str name="mlt.fl">body</str>
</searchComponent>
----
The above example would always enable MoreLikeThis for all queries and will always use the "body" field.
This is probably not something you really want!
But the example serves to show how you might define whichever parameters you would like to be default for MoreLikeThis.
If you gave the search component a name other than "mlt" as in the above example, you would need to explicitly add it to a request handler as described in the section <<requesthandlers-and-searchcomponents-in-solrconfig.adoc#search-components,Search Components>>.
Because the above example uses the same name as the default, the parameters defined there override Solr's default.
==== Search Component Parameters
The MoreLikeThis search component supports the following parameters in addition to the <<Common Handler and Component Parameters,common parameters>> above.
`mlt`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
If set to `true`, activates the `MoreLikeThis` component and enables Solr to return `MoreLikeThis` results.
`mlt.count`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `5`
|===
+
Specifies the number of similar documents to be returned for each result.
==== Search Component Query and Response
The response when using MoreLikeThis as a search component is different than when using the request handler.
In this case, we are using the `/select` request handler and performing a regular query (`q=author:martin`).
We've asked for MoreLikeThis to be added to the response (`mlt=true`), but otherwise the parameters are the same as the earlier example (we've asked for interesting terms and set minimum term and document frequencies to `0`).
[source,bash]
http://localhost:8983/solr/gettingstarted/select?mlt.fl=name&mlt.mindf=0&mlt.mintf=0&mlt=true&q=author%3Amartin
The response includes the results of our query, in this case 3 documents which have the term "martin" in the author field.
We've changed the field, however, to find documents that are similar to these based on values in the `name` field (`mlt.fl=name`).
In the response, a `moreLikeThis` section has been added.
For each document in the results that match our query, a list of document IDs is returned with score values.
Each of these documents are similar to the document in the result list to varying degrees.
[source,json]
----
{
"response":{"numFound":3,"start":0,"maxScore":0.43659902,"numFoundExact":true, "docs":[
{
"id":"0553573403",
"cat":["book"],
"name":["A Game of Thrones"],
"price":[7.99],
"inStock":[true],
"author":["George R.R. Martin"],
"series_t":"A Song of Ice and Fire",
"sequence_i":1,
"genre_s":"fantasy",
"_version_":1693062911089442816},
{
"id":"0553579908",
"cat":["book"],
"name":["A Clash of Kings"],
"price":[7.99],
"inStock":[true],
"author":["George R.R. Martin"],
"series_t":"A Song of Ice and Fire",
"sequence_i":2,
"genre_s":"fantasy",
"_version_":1693062911094685696},
{
"id":"055357342X",
"cat":["book"],
"name":["A Storm of Swords"],
"price":[7.99],
"inStock":[true],
"author":["George R.R. Martin"],
"series_t":"A Song of Ice and Fire",
"sequence_i":3,
"genre_s":"fantasy",
"_version_":1693062911095734272}]
},
"moreLikeThis":[
"0553573403",{"numFound":6,"start":0,"maxScore":1.6554483,"numFoundExact":true,
"docs":[
{
"id":"055357342X",
"score":1.6554483},
{
"id":"0553579908",
"score":1.6554483},
{
"id":"0805080481",
"score":1.3422124},
{
"id":"0812550706",
"score":1.284826},
{
"id":"978-1423103349",
"score":0.7652973}]
},
"0553579908",{"numFound":5,"start":0,"maxScore":1.6554483,"numFoundExact":true,
"docs":[
{
"id":"055357342X",
"score":1.6554483},
{
"id":"0553573403",
"score":1.6554483},
{
"id":"0805080481",
"score":1.3422124},
{
"id":"978-1423103349",
"score":0.7652973},
{
"id":"VDBDB1A16",
"score":0.68205893}]
},
"055357342X",{"numFound":5,"start":0,"maxScore":1.6554483,"numFoundExact":true,
"docs":[
{
"id":"0553579908",
"score":1.6554483},
{
"id":"0553573403",
"score":1.6554483},
{
"id":"0805080481",
"score":1.3422124},
{
"id":"978-1423103349",
"score":0.7652973},
{
"id":"VDBDB1A16",
"score":0.68205893}]
}]}
----
== MoreLikeThis Query Parser
The `mlt` query parser provides a mechanism to retrieve documents similar to a specific document, like the request handler.
It uses Lucene's existing `MoreLikeThis` logic and also works in SolrCloud mode.
The document identifier used here is the document's `uniqueKey` value and not the Lucene internal document id.
The list of returned documents excludes the queried document.
One benefit of the query parser is that it can be used in various places, not only in a standard `q` parameter.
This allows MoreLikeThis to be added to boost queries, filter queries, function queries, etc.
=== Query Parser Parameters
This query parser takes the following parameters:
`qf`::
+
[%autowidth,frame=none]
|===
s|Required |Default: none
|===
+
Defines the fields to use as the basis for similarity analysis.
`mintf`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `2`
|===
+
Defines the minimum frequency below which terms will be ignored in the source document.
`mindf`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `5`
|===
+
Defines the minimum frequency below which terms will be ignored which do not occur in at least this many documents.
`maxdf`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Sets the maximum frequency above which terms will be ignored which occur in more than this many documents.
`minwl`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Sets the minimum word length below which words will be ignored.
`maxwl`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Sets the maximum word length above which words will be ignored.
`maxqt`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `25`
|===
+
Sets the maximum number of query terms that will be included in any generated query.
`maxntp`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `5000`
|===
+
Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support.
`boost`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `false`
|===
+
Specifies if the query will be boosted by the interesting term relevance. It can be either `true` or `false`.
=== Query Parser Query and Response
The structure of a MoreLikeThis query parser request is like a query using <<local-parameters-in-queries.adoc#,local params>>, as in:
[source,bash]
----
{!mlt qf=name}1
----
This would use the MoreLikeThis query parser to find documents similar to document "1", based on the "name" field.
Additional parameters would be added inside the brackets, for example if we wanted to specify limits for `mintf` and `mindf`:
[source,bash]
----
{!mlt qf=name mintf=2 mindf=3}1
----
If given a query such as the following based on the example documents provided with Solr:
[source,bash]
http://localhost:8983/solr/gettingstarted/select?q={!mlt qf=author mintf=1 mindf=1}0553573403
The query parser response includes only the similar documents sorted by score:
[source,json]
----
{
"response":{"numFound":2,"start":0,"maxScore":1.309797,"numFoundExact":true,
"docs":[
{
"id":"0553579908",
"cat":["book"],
"name":["A Clash of Kings"],
"price":[7.99],
"inStock":[true],
"author":["George R.R. Martin"],
"series_t":"A Song of Ice and Fire",
"sequence_i":2,
"genre_s":"fantasy",
"_version_":1693062911094685696},
{
"id":"055357342X",
"cat":["book"],
"name":["A Storm of Swords"],
"price":[7.99],
"inStock":[true],
"author":["George R.R. Martin"],
"series_t":"A Song of Ice and Fire",
"sequence_i":3,
"genre_s":"fantasy",
"_version_":1693062911095734272}]
}}
----