| = MoreLikeThis |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| MoreLikeThis enables queries for documents similar to a document in their result list. |
| |
| It does this by using terms from the original document to find similar documents in the index. |
| |
| There are several ways to use MoreLikeThis. |
| The first, and most common, is to use it as a request handler. |
| In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link). |
| |
| The second is to use it as a search component. |
| This is less desirable since it performs the MoreLikeThis analysis on every document that matches a user query. This may slow search results. |
| |
| Another approach is to use it as a request handler but with externally supplied text. |
| This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document. |
| |
| Finally, the MLT query parser can be used. |
| This operates in much the same way as the request handler but since it is a query parser it can be used in filter queries, boost queries, etc., and results can be paginated or highlighted as needed. |
| |
| == How MoreLikeThis Works |
| |
| `MoreLikeThis` constructs a Lucene query based on terms in a document. |
| It does this by pulling terms from the list of fields provided with the request. |
| |
| For best results, the fields should have stored term vectors (`termVectors=true`), which can be <<defining-fields.adoc#,configured in the schema>>. |
| If term vectors are not stored, MoreLikeThis can generate terms from stored fields. |
| The field used for the `uniqueKey` must also be stored in order for MoreLikeThis to work properly. |
| |
| Terms from the original document are filtered using thresholds defined with the MoreLikeThis parameters. |
| Once the terms have been selected, a query is run with any other query parameters as appropriate and a new document set is returned. |
| |
| == MoreLikeThis Handler and Component |
| |
| The MoreLikeThis request handler and search component share several parameters, but also have some key differences in response and operation, as described below. |
| |
| === Common Handler and Component Parameters |
| |
| The list below summarizes the `MoreLikeThis` parameters supported by Solr. |
| These parameters can be used with the MoreLikeThis search component or request handler. |
| |
| `mlt.fl`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| s|Required |Default: none |
| |=== |
| + |
| Specifies the fields to use for similarity. |
| A list of fields can be provided separated by commas. |
| If possible, the fields should have stored `termVectors`. |
| |
| `mlt.mintf`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `2` |
| |=== |
| + |
| Specifies the minimum frequency below which terms will be ignored in the source document. |
| |
| `mlt.mindf`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `5` |
| |=== |
| + |
| Specifies the minimum frequency below which terms will be ignored which do not occur in at least this many documents. |
| |
| `mlt.maxdf`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| Specifies the maximum frequency above which terms will be ignored which occur in more than this many documents. |
| |
| `mlt.maxdfpct`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| Specifies the maximum document frequency using a ratio relative to the number of documents in the index. |
| The value provided must be an integer between `0` and `100`. |
| For example, `mlt.maxdfpct=75` means the word will be ignored if it occurs in more than 75 percent of the documents in the index. |
| |
| `mlt.minwl`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| Sets the minimum word length below which words will be ignored. |
| |
| `mlt.maxwl`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| Sets the maximum word length above which words will be ignored. |
| |
| `mlt.maxqt`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `25` |
| |=== |
| + |
| Sets the maximum number of query terms that will be included in any generated query. |
| |
| `mlt.maxntp`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `5000` |
| |=== |
| + |
| Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support. |
| |
| `mlt.boost`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `false` |
| |=== |
| + |
| Specifies if the query will be boosted by the interesting term relevance. |
| Possible values are `true` or `false`. |
| |
| `mlt.qf`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| Query fields and their boosts using the same format used by the <<the-dismax-query-parser.adoc#,DisMax Query Parser>>. |
| These fields must also be specified in `mlt.fl`. |
| |
| `mlt.interestingTerms`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `none` |
| |=== |
| + |
| Adds a section in the response that shows the top terms (based on TF/IDF) used for the MoreLikeThis query. |
| It supports three possible values: |
| + |
| * `list` lists the terms. |
| * `none` lists no terms (the default). |
| * `details` lists the terms along with the boost value used for each term. |
| Unless `mlt.boost=true`, all terms will have `boost=1.0`. |
| |
| + |
| To use this parameter with the <<MoreLikeThis Search Component,search component>>, the query cannot be distributed. |
| In order to get interesting terms, the query must be sent to a single shard and limited to that shard only (with the <<distributed-requests.adoc#limiting-which-shards-are-queried,`shards`>> parameter). |
| Multi-shard support is, however, available with the MoreLikeThis request handler. |
| |
| === MoreLikeThis Request Handler |
| |
| ==== Request Handler Configuration |
| |
| The MoreLikeThis request handler is not configured by default and needs to be set up before using it. |
| You can do this by manually editing `solrconfig.xml` or with the Config API: |
| |
| [.dynamic-tabs] |
| -- |
| [example.tab-pane#manualconfig] |
| ==== |
| [.tab-label]*Manual Configuration* |
| |
| [source,xml] |
| ---- |
| <requestHandler name="/mlt" class="solr.MoreLikeThisHandler"> |
| <str name="mlt.fl">body</str> |
| </requestHandler> |
| ---- |
| ==== |
| |
| [example.tab-pane#configapi] |
| ==== |
| [.tab-label]*Config API* |
| |
| [source,bash] |
| ---- |
| curl -X POST -H 'Content-type:application/json' -d { |
| "add-requesthandler": { |
| "name": "/mlt", |
| "class": "solr.MoreLikeThisHandler", |
| "defaults": {"mlt.fl": "body"} |
| } |
| } http://localhost:8983/solr/<collection>/config |
| ---- |
| ==== |
| -- |
| |
| Both of the above examples set the `mlt.fl` parameter to "body" for the request handler. |
| This means that all requests to the handler will use that value for the parameter unless specifically overridden in an individual request. |
| |
| For more about request handler configuration in general, see the section <<requesthandlers-and-searchcomponents-in-solrconfig.adoc#default-components,RequestHandlers and SearchComponents in Solrconfig>>. |
| |
| ==== Request Handler Parameters |
| |
| The MoreLikeThis request handler supports the following parameters in addition to the <<Common Handler and Component Parameters,common parameters>> above. |
| It supports faceting, paging, and filtering using common query parameters, but does not work well with alternate query parsers. |
| |
| `mlt.match.include`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `true` |
| |=== |
| + |
| Specifies if the response should include the matched document. |
| If set to `false`, the response will look like a normal select response. |
| |
| `mlt.match.offset`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| Specifies an offset into the main query search results to locate the document on which the MoreLikeThis query should operate. |
| By default, the query operates on the first result for the `q` parameter. |
| |
| ==== Request Handler Query and Response |
| |
| Queries to the MoreLikeThis request handler use the name defined when it was configured (`/mlt` in the above example). |
| |
| The following example query uses a document (`q=id:0553573403`) found in Solr's example document set (`./example/exampledocs`), and asks that the author field be used to find similar documents (`mlt.fl=author`). |
| |
| [source,bash] |
| http://localhost:8983/solr/gettingstarted/mlt?mlt.fl=author&mlt.interestingTerms=details&mlt.match.include=true&mlt.mindf=0&mlt.mintf=0&q=id%3A0553573403 |
| |
| This query also requests interesting terms with their boosts (`mlt.interestingTerms=details`) and that the original document also be returned (`mlt.match.include=true`). |
| The minimum term frequency and minimum word document frequency are set to `0`. |
| |
| The response will include a section `match`, which includes the original document. |
| The `response` section includes the similar documents. |
| Finally, the `interestingTerms` section shows the terms from the author field that were used to find the similar documents. |
| Because we did not also specify `mlt.boost`, the boost values shown for the interesting terms all display `1.0`. |
| |
| [source,json] |
| ---- |
| { |
| "match":{"numFound":1,"start":0,"numFoundExact":true, |
| "docs":[ |
| { |
| "id":"0553573403", |
| "cat":["book"], |
| "name":["A Game of Thrones"], |
| "price":[7.99], |
| "inStock":[true], |
| "author":["George R.R. Martin"], |
| "series_t":"A Song of Ice and Fire", |
| "sequence_i":1, |
| "genre_s":"fantasy", |
| "_version_":1693062911089442816}] |
| }, |
| "response":{"numFound":2,"start":0,"numFoundExact":true, |
| "docs":[ |
| { |
| "id":"0553579908", |
| "cat":["book"], |
| "name":["A Clash of Kings"], |
| "price":[7.99], |
| "inStock":[true], |
| "author":["George R.R. Martin"], |
| "series_t":"A Song of Ice and Fire", |
| "sequence_i":2, |
| "genre_s":"fantasy", |
| "_version_":1693062911094685696}, |
| { |
| "id":"055357342X", |
| "cat":["book"], |
| "name":["A Storm of Swords"], |
| "price":[7.99], |
| "inStock":[true], |
| "author":["George R.R. Martin"], |
| "series_t":"A Song of Ice and Fire", |
| "sequence_i":3, |
| "genre_s":"fantasy", |
| "_version_":1693062911095734272}] |
| }, |
| "interestingTerms":[ |
| "author:r.r",1.0, |
| "author:george",1.0, |
| "author:martin",1.0]} |
| ---- |
| |
| If we had not requested `mlt.match.include=true`, the response would not have included the `match` section. |
| |
| ==== Streaming External Content to MoreLikeThis |
| |
| An external document (one not in the index) can be passed to the MoreLikeThis request handler to be used for recommended documents. |
| |
| This is accomplished with the use of <<content-streams.adoc#,Content Streams>>. |
| The body of a document can be passed directly to the request handler with the `stream.body` parameter. |
| Alternatively, if remote streams are enabled, a URL or file could be passed. |
| |
| [source,bash] |
| ---- |
| http://localhost:8983/solr/mlt?stream.body=electronics%20memory&mlt.fl=manu,cat&mlt.interestingTerms=list&mlt.mintf=0 |
| ---- |
| |
| This query would pass the terms "electronics memory" to the request handler instead of using a document already in the index. |
| |
| The response in this case would look similar to the response above that used a document already in the index. |
| |
| === MoreLikeThis Search Component |
| |
| Using MoreLikeThis as a search component returns similar documents for each document in the response set for another query. |
| It's important to note this could incur a cost to search performance so should only be used when the use case warrants it. |
| |
| ==== Search Component Configuration |
| |
| The MoreLikeThis search component is a default search component that works with all search handlers (see also <<requesthandlers-and-searchcomponents-in-solrconfig.adoc#default-components,Default Components>>). |
| |
| Since it is configured already, it doesn't need any additional configuration unless you'd like to set parameters for a particular collection that override the MoreLikeThis defaults. |
| To do this, you could configure it like this: |
| |
| [source,xml] |
| ---- |
| <searchComponent name="mlt" class="solr.MoreLikeThisComponent"> |
| <str name="mlt">true</str> |
| <str name="mlt.fl">body</str> |
| </searchComponent> |
| ---- |
| |
| The above example would always enable MoreLikeThis for all queries and will always use the "body" field. |
| This is probably not something you really want! |
| But the example serves to show how you might define whichever parameters you would like to be default for MoreLikeThis. |
| |
| If you gave the search component a name other than "mlt" as in the above example, you would need to explicitly add it to a request handler as described in the section <<requesthandlers-and-searchcomponents-in-solrconfig.adoc#search-components,Search Components>>. |
| Because the above example uses the same name as the default, the parameters defined there override Solr's default. |
| |
| ==== Search Component Parameters |
| |
| The MoreLikeThis search component supports the following parameters in addition to the <<Common Handler and Component Parameters,common parameters>> above. |
| |
| `mlt`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| If set to `true`, activates the `MoreLikeThis` component and enables Solr to return `MoreLikeThis` results. |
| |
| `mlt.count`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `5` |
| |=== |
| + |
| Specifies the number of similar documents to be returned for each result. |
| |
| ==== Search Component Query and Response |
| |
| The response when using MoreLikeThis as a search component is different than when using the request handler. |
| |
| In this case, we are using the `/select` request handler and performing a regular query (`q=author:martin`). |
| We've asked for MoreLikeThis to be added to the response (`mlt=true`), but otherwise the parameters are the same as the earlier example (we've asked for interesting terms and set minimum term and document frequencies to `0`). |
| |
| [source,bash] |
| http://localhost:8983/solr/gettingstarted/select?mlt.fl=name&mlt.mindf=0&mlt.mintf=0&mlt=true&q=author%3Amartin |
| |
| The response includes the results of our query, in this case 3 documents which have the term "martin" in the author field. |
| We've changed the field, however, to find documents that are similar to these based on values in the `name` field (`mlt.fl=name`). |
| |
| In the response, a `moreLikeThis` section has been added. |
| For each document in the results that match our query, a list of document IDs is returned with score values. |
| Each of these documents are similar to the document in the result list to varying degrees. |
| |
| [source,json] |
| ---- |
| { |
| "response":{"numFound":3,"start":0,"maxScore":0.43659902,"numFoundExact":true, "docs":[ |
| { |
| "id":"0553573403", |
| "cat":["book"], |
| "name":["A Game of Thrones"], |
| "price":[7.99], |
| "inStock":[true], |
| "author":["George R.R. Martin"], |
| "series_t":"A Song of Ice and Fire", |
| "sequence_i":1, |
| "genre_s":"fantasy", |
| "_version_":1693062911089442816}, |
| { |
| "id":"0553579908", |
| "cat":["book"], |
| "name":["A Clash of Kings"], |
| "price":[7.99], |
| "inStock":[true], |
| "author":["George R.R. Martin"], |
| "series_t":"A Song of Ice and Fire", |
| "sequence_i":2, |
| "genre_s":"fantasy", |
| "_version_":1693062911094685696}, |
| { |
| "id":"055357342X", |
| "cat":["book"], |
| "name":["A Storm of Swords"], |
| "price":[7.99], |
| "inStock":[true], |
| "author":["George R.R. Martin"], |
| "series_t":"A Song of Ice and Fire", |
| "sequence_i":3, |
| "genre_s":"fantasy", |
| "_version_":1693062911095734272}] |
| }, |
| "moreLikeThis":[ |
| "0553573403",{"numFound":6,"start":0,"maxScore":1.6554483,"numFoundExact":true, |
| "docs":[ |
| { |
| "id":"055357342X", |
| "score":1.6554483}, |
| { |
| "id":"0553579908", |
| "score":1.6554483}, |
| { |
| "id":"0805080481", |
| "score":1.3422124}, |
| { |
| "id":"0812550706", |
| "score":1.284826}, |
| { |
| "id":"978-1423103349", |
| "score":0.7652973}] |
| }, |
| "0553579908",{"numFound":5,"start":0,"maxScore":1.6554483,"numFoundExact":true, |
| "docs":[ |
| { |
| "id":"055357342X", |
| "score":1.6554483}, |
| { |
| "id":"0553573403", |
| "score":1.6554483}, |
| { |
| "id":"0805080481", |
| "score":1.3422124}, |
| { |
| "id":"978-1423103349", |
| "score":0.7652973}, |
| { |
| "id":"VDBDB1A16", |
| "score":0.68205893}] |
| }, |
| "055357342X",{"numFound":5,"start":0,"maxScore":1.6554483,"numFoundExact":true, |
| "docs":[ |
| { |
| "id":"0553579908", |
| "score":1.6554483}, |
| { |
| "id":"0553573403", |
| "score":1.6554483}, |
| { |
| "id":"0805080481", |
| "score":1.3422124}, |
| { |
| "id":"978-1423103349", |
| "score":0.7652973}, |
| { |
| "id":"VDBDB1A16", |
| "score":0.68205893}] |
| }]} |
| ---- |
| |
| == MoreLikeThis Query Parser |
| |
| The `mlt` query parser provides a mechanism to retrieve documents similar to a specific document, like the request handler. |
| |
| It uses Lucene's existing `MoreLikeThis` logic and also works in SolrCloud mode. |
| The document identifier used here is the document's `uniqueKey` value and not the Lucene internal document id. |
| The list of returned documents excludes the queried document. |
| |
| One benefit of the query parser is that it can be used in various places, not only in a standard `q` parameter. |
| This allows MoreLikeThis to be added to boost queries, filter queries, function queries, etc. |
| |
| === Query Parser Parameters |
| |
| This query parser takes the following parameters: |
| |
| `qf`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| s|Required |Default: none |
| |=== |
| + |
| Defines the fields to use as the basis for similarity analysis. |
| |
| `mintf`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `2` |
| |=== |
| + |
| Defines the minimum frequency below which terms will be ignored in the source document. |
| |
| `mindf`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `5` |
| |=== |
| + |
| Defines the minimum frequency below which terms will be ignored which do not occur in at least this many documents. |
| |
| `maxdf`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| Sets the maximum frequency above which terms will be ignored which occur in more than this many documents. |
| |
| `minwl`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| Sets the minimum word length below which words will be ignored. |
| |
| `maxwl`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: none |
| |=== |
| + |
| Sets the maximum word length above which words will be ignored. |
| |
| `maxqt`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `25` |
| |=== |
| + |
| Sets the maximum number of query terms that will be included in any generated query. |
| |
| `maxntp`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `5000` |
| |=== |
| + |
| Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support. |
| |
| `boost`:: |
| + |
| [%autowidth,frame=none] |
| |=== |
| |Optional |Default: `false` |
| |=== |
| + |
| Specifies if the query will be boosted by the interesting term relevance. It can be either `true` or `false`. |
| |
| === Query Parser Query and Response |
| |
| The structure of a MoreLikeThis query parser request is like a query using <<local-parameters-in-queries.adoc#,local params>>, as in: |
| |
| [source,bash] |
| ---- |
| {!mlt qf=name}1 |
| ---- |
| |
| This would use the MoreLikeThis query parser to find documents similar to document "1", based on the "name" field. |
| |
| Additional parameters would be added inside the brackets, for example if we wanted to specify limits for `mintf` and `mindf`: |
| |
| [source,bash] |
| ---- |
| {!mlt qf=name mintf=2 mindf=3}1 |
| ---- |
| |
| If given a query such as the following based on the example documents provided with Solr: |
| |
| [source,bash] |
| http://localhost:8983/solr/gettingstarted/select?q={!mlt qf=author mintf=1 mindf=1}0553573403 |
| |
| The query parser response includes only the similar documents sorted by score: |
| |
| [source,json] |
| ---- |
| { |
| "response":{"numFound":2,"start":0,"maxScore":1.309797,"numFoundExact":true, |
| "docs":[ |
| { |
| "id":"0553579908", |
| "cat":["book"], |
| "name":["A Clash of Kings"], |
| "price":[7.99], |
| "inStock":[true], |
| "author":["George R.R. Martin"], |
| "series_t":"A Song of Ice and Fire", |
| "sequence_i":2, |
| "genre_s":"fantasy", |
| "_version_":1693062911094685696}, |
| { |
| "id":"055357342X", |
| "cat":["book"], |
| "name":["A Storm of Swords"], |
| "price":[7.99], |
| "inStock":[true], |
| "author":["George R.R. Martin"], |
| "series_t":"A Song of Ice and Fire", |
| "sequence_i":3, |
| "genre_s":"fantasy", |
| "_version_":1693062911095734272}] |
| }} |
| ---- |