blob: c84fc7b2552b37875b96e95b733d1908b0403ee7 [file] [log] [blame]
= The Terms Component
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
The Terms Component provides access to the indexed terms in a field and the number of documents that match each term. This can be useful for building an auto-suggest feature or any other feature that operates at the term level instead of the search or document level. Retrieving terms in index order is very fast since the implementation directly uses Lucene's TermEnum to iterate over the term dictionary.
In a sense, this search component provides fast field-faceting over the whole index, not restricted by the base query or any filters. The document frequencies returned are the number of documents that match the term, including any documents that have been marked for deletion but not yet removed from the index.
== Configuring the Terms Component
By default, the Terms Component is already configured in `solrconfig.xml` for each collection.
=== Defining the Terms Component
Defining the Terms search component is straightforward: simply give it a name and use the class `solr.TermsComponent`.
[source,xml]
----
<searchComponent name="terms" class="solr.TermsComponent"/>
----
This makes the component available for use, but by itself will not be useable until included with a request handler.
=== Using the Terms Component in a Request Handler
The terms component is included with the `/terms` request handler, which is among Solr's out-of-the-box request handlers - see <<implicit-requesthandlers.adoc#,Implicit RequestHandlers>>.
Note that the defaults for this request handler set the parameter "terms" to true, which allows terms to be returned on request. The parameter "distrib" is set to false, which allows this handler to be used only on a single Solr core.
You could add this component to another handler if you wanted to, and pass "terms=true" in the HTTP request in order to get terms back. If it is only defined in a separate handler, you must use that handler when querying in order to get terms and not regular documents as results.
=== Terms Component Parameters
The parameters below allow you to control what terms are returned. You can also configure any of these with the request handler if you'd like to set them permanently. Or, you can add them to the query request. These parameters are:
`terms`::
If set to `true`, enables the Terms Component. By default, the Terms Component is off (`false`).
+
Example: `terms=true`
`terms.fl`::
Specifies the field from which to retrieve terms. This parameter is required if `terms=true`.
+
Example: `terms.fl=title`
`terms.list`::
Fetches the document frequency for a comma-delimited list of terms. Terms are always returned in index order. If `terms.ttf` is set to true, also returns their total term frequency. If multiple `terms.fl` are defined, these statistics will be returned for each term in each requested field.
+
Example: `terms.list=termA,termB,termC`
`terms.limit`::
Specifies the maximum number of terms to return. The default is `10`. If the limit is set to a number less than 0, then no maximum limit is enforced. Although this is not required, either this parameter or `terms.upper` must be defined.
+
Example: `terms.limit=20`
`terms.lower`::
Specifies the term at which to start. If not specified, the empty string is used, causing Solr to start at the beginning of the field.
+
Example: `terms.lower=orange`
`terms.lower.incl`::
If set to true, includes the lower-bound term (specified with `terms.lower` in the result set.
+
Example: `terms.lower.incl=false`
`terms.mincount`::
Specifies the minimum document frequency to return in order for a term to be included in a query response. Results are inclusive of the mincount (that is, >= mincount).
+
Example: `terms.mincount=5`
`terms.maxcount`::
Specifies the maximum document frequency a term must have in order to be included in a query response. The default setting is -1, which sets no upper bound. Results are inclusive of the maxcount (that is, \<= maxcount).
+
Example: `terms.maxcount=25`
`terms.prefix`::
Restricts matches to terms that begin with the specified string.
+
Example: `terms.prefix=inter`
`terms.raw`::
If set to true, returns the raw characters of the indexed term, regardless of whether it is human-readable. For instance, the indexed form of numeric numbers is not human-readable.
+
Example: `terms.raw=true`
`terms.regex`::
Restricts matches to terms that match the regular expression.
+
Example: `terms.regex=.*pedist`
`terms.regex.flag`::
Defines a Java regex flag to use when evaluating the regular expression defined with `terms.regex`. See http://docs.oracle.com/javase/tutorial/essential/regex/pattern.html for details of each flag. Valid options are:
* `case_insensitive`
* `comments`
* `multiline`
* `literal`
* `dotall`
* `unicode_case`
* `canon_eq`
* `unix_lines`
+
Example: `terms.regex.flag=case_insensitive`
`terms.stats`::
Include index statistics in the results. Currently returns only the *numDocs* for a collection. When combined with `terms.list` it provides enough information to compute inverse document frequency (IDF) for a list of terms.
`terms.sort`::
Defines how to sort the terms returned. Valid options are `count`, which sorts by the term frequency, with the highest term frequency first, or `index`, which sorts in index order.
+
Example: `terms.sort=index`
`terms.ttf`::
If set to true, returns both `df` (docFreq) and `ttf` (totalTermFreq) statistics for each requested term in `terms.list`. In this case, the response format is:
+
[source,xml]
----
<lst name="terms">
<lst name="field">
<lst name="termA">
<long name="df">22</long>
<long name="ttf">73</long>
</lst>
</lst>
</lst>
----
`terms.upper`::
Specifies the term to stop at. Although this parameter is not required, either this parameter or `terms.limit` must be defined.
+
Example: `terms.upper=plum`
`terms.upper.incl`::
If set to true, the upper bound term is included in the result set. The default is false.
+
Example: `terms.upper.incl=true`
The response to a terms request is a list of the terms and their document frequency values.
You may also be interested in the {solr-javadocs}/solr-core/org/apache/solr/handler/component/TermsComponent.html[TermsComponent javadoc].
== Terms Component Examples
All of the following sample queries work with Solr's "`bin/solr -e techproducts`" example.
=== Get Top 10 Terms
This query requests the first ten terms in the name field:
[source,text]
http://localhost:8983/solr/techproducts/terms?terms.fl=name&wt=xml
Results:
[source,xml]
----
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">2</int>
</lst>
<lst name="terms">
<lst name="name">
<int name="one">5</int>
<int name="184">3</int>
<int name="1gb">3</int>
<int name="3200">3</int>
<int name="400">3</int>
<int name="ddr">3</int>
<int name="gb">3</int>
<int name="ipod">3</int>
<int name="memory">3</int>
<int name="pc">3</int>
</lst>
</lst>
</response>
----
=== Get First 10 Terms Starting with Letter 'a'
This query requests the first ten terms in the name field, in index order (instead of the top 10 results by document count):
[source,text]
http://localhost:8983/solr/techproducts/terms?terms.fl=name&terms.lower=a&terms.sort=index&wt=xml
Results:
[source,xml]
----
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="terms">
<lst name="name">
<int name="a">1</int>
<int name="all">1</int>
<int name="apple">1</int>
<int name="asus">1</int>
<int name="ata">1</int>
<int name="ati">1</int>
<int name="belkin">1</int>
<int name="black">1</int>
<int name="british">1</int>
<int name="cable">1</int>
</lst>
</lst>
</response>
----
=== SolrJ Invocation
[source,java]
----
SolrQuery query = new SolrQuery();
query.setRequestHandler("/terms");
query.setTerms(true);
query.setTermsLimit(5);
query.setTermsLower("s");
query.setTermsPrefix("s");
query.addTermsField("terms_s");
query.setTermsMinCount(1);
QueryRequest request = new QueryRequest(query);
List<Term> terms = request.process(getSolrClient()).getTermsResponse().getTerms("terms_s");
----
== Using the Terms Component for an Auto-Suggest Feature
If the <<suggester.adoc#,Suggester>> doesn't suit your needs, you can use the Terms component in Solr to build a similar feature for your own search application. Simply submit a query specifying whatever characters the user has typed so far as a prefix. For example, if the user has typed "at", the search engine's interface would submit the following query:
[source,text]
http://localhost:8983/solr/techproducts/terms?terms.fl=name&terms.prefix=at&wt=xml
Result:
[source,xml]
----
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<lst name="terms">
<lst name="name">
<int name="ata">1</int>
<int name="ati">1</int>
</lst>
</lst>
</response>
----
You can use the parameter `omitHeader=true` to omit the response header from the query response, like in this example, which also returns the response in JSON format:
[source,text]
http://localhost:8983/solr/techproducts/terms?terms.fl=name&terms.prefix=at&omitHeader=true
Result:
[source,json]
----
{
"terms": {
"name": [
"ata",
1,
"ati",
1
]
}
}
----
== Distributed Search Support
The TermsComponent also supports distributed indexes. For the `/terms` request handler, you must provide the following two parameters:
`shards`::
Specifies the shards in your distributed indexing configuration. For more information about distributed indexing, see <<distributed-search-with-index-sharding.adoc#,Distributed Search with Index Sharding>>.
+
The `shards` parameter is subject to a host whitelist that has to be configured in the component's parameters using the configuration key `shardsWhitelist` and the list of hosts as values.
+
By default the whitelist will be populated with all live nodes when running in SolrCloud mode. If you need to disable this feature for backwards compatibility, you can set the system property `solr.disable.shardsWhitelist=true`.
+
See the section <<distributed-requests.adoc#configuring-the-shardhandlerfactory,Configuring the ShardHandlerFactory>> for more information about how the whitelist works.
`shards.qt`::
Specifies the request handler Solr uses for requests to shards.