blob: 2f05d08d0e9437be78d360f9c12f5168fa674232 [file] [log] [blame]
= DocValues
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
== Why DocValues?
The standard way that Solr builds the index is with an _inverted index_. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.
For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).
In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
== Enabling DocValues
To use docValues, you only need to enable it for a field that you will use it with. As with all schema design, you need to define a field type and then define fields of that type with docValues enabled. All of these actions are done in `schema.xml`.
Enabling a field for docValues only requires adding `docValues="true"` to the field (or field type) definition, as in this example from the `schema.xml` of Solr's `sample_techproducts_configs` <<config-sets.adoc#,configset>>:
[source,xml]
----
<field name="manu_exact" type="string" indexed="false" stored="false" docValues="true" />
----
[IMPORTANT]
If you have already indexed data into your Solr index, you will need to completely reindex your content after changing your field definitions in `schema.xml` in order to successfully use docValues.
DocValues are only available for specific field types. The types chosen determine the underlying Lucene docValue type that will be used. The available Solr field types are:
* `StrField`, and `UUIDField`:
** If the field is single-valued (i.e., multi-valued is false), Lucene will use the `SORTED` type.
** If the field is multi-valued, Lucene will use the `SORTED_SET` type. Entries are kept in sorted order and duplicates are removed.
* `BoolField`:
** If the field is single-valued (i.e., multi-valued is false), Lucene will use the `SORTED` type.
** If the field is multi-valued, Lucene will use the `SORTED_SET` type. Entries are kept in sorted order and duplicates are removed.
* Any `*PointField` Numeric or Date fields, `EnumFieldType`, and `CurrencyFieldType`:
** If the field is single-valued (i.e., multi-valued is false), Lucene will use the `NUMERIC` type.
** If the field is multi-valued, Lucene will use the `SORTED_NUMERIC` type. Entries are kept in sorted order and duplicates are kept.
* Any of the deprecated `Trie*` Numeric or Date fields, `EnumField` and `CurrencyField`:
** If the field is single-valued (i.e., multi-valued is false), Lucene will use the `NUMERIC` type.
** If the field is multi-valued, Lucene will use the `SORTED_SET` type. Entries are kept in sorted order and duplicates are removed.
These Lucene types are related to how the {lucene-javadocs}/core/org/apache/lucene/index/DocValuesType.html[values are sorted and stored].
There is an additional configuration option available, which is to modify the `docValuesFormat` <<field-type-definitions-and-properties.adoc#docvaluesformat,used by the field type>>. The default implementation employs a mixture of loading some things into memory and keeping some on disk. In some cases, however, you may choose to specify an alternative {lucene-javadocs}/core/org/apache/lucene/codecs/DocValuesFormat.html[DocValuesFormat implementation]. For example, you could choose to keep everything in memory by specifying `docValuesFormat="Direct"` on a field type:
[source,xml]
----
<fieldType name="string_in_mem_dv" class="solr.StrField" docValues="true" docValuesFormat="Direct" />
----
Please note that the `docValuesFormat` option may change in future releases.
[NOTE]
Lucene index back-compatibility is only supported for the default codec. If you choose to customize the `docValuesFormat` in your `schema.xml`, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading.
== Using DocValues
=== Sorting, Faceting & Functions
If `docValues="true"` for a field, then DocValues will automatically be used any time the field is used for <<common-query-parameters.adoc#sort-parameter,sorting>>, <<faceting.adoc#,faceting>> or <<function-queries.adoc#,function queries>>.
=== Retrieving DocValues During Search
Field values retrieved during search queries are typically returned from stored values. However, non-stored docValues fields will be also returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g., "`fl=*`") for search queries depending on the effective value of the `useDocValuesAsStored` parameter for each field. For schema versions >= 1.6, the implicit default is `useDocValuesAsStored="true"`. See <<field-type-definitions-and-properties.adoc#,Field Type Definitions and Properties>> & <<defining-fields.adoc#,Defining Fields>> for more details.
When `useDocValuesAsStored="false"`, non-stored DocValues fields can still be explicitly requested by name in the <<common-query-parameters.adoc#fl-field-list-parameter,`fl` parameter>>, but will not match glob patterns (`"*"`). Note that returning DocValues along with "regular" stored fields at query time has performance implications that stored fields may not because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in sorted order rather than insertion order and may have duplicates removed, see above. If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires reindexing).
In cases where the query is returning _only_ docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access.
When retrieving fields from their docValues form (such as when using the <<exporting-result-sets.adoc#,/export handler>>, <<streaming-expressions.adoc#,streaming expressions>> or if the field is requested in the `fl` parameter), two important differences between regular stored fields and docValues fields must be understood:
1. Order is _not_ preserved. When retrieving stored fields, the insertion order is the return order. For docValues, it is the _sorted_ order.
2. For field types using `SORTED_SET` (see above), multiple identical entries are collapsed into a single value. Thus if values 4, 5, 2, 4, 1 are inserted, the values returned will be 1, 2, 4, 5.