blob: 49dacce226f16745ecdbaf000a11c9336a56acd4 [file] [log] [blame]
= Working with External Files and Processes
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
== The ExternalFileField Type
The `ExternalFileField` type makes it possible to specify the values for a field in a file outside the Solr index. For such a field, the file contains mappings from a key field to the field value. Another way to think of this is that, instead of specifying the field in documents as they are indexed, Solr finds values for this field in the external file.
[IMPORTANT]
====
External fields are not searchable. They can be used only for function queries or display. For more information on function queries, see the section on <<function-queries.adoc#,Function Queries>>.
====
The `ExternalFileField` type is handy for cases where you want to update a particular field in many documents more often than you want to update the rest of the documents. For example, suppose you have implemented a document rank based on the number of views. You might want to update the rank of all the documents daily or hourly, while the rest of the contents of the documents might be updated much less frequently. Without `ExternalFileField`, you would need to update each document just to change the rank. Using `ExternalFileField` is much more efficient because all document values for a particular field are stored in an external file that can be updated as frequently as you wish.
In `schema.xml`, the definition of this field type might look like this:
[source,xml]
----
<fieldType name="entryRankFile" keyField="pkId" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField"/>
----
The `keyField` attribute defines the key that will be defined in the external file. It is usually the unique key for the index, but it doesn't need to be as long as the `keyField` can be used to identify documents in the index. A `defVal` defines a default value that will be used if there is no entry in the external file for a particular document.
=== Format of the External File
The file itself is located in Solr's index directory, which by default is `$SOLR_HOME/data`. The name of the file should be `external_fieldname_` or `external_fieldname_.*`. For the example above, then, the file could be named `external_entryRankFile` or `external_entryRankFile.txt`.
[TIP]
====
If any files using the name pattern `.*` (such as `.txt`) appear, the last (after being sorted by name) will be used and previous versions will be deleted. This behavior supports implementations on systems where one may not be able to overwrite a file (for example, on Windows, if the file is in use).
====
The file contains entries that map a key field, on the left of the equals sign, to a value, on the right. Here are a few example entries:
[source,text]
----
doc33=1.414
doc34=3.14159
doc40=42
----
The keys listed in this file do not need to be unique. The file does not need to be sorted, but Solr will be able to perform the lookup faster if it is.
=== Reloading an External File
It's possible to define an event listener to reload an external file when either a searcher is reloaded or when a new searcher is started. See the section <<query-settings-in-solrconfig.adoc#query-related-listeners,Query-Related Listeners>> for more information, but a sample definition in `solrconfig.xml` might look like this:
[source,xml]
----
<listener event="newSearcher" class="org.apache.solr.schema.ExternalFileFieldReloader"/>
<listener event="firstSearcher" class="org.apache.solr.schema.ExternalFileFieldReloader"/>
----
== The PreAnalyzedField Type
The `PreAnalyzedField` type provides a way to send to Solr serialized token streams, optionally with independent stored values of a field, and have this information stored and indexed without any additional text processing applied in Solr. This is useful if user wants to submit field content that was already processed by some existing external text processing pipeline (e.g., it has been tokenized, annotated, stemmed, synonyms inserted, etc.), while using all the rich attributes that Lucene's TokenStream provides (per-token attributes).
The serialization format is pluggable using implementations of PreAnalyzedParser interface. There are two out-of-the-box implementations:
* <<JsonPreAnalyzedParser>>: as the name suggests, it parses content that uses JSON to represent field's content. This is the default parser to use if the field type is not configured otherwise.
* <<SimplePreAnalyzedParser>>: uses a simple strict plain text format, which in some situations may be easier to create than JSON.
There is only one configuration parameter, `parserImpl`. The value of this parameter should be a fully qualified class name of a class that implements PreAnalyzedParser interface. The default value of this parameter is `org.apache.solr.schema.JsonPreAnalyzedParser`.
By default, the query-time analyzer for fields of this type will be the same as the index-time analyzer, which expects serialized pre-analyzed text. You must add a query type analyzer to your fieldType in order to perform analysis on non-pre-analyzed queries. In the example below, the index-time analyzer expects the default JSON serialization format, and the query-time analyzer will employ StandardTokenizer/LowerCaseFilter:
[source,xml]
----
<fieldType name="pre_with_query_analyzer" class="solr.PreAnalyzedField">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
----
=== JsonPreAnalyzedParser
This is the default serialization format used by PreAnalyzedField type. It uses a top-level JSON map with the following keys:
// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
[cols="20,60,20",options="header"]
|===
|Key |Description |Required
|`v` |Version key. Currently the supported version is `1`. |required
|`str` |Stored string value of a field. You can use at most one of `str` or `bin`. |optional
|`bin` |Stored binary value of a field. The binary value has to be Base64 encoded. |optional
|`tokens` |serialized token stream. This is a JSON list. |optional
|===
Any other top-level key is silently ignored.
==== Token Stream Serialization
The token stream is expressed as a JSON list of JSON maps. The map for each token consists of the following keys and values:
// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
[cols="10,20,20,30,20",options="header"]
|===
|Key |Description |Lucene Attribute |Value |Required?
|`t` |token |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/CharTermAttribute.html[CharTermAttribute] |UTF-8 string representing the current token |required
|`s` |start offset |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html[OffsetAttribute] |Non-negative integer |optional
|`e` |end offset |OffsetAttribute |Non-negative integer |optional
|`i` |position increment |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html[PositionIncrementAttribute] |Non-negative integer - default is `1` |optional
|`p` |payload |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/PayloadAttribute.html[PayloadAttribute] |Base64 encoded payload |optional
|`y` |lexical type |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html[TypeAttribute] |UTF-8 string |optional
|`f` |flags |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/FlagsAttribute.html[FlagsAttribute] |String representing an integer value in hexadecimal format |optional
|===
Any other key is silently ignored.
==== JsonPreAnalyzedParser Example
[source,json]
----
{
"v":"1",
"str":"test ąćęłńóśźż",
"tokens": [
{"t":"two","s":5,"e":8,"i":1,"y":"word"},
{"t":"three","s":20,"e":22,"i":1,"y":"foobar"},
{"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"}
]
}
----
=== SimplePreAnalyzedParser
The fully qualified class name to use when specifying this format via the `parserImpl` configuration parameter is `org.apache.solr.schema.SimplePreAnalyzedParser`.
==== SimplePreAnalyzedParser Syntax
The serialization format supported by this parser is as follows:
.Serialization format
[source,text]
----
content ::= version (stored)? tokens
version ::= digit+ " "
; stored field value - any "=" inside must be escaped!
stored ::= "=" text "="
tokens ::= (token ((" ") + token)*)*
token ::= text ("," attrib)*
attrib ::= name '=' value
name ::= text
value ::= text
----
Special characters in "text" values can be escaped using the escape character `\`. The following escape sequences are recognized:
[width="60%",options="header",]
|===
|EscapeSequence |Description
|`\` |literal space character
|`\,` |literal `,` character
|`\=` |literal `=` character
|`\\` |literal `\` character
|`\n` |newline
|`\r` |carriage return
|`\t` |horizontal tab
|===
Please note that Unicode sequences (e.g., `\u0001`) are not supported.
==== Supported Attributes
The following token attributes are supported, and identified with short symbolic names:
// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
[cols="10,30,30,30",options="header"]
|===
|Name |Description |Lucene attribute |Value format
|`i` |position increment |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html[PositionIncrementAttribute] |integer
|`s` |start offset |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html[OffsetAttribute] |integer
|`e` |end offset |OffsetAttribute |integer
|`y` |lexical type |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html[TypeAttribute] |string
|`f` |flags |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/FlagsAttribute.html[FlagsAttribute] |hexadecimal integer
|`p` |payload |{lucene-javadocs}/core/org/apache/lucene/analysis/tokenattributes/PayloadAttribute.html[PayloadAttribute] |bytes in hexadecimal format; whitespace is ignored
|===
Token positions are tracked and implicitly added to the token stream - the start and end offsets consider only the term text and whitespace, and exclude the space taken by token attributes.
==== Example Token Streams
// TODO: in cwiki each of these examples was in its own "panel" ... do we want something like that here?
// TODO: these examples match what was in cwiki, but I'm honestly not sure if the formatting there was correct to start?
[source,text]
----
1 one two three
----
* version: 1
* stored: null
* token: (term=`one`,startOffset=0,endOffset=3)
* token: (term=`two`,startOffset=4,endOffset=7)
* token: (term=`three`,startOffset=8,endOffset=13)
[source,text]
----
1 one two three
----
* version: 1
* stored: null
* token: (term=`one`,startOffset=0,endOffset=3)
* token: (term=`two`,startOffset=5,endOffset=8)
* token: (term=`three`,startOffset=11,endOffset=16)
[source,text]
----
1 one,s=123,e=128,i=22 two three,s=20,e=22
----
* version: 1
* stored: null
* token: (term=`one`,positionIncrement=22,startOffset=123,endOffset=128)
* token: (term=`two`,positionIncrement=1,startOffset=5,endOffset=8)
* token: (term=`three`,positionIncrement=1,startOffset=20,endOffset=22)
[source,text]
----
1 \ one\ \,,i=22,a=\, two\=
\n,\ =\ \
----
* version: 1
* stored: null
* token: (term=`one ,`,positionIncrement=22,startOffset=0,endOffset=6)
* token: (term=`two=` ,positionIncrement=1,startOffset=7,endOffset=15)
* token: (term=`\`,positionIncrement=1,startOffset=17,endOffset=18)
Note that unknown attributes and their values are ignored, so in this example, the "```a```" attribute on the first token and the " " (escaped space) attribute on the second token are ignored, along with their values, because they are not among the supported attribute names.
[source,text]
----
1 ,i=22 ,i=33,s=2,e=20 ,
----
* version: 1
* stored: null
* token: (term=,positionIncrement=22,startOffset=0,endOffset=0)
* token: (term=,positionIncrement=33,startOffset=2,endOffset=20)
* token: (term=,positionIncrement=1,startOffset=2,endOffset=2)
[source,text]
----
1 =This is the stored part with \=
\n \t escapes.=one two three
----
* version: 1
* stored: `This is the stored part with = \t escapes.`
* token: (term=`one`,startOffset=0,endOffset=3)
* token: (term=`two`,startOffset=4,endOffset=7)
* token: (term=`three`,startOffset=8,endOffset=13)
Note that the `\t` in the above stored value is not literal; it's shown that way to visually indicate the actual tab char that is in the stored value.
[source,text]
----
1 ==
----
* version: 1
* stored: ""
* (no tokens)
[source,text]
----
1 =this is a test.=
----
* version: 1
* stored: `this is a test.`
* (no tokens)