| = The Tagger Handler |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| The Tagger request handler, AKA the "SolrTextTagger", is a "text tagger". |
| |
| Given a dictionary (a Solr index) with a name-like field, |
| you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. |
| It's used for named entity recognition (NER). |
| |
| The tagger doesn't do any natural language processing (NLP) (outside of Lucene text analysis) so it's considered a "naive tagger", |
| but it's definitely useful as-is and a more complete NER or ERD (entity recognition and disambiguation) |
| system can be built with this as a key component. |
| The SolrTextTagger might be used on queries for query-understanding or large documents as well. |
| |
| To get a sense of how to use it, jump to the <<tutorial-with-geonames,tutorial>> below. |
| |
| The Tagger request handler *does not* yet support a sharded index. |
| It can be used in a cluster running in SolrCloud mode, but the collection that |
| stores the tag dictionary must be a single-sharded collection. |
| Despite this limitation, tens to hundreds of millions of names (documents) can |
| be supported; the maximum is mostly limited only by memory. |
| |
| == Tagger Configuration |
| |
| To configure the tagger, your Solr schema needs 2 fields: |
| |
| * A unique key field (see <<other-schema-elements.adoc#unique-key,Unique Key>> for how to define a unique key in your schema). |
| Recommended field settings: set `docValues=true`. |
| * A tag field, which must be a `TextField`, with `ConcatenateGraphFilterFactory` at the end of the index chain (not the query chain): |
| Set `preservePositionIncrements=false` on that filter. |
| Recommended field settings: `omitNorms=true`, `omitTermFreqAndPositions=true` and _maybe_ specify the postings format -- see <<tagger-performance-tips,performance tips>>. |
| |
| The text field's _index analysis chain_, aside from needing `ConcatenateGraphFilterFactory` at the end, |
| can otherwise have whatever tokenizer and filters suit your matching preferences. |
| It can have multi-word synonyms and use `WordDelimiterGraphFilterFactory` for example. |
| However, do _not_ use `FlattenGraphFilterFactory` as it will interfere with `ConcatenateGraphFilterFactory`. |
| Position gaps (e.g., stop words) get ignored; it's not (yet) supported for the gap to be significant. |
| |
| The text field's _query analysis chain_, on the other hand, is more limited. |
| There should not be tokens at the same position, thus no synonym expansion -- do that at index time instead. |
| Stop words (or any other filter introducing a position gap) are supported. |
| At runtime the tagger can be configured to either treat it as a tag break or to ignore it. |
| |
| Your `solrconfig.xml` needs the `solr.TagRequestHandler` defined, which supports `defaults`, `invariants`, and `appends` |
| sections just like the search handler. |
| |
| For configuration examples, jump to the <<tutorial-with-geonames,tutorial>> below. |
| |
| == Tagger Parameters |
| |
| The tagger's execution is completely configurable with request parameters. Only `field` is required. |
| |
| `field`:: |
| The tag field that serves as the dictionary. |
| This is required; you'll probably specify it in the request handler. |
| |
| `fq`:: |
| You can specify some number of _filter queries_ to limit the dictionary used for tagging. |
| This parameter is the same one used by the `solr.SearchHandler`. |
| |
| `rows`:: |
| The maximum number of documents to return, but defaulting to 10000 for a tag request. |
| This parameter is the same as is used by the `solr.SearchHandler`. |
| |
| `fl`:: |
| Solr's standard parameter for listing the fields to return. |
| This parameter is the same one used by the `solr.SearchHandler`. |
| |
| `overlaps`:: |
| Choose the algorithm to determine which tags in an overlapping set should be retained, versus being pruned away. |
| Options are: |
| |
| * `ALL`: Emit all tags. |
| * `NO_SUB`: Don't emit a tag that is completely within another tag (i.e., no subtag). |
| * `LONGEST_DOMINANT_RIGHT`: Given a cluster of overlapping tags, emit the longest one (by character length). |
| If there is a tie, pick the right-most. |
| Remove any tags overlapping with this tag then repeat the algorithm to potentially find other tags that can be emitted in the cluster. |
| |
| `matchText`:: |
| A boolean indicating whether to return the matched text in the tag response. |
| This will trigger the tagger to fully buffer the input before tagging. |
| |
| `tagsLimit`:: |
| The maximum number of tags to return in the response. |
| Tagging effectively stops after this point. |
| By default this is `1000`. |
| |
| `skipAltTokens`:: |
| A boolean flag used to suppress errors that can occur if, for example, |
| you enable synonym expansion at query time in the analyzer, which you normally shouldn't do. |
| Let this default to false unless you know that such tokens can't be avoided. |
| |
| `ignoreStopwords`:: |
| A boolean flag that causes stopwords (or any condition causing positions to skip like >255 char words) |
| to be ignored as if they aren't there. |
| Otherwise, the behavior is to treat them as breaks in tagging on the presumption your indexed text-analysis |
| configuration doesn't have a `StopWordFilter` defined. |
| By default the indexed analysis chain is checked for the presence of a `StopWordFilter` and if found |
| then `ignoreStopWords` is true if unspecified. |
| You probably shouldn't have a `StopWordFilter` configured and probably won't need to set this parameter either. |
| |
| `xmlOffsetAdjust`:: |
| A boolean indicating that the input is XML and furthermore that the offsets of returned tags should be adjusted as |
| necessary to allow for the client to insert an opening and closing element at the tag offset pair. |
| If it isn't possible to do so then the tag will be omitted. |
| You are expected to configure `HTMLStripCharFilterFactory` in the schema when using this option. |
| This will trigger the tagger to fully buffer the input before tagging. |
| |
| Solr's parameters for controlling the response format are also supported, such as `echoParams`, `wt`, `indent`, etc. |
| |
| == Tutorial with Geonames |
| |
| This is a tutorial that demonstrates how to configure and use the text |
| tagger with the popular http://www.geonames.org/[Geonames] data set. It's more than a tutorial; |
| it's a how-to with information that wasn't described above. |
| |
| === Create and Configure a Solr Collection |
| |
| Create a Solr collection named "geonames". For the tutorial, we'll |
| assume the default "data-driven" configuration. It's good for |
| experimentation and getting going fast but not for production or being |
| optimal. |
| |
| [source,bash] |
| bin/solr create -c geonames |
| |
| ==== Configuring the Tagger |
| |
| We need to configure the schema first. The "data driven" mode we're |
| using allows us to keep this step fairly minimal -- we just need to |
| declare a field type, 2 fields, and a copy-field. |
| |
| The critical part |
| up-front is to define the "tag" field type. There are many many ways to |
| configure text analysis; and we're not going to get into those choices |
| here. But an important bit is the `ConcatenateGraphFilterFactory` at the |
| end of the index analyzer chain. Another important bit for performance |
| is `postingsFormat=FST50` resulting in a compact FST based in-memory data |
| structure that is especially beneficial for the text tagger. |
| |
| Schema configuration: |
| |
| [source,bash] |
| ---- |
| curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/geonames/schema -d '{ |
| "add-field-type":{ |
| "name":"tag", |
| "class":"solr.TextField", |
| "postingsFormat":"FST50", |
| "omitNorms":true, |
| "omitTermFreqAndPositions":true, |
| "indexAnalyzer":{ |
| "tokenizer":{ |
| "class":"solr.StandardTokenizerFactory" }, |
| "filters":[ |
| {"class":"solr.EnglishPossessiveFilterFactory"}, |
| {"class":"solr.ASCIIFoldingFilterFactory"}, |
| {"class":"solr.LowerCaseFilterFactory"}, |
| {"class":"solr.ConcatenateGraphFilterFactory", "preservePositionIncrements":false } |
| ]}, |
| "queryAnalyzer":{ |
| "tokenizer":{ |
| "class":"solr.StandardTokenizerFactory" }, |
| "filters":[ |
| {"class":"solr.EnglishPossessiveFilterFactory"}, |
| {"class":"solr.ASCIIFoldingFilterFactory"}, |
| {"class":"solr.LowerCaseFilterFactory"} |
| ]} |
| }, |
| |
| "add-field":{"name":"name", "type":"text_general"}, |
| |
| "add-field":{"name":"name_tag", "type":"tag", "stored":false }, |
| |
| "add-copy-field":{"source":"name", "dest":["name_tag"]} |
| }' |
| ---- |
| |
| Configure a custom Solr Request Handler: |
| |
| [source,bash] |
| ---- |
| curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/geonames/config -d '{ |
| "add-requesthandler" : { |
| "name": "/tag", |
| "class":"solr.TaggerRequestHandler", |
| "defaults":{"field":"name_tag"} |
| } |
| }' |
| ---- |
| |
| [[tagger-load-some-sample-data]] |
| === Load Some Sample Data |
| |
| We'll go with some Geonames.org data in CSV format. Solr is quite |
| flexible in loading data in a variety of formats. This |
| http://download.geonames.org/export/dump/cities1000.zip[cities1000.zip] |
| should be almost 7MB file expanding to a cities1000.txt file around |
| 22.2MB containing 145k lines, each a city in the world of at least 1000 |
| population. |
| |
| Using bin/post: |
| [source,bash] |
| ---- |
| bin/post -c geonames -type text/csv \ |
| -params 'optimize=true&maxSegments=1&separator=%09&encapsulator=%00&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate' \ |
| /tmp/cities1000.txt |
| ---- |
| |
| or using curl: |
| |
| [source,bash] |
| ---- |
| curl -X POST --data-binary @/path/to/cities1000.txt -H 'Content-type:application/csv' \ |
| 'http://localhost:8983/solr/geonames/update?commit=true&optimize=true&maxSegments=1&separator=%09&encapsulator=%00&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate' |
| ---- |
| |
| That might take around 35 seconds; it depends. It can be a lot faster if |
| the schema were tuned to only have what we truly need (no text search if |
| not needed). |
| |
| In that command we said `optimize=true&maxSegments=1` to put the index in a state that |
| will make tagging faster. The `encapsulator=%00` is a bit of a hack to |
| disable the default double-quote. |
| |
| === Tag Time! |
| |
| This is a trivial example tagging a small piece of text. For more |
| options, see the earlier documentation. |
| |
| [source,bash] |
| ---- |
| curl -X POST \ |
| 'http://localhost:8983/solr/geonames/tag?overlaps=NO_SUB&tagsLimit=5000&fl=id,name,countrycode&wt=json&indent=on' \ |
| -H 'Content-Type:text/plain' -d 'Hello New York City' |
| ---- |
| |
| The response should be this (the QTime may vary): |
| |
| [source,json] |
| ---- |
| { |
| "responseHeader":{ |
| "status":0, |
| "QTime":1}, |
| "tagsCount":1, |
| "tags":[[ |
| "startOffset",6, |
| "endOffset",19, |
| "ids",["5128581"]]], |
| "response":{"numFound":1,"start":0,"docs":[ |
| { |
| "id":"5128581", |
| "name":["New York City"], |
| "countrycode":["US"]}] |
| }} |
| ---- |
| |
| == Tagger Performance Tips |
| |
| * Follow the recommended configuration field settings above. |
| Additionally, for the best tagger performance, set `postingsFormat=FST50`. |
| However, non-default postings formats have no backwards-compatibility guarantees, and so if you upgrade Solr then you may find a nasty exception on startup as it fails to read the older index. |
| If the input text to be tagged is small (e.g., you are tagging queries or tweets) then the postings format choice isn't as important. |
| * "optimize" after loading your dictionary down to 1 Lucene segment, or at least to as few as possible. |
| * For bulk tagging lots of documents, there are some strategies, not mutually exclusive: |
| ** Batch them. |
| The tagger doesn't directly support batching but as a hack you can send a bunch of documents concatenated with |
| a nonsense word that is not in the dictionary like "ZZYYXXAABBCC" between them. |
| You'll need to keep track of the character offsets of these so you can subtract them from the results. |
| ** For reducing tagging latency even further, consider embedding Solr with `EmbeddedSolrServer`. |
| See `EmbeddedSolrNoSerializeTest`. |
| ** Use more than one thread -- perhaps as many as there are CPU cores available to Solr. |