solr/solr-ref-guide/src/the-tagger-handler.adoc - lucene-solr - Git at Google

 [[the-tagger-handler]]
 = The Tagger Handler

 The "Tagger" Request Handler, AKA the "SolrTextTagger" is a "text tagger".
 Given a dictionary (a Solr index) with a name-like field,
   you post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired.
 It's used for named entity recognition (NER).
 It doesn't do any NLP (outside of Lucene text analysis) so it's said to be a "naive tagger",
   but it's definitely useful as-is and a more complete NER or ERD (entity recognition and disambiguation)
   system can be built with this as a key component.
 The SolrTextTagger might be used on queries for query-understanding or large documents as well.

 To get a sense of how to use it, jump to the tutorial below.

 The tagger does not yet support a sharded index.
 Tens, perhaps hundreds of millions of names (documents) are supported, mostly limited by memory.

 [[tagger-configuration]]
 == Configuration

 The Solr schema needs 2 things:

 * A unique key field (see `<uniqueKey>`).
   Recommended field settings: set `docValues=true`
 * A tag field, a TextField, with `ConcatenateGraphFilterFactory` at the end of the index chain (not the query chain):
   Set `preservePositionIncrements=false` on that filter.
   Recommended field settings: `omitNorms=true`, `omitTermFreqAndPositions=true` and `postingsFormat=FST50`

 The text field's _index analysis chain_, aside from needing ConcatenateGraphFilterFactory at the end,
   can otherwise have whatever tokenizer and filters suit your matching preferences.
 It can have multi-word synonyms and use WordDelimiterGraphFilterFactory for example.
 However, do _not_ use FlattenGraphFilterFactory as it will interfere with ConcatenateGraphFilterFactory.
 Position gaps (e.g. stop words) get ignored; it's not (yet) supported for the gap to be significant.

 The text field's _query analysis chain_, on the other hand, is more limited.
 There should not be tokens at the same position, thus no synonym expansion -- do that at index time instead.
 Stop words (or any other filter introducing a position gap) are supported.
 At runtime the tagger can be configured to either treat it as a tag break or to ignore it.

 The Solr config needs the `solr.TagRequestHandler` defined, which supports `defaults`, `invariants`, and `appends`
 sections just like the search handler.

 [[tagger-parameters]]
 == Tagger Parameters

 The tagger's execution is completely configurable with request parameters.  Only `field` is required.

 `field`::
   The tag field that serves as the dictionary.
   This is required; you'll probably specify it in the request handler.

 `fq`::
   You can specify some number of _filter queries_ to limit the dictionary used for tagging.
   This parameter is the same as is used by the `solr.SearchHandler`.

 `rows`::
   The maximum number of documents to return, but defaulting to 10000 for a tag request.
   This parameter is the same as is used by the `solr.SearchHandler`.

 `fl`::
   Solr's standard param for listing the fields to return.
   This parameter is the same as is used by the `solr.SearchHandler`.

 `overlaps`::
   Choose the algorithm to determine which tags in an overlapping set should be retained, versus being pruned away.
   Options are:

   * `ALL`: Emit all tags.
   * `NO_SUB`: Don't emit a tag that is completely within another tag (i.e. no subtag).
   * `LONGEST_DOMINANT_RIGHT`: Given a cluster of overlapping tags, emit the longest one (by character length).
      If there is a tie, pick the right-most.
      Remove any tags overlapping with this tag then repeat the algorithm to potentially find other tags
      that can be emitted in the cluster.

 `matchText`::
   A boolean indicating whether to return the matched text in the tag response.
   This will trigger the tagger to fully buffer the input before tagging.

 `tagsLimit`::
   The maximum number of tags to return in the response.
   Tagging effectively stops after this point.
   By default this is 1000.

 `skipAltTokens`::
   A boolean flag used to suppress errors that can occur if, for example,
   you enable synonym expansion at query time in the analyzer, which you normally shouldn't do.
   Let this default to false unless you know that such tokens can't be avoided.

 `ignoreStopwords`::
   A boolean flag that causes stopwords (or any condition causing positions to skip like >255 char words)
   to be ignored as if it wasn't there.
   Otherwise, the behavior is to treat them as breaks in tagging on the presumption your indexed text-analysis
   configuration doesn't have a StopWordFilter.
   By default the indexed analysis chain is checked for the presence of a StopWordFilter and if found
   then ignoreStopWords is true if unspecified.
   You probably shouldn't have a StopWordFilter configured and probably won't need to set this param either.

 `xmlOffsetAdjust`::
   A boolean indicating that the input is XML and furthermore that the offsets of returned tags should be adjusted as
   necessary to allow for the client to insert an openening and closing element at the tag offset pair.
   If it isn't possible to do so then the tag will be omitted.
   You are expected to configure `HTMLStripCharFilterFactory` in the schema when using this option.
   This will trigger the tagger to fully buffer the input before tagging.

 Solr's parameters for controlling the response format are supported, like:
   `echoParams`, `wt`, `indent`, etc.

 [[tagger-tutorial-with-geonames]]
 == Tutorial with Geonames

 This is a tutorial that demonstrates how to configure and use the text
 tagger with the popular Geonames data set. It's more than a tutorial;
 it's a how-to with information that wasn't described above.

 [[tagger-create-and-configure-a-solr-collection]]
 === Create and Configure a Solr Collection

 Create a Solr collection named "geonames". For the tutorial, we'll
 assume the default "data-driven" configuration. It's good for
 experimentation and getting going fast but not for production or being
 optimal.

 ....
 bin/solr create -c geonames
 ....

 [[tagger-configuring]]
 ==== Configuring

 We need to configure the schema first. The "data driven" mode we're
 using allows us to keep this step fairly minimal -- we just need to
 declare a field type, 2 fields, and a copy-field. The critical part
 up-front is to define the "tag" field type. There are many many ways to
 configure text analysis; and we're not going to get into those choices
 here. But an important bit is the `ConcatenateGraphFilterFactory` at the
 end of the index analyzer chain. Another important bit for performance
 is postingsFormat=FST50 resulting in a compact FST based in-memory data
 structure that is especially beneficial for the text tagger.

 Schema configuration:

 ....
 curl -X POST -H 'Content-type:application/json'  http://localhost:8983/solr/geonames/schema -d '{
   "add-field-type":{
     "name":"tag",
     "class":"solr.TextField",
     "postingsFormat":"FST50",
     "omitNorms":true,
     "omitTermFreqAndPositions":true,
     "indexAnalyzer":{
       "tokenizer":{
          "class":"solr.StandardTokenizerFactory" },
       "filters":[
         {"class":"solr.EnglishPossessiveFilterFactory"},
         {"class":"solr.ASCIIFoldingFilterFactory"},
         {"class":"solr.LowerCaseFilterFactory"},
         {"class":"solr.ConcatenateGraphFilterFactory", "preservePositionIncrements":false }
       ]},
     "queryAnalyzer":{
       "tokenizer":{
          "class":"solr.StandardTokenizerFactory" },
       "filters":[
         {"class":"solr.EnglishPossessiveFilterFactory"},
         {"class":"solr.ASCIIFoldingFilterFactory"},
         {"class":"solr.LowerCaseFilterFactory"}
       ]}
     },

   "add-field":{ "name":"name",     "type":"text_general"},

   "add-field":{ "name":"name_tag", "type":"tag",          "stored":false },

   "add-copy-field":{ "source":"name", "dest":[ "name_tag" ]}
 }'
 ....

 Configure a custom Solr Request Handler:

 ....
 curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/geonames/config -d '{
   "add-requesthandler" : {
     "name": "/tag",
     "class":"solr.TaggerRequestHandler",
     "defaults":{ "field":"name_tag" }
   }
 }'
 ....

 [[tagger-load-some-sample-data]]
 === Load Some Sample Data

 We'll go with some Geonames.org data in CSV format. Solr is quite
 flexible in loading data in a variety of formats. This
 http://download.geonames.org/export/dump/cities1000.zip[cities1000.zip]
 should be almost 7MB file expanding to a cities1000.txt file around
 22.2MB containing 145k lines, each a city in the world of at least 1000
 population.

 Using bin/post:
 ....
 bin/post -c geonames -type text/csv \
   -params 'optimize=true&separator=%09&encapsulator=%00&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate' \
   /tmp/cities1000.txt
 ....
 or using curl:
 ....
 curl -X POST --data-binary @/path/to/cities1000.txt -H 'Content-type:application/csv' \
   'http://localhost:8983/solr/geonames/update?commit=true&optimize=true&separator=%09&encapsulator=%00&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate'
 ....

 That might take around 35 seconds; it depends. It can be a lot faster if
 the schema were tuned to only have what we truly need (no text search if
 not needed).

 In that command we said optimize=true to put the index in a state that
 will make tagging faster. The encapsulator=%00 is a bit of a hack to
 disable the default double-quote.

 [[tagger-tag-time]]
 === Tag Time!

 This is a trivial example tagging a small piece of text. For more
 options, see the earlier documentation.

 ....
 curl -X POST \
   'http://localhost:8983/solr/geonames/tag?overlaps=NO_SUB&tagsLimit=5000&fl=id,name,countrycode&wt=json&indent=on' \
   -H 'Content-Type:text/plain' -d 'Hello New York City'
 ....

 The response should be this (the QTime may vary):

 ....
 {
   "responseHeader":{
     "status":0,
     "QTime":1},
   "tagsCount":1,
   "tags":[[
       "startOffset",6,
       "endOffset",19,
       "ids",["5128581"]]],
   "response":{"numFound":1,"start":0,"docs":[
       {
         "id":"5128581",
         "name":["New York City"],
         "countrycode":["US"]}]
   }}
 ....

 [[tagger-tips]]
 == Tips

 Performance Tips:

 * Follow the recommended configuration field settings, especially `postingsFormat=FST50`.
 * "optimize" after loading your dictionary down to 1 Lucene segment, or at least to as few as possible.
 * For bulk tagging lots of documents, there are some strategies, not mutually exclusive:
 ** Batch them.
    The tagger doesn't directly support batching but as a hack you can send a bunch of documents concatenated with
      a nonsense word that is not in the dictionary like "ZZYYXXAABBCC" between them.
      You'll need to keep track of the character offsets of these so you can subtract them from the results.
 ** For reducing tagging latency even further, consider embedding Solr with `EmbeddedSolrServer`.
    See `EmbeddedSolrNoSerializeTest`.
 ** Use more than one thread -- perhaps as many as there are CPU cores available to Solr.
	[[the-tagger-handler]]
	= The Tagger Handler

	The "Tagger" Request Handler, AKA the "SolrTextTagger" is a "text tagger".
	Given a dictionary (a Solr index) with a name-like field,
	you post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired.
	It's used for named entity recognition (NER).
	It doesn't do any NLP (outside of Lucene text analysis) so it's said to be a "naive tagger",
	but it's definitely useful as-is and a more complete NER or ERD (entity recognition and disambiguation)
	system can be built with this as a key component.
	The SolrTextTagger might be used on queries for query-understanding or large documents as well.

	To get a sense of how to use it, jump to the tutorial below.

	The tagger does not yet support a sharded index.
	Tens, perhaps hundreds of millions of names (documents) are supported, mostly limited by memory.

	[[tagger-configuration]]
	== Configuration

	The Solr schema needs 2 things:

	* A unique key field (see `<uniqueKey>`).
	Recommended field settings: set `docValues=true`
	* A tag field, a TextField, with `ConcatenateGraphFilterFactory` at the end of the index chain (not the query chain):
	Set `preservePositionIncrements=false` on that filter.
	Recommended field settings: `omitNorms=true`, `omitTermFreqAndPositions=true` and `postingsFormat=FST50`

	The text field's _index analysis chain_, aside from needing ConcatenateGraphFilterFactory at the end,
	can otherwise have whatever tokenizer and filters suit your matching preferences.
	It can have multi-word synonyms and use WordDelimiterGraphFilterFactory for example.
	However, do _not_ use FlattenGraphFilterFactory as it will interfere with ConcatenateGraphFilterFactory.
	Position gaps (e.g. stop words) get ignored; it's not (yet) supported for the gap to be significant.

	The text field's _query analysis chain_, on the other hand, is more limited.
	There should not be tokens at the same position, thus no synonym expansion -- do that at index time instead.
	Stop words (or any other filter introducing a position gap) are supported.
	At runtime the tagger can be configured to either treat it as a tag break or to ignore it.

	The Solr config needs the `solr.TagRequestHandler` defined, which supports `defaults`, `invariants`, and `appends`
	sections just like the search handler.

	[[tagger-parameters]]
	== Tagger Parameters

	The tagger's execution is completely configurable with request parameters. Only `field` is required.

	`field`::
	The tag field that serves as the dictionary.
	This is required; you'll probably specify it in the request handler.

	`fq`::
	You can specify some number of _filter queries_ to limit the dictionary used for tagging.
	This parameter is the same as is used by the `solr.SearchHandler`.

	`rows`::
	The maximum number of documents to return, but defaulting to 10000 for a tag request.
	This parameter is the same as is used by the `solr.SearchHandler`.

	`fl`::
	Solr's standard param for listing the fields to return.
	This parameter is the same as is used by the `solr.SearchHandler`.

	`overlaps`::
	Choose the algorithm to determine which tags in an overlapping set should be retained, versus being pruned away.
	Options are:

	* `ALL`: Emit all tags.
	* `NO_SUB`: Don't emit a tag that is completely within another tag (i.e. no subtag).
	* `LONGEST_DOMINANT_RIGHT`: Given a cluster of overlapping tags, emit the longest one (by character length).
	If there is a tie, pick the right-most.
	Remove any tags overlapping with this tag then repeat the algorithm to potentially find other tags
	that can be emitted in the cluster.

	`matchText`::
	A boolean indicating whether to return the matched text in the tag response.
	This will trigger the tagger to fully buffer the input before tagging.

	`tagsLimit`::
	The maximum number of tags to return in the response.
	Tagging effectively stops after this point.
	By default this is 1000.

	`skipAltTokens`::
	A boolean flag used to suppress errors that can occur if, for example,
	you enable synonym expansion at query time in the analyzer, which you normally shouldn't do.
	Let this default to false unless you know that such tokens can't be avoided.

	`ignoreStopwords`::
	A boolean flag that causes stopwords (or any condition causing positions to skip like >255 char words)
	to be ignored as if it wasn't there.
	Otherwise, the behavior is to treat them as breaks in tagging on the presumption your indexed text-analysis
	configuration doesn't have a StopWordFilter.
	By default the indexed analysis chain is checked for the presence of a StopWordFilter and if found
	then ignoreStopWords is true if unspecified.
	You probably shouldn't have a StopWordFilter configured and probably won't need to set this param either.

	`xmlOffsetAdjust`::
	A boolean indicating that the input is XML and furthermore that the offsets of returned tags should be adjusted as
	necessary to allow for the client to insert an openening and closing element at the tag offset pair.
	If it isn't possible to do so then the tag will be omitted.
	You are expected to configure `HTMLStripCharFilterFactory` in the schema when using this option.
	This will trigger the tagger to fully buffer the input before tagging.

	Solr's parameters for controlling the response format are supported, like:
	`echoParams`, `wt`, `indent`, etc.

	[[tagger-tutorial-with-geonames]]
	== Tutorial with Geonames

	This is a tutorial that demonstrates how to configure and use the text
	tagger with the popular Geonames data set. It's more than a tutorial;
	it's a how-to with information that wasn't described above.

	[[tagger-create-and-configure-a-solr-collection]]
	=== Create and Configure a Solr Collection

	Create a Solr collection named "geonames". For the tutorial, we'll
	assume the default "data-driven" configuration. It's good for
	experimentation and getting going fast but not for production or being
	optimal.

	....
	bin/solr create -c geonames
	....

	[[tagger-configuring]]
	==== Configuring

	We need to configure the schema first. The "data driven" mode we're
	using allows us to keep this step fairly minimal -- we just need to
	declare a field type, 2 fields, and a copy-field. The critical part
	up-front is to define the "tag" field type. There are many many ways to
	configure text analysis; and we're not going to get into those choices
	here. But an important bit is the `ConcatenateGraphFilterFactory` at the
	end of the index analyzer chain. Another important bit for performance
	is postingsFormat=FST50 resulting in a compact FST based in-memory data
	structure that is especially beneficial for the text tagger.

	Schema configuration:

	....
	curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/geonames/schema -d '{
	"add-field-type":{
	"name":"tag",
	"class":"solr.TextField",
	"postingsFormat":"FST50",
	"omitNorms":true,
	"omitTermFreqAndPositions":true,
	"indexAnalyzer":{
	"tokenizer":{
	"class":"solr.StandardTokenizerFactory" },
	"filters":[
	{"class":"solr.EnglishPossessiveFilterFactory"},
	{"class":"solr.ASCIIFoldingFilterFactory"},
	{"class":"solr.LowerCaseFilterFactory"},
	{"class":"solr.ConcatenateGraphFilterFactory", "preservePositionIncrements":false }
	]},
	"queryAnalyzer":{
	"tokenizer":{
	"class":"solr.StandardTokenizerFactory" },
	"filters":[
	{"class":"solr.EnglishPossessiveFilterFactory"},
	{"class":"solr.ASCIIFoldingFilterFactory"},
	{"class":"solr.LowerCaseFilterFactory"}
	]}
	},

	"add-field":{ "name":"name", "type":"text_general"},

	"add-field":{ "name":"name_tag", "type":"tag", "stored":false },

	"add-copy-field":{ "source":"name", "dest":[ "name_tag" ]}
	}'
	....

	Configure a custom Solr Request Handler:

	....
	curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/geonames/config -d '{
	"add-requesthandler" : {
	"name": "/tag",
	"class":"solr.TaggerRequestHandler",
	"defaults":{ "field":"name_tag" }
	}
	}'
	....

	[[tagger-load-some-sample-data]]
	=== Load Some Sample Data

	We'll go with some Geonames.org data in CSV format. Solr is quite
	flexible in loading data in a variety of formats. This
	http://download.geonames.org/export/dump/cities1000.zip[cities1000.zip]
	should be almost 7MB file expanding to a cities1000.txt file around
	22.2MB containing 145k lines, each a city in the world of at least 1000
	population.

	Using bin/post:
	....
	bin/post -c geonames -type text/csv \
	-params 'optimize=true&separator=%09&encapsulator=%00&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate' \
	/tmp/cities1000.txt
	....
	or using curl:
	....
	curl -X POST --data-binary @/path/to/cities1000.txt -H 'Content-type:application/csv' \
	'http://localhost:8983/solr/geonames/update?commit=true&optimize=true&separator=%09&encapsulator=%00&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate'
	....

	That might take around 35 seconds; it depends. It can be a lot faster if
	the schema were tuned to only have what we truly need (no text search if
	not needed).

	In that command we said optimize=true to put the index in a state that
	will make tagging faster. The encapsulator=%00 is a bit of a hack to
	disable the default double-quote.

	[[tagger-tag-time]]
	=== Tag Time!

	This is a trivial example tagging a small piece of text. For more
	options, see the earlier documentation.

	....
	curl -X POST \
	'http://localhost:8983/solr/geonames/tag?overlaps=NO_SUB&tagsLimit=5000&fl=id,name,countrycode&wt=json&indent=on' \
	-H 'Content-Type:text/plain' -d 'Hello New York City'
	....

	The response should be this (the QTime may vary):

	....
	{
	"responseHeader":{
	"status":0,
	"QTime":1},
	"tagsCount":1,
	"tags":[[
	"startOffset",6,
	"endOffset",19,
	"ids",["5128581"]]],
	"response":{"numFound":1,"start":0,"docs":[
	{
	"id":"5128581",
	"name":["New York City"],
	"countrycode":["US"]}]
	}}
	....

	[[tagger-tips]]
	== Tips

	Performance Tips:

	* Follow the recommended configuration field settings, especially `postingsFormat=FST50`.
	* "optimize" after loading your dictionary down to 1 Lucene segment, or at least to as few as possible.
	* For bulk tagging lots of documents, there are some strategies, not mutually exclusive:
	** Batch them.
	The tagger doesn't directly support batching but as a hack you can send a bunch of documents concatenated with
	a nonsense word that is not in the dictionary like "ZZYYXXAABBCC" between them.
	You'll need to keep track of the character offsets of these so you can subtract them from the results.
	** For reducing tagging latency even further, consider embedding Solr with `EmbeddedSolrServer`.
	See `EmbeddedSolrNoSerializeTest`.
	** Use more than one thread -- perhaps as many as there are CPU cores available to Solr.