solr/solr-ref-guide/src/term-vectors.adoc - solr - Git at Google

 = Text Analysis and Term Vectors
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 Term frequency-inverse document frequency (TF-IDF) term vectors are often used to
 represent text documents when performing text mining and machine learning operations. The math expressions
 library can be used to perform text analysis and create TF-IDF term vectors.

 == Text Analysis

 The `analyze` function applies a Solr analyzer to a text field and returns the tokens
 emitted by the analyzer in an array. Any analyzer chain that is attached to a field in Solr's
 schema can be used with the `analyze` function.

 In the example below, the text "hello world" is analyzed using the analyzer chain attached to the `subject` field in
 the schema. The `subject` field is defined as the field type `text_general` and the text is analyzed using the
 analysis chain configured for the `text_general` field type.

 [source,text]
 ----
 analyze("hello world", subject)
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "return-value": [
           "hello",
           "world"
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 === Annotating Documents

 The `analyze` function can be used inside of a `select` function to annotate documents with the tokens
 generated by the analysis.

 The example below performs a `search` in "collection1". Each tuple returned by the `search` function
 contains an `id` and `subject`. For each tuple, the
 `select` function selects the `id` field and calls the `analyze` function on the `subject` field.
 The analyzer chain specified by the `subject_bigram` field is configured to perform a bigram analysis.
 The tokens generated by the `analyze` function are added to each tuple in a field called `terms`.


 [source,text]
 ----
 select(search(collection1, q="*:*", fl="id, subject", sort="id asc"),
        id,
        analyze(subject, subject_bigram) as terms)
 ----

 Notice in the output that an array of bigram terms have been added to the tuples:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "terms": [
           "text analysis",
           "analysis example"
         ],
         "id": "1"
       },
       {
         "terms": [
           "example number",
           "number two"
         ],
         "id": "2"
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 4
       }
     ]
   }
 }
 ----

 == TF-IDF Term Vectors

 The `termVectors` function can be used to build TF-IDF term vectors from the terms generated by the `analyze` function.

 The `termVectors` function operates over a list of tuples that contain a field called `id` and a field called `terms`.
 Notice that this is the exact output structure of the document annotation example above.

 The `termVectors` function builds a matrix from the list of tuples. There is row in the
 matrix for each tuple in the list. There is a column in the matrix for each term in the `terms` field.

 [source,text]
 ----
 let(echo="c, d", <1>
     a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"), <2>
              id,
              analyze(subject, subject_bigram) as terms),
     b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1), <3>
     c=getRowLabels(b), <4>
     d=getColumnLabels(b))
 ----

 The example below builds on the document annotation example.

 <1> The `echo` parameter will echo variables *`c`* and *`d`*, so the output includes
 the row and column labels, which will be defined later in the expression.
 <2> The list of tuples are stored in variable *`a`*. The `termVectors` function
 operates over variable *`a`* and builds a matrix with 2 rows and 4 columns.
 <3> The `termVectors` function sets the row and column labels of the term vectors matrix as variable *`b`*.
 The row labels are the document ids and the column labels are the terms.
 <4> The `getRowLabels` and `getColumnLabels` functions return
 the row and column labels which are then stored in variables *`c`* and *`d`*.

 When this expression is sent to the `/stream` handler it
 responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "c": [
           "1",
           "2"
         ],
         "d": [
           "analysis example",
           "example number",
           "number two",
           "text analysis"
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 5
       }
     ]
   }
 }
 ----

 === TF-IDF Values

 The values within the term vectors matrix are the TF-IDF values for each term in each document. The
 example below shows the values of the matrix.

 [source,text]
 ----
 let(a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
              id,
              analyze(subject, subject_bigram) as terms),
     b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1))
 ----

 When this expression is sent to the `/stream` handler it
 responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           [
             1.4054651081081644,
             0,
             0,
             1.4054651081081644
           ],
           [
             0,
             1.4054651081081644,
             1.4054651081081644,
             0
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 5
       }
     ]
   }
 }
 ----

 === Limiting the Noise

 One of the key challenges when with working term vectors is that text often has a significant amount of noise
 which can obscure the important terms in the data. The `termVectors` function has several parameters
 designed to filter out the less meaningful terms. This is also important because eliminating
 the noisy terms helps keep the term vector matrix small enough to fit comfortably in memory.

 There are four parameters designed to filter noisy terms from the term vector matrix:

 `minTermLength`::
 The minimum term length required to include the term in the matrix.

 minDocFreq::
 The minimum percentage, expressed as a number between 0 and 1, of documents the term must appear in to be included in the index.

 maxDocFreq::
 The maximum percentage, expressed as a number between 0 and 1, of documents the term can appear in to be included in the index.

 exclude::
 A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that
 term will be excluded from the term vector.
	= Text Analysis and Term Vectors
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	Term frequency-inverse document frequency (TF-IDF) term vectors are often used to
	represent text documents when performing text mining and machine learning operations. The math expressions
	library can be used to perform text analysis and create TF-IDF term vectors.

	== Text Analysis

	The `analyze` function applies a Solr analyzer to a text field and returns the tokens
	emitted by the analyzer in an array. Any analyzer chain that is attached to a field in Solr's
	schema can be used with the `analyze` function.

	In the example below, the text "hello world" is analyzed using the analyzer chain attached to the `subject` field in
	the schema. The `subject` field is defined as the field type `text_general` and the text is analyzed using the
	analysis chain configured for the `text_general` field type.

	[source,text]
	----
	analyze("hello world", subject)
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"return-value": [
	"hello",
	"world"
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----

	=== Annotating Documents

	The `analyze` function can be used inside of a `select` function to annotate documents with the tokens
	generated by the analysis.

	The example below performs a `search` in "collection1". Each tuple returned by the `search` function
	contains an `id` and `subject`. For each tuple, the
	`select` function selects the `id` field and calls the `analyze` function on the `subject` field.
	The analyzer chain specified by the `subject_bigram` field is configured to perform a bigram analysis.
	The tokens generated by the `analyze` function are added to each tuple in a field called `terms`.


	[source,text]
	----
	select(search(collection1, q=":", fl="id, subject", sort="id asc"),
	id,
	analyze(subject, subject_bigram) as terms)
	----

	Notice in the output that an array of bigram terms have been added to the tuples:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"terms": [
	"text analysis",
	"analysis example"
	],
	"id": "1"
	},
	{
	"terms": [
	"example number",
	"number two"
	],
	"id": "2"
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 4
	}
	]
	}
	}
	----

	== TF-IDF Term Vectors

	The `termVectors` function can be used to build TF-IDF term vectors from the terms generated by the `analyze` function.

	The `termVectors` function operates over a list of tuples that contain a field called `id` and a field called `terms`.
	Notice that this is the exact output structure of the document annotation example above.

	The `termVectors` function builds a matrix from the list of tuples. There is row in the
	matrix for each tuple in the list. There is a column in the matrix for each term in the `terms` field.

	[source,text]
	----
	let(echo="c, d", <1>
	a=select(search(collection3, q=":", fl="id, subject", sort="id asc"), <2>
	id,
	analyze(subject, subject_bigram) as terms),
	b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1), <3>
	c=getRowLabels(b), <4>
	d=getColumnLabels(b))
	----

	The example below builds on the document annotation example.

	<1> The `echo` parameter will echo variables `c` and `d`, so the output includes
	the row and column labels, which will be defined later in the expression.
	<2> The list of tuples are stored in variable `a`. The `termVectors` function
	operates over variable `a` and builds a matrix with 2 rows and 4 columns.
	<3> The `termVectors` function sets the row and column labels of the term vectors matrix as variable `b`.
	The row labels are the document ids and the column labels are the terms.
	<4> The `getRowLabels` and `getColumnLabels` functions return
	the row and column labels which are then stored in variables `c` and `d`.

	When this expression is sent to the `/stream` handler it
	responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"c": [
	"1",
	"2"
	],
	"d": [
	"analysis example",
	"example number",
	"number two",
	"text analysis"
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 5
	}
	]
	}
	}
	----

	=== TF-IDF Values

	The values within the term vectors matrix are the TF-IDF values for each term in each document. The
	example below shows the values of the matrix.

	[source,text]
	----
	let(a=select(search(collection3, q=":", fl="id, subject", sort="id asc"),
	id,
	analyze(subject, subject_bigram) as terms),
	b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1))
	----

	When this expression is sent to the `/stream` handler it
	responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": [
	[
	1.4054651081081644,
	0,
	0,
	1.4054651081081644
	],
	[
	0,
	1.4054651081081644,
	1.4054651081081644,
	0
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 5
	}
	]
	}
	}
	----

	=== Limiting the Noise

	One of the key challenges when with working term vectors is that text often has a significant amount of noise
	which can obscure the important terms in the data. The `termVectors` function has several parameters
	designed to filter out the less meaningful terms. This is also important because eliminating
	the noisy terms helps keep the term vector matrix small enough to fit comfortably in memory.

	There are four parameters designed to filter noisy terms from the term vector matrix:

	`minTermLength`::
	The minimum term length required to include the term in the matrix.

	minDocFreq::
	The minimum percentage, expressed as a number between 0 and 1, of documents the term must appear in to be included in the index.

	maxDocFreq::
	The maximum percentage, expressed as a number between 0 and 1, of documents the term can appear in to be included in the index.

	exclude::
	A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that
	term will be excluded from the term vector.