| = Text Analysis and Term Vectors |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| This section of the user guide presents an overview of the text analysis, text analytics |
| and TF-IDF term vector functions in math expressions. |
| |
| == Text Analysis |
| |
| The `analyze` function applies a Solr analyzer to a text field and returns the tokens |
| emitted by the analyzer in an array. Any analyzer chain that is attached to a field in Solr's |
| schema can be used with the `analyze` function. |
| |
| In the example below, the text "hello world" is analyzed using the analyzer chain attached to the `subject` field in |
| the schema. The `subject` field is defined as the field type `text_general` and the text is analyzed using the |
| analysis chain configured for the `text_general` field type. |
| |
| [source,text] |
| ---- |
| analyze("hello world", subject) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "return-value": [ |
| "hello", |
| "world" |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 0 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| |
| === Annotating Documents |
| |
| The `analyze` function can be used inside of a `select` function to annotate documents with the tokens generated by the analysis. |
| |
| The example below performs a `search` in "collection1". |
| Each tuple returned by the `search` function contains an `id` and `subject`. |
| For each tuple, the `select` function selects the `id` field and calls the `analyze` function on the `subject` field. |
| The analyzer chain specified by the `subject_bigram` field is configured to perform a bigram analysis. |
| The tokens generated by the `analyze` function are added to each tuple in a field called `terms`. |
| |
| |
| [source,text] |
| ---- |
| select(search(collection1, q="*:*", fl="id, subject", sort="id asc"), |
| id, |
| analyze(subject, subject_bigram) as terms) |
| ---- |
| |
| Notice in the output that an array of bigram terms have been added to the tuples: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "terms": [ |
| "text analysis", |
| "analysis example" |
| ], |
| "id": "1" |
| }, |
| { |
| "terms": [ |
| "example number", |
| "number two" |
| ], |
| "id": "2" |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 4 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Text Analytics |
| |
| The `cartesianProduct` function can be used in conjunction |
| with the `analyze` function to perform a wide range |
| of text analytics. |
| |
| The `cartesianProduct` function explodes a multivalued field into a stream of tuples. |
| When the `analyze` function is used to create the multivalued field, the `cartesianProduct` function will explode the analyzed tokens into a stream of tuples. |
| This allows analytics to be performed over the stream of analyzed tokens and the result to be visualized with Zeppelin-Solr. |
| |
| *Example: Phrase Aggregation* |
| |
| An example performing phrase aggregation is used to illustrate the power of combining `cartesianProduct` and `analyze`. |
| |
| In this example the `search` expression is performed over a collection of movie reviews. |
| The phrase query "Man on Fire" is searched for and the top 5000 results, by score are returned. |
| A single field from the results is return which is the `review_t` field that |
| contains text of the movie review. |
| |
| Then `cartesianProduct` function is run over the search results. |
| The `cartesianProduct` function applies the `analyze` function, which takes the `review_t` field and analyzes it with the Lucene/Solr analyzer attached to the `text_bigrams` schema field. |
| This analyzer emits the bigrams found in the text field. |
| The `cartesianProduct` function explodes each bigram into its own tuple with the bigram stored in the field `term`. |
| |
| The stream of tuples, each containing a bigram, is then filtered by the `having` function |
| using regular expressions to select bigrams with a length of 12 or greater and to filter |
| out bigrams that contain specific characters. |
| |
| The `hashRollup` function then aggregates the bigrams and the `top` function emits the top 10 bigrams by count. |
| |
| Then Zeppelin-Solr is used to visualize the top 10 ten bigrams. |
| |
| image::images/math-expressions/text-analytics.png[] |
| |
| Lucene/Solr analyzers can be configured in many different ways to support |
| aggregations over NLP entities (people, places, companies, etc.) as well as |
| tokens extracted with regular expressions or dictionaries. |
| |
| == TF-IDF Term Vectors |
| |
| The `termVectors` function can be used to build TF-IDF term vectors from the terms generated by the `analyze` function. |
| |
| The `termVectors` function operates over a list of tuples that contain a field called `id` and a field called `terms`. |
| Notice that this is the exact output structure of the document annotation example above. |
| |
| The `termVectors` function builds a matrix from the list of tuples. |
| There is row in the matrix for each tuple in the list. |
| There is a column in the matrix for each term in the `terms` field. |
| |
| [source,text] |
| ---- |
| let(echo="c, d", <1> |
| a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"), <2> |
| id, |
| analyze(subject, subject_bigram) as terms), |
| b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1), <3> |
| c=getRowLabels(b), <4> |
| d=getColumnLabels(b)) |
| ---- |
| |
| The example below builds on the document annotation example. |
| |
| <1> The `echo` parameter will echo variables `c` and `d`, so the output includes |
| the row and column labels, which will be defined later in the expression. |
| <2> The list of tuples are stored in variable `a`. The `termVectors` function |
| operates over variable `a` and builds a matrix with 2 rows and 4 columns. |
| <3> The `termVectors` function sets the row and column labels of the term vectors matrix as variable `b`. |
| The row labels are the document ids and the column labels are the terms. |
| <4> The `getRowLabels` and `getColumnLabels` functions return |
| the row and column labels which are then stored in variables *`c`* and *`d`*. |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "c": [ |
| "1", |
| "2" |
| ], |
| "d": [ |
| "analysis example", |
| "example number", |
| "number two", |
| "text analysis" |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 5 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === TF-IDF Values |
| |
| The values within the term vectors matrix are the TF-IDF values for each term in each document. |
| The example below shows the values of the matrix. |
| |
| [source,text] |
| ---- |
| let(a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"), |
| id, |
| analyze(subject, subject_bigram) as terms), |
| b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "b": [ |
| [ |
| 1.4054651081081644, |
| 0, |
| 0, |
| 1.4054651081081644 |
| ], |
| [ |
| 0, |
| 1.4054651081081644, |
| 1.4054651081081644, |
| 0 |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 5 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Limiting the Noise |
| |
| One of the key challenges when working with term vectors is that text often has a significant amount of noise which can obscure the important terms in the data. |
| The `termVectors` function has several parameters designed to filter out the less meaningful terms. |
| This is also important because eliminating the noisy terms helps keep the term vector matrix small enough to fit comfortably in memory. |
| |
| There are four parameters designed to filter noisy terms from the term vector matrix: |
| |
| `minTermLength`:: |
| The minimum term length required to include the term in the matrix. |
| |
| `minDocFreq`:: |
| The minimum percentage, expressed as a number between 0 and 1, of documents the term must appear in to be included in the index. |
| |
| `maxDocFreq`:: |
| The maximum percentage, expressed as a number between 0 and 1, of documents the term can appear in to be included in the index. |
| |
| `exclude`:: |
| A comma-delimited list of strings used to exclude terms. If a term contains any of the excluded strings that |
| term will be excluded from the term vector. |