| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <html> |
| <head> |
| <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <title> |
| Apache Lucene OpenNLP integration module |
| </title> |
| </head> |
| <body> |
| <p> |
| This module exposes functionality from |
| <a href="http://opennlp.apache.org">Apache OpenNLP</a> to Apache Lucene. |
| The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. |
| <p> |
| For an introduction to Lucene's analysis API, see the {@link org.apache.lucene.analysis} package documentation. |
| <p> |
| The OpenNLP Tokenizer behavior is similar to the WhiteSpaceTokenizer but is smart about |
| inter-word punctuation. The term stream looks very much like the way you parse words and |
| punctuation while reading. The major difference between this tokenizer and most other |
| tokenizers shipped with Lucene is that punctuation is tokenized. This is required for |
| the following taggers to operate properly. |
| <p> |
| The OpenNLP taggers annotate terms using the <code>TypeAttribute</code>. |
| <ul> |
| <li><code>OpenNLPTokenizer</code> segments text into sentences or words. This Tokenizer |
| uses the OpenNLP Sentence Detector and/or Tokenizer classes. When used together, the |
| Tokenizer receives sentences and can do a better job.</li> |
| <li><code>OpenNLPFilter</code> tags words using one or more technologies: Part-of-Speech, |
| Chunking, and Named Entity Recognition. These tags are assigned as token types. Note that |
| only of these operations will tag |
| </li> |
| </ul> |
| <p> |
| Since the <code>TypeAttribute</code> is not stored in the index, it is recommended that one |
| of these filters is used following <code>OpenNLPFilter</code> to enable search against the |
| assigned tags: |
| <ul> |
| <li><code>TypeAsPayloadFilter</code> copies the <code>TypeAttribute</code> value to the |
| <code>PayloadAttribute</code></li> |
| <li><code>TypeAsSynonymFilter</code> creates a cloned token at the same position as each |
| tagged token, and copies the {{TypeAttribute}} value to the {{CharTermAttribute}}, optionally |
| with a customized prefix (so that tags effectively occupy a different namespace from token |
| text).</li> |
| </ul> |
| </body> |
| </html> |