| <?xml version="1.0"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| http://www.apache.org/licenses/LICENSE-2.0 |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <document> |
| |
| <properties> |
| <title>Commons Text - User guide</title> |
| <author email="dev@commons.apache.org">Commons Documentation Team</author> |
| </properties> |
| |
| <body> |
| <!-- $Id$ --> |
| |
| <section name='User Guide for Commons "Text"'> |
| <div align="center"> |
| <h1>The Commons <em>Text</em> Package |
| </h1> |
| <h2>Users Guide</h2> |
| <br/> |
| <a href="#Description">[Description]</a> |
| <a href="#text">[text]</a> |
| <a href="#text.diff">[text.diff]</a> |
| <a href="#text.lookup">[text.lookup]</a> |
| <a href="#text.similarity">[text.similarity]</a> |
| <a href="#text.translate">[text.translate]</a> |
| <br/> |
| <br/> |
| </div> |
| </section> |
| |
| <section name="Description"> |
| <p>The Commons Text library provides additions to the standard JDK's |
| text handling. Our goal is to provide a consistent set of tools for |
| processing text generally from computing distances between Strings |
| to being able to efficiently do String escaping of various types. |
| </p> |
| </section> |
| |
| <section name="text"> |
| |
| <p>Originally the text package was added in Commons Lang 2.2. However, its |
| new home is here. It provides, amongst other |
| classes, a replacement for <code>StringBuffer</code> named <code> |
| StrBuilder</code>, a class for substituting variables within a String |
| named <code>StrSubstitutor</code> and a replacement for StringTokenizer |
| named <code>StrTokenizer</code>. While somewhat ungainly, the <code> |
| Str |
| </code> prefix has been used to ensure we don't clash with any current |
| or future standard Java classes. |
| </p> |
| |
| <p>Beyond the text utilities ported over from lang, we have also included various |
| string similarity and distance functions. Lastly, there are also utilities for |
| addressing differences between bodies of text for the sake of viewing these |
| differences. |
| </p> |
| |
| <subsection name="StringEscapeUtils"> |
| <p>From Lang 3.5, we have moved into Text StringEscapeUtils and StrTokenizer. |
| It's provides ways in which to generate pieces of text, such as might |
| be used for default passwords. StringEscapeUtils contains methods to |
| escape and unescape Java, JavaScript, HTML and XML. It is worth noting that |
| the package <code>org.apache.commons.text.translate</code> holds the |
| functionality underpinning the StringEscapeUtils with mappings and translations |
| between such mappings for the sake of doing String escaping. StrTokenizer is |
| an improved alternative to java.util.StringTokenizer. |
| </p> |
| </subsection> |
| |
| <subsection name="StringSubstitutor"> |
| <p> |
| The simplest example is to use this class to replace Java System properties. For example: |
| <pre> |
| StringSubstitutor.replaceSystemProperties( |
| "You are running with java.version = ${java.version} and os.name = ${os.name}."); |
| </pre> |
| </p> |
| <p> |
| For details see <a href="http://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/StringSubstitutor.html">StringSubstitutor</a>. |
| </p> |
| <p> |
| To build a default full-featured substitutor, use: |
| </p> |
| <ul> |
| <li>Commons Text >= 1.8: |
| <a href="http://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/StringSubstitutor.html">org.apache.commons.text.StringSubstitutor.createInterpolator()</a></li> |
| <li>Commons Text < 1.8: |
| <a href="http://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/StringSubstitutor.html">new StringSubstitutor(StringLookupFactory.INSTANCE.interpolatorStringLookup())</a></li> |
| </ul> |
| <p> |
| The available substitutions are defined in |
| <a href="http://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/lookup/StringLookupFactory.html">org.apache.commons.text.lookup.StringLookupFactory</a>: |
| </p> |
| </subsection> |
| |
| <subsection name="Similarity and Distance"> |
| <p>The <code>org.apache.commons.text.similarity</code> packages contains various different mechanisms of |
| calculating "similarity scores" as well as "edit distances between Strings. Note, |
| the difference between a "similarity score" and a "distance function" is that |
| a distance functions meets the following qualifications: |
| <ul> |
| <li><code>d(x,y) >= 0</code>, non-negativity or separation axiom |
| </li> |
| <li><code>d(x,y) == 0</code>, if and only if, |
| <code>x == y</code> |
| </li> |
| <li><code>d(x,y) == d(y,x)</code>, symmetry, and |
| </li> |
| <li><code>d(x,z) <= d(x,y) + d(y,z)</code>, the triangle inequality |
| </li> |
| </ul> |
| whereas a "similarity score" need not satisfy all such properties. Though, it |
| is fairly easy to "normalize" a similarity score to manufacture an "edit distance." |
| </p> |
| <p> |
| The list of "edit distances" that we currently support follow: |
| <ul> |
| <li>Cosine Distance,</li> |
| <li>Hamming Distance,</li> |
| <li>Jaccard Distance,</li> |
| <li>Jaro Winkler Distance,</li> |
| <li>Levenshtein Distance,</li> |
| <li>Longest Commons Subsequence Distance,</li> |
| </ul> |
| and the list of "similarity scores" that we support follows: |
| <ul> |
| <li>Cosine Similarity,</li> |
| <li>Fuzzy Score Similarity,</li> |
| <li>Jaccard Similarity,</li> |
| <li>Jaro-Winkler Similarity, and</li> |
| <li>Longest Common Subsequence Similarity.</li> |
| </ul> |
| </p> |
| </subsection> |
| |
| <subsection |
| name="Text diff'ing"> |
| <p>The <code>org.apache.commons.text.diff</code> package contains code for |
| doing diff between strings. The initial implementation of the Myers algorithm was adapted from the |
| commons-collections sequence package. |
| </p> |
| </subsection> |
| |
| |
| </section> |
| |
| <section name="text.diff"> |
| <!-- |
| CommandVisitor |
| DeleteCommand |
| EditCommand |
| EditScript |
| InsertCommand |
| KeepCommand |
| ReplacementsFinder |
| ReplacementsHandler |
| StringsComparator |
| --> |
| <p>Provides algorithms for diff between strings.</p> |
| <p>The initial implementation of the Myers algorithm was adapted from the |
| commons-collections sequence package. |
| </p> |
| </section> |
| |
| <section name="text.lookup"> |
| <p>Provides algorithms for looking up strings used by a |
| <a href="http://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/StringSubstitutor.html">StringSubstitutor</a>. |
| where you can select which lookup are used from |
| <a href="http://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/lookup/StringLookupFactory.html">StringLookupFactory</a>.</p> |
| <p> |
| </p> |
| </section> |
| |
| <section name="text.similarity"> |
| <!-- |
| Enum |
| EnumUtils |
| ValuedEnum |
| --> |
| <p>Provides algorithms for string similarity.</p> |
| |
| <p>The algorithms that implement the EditDistance interface follow the |
| same |
| simple principle: the more similar (closer) strings are, lower is the |
| distance. |
| For example, the words house and hose are closer than house and |
| trousers. |
| </p> |
| |
| <p>The following algorithms are available at the moment:</p> |
| |
| <ul> |
| <li> |
| <code>CosineDistance</code> |
| </li> |
| <li> |
| <code>CosineSimilarity</code> |
| </li> |
| <li> |
| <code>FuzzyScore</code> |
| </li> |
| <li> |
| <code>HammingDistance</code> |
| </li> |
| <li> |
| <code>JaroWinklerDistance</code> |
| </li> |
| <li> |
| <code>JaroWinklerSimilarity</code> |
| </li> |
| <li> |
| <code>LevenshteinDistance</code> |
| </li> |
| <li> |
| <code>LongestCommonSubsequenceDistance</code> |
| </li> |
| </ul> |
| |
| <p>The <code>CosineDistance</code> utilises a |
| <code>RegexTokenizer</code> |
| regular expression tokenizer (\w+). And the <code> |
| LevenshteinDistance</code>'s |
| behaviour can be changed to take into consideration a maximum |
| throughput. |
| </p> |
| </section> |
| |
| <section name="text.translate.*"> |
| <!-- |
| ExceptionUtils |
| Nestable |
| NestableDelegate |
| NestableError |
| NestableException |
| NestableRuntimeException |
| --> |
| <p>An API for creating text translation routines from a set of smaller |
| building blocks. Initially created to make it possible for the user to |
| customize the rules in the StringEscapeUtils class. |
| </p> |
| <p>These classes are immutable, and therefore thread-safe.</p> |
| </section> |
| |
| </body> |
| </document> |