blob: a54fd160dc624c07ec4a0a5ef53bd6976dc483c8 [file] [log] [blame]
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<document>
<properties>
<title>Commons Text - User guide</title>
<author email="dev@commons.apache.org">Commons Documentation Team</author>
</properties>
<body>
<!-- $Id$ -->
<section name='User guide for Commons "Text"'>
<div align="center">
<h1>The Commons <em>Text</em> Package
</h1>
<h2>Users Guide</h2>
<br/>
<a href="#Description">[Description]</a>
<a href="#text.">[text.*]</a>
<a href="#text.diff.">[text.diff.*]</a>
<a href="#text.similarity.">[text.similarity.*]</a>
<a href="#text.translate.">[text.translate.*]</a>
<br/>
<br/>
</div>
</section>
<section name="Description">
<p>The Commons Text library provides additions to the standard JDK's
text handling. Our goal is to provide a consistent set of tools for
processing text generally from computing distances between Strings
to being able to effeciently do String escaping of various types.
</p>
</section>
<section name="text.*">
<p>Originally the text package was added in Commons Lang 2.2. However, its
new home is here. It provides, amongst other
classes, a replacement for <code>StringBuffer</code> named <code>
StrBuilder</code>, a class for substituting variables within a String
named <code>StrSubstitutor</code> and a replacement for StringTokenizer
named <code>StrTokenizer</code>. While somewhat ungainly, the <code>
Str
</code> prefix has been used to ensure we don't clash with any current
or future standard Java classes.
</p>
<p>Beyond the text utilities ported over from lang, we have also included various
string similarity and distance functions. Lastly, there are also utilities for
addressing differences between bodies of text for the sake of viewing these
differences.
</p>
<subsection name="StringEscapeUtils">
<p>From Lang 3.5, we have moved into Text StringEscapeUtils and StrTokenizer.
It's provides ways in which to generate pieces of text, such as might
be used for default passwords. StringEscapeUtils contains methods to
escape and unescape Java, JavaScript, HTML, XML and SQL. It is worth noting that
the package <code>org.apache.commons.text.translate</code> holds the
functionality underpinning the StringEscapeUtils with mappings and translations
between such mappings for the sake of doing String escaping. StrTokenizer is
an improved alternative to java.util.StringTokenizer.
</p>
</subsection>
<subsection name="Similarity and Distance">
<p>The <code>org.apache.commons.text.similarity</code> packages contains various different mechanisms of
calculating "similarity scores" as well as "edit distances between Strings. Note,
the difference between a "similarity score" and a "distance function" is that
a distance functions meets the following qualifications:
<ul>
<li><code>d(x,y) &gt;= 0</code>, non-negativity or separation axiom
</li>
<li><code>d(x,y) == 0</code>, if and only if,
<code>x == y</code>
</li>
<li><code>d(x,y) == d(y,x)</code>, symmetry, and
</li>
<li><code>d(x,z) &lt;= d(x,y) + d(y,z)</code>, the triangle inequality
</li>
</ul>
whereas a "similarity score" need not satisfy all such properties. Though, it
is fairly easy to "normalize" a similarity score to manufacture an "edit distance."
</p>
<p>
The list of "edit distances" that we currently support follow:
<ul>
<li>Cosine Distance,</li>
<li>Hamming Distance,</li>
<li>Jaccard Distance,</li>
<li>Jaro Winkler Distance,</li>
<li>Levenshtein Distance,</li>
<li>Longest Commons Subsequence Distance,</li>
</ul>
and the list of "similarity scores" that we support follows:
<ul>
<li>Cosine Similarity,</li>
<li>Fuzzy Score Similarity,</li>
<li>Jaccard Similarity, and</li>
<li>Longest Common Subsequence Similarity.</li>
</ul>
</p>
</subsection>
<subsection
name="Text diff'ing">
<p>The <code>org.apache.commons.text.diff</code> package contains code for
doing diff between strings. The initial implementation of the Myers algorithm was adapted from the
commons-collections sequence package.
</p>
</subsection>
</section>
<section name="text.diff.*">
<!--
CommandVisitor
DeleteCommand
EditCommand
EditScript
InsertCommand
KeepCommand
ReplacementsFinder
ReplacementsHandler
StringsComparator
-->
<p>Provides algorithms for diff between strings.</p>
<p>The initial implementation of the Myers algorithm was adapted from the
commons-collections sequence package.
</p>
</section>
<section name="text.similarity.*">
<!--
Enum
EnumUtils
ValuedEnum
-->
<p>Provides algorithms for string similarity.</p>
<p>The algorithms that implement the EditDistance interface follow the
same
simple principle: the more similar (closer) strings are, lower is the
distance.
For example, the words house and hose are closer than house and
trousers.
</p>
<p>The following algorithms are available at the moment:</p>
<ul>
<li>
<code>CosineDistance</code>
</li>
<li>
<code>CosineSimilarity</code>
</li>
<li>
<code>FuzzyScore</code>
</li>
<li>
<code>HammingDistance</code>
</li>
<li>
<code>JaroWinklerDistance</code>
</li>
<li>
<code>LevenshteinDistance</code>
</li>
<li>
<code>LongestCommonSubsequenceDistance</code>
</li>
</ul>
<p>The <code>CosineDistance</code> utilises a
<code>RegexTokenizer</code>
regular expression tokenizer (\w+). And the <code>
LevenshteinDistance</code>'s
behaviour can be changed to take into consideration a maximum
throughput.
</p>
</section>
<section name="text.translate.*">
<!--
ExceptionUtils
Nestable
NestableDelegate
NestableError
NestableException
NestableRuntimeException
-->
<p>An API for creating text translation routines from a set of smaller
building blocks. Initially created to make it possible for the user to
customize the rules in the StringEscapeUtils class.
</p>
<p>These classes are immutable, and therefore thread-safe.</p>
</section>
</body>
</document>