| <!DOCTYPE html public "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> |
| <LINK rel="Stylesheet" type="text/css" media="all" href="../harmony.css"> |
| <title>Design of the regex processing framework</title> |
| </head> |
| <body> |
| |
| <h1 align=center><a name="Top"></a>Regex Processing Framework</h1> |
| <p class="TOCHeading"><A href="#Revision_History">Revision History</A></p> |
| <p class="TOCHeading"><A href="#Disclaimer">Disclaimer</A></p> |
| <p class="TOCHeading"><A href="#About_this_document">About this Document</A> |
| </p> |
| <p class="TOC"><A href="#Purpose">Purpose </A> </p> |
| <p class="TOC"><A href="#Intended_Audience">Intended Audience </A> </p> |
| <p class="TOC"><A href="#Documentation_Conventions">Documentation Conventions</A></p> |
| <p class="TOCHeading"><a href="#Introduction">Introduction to Pattern Matching</a></p> |
| <p class="TOCHeading"><a href="#AbstractSetIntro">AbstractSet Interface</a></p> |
| <p class="TOC"><a href="#Characteristics">Characteristics</a></p> |
| <p class="TOC"><a href="#MethodsUsed">Methods Used</a></p> |
| <p class="TOC"><a href="#ClassHierarchy">Class Hierarchy</a></p> |
| <p class="TOC"><a href="#Optimizations">Optimizations</a></p> |
| |
| <p class="TOCHeading"><a href="#UsageExamples">Usage Examples</a></p> |
| <p class="TOCHeading"><A href="#References">References</A> </p> |
| <H1><A name="Revision_History"></A>Revision History</H1> |
| <TABLE width="100%"> |
| <TR> |
| <TD class="TableHeading" width="25%"> Version </TD> |
| <TD class="TableHeading" width="50%"> Version Information </TD> |
| <TD class="TableHeading"> Date </TD> |
| </TR> |
| <TR> |
| <TD class="TableCell" width="25%"> Initial version </TD> |
| <TD class="TableCell" width="25%"> Nadya Morozova, Nikolay Kuznetsov: document |
| created.</TD> |
| <TD class="TableCell"> December 16, 2005</TD> |
| </TR> |
| </TABLE> |
| <H1><A name="Disclaimer"></A>Disclaimer and Legal Information</H1> |
| <P>Copyright 2005 The Apache Software Foundation or its licensors, as applicable.</P> |
| <P>Licensed under the Apache License, Version 2.0 (the "License"); you |
| may not use this file except in compliance with the License. You may obtain |
| a copy of the License at <A href="http://www.apache.org/licenses/LICENSE-2.0"> |
| http://www.apache.org/licenses/LICENSE-2.0</A>. </P> |
| <P>Unless required by applicable law or agreed to in writing, software distributed |
| under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES |
| OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations under the License.</P> |
| <h1><a name="About_this_document"></a>About This Document</h1> |
| <h2><a name="Purpose"></a>Purpose</h2> |
| <p>This document describes the <code>java.util.regex</code> package delivered as part of |
| the Harmony classlibrary project. This document provides an overview |
| of the overall architecture with focus on the aspects improving performance. |
| </p> |
| <h2><a name="Intended_Audience"></a>Intended Audience</h2> |
| |
| <p>The target audience for the document includes a wide community of engineers |
| interested in using regular expression package and in further work with the |
| product to contribute to its development. The document assumes that readers |
| are familiar with Regular expressions [<a href="#Fri02">1</a>, <a href="#MY60">2</a>, |
| <a href="#THO68">3</a>], finite automata theory [<a href="#ASU86">4</a>], basic |
| compiler techniques [<a href="#ASU86">4</a>] and the Java<a |
| href="#*">*</a> programming language [<a href="#JavaSite">5</a>]. </p> |
| <h2><a name="Documentation_Conventions"></a>Documentation Conventions</h2> |
| <p>This document uses the <a |
| href="../conventions.htm">unified conventions</a> for the Harmony documentation kit.</p> |
| <P class="backtotop"><A href="#Top">Back to Top</A></P> |
| |
| <h1><a name="Introduction"></a>Introduction to Pattern Matching </h1> |
| |
| <p>To analyze text in search of sequences matching preset patterns, you can choose |
| from a variety of techniques, such as the wide-spread <i>regular expressions</i> |
| (RE) [<a href="#Fri02">1</a>], <em>exact string matching</em>, for example, the Boyer-Moore algorithm |
| (BM) [<a href="#BM77">6</a>], and others. However, the RE engine is rather complex, and significantly |
| impacts performance, whereas the exact string matching techniques have a limited |
| application. </p> |
| |
| <p>For example, the regular expression <code>.*world</code> is used to find the |
| last instance of <i>world</i> in a sentence. The BM string searching algorithm is more |
| appropriate for solving the task than the RE technique, and is more effective |
| from the performance perspective. To optimize using pattern matching techniques |
| with different tasks, Harmony provides a unified interface that applies to any part |
| of a regular expression, including the whole expression. </p> |
| |
| <p>In terms of the Finite Automata theory, which is the basis of regular expression |
| processing, a part of regular expression is a <i>node</i>. The Harmony regex framework |
| treats every distinctive part of a regular expression and the whole expression |
| as a node. Each node implements the unified interface adjusted for a specific |
| technique. For instance, for the regular expression in the example, the Harmony |
| framework includes a special class SequenceSet, which has a unified interface |
| called <code>AbstractSet</code>, and implements the Boyer-Moore algorithm in |
| searching for a word.</p> |
| <P class="backtotop"><A href="#Top">Back to Top</A></P> |
| <h1><a name="AbstractSetIntro"></a>AbstractSet Interface</h1> |
| |
| <p>The key feature of the Harmony regex framework the single super interface, <code>AbstractSet</code>, |
| shared by all types of nodes. Individual nodes are independent, so that every |
| node of the automaton incorporates its own find-match algorithm optimized for |
| a specific type of nodes. You can use methods implemented in the <code>AbstractSet</code> |
| interface subclasses or override these methods to use your own more efficient |
| algorithm.</p> |
| |
| <h2><a name="Characteristics"></a>Characteristics</h2> |
| |
| <p>The <code>AbstractSet</code> interface has the following key characteristics: |
| </p> |
| |
| <ul> |
| <li> <b>Unified structure:</b> General contract for all nodes including node |
| chains. This enables the framework to support new features and optimizations |
| for a specific construction of RE without impacting other parts of the framework. |
| <li><b>Simplicity:</b> Every node of an automaton is simple and handles a specific |
| optimization or functionality. The resulting automaton is straight-forward |
| and includes no multi-purpose functionality. To enable support of new functionality, |
| create a new class. |
| <p class="note">NOTE: </p> |
| <p class="notetext">Because new features and optimizations are implemented |
| independently, expansion of the framework has no negative impact on the |
| overall performance. </p> |
| <li><b>Independence</b>:The matching and finding procedures are incorporated |
| into the interface and require no external logic. The interface allows using |
| the same find-match methods for all types of nodes, and at the same time enables |
| optimizing the find-match algorithm for individual node types to improve performance. |
| </ul> |
| <P class="backtotop"><A href="#Top">Back to Top</A></P> |
| <h2><a name="MethodsUsed"></a>Methods Used </h2> |
| |
| <p>The <code>AbstractSet</code> interface defines the matching and finding methods, |
| and the service methods supporting them, as follows. </p> |
| <p class="class">Service Methods</p> |
| <dl> |
| <dt>accept()</dt> |
| <dd>Checks whether the current node matches the current string symbol(s) to |
| ensure that the automaton can proceed with the current state. The method returns |
| the number of accepted symbols. This method is designed for nodes representing |
| tokens, which mostly use the default match procedure. To optimize the match |
| procedure for a specific type of token, you only need to override the <code>accept()</code> |
| method, see <a href="#Example1">Example 1</a>.</dd> |
| <dt>first()</dt> |
| <dd>Checks whether the first symbol of the current node intersects with previous |
| one. If not, then the previous node is quantified possessively without <i>backtrack</i> |
| (the option of restarting the search from the previous position). </dd> |
| <dt>next()</dt> |
| <dd>Returns the next node of the automaton. </dd> |
| </dl> |
| <p class="class">Matching and Finding Methods</p> |
| <dl> |
| <dt>matches()</dt> |
| <dd>Runs the match procedure starting from the current state. This is the only |
| method you need to override in order to introduce new functionality. By default, |
| the <code>matches()</code> method does the following when working with terms (see <a href="#ClassHierarchy">Class |
| Hierarchy</a>): |
| <ol> |
| <li> Calls the <code>accept()</code> method and proceeds if the method returns |
| a positive value. </li> |
| <li> Calls the match of the next node with the shift obtained from the <code>accept()</code> |
| method.</li> |
| <li> Returns <code>TRUE</code> in case of a successful match. </li> |
| </ol> |
| </dd> |
| <dt>find()</dt> |
| <dd>Checks whether the pattern can match any part of the string in the following |
| way: |
| <ol> |
| <li>Finds the next position in the input string, for which the <code>accept()</code> |
| method returns a positive value. If no matches are found, the method terminates |
| and returns a negative value.</li> |
| <li>From this position, runs the <code>matches()</code> method.</li> |
| </ol> |
| </dd> |
| <dt>findBack()</dt> |
| <dd>Does the same as the <code>find()</code> method but<code></code> starts |
| from the end of the string or from the nearest new-line symbol. This method |
| optimizes the find procedure when the current node of the pattern fits the |
| rest of the string. In such cases, it is necessary to find the last occurrence |
| of the pattern represented by the next node, as in the case of the <code>.*world |
| </code> pattern. </dd> |
| </dl> |
| <P class="backtotop"><A href="#Top">Back to Top</A></P> |
| <h2><a name="ClassHierarchy"></a>Class Hierarchy</h2> |
| |
| <p><a href="#figure1">Figure 1</a> shows the class hierarchy based on the <code>AbstractSet</code> |
| interface. As mentioned in its <a href="#AbstractSetIntro">description</a>, |
| the whole hierarchy is based on this interface. The other classes of the framework |
| are divided into the following categories: </p> |
| |
| <ul> |
| <li><b>Tokens or Leafs</b> (<code>LeafSet</code> in the figure): nodes representing |
| ordinal symbols, substrings, ranges, character classes, and other basic units |
| of regular expressions. |
| <li><b>Quantifiers</b> (<code>QuantifierSet</code> in the figure): set of nodes responsible |
| for quantification of terms. Terms are simple and usually represent only one |
| symbol. Therefore, terms are quantified without recursion, and backtracks are |
| processed on the basis of the underlying pattern length. |
| <li><b>Groups </b>(<code>JointSet</code> in the figure): sets derived from parenthetic constructions. |
| These nodes represent alternations of other sets. |
| <li><b>Group Quantifiers </b>(<code>GroupQuantifier</code> in the figure): special |
| quantifiers over Groups. Because the length of a groups can vary, backtracks |
| require recursion. |
| </ul> |
| <p align=center><img border=0 alt="Hierarchy of classes in regexp framework" src="images/picture1.gif"></p> |
| <p class="special"><a name="figure1"></a>Figure 1. Class Hierarchy</p> |
| |
| <p>Figure 1 displays only the basics of the regex framework. This framework also |
| includes other nodes responsible for different optimizations. For more details, |
| see the comments inlined in code. </p> |
| <P class="backtotop"><A href="#Top">Back to Top</A></P> |
| <h2><a name="Optimizations"></a>Optimizations</h2> |
| |
| <p>In the current implementation, most optimizations are based on the node representation |
| to improve efficiency of the framework. It is noteworthy that an optimal finite |
| automaton is built during compile time, so that the constructed automaton does |
| not spend additional time on decision-making overhead during match time. The |
| regex framework optimizations improve different aspects, such as: </p> |
| <ul> |
| <li> <b>Character or String find methods: </b>The methods <code>find()</code> |
| and <code>findBack()</code> of certain nodes use exact pattern matching algorithms, |
| specifically, the Boyer-Moore fast string search algorithm. |
| <li> <b>Quantified dot term:</b> This term (<code>.*</code>) covers the rest |
| of the string and can therefore run <code>findBack()</code> from the end of |
| the string. If the <code>findBack()</code> method of the following node implements |
| a special algorithm, the pattern matching speed for this chain increases, |
| otherwise the operation runs at the same speed. |
| <li> <b>Quantified non-intersecting states:</b> If the quantified node does |
| not intersect with the first character of the next one, the framework treats |
| the quantifier as the possessive quantifier because no backtracks can occur. |
| Possessive quantification is based on loops instead of recursion and works |
| faster. |
| </ul> |
| <P class="backtotop"><A href="#Top">Back to Top</A></P> |
| |
| <h1><a name="UsageExamples"></a>Usage Examples</h1> |
| <P>This part on the document illustrates usage of the Harmony regex framework. </P> |
| <h3><a name="Example1"></a>Example 1</h3> |
| <p>This example illustrates using the <code>CharSet</code> class, which includes all nodes accepting characters to create a new type of tokens. For that, the class uses the <code>LeafSet</code> class, which is the basic class for tokens. This example shows that the only method you need to override in order to present a new type of tokens is the <code>accept()</code> method. </p> |
| <pre>class CharSet extends LeafSet { |
| ... |
| public int accept(int strIndex, CharSequence testString) { |
| // checks that the current symbol of the input string is the one this |
| // instance represents and returns 1 (the number of |
| // accepted symbols) or -1 if accept fails: |
| return (this.ch == testString.charAt(strIndex)) ? 1 : -1; |
| } |
| ... |
| } |
| </pre> |
| <h3>Example 2</h3> |
| <p>The following example demonstrates independent implementation of the <code>find()</code> method for <code>SequenceSet</code>.</p> |
| <p class="note">Note: </p> |
| <p class="notetext">This changes the find procedure for nodes representing character |
| sequences and at the same time does not affect the find-match algorithms of |
| other types of nodes. </p> |
| <pre>class SequenceSet extends LeafSet { |
| ... |
| protected int indexOf(CharSequence str, int from) { |
| // Boyer-Moore algorithm |
| ... |
| } |
| |
| public int find(int strIndex, CharSequence testString, |
| MatchResultImpl matchResult) { |
| ... |
| while (strIndex <= strLength) { |
| // call the fast search method instead of default implementation |
| strIndex = indexOf(testStr, strIndex); |
| if (strIndex < 0) { |
| return -1; |
| } |
| if (next.matches(strIndex + charCount, testString, matchResult) >= 0) { |
| return strIndex; |
| } |
| strIndex++; |
| } |
| return -1; |
| } |
| ... |
| } |
| </pre> |
| |
| |
| <h3>Example 3 </h3> |
| <p>This example illustrates how to turn the match procedure of the <code>.*</code> |
| patterns into the find procedure, which is potentially faster. </p> |
| |
| <pre>class DotQuantifierSet extends LeafQuantifierSet { |
| ... |
| public int matches(int stringIndex, CharSequence testString, |
| MatchResultImpl matchResult) { |
| ... |
| // find line terminator, since .* represented by this node accepts all characters |
| // except line terminators |
| int startSearch = findLineTerminator(stringIndex, strLength, testString); |
| ... |
| // run findBack method of the next node, because the previous |
| // part of the string is accepted by .* and no matter where |
| // the next part of pattern is found, the procedure works OK. |
| return next.findBack(stringIndex, startSearch, testString, matchResult); |
| } |
| } |
| </pre> |
| <P class="backtotop"><A href="#Top">Back to Top</A></P> |
| |
| <h1><a name="References"></a>References</h1> |
| <p>[<a name="Fri02"></a>1] Jeffrey E. F. Friedl., <em>Mastering regular expressions</em>, |
| 2nd Edition., July 2002, O'Reilly, ISBN 0-596-00289-0 </p> |
| <p>[<a name="MY60"></a>2] McNaughton, R. and Yamada, H. <em>Regular Expressions |
| and State Graphs for Automata</em>, IRA Trans. on Electronic Computers, Vol. |
| EC-9, No. 1, Mar. 1960, pp 39-47. </p> |
| |
| <p>[<a name="THO68"></a>3] Thompson, K., <em>Regular Expression search Algorithm</em>, |
| Communication ACM 11:6 (1968), pp. 419-422. </p> |
| |
| <p>[<a name="ASU86"></a>4] Alfred V. Aho, Ravi Sethi, Jeffrey |
| D. Ullman, <em>Compilers, Principles Techniques and Tools</em>, Addison-Wesley |
| Publishing Company, Inc., 1985, ISBN 0-201-10088-6 </p> |
| |
| <p>[<a name="JavaSite"></a>5] Java Technology site, <a href="http://java.sun.com" target="_blank">http://java.sun.com</a> |
| </p> |
| |
| <p>[<a name="BM77"></a>6] R. Boyer and S. Moore. A fast string searching algorithm. C. ACM, 20:762-772, 1977. </p> |
| |
| <p>[<a name="KMP77"></a>7] D.E. Knuth, .I. Morris, and V. Pratt. Fast pattern |
| matching in strings. SIAM J on Computing, 6:323-350, 1977. </p> |
| <P class="backtotop"><A href="#Top">Back to Top</A></P> |
| <P class="backtotop"> </P> |
| |
| |
| <P><A name="*">*</A> Other brands and names are the property of their respective |
| owners.</P> |
| </body> |
| </html> |