| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| |
| <!-- |
| Copyright 1999-2004 The Apache Software Foundation |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| |
| <!-- Content Stylesheet for Site --> |
| |
| |
| <!-- start the processing --> |
| <!-- ====================================================================== --> |
| <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> |
| <!-- Main Page Section --> |
| <!-- ====================================================================== --> |
| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/> |
| |
| |
| |
| |
| |
| <title>Jakarta Lucene - Plan for enhancements to Lucene</title> |
| </head> |
| |
| <body bgcolor="#ffffff" text="#000000" link="#525D76"> |
| <table border="0" width="100%" cellspacing="0"> |
| <!-- TOP IMAGE --> |
| <tr> |
| <td align="left"> |
| <a href="http://jakarta.apache.org"><img src="http://jakarta.apache.org/images/jakarta-logo.gif" border="0"/></a> |
| </td> |
| <td align="right"> |
| <a href="http://jakarta.apache.org/lucene/"><img src="./images/lucene_green_300.gif" alt="Jakarta Lucene" border="0"/></a> |
| </td> |
| </tr> |
| </table> |
| <table border="0" width="100%" cellspacing="4"> |
| <tr><td colspan="2"> |
| <hr noshade="" size="1"/> |
| </td></tr> |
| |
| <tr> |
| <!-- LEFT SIDE NAVIGATION --> |
| <td width="20%" valign="top" nowrap="true"> |
| |
| <!-- ============================================================ --> |
| |
| <p><strong>About</strong></p> |
| <ul> |
| <li> <a href="./index.html">Overview</a> |
| </li> |
| <li> <a href="http://wiki.apache.org/jakarta-lucene/PoweredBy">Powered by Lucene</a> |
| </li> |
| <li> <a href="./whoweare.html">Who We Are</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/mail.html">Mailing Lists</a> |
| </li> |
| </ul> |
| <p><strong>Resources</strong></p> |
| <ul> |
| <li> <a href="http://wiki.apache.org/jakarta-lucene">Wiki</a> |
| </li> |
| <li> <a href="http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi">FAQ (Official)</a> |
| </li> |
| <li> <a href="http://www.jguru.com/faq/Lucene">jGuru FAQ</a> |
| </li> |
| <li> <a href="./gettingstarted.html">Getting Started</a> |
| </li> |
| <li> <a href="./queryparsersyntax.html">Query Syntax</a> |
| </li> |
| <li> <a href="./systemproperties.html">System Properties</a> |
| </li> |
| <li> <a href="./fileformats.html">File Formats</a> |
| </li> |
| <li> <a href="./api/index.html">Javadoc</a> |
| </li> |
| <li> <a href="./contributions.html">Contributions</a> |
| </li> |
| <li> <a href="./resources.html">Articles, etc.</a> |
| </li> |
| <li> <a href="./benchmarks.html">Benchmarks</a> |
| </li> |
| <li> <a href="http://issues.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=%5BPATCH%5D&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27">Patches</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/bugs.html">Bugs</a> |
| </li> |
| <li> <a href="http://issues.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27">Lucene Bugs</a> |
| </li> |
| <li> <a href="http://issues.apache.org/eyebrowse/SummarizeList?listId=30">Lucene-user</a> |
| </li> |
| <li> <a href="http://issues.apache.org/eyebrowse/SummarizeList?listId=29">Lucene-dev</a> |
| </li> |
| <li> <a href="./lucene-sandbox/">Lucene Sandbox</a> |
| </li> |
| </ul> |
| <p><strong>Download</strong></p> |
| <ul> |
| <li> <a href="http://jakarta.apache.org/site/binindex.html">Binaries</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/sourceindex.html">Source Code</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/cvsindex.html">CVS Repositories</a> |
| </li> |
| </ul> |
| <p><strong>Jakarta</strong></p> |
| <ul> |
| <li> <a href="http://jakarta.apache.org/site/getinvolved.html">Get Involved</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/acknowledgements.html">Acknowledgements</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/contact.html">Contact</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/legal.html">Legal</a> |
| </li> |
| </ul> |
| </td> |
| <td width="80%" align="left" valign="top"> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Purpose"><strong>Purpose</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The purpose of this document is to outline plans for |
| making <a href="http://jakarta.apache.org/lucene"> |
| Jakarta Lucene</a> work as a more general drop-in |
| component. It makes the assumption that this is an |
| objective for the Lucene user and development community. |
| </p> |
| <p> |
| The best reference is <a href="http://www.htdig.org"> |
| htDig</a>, though it is not quite as sophisticated as |
| Lucene, it has a number of features that make it |
| desirable. It however is a traditional c-compiled app |
| which makes it somewhat unpleasant to install on some |
| platforms (like Solaris!). |
| </p> |
| <p> |
| This plan is being submitted to the Lucene developer |
| community for an initial reaction, advice, feedback and |
| consent. Following this it will be submitted to the |
| Lucene user community for support. Although, I'm (Andy |
| Oliver) capable of providing these enhancements by |
| myself, I'd of course prefer to work on them in concert |
| with others. |
| </p> |
| <p> |
| While I'm outlaying a fairly large feature set, these can |
| be implemented incrementally of course (and are probably |
| best if done that way). |
| </p> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Goal and Objectives"><strong>Goal and Objectives</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The goal is to provide features to Lucene that allow it |
| to be used as a drop-in search engine. It should provide |
| many of the features of projects like <a href="http://www.htdig.org">htDig</a> while surpassing |
| them with unique Lucene features and capabilities such as |
| easy installation on and java-supporting platform, |
| and support for document fields and field searches. And |
| of course, <a href="http://apache.org/LICENSE"> |
| a pragmatic software license</a>. |
| </p> |
| <p> |
| To reach this goal we'll implement code to support the |
| following objectives that augment but do not replace |
| the current Lucene feature set. |
| </p> |
| <ul> |
| <li> |
| Document Location Independence - meaning mapping |
| real contexts to runtime contexts. |
| Essentially, if the document is at |
| /var/www/htdocs/mydoc.html, I probably want it |
| indexed as |
| http://www.bigevilmegacorp.com/mydoc.html. |
| </li> |
| <li> |
| Standard methods of creating central indicies - |
| file system indexing is probably less useful in |
| many environments than is *remote* indexing (for |
| instance http). I would suggest that most folks |
| would prefer that general functionality be |
| supported by Lucene instead of having to write |
| code for every indexing project. Obviously, if |
| what they are doing is *special* they'll have to |
| code, but general document indexing across |
| web servers would not qualify. |
| </li> |
| <li> |
| Document interpretation abstraction - currently |
| one must handle document object construction via |
| custom code. A standard interface for plugging |
| in format handlers should be supported. |
| </li> |
| <li> |
| Mime and file-extension to document |
| interpretation mapping. |
| </li> |
| </ul> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Crawlers"><strong>Crawlers</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| Crawlers are data source executable code. They crawl a file |
| system, ftp site, web site, etc. to create the index. |
| These standard crawlers may not make ALL of Lucene's |
| functionality available, though they should be able to |
| make most of it available through configuration. |
| </p> |
| <p> |
| <b> Abstract Crawler </b> |
| </p> |
| <p> |
| The AbstractCrawler is basically the parent for all |
| Crawler classes. It provides implementation for the |
| following functions/properties: |
| </p> |
| <ul> |
| <li> |
| index path - where to write the index. |
| </li> |
| <li> |
| cui - create or update the index |
| </li> |
| <li> |
| root context - the start of the pathname |
| that should be replaced by the |
| replace with property or dropped |
| entirely. Example: /opt/tomcat/webapps |
| </li> |
| <li> |
| replace with - when specified replaces |
| the root context. Example: |
| http://jakarta.apache.org. |
| </li> |
| <li> |
| replacement type - the type of |
| replace with path: relative, URL or |
| path. |
| </li> |
| <li> |
| location - the location to start |
| indexing at. |
| </li> |
| <li> |
| doctypes - only index documents with |
| these doctypes. If not specified all |
| registered mime-types are used. |
| Example: "xml,doc,html" |
| </li> |
| <li> |
| recursive - if not specified is turned |
| off. |
| </li> |
| <li> |
| level - optional level of directory or |
| links to traverse. By default is |
| assumed to be infinite. Recursive must |
| be turned on or this is ignored. Range: |
| 0 - Long.MAX_VALUE. |
| </li> |
| <li> |
| SleeptimeBetweenCalls - can be used to |
| avoid flooding a machine with too many |
| requests |
| </li> |
| <li> |
| RequestTimeout - kill the crawler |
| request after the specified period of |
| inactivity. |
| </li> |
| <li> |
| IncludeFilter - include only items |
| matching filter. (can occur multiple |
| times) |
| </li> |
| <li> |
| ExcludeFilter - exclude only items |
| matching filter. (can occur multiple |
| times) |
| </li> |
| <li> |
| ExpandOnly - use but do not index items |
| that match this pattern (regex?) (can |
| occur multiple times) |
| </li> |
| <li> |
| NoExpand - Index but do not follow the |
| links in items that match this pattern |
| (regex?) (can occur multiple times) |
| </li> |
| <li> |
| MaxItems - stops indexing after x |
| documents have been indexed. |
| </li> |
| <li> |
| MaxMegs - stops indexing after x megs |
| have been indexed.. (should this be in |
| specific crawlers?) |
| </li> |
| <li> |
| properties - in addition to the settings |
| (probably from the command line) read |
| this properties file and get them from |
| it. Command line options override |
| the properties file in the case of |
| duplicates. There should also be an |
| environment variable or VM parameter to |
| set this. |
| </li> |
| </ul> |
| <p> |
| <b>FileSystemCrawler</b> |
| </p> |
| <p> |
| This should extend the AbstractCrawler and |
| support any additional options required for a |
| file system index. |
| </p> |
| <p> |
| <b>HTTP Crawler </b> |
| </p> |
| <p> |
| Supports the AbstractCrawler options as well as: |
| </p> |
| <ul> |
| <li> |
| span hosts - Whether to span hosts or not, |
| by default this should be no. |
| </li> |
| <li> |
| restrict domains - (ignored if span |
| hosts is not enabled). Whether all |
| spanned hosts must be in the same domain |
| (default is off). |
| </li> |
| <li> |
| try directories - Whether to attempt |
| directory listings or not (so if you |
| recurse and go to |
| /nextcontext/index.html this option says |
| to also try /nextcontext to get the dir |
| listing) |
| </li> |
| <li> |
| map extensions - |
| (always/default/never/fallback). Whether |
| to always use extension mapping, by |
| default (fallback to mime type), NEVER |
| or fallback if mime is not available |
| (default). |
| </li> |
| <li> |
| ignore robots - ignore robots.txt, on or |
| off (default - off) |
| </li> |
| </ul> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="MIMEMap"><strong>MIMEMap</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| A configurable registry of document types, their |
| description, an identifier, mime-type and file |
| extension. This should map both MIME -> factory |
| and extension -> factory. |
| </p> |
| <p> |
| This might be configured at compile time or by a |
| properties file, etc. For example: |
| </p> |
| <table> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| Description |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| Identifier |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| Extensions |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| MimeType |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| DocumentFactory |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| "Word Document" |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| "doc" |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| "doc" |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| "vnd.application/ms-word" |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| POIWordDocumentFactory |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| "HTML Document" |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| "html" |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| "html,htm" |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| HTMLDocumentFactory |
| </font> |
| </td> |
| </tr> |
| </table> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="DocumentFactory"><strong>DocumentFactory</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| An interface for classes which create document objects |
| for particular file types. Examples: |
| HTMLDocumentFactory, DOCDocumentFactory, |
| XLSDocumentFactory, XML DocumentFactory. |
| </p> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="FieldMapping classes"><strong>FieldMapping classes</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| A class that maps standard fields from the |
| DocumentFactories into *fields* in the Document objects |
| they create. I suggest that a regular expression system |
| or xpath might be the most universal way to do this. |
| For instance if perhaps I had an XML factory that |
| represented XML elements as fields, I could map content |
| from particular fields to their fields or suppress them |
| entirely. We could even make this configurable. |
| </p> |
| <p> |
| |
| for example: |
| </p> |
| <ul> |
| <li> |
| htmldoc.properties |
| </li> |
| <li> |
| suppress=* |
| </li> |
| <li> |
| author=content:g/author\:\ ........................................./ |
| </li> |
| <li> |
| author.suppress=false |
| </li> |
| <li> |
| title=content:g/title\:\ ........................................./ |
| </li> |
| <li> |
| title.suppress=false |
| </li> |
| </ul> |
| <p> |
| In this example we map html documents such that all |
| fields are suppressed but author and title. We map |
| author and title to anything in the content matching |
| author: (and x characters). Okay my regular expresions |
| suck but hopefully you get the idea. |
| </p> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Final Thoughts"><strong>Final Thoughts</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| We might also consider eliminating the DocumentFactory |
| entirely by making an AbstractDocument from which the |
| current document object would inherit from. I |
| experimented with this locally, and it was a relatively |
| minor code change and there was of course no difference |
| in performance. The Document Factory classes would |
| instead be instances of various subclasses of |
| AbstractDocument. |
| </p> |
| <p> |
| My inspiration for this is HTDig (http://www.htdig.org/). |
| While this goes slightly beyond what HTDig provides by |
| providing field mapping (where HTDIG is just interested |
| in Strings/numbers wherever they are found), it provides |
| at least what I would need to use this as a drop-in for |
| most places I contract at (with the obvious exception of |
| a default set of content handlers which would of course |
| develop naturally over time). |
| </p> |
| <p> |
| I am able to certainly contribute to this effort if the |
| development community is open to it. I'd suggest we do |
| it iteratively in stages and not aim for all of this at |
| once (for instance leave out the field mapping at first). |
| </p> |
| <p> |
| |
| Anyhow, please give me some feedback, counter |
| suggestions, let me know if I'm way off base or out of |
| line, etc. -Andy |
| </p> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| </td> |
| </tr> |
| |
| <!-- FOOTER --> |
| <tr><td colspan="2"> |
| <hr noshade="" size="1"/> |
| </td></tr> |
| <tr><td colspan="2"> |
| <div align="center"><font color="#525D76" size="-1"><em> |
| Copyright © 1999-2004, The Apache Software Foundation |
| </em></font></div> |
| </td></tr> |
| </table> |
| </body> |
| </html> |
| <!-- end the processing --> |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |