| <?xml version="1.0" encoding="UTF-8"?> |
| |
| <document> |
| <properties> |
| <title>Plan for enhancements to Lucene</title> |
| <authors> |
| <person email="acoliver@apache.org" name="Andrew C. Oliver" id="AO"/> |
| </authors> |
| </properties> |
| <body> |
| |
| <section name="Purpose"> |
| <p> |
| The purpose of this document is to outline plans for |
| making <a href="http://jakarta.apache.org/lucene"> |
| Jakarta Lucene</a> work as a more general drop-in |
| component. It makes the assumption that this is an |
| objective for the Lucene user and development community. |
| </p> |
| <p> |
| The best reference is <a href="http://www.htdig.org"> |
| htDig</a>, though it is not quite as sophisticated as |
| Lucene, it has a number of features that make it |
| desirable. It however is a traditional c-compiled app |
| which makes it somewhat unpleasant to install on some |
| platforms (like Solaris!). |
| </p> |
| <p> |
| This plan is being submitted to the Lucene developer |
| community for an initial reaction, advice, feedback and |
| consent. Following this it will be submitted to the |
| Lucene user community for support. Although, I'm (Andy |
| Oliver) capable of providing these enhancements by |
| myself, I'd of course prefer to work on them in concert |
| with others. |
| </p> |
| <p> |
| While I'm outlaying a fairly large feature set, these can |
| be implemented incrementally of course (and are probably |
| best if done that way). |
| </p> |
| </section> |
| |
| <section name="Goal and Objectives"> |
| <p> |
| The goal is to provide features to Lucene that allow it |
| to be used as a drop-in search engine. It should provide |
| many of the features of projects like <a |
| href="http://www.htdig.org">htDig</a> while surpassing |
| them with unique Lucene features and capabilities such as |
| easy installation on and java-supporting platform, |
| and support for document fields and field searches. And |
| of course, <a href="http://apache.org/LICENSE"> |
| a pragmatic software license</a>. |
| </p> |
| <p> |
| To reach this goal we'll implement code to support the |
| following objectives that augment but do not replace |
| the current Lucene feature set. |
| </p> |
| <ul> |
| <li> |
| Document Location Independence - meaning mapping |
| real contexts to runtime contexts. |
| Essentially, if the document is at |
| /var/www/htdocs/mydoc.html, I probably want it |
| indexed as |
| http://www.bigevilmegacorp.com/mydoc.html. |
| </li> |
| <li> |
| Standard methods of creating central indicies - |
| file system indexing is probably less useful in |
| many environments than is *remote* indexing (for |
| instance http). I would suggest that most folks |
| would prefer that general functionality be |
| supported by Lucene instead of having to write |
| code for every indexing project. Obviously, if |
| what they are doing is *special* they'll have to |
| code, but general document indexing across |
| web servers would not qualify. |
| </li> |
| <li> |
| Document interpretation abstraction - currently |
| one must handle document object construction via |
| custom code. A standard interface for plugging |
| in format handlers should be supported. |
| </li> |
| <li> |
| Mime and file-extension to document |
| interpretation mapping. |
| </li> |
| </ul> |
| </section> |
| <section name="Crawlers"> |
| <p> |
| Crawlers are data source executable code. They crawl a file |
| system, ftp site, web site, etc. to create the index. |
| These standard crawlers may not make ALL of Lucene's |
| functionality available, though they should be able to |
| make most of it available through configuration. |
| </p> |
| <!--<section name="AbstractIndexer">--> |
| <p> |
| <b> Abstract Crawler </b> |
| </p> |
| <p> |
| The AbstractCrawler is basically the parent for all |
| Crawler classes. It provides implementation for the |
| following functions/properties: |
| </p> |
| <ul> |
| <li> |
| index path - where to write the index. |
| </li> |
| <li> |
| cui - create or update the index |
| </li> |
| <li> |
| root context - the start of the pathname |
| that should be replaced by the |
| replace with property or dropped |
| entirely. Example: /opt/tomcat/webapps |
| </li> |
| <li> |
| replace with - when specified replaces |
| the root context. Example: |
| http://jakarta.apache.org. |
| </li> |
| <li> |
| replacement type - the type of |
| replace with path: relative, URL or |
| path. |
| </li> |
| <li> |
| location - the location to start |
| indexing at. |
| </li> |
| <li> |
| doctypes - only index documents with |
| these doctypes. If not specified all |
| registered mime-types are used. |
| Example: "xml,doc,html" |
| </li> |
| <li> |
| recursive - if not specified is turned |
| off. |
| </li> |
| <li> |
| level - optional level of directory or |
| links to traverse. By default is |
| assumed to be infinite. Recursive must |
| be turned on or this is ignored. Range: |
| 0 - Long.MAX_VALUE. |
| </li> |
| <li> |
| SleeptimeBetweenCalls - can be used to |
| avoid flooding a machine with too many |
| requests |
| </li> |
| <li> |
| RequestTimeout - kill the crawler |
| request after the specified period of |
| inactivity. |
| </li> |
| <li> |
| IncludeFilter - include only items |
| matching filter. (can occur multiple |
| times) |
| </li> |
| <li> |
| ExcludeFilter - exclude only items |
| matching filter. (can occur multiple |
| times) |
| </li> |
| <li> |
| ExpandOnly - use but do not index items |
| that match this pattern (regex?) (can |
| occur multiple times) |
| </li> |
| <li> |
| NoExpand - Index but do not follow the |
| links in items that match this pattern |
| (regex?) (can occur multiple times) |
| </li> |
| <li> |
| MaxItems - stops indexing after x |
| documents have been indexed. |
| </li> |
| <li> |
| MaxMegs - stops indexing after x megs |
| have been indexed.. (should this be in |
| specific crawlers?) |
| </li> |
| <li> |
| properties - in addition to the settings |
| (probably from the command line) read |
| this properties file and get them from |
| it. Command line options override |
| the properties file in the case of |
| duplicates. There should also be an |
| environment variable or VM parameter to |
| set this. |
| </li> |
| </ul> |
| <!--</section>--> |
| <!--<s2 title="FileSystemIndexer">--> |
| <p> |
| <b>FileSystemCrawler</b> |
| </p> |
| <p> |
| This should extend the AbstractCrawler and |
| support any additional options required for a |
| file system index. |
| </p> |
| <!--</s2>--> |
| <!--<s2 title="HTTPIndexer">--> |
| <p> |
| <b>HTTP Crawler </b> |
| </p> |
| <p> |
| Supports the AbstractCrawler options as well as: |
| </p> |
| <ul> |
| <li> |
| span hosts - Whether to span hosts or not, |
| by default this should be no. |
| </li> |
| <li> |
| restrict domains - (ignored if span |
| hosts is not enabled). Whether all |
| spanned hosts must be in the same domain |
| (default is off). |
| </li> |
| <li> |
| try directories - Whether to attempt |
| directory listings or not (so if you |
| recurse and go to |
| /nextcontext/index.html this option says |
| to also try /nextcontext to get the dir |
| listing) |
| </li> |
| <li> |
| map extensions - |
| (always/default/never/fallback). Whether |
| to always use extension mapping, by |
| default (fallback to mime type), NEVER |
| or fallback if mime is not available |
| (default). |
| </li> |
| <li> |
| ignore robots - ignore robots.txt, on or |
| off (default - off) |
| </li> |
| </ul> |
| <!-- </s2> --> |
| </section> |
| |
| <section name="MIMEMap"> |
| <p> |
| A configurable registry of document types, their |
| description, an identifier, mime-type and file |
| extension. This should map both MIME -> factory |
| and extension -> factory. |
| </p> |
| <p> |
| This might be configured at compile time or by a |
| properties file, etc. For example: |
| </p> |
| <table> |
| <tr> |
| <td>Description</td> |
| <td>Identifier</td> |
| <td>Extensions</td> |
| <td>MimeType</td> |
| <td>DocumentFactory</td> |
| </tr> |
| <tr> |
| <td>"Word Document"</td> |
| <td>"doc"</td> |
| <td>"doc"</td> |
| <td>"vnd.application/ms-word"</td> |
| <td>POIWordDocumentFactory</td> |
| </tr> |
| <tr> |
| <td>"HTML Document"</td> |
| <td>"html"</td> |
| <td>"html,htm"</td> |
| <td></td> |
| <td>HTMLDocumentFactory</td> |
| </tr> |
| </table> |
| </section> |
| <section name="DocumentFactory"> |
| <p> |
| An interface for classes which create document objects |
| for particular file types. Examples: |
| HTMLDocumentFactory, DOCDocumentFactory, |
| XLSDocumentFactory, XML DocumentFactory. |
| </p> |
| </section> |
| <section name="FieldMapping classes"> |
| <p> |
| A class that maps standard fields from the |
| DocumentFactories into *fields* in the Document objects |
| they create. I suggest that a regular expression system |
| or xpath might be the most universal way to do this. |
| For instance if perhaps I had an XML factory that |
| represented XML elements as fields, I could map content |
| from particular fields to their fields or suppress them |
| entirely. We could even make this configurable. |
| </p> |
| <p> |
| |
| for example: |
| </p> |
| <ul> |
| <li> |
| htmldoc.properties |
| </li> |
| <li> |
| suppress=* |
| </li> |
| <li> |
| author=content:g/author\:\ ........................................./ |
| </li> |
| <li> |
| author.suppress=false |
| </li> |
| <li> |
| title=content:g/title\:\ ........................................./ |
| </li> |
| <li> |
| title.suppress=false |
| </li> |
| </ul> |
| <p> |
| In this example we map html documents such that all |
| fields are suppressed but author and title. We map |
| author and title to anything in the content matching |
| author: (and x characters). Okay my regular expresions |
| suck but hopefully you get the idea. |
| </p> |
| </section> |
| <section name="Final Thoughts"> |
| <p> |
| We might also consider eliminating the DocumentFactory |
| entirely by making an AbstractDocument from which the |
| current document object would inherit from. I |
| experimented with this locally, and it was a relatively |
| minor code change and there was of course no difference |
| in performance. The Document Factory classes would |
| instead be instances of various subclasses of |
| AbstractDocument. |
| </p> |
| <p> |
| My inspiration for this is HTDig (http://www.htdig.org/). |
| While this goes slightly beyond what HTDig provides by |
| providing field mapping (where HTDIG is just interested |
| in Strings/numbers wherever they are found), it provides |
| at least what I would need to use this as a drop-in for |
| most places I contract at (with the obvious exception of |
| a default set of content handlers which would of course |
| develop naturally over time). |
| </p> |
| <p> |
| I am able to certainly contribute to this effort if the |
| development community is open to it. I'd suggest we do |
| it iteratively in stages and not aim for all of this at |
| once (for instance leave out the field mapping at first). |
| </p> |
| <p> |
| |
| Anyhow, please give me some feedback, counter |
| suggestions, let me know if I'm way off base or out of |
| line, etc. -Andy |
| </p> |
| </section> |
| |
| </body> |
| </document> |