xdocs/luceneplan.xml - lucene-solr - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>

 <document>
   <properties>
    <title>Plan for enhancements to Lucene</title>
    <authors>
     <person email="acoliver@apache.org" name="Andrew C. Oliver" id="AO"/>
    </authors>
   </properties>
   <body>

         <section name="Purpose">
                 <p>
                         The purpose of this document is to outline plans for
                         making <a href="http://jakarta.apache.org/lucene">
                         Jakarta Lucene</a> work as a more general drop-in
                         component.  It makes the assumption that this is an
                         objective for the Lucene user and development community.
                 </p>
                 <p>
                         The best reference is <a href="http://www.htdig.org">
                         htDig</a>, though it is not quite as sophisticated as
                         Lucene, it has a number of features that make it
                         desirable.  It however is a traditional c-compiled app
                         which makes it somewhat unpleasant to install on some
                         platforms (like Solaris!).
                 </p>
                 <p>
                         This plan is being submitted to the Lucene developer
                         community for an initial reaction, advice, feedback and
                         consent.  Following this it will be submitted to the
                         Lucene user community for support.  Although, I'm (Andy
                         Oliver) capable of providing these enhancements by
                         myself, I'd of course prefer to work on them in concert
                         with others.
                 </p>
                 <p>
                         While I'm outlaying a fairly large feature set, these can
                         be implemented incrementally of course (and are probably
                         best if done that way).
                 </p>
         </section>

         <section name="Goal and Objectives">
                 <p>
                         The goal is to provide features to Lucene that allow it
                         to be used as a drop-in search engine.  It should provide
                         many of the features of projects like <a
                         href="http://www.htdig.org">htDig</a> while surpassing
                         them with unique Lucene features and capabilities such as
                         easy installation on and java-supporting platform,
                         and support for document fields and field searches.  And
                         of course, <a href="http://apache.org/LICENSE">
                         a pragmatic software license</a>.
                 </p>
                 <p>
                         To reach this goal we'll implement code to support the
                         following objectives that augment but do not replace
                         the current Lucene feature set.
                 </p>
                 <ul>
                         <li>
                                 Document Location Independence - meaning mapping
                                 real contexts to runtime contexts.
                                 Essentially, if the document is at
                                 /var/www/htdocs/mydoc.html, I probably want it
                                 indexed as
                                 http://www.bigevilmegacorp.com/mydoc.html.
                         </li>
                         <li>
                                 Standard methods of creating central indicies -
                                 file system indexing is probably less useful in
                                 many environments than is *remote* indexing (for
                                 instance http).  I would suggest that most folks
                                 would prefer that general functionality be
                                 supported by Lucene instead of having to write
                                 code for every indexing project.  Obviously, if
                                 what they are doing is *special* they'll have to
                                 code, but general document indexing across
                                 web servers would not qualify.
                         </li>
                         <li>
                                 Document interpretation abstraction - currently
                                 one must handle document object construction via
                                 custom code.  A standard interface for plugging
                                 in format handlers should be supported.
                         </li>
                         <li>
                                 Mime and file-extension to document
                                 interpretation mapping.
                         </li>
                 </ul>
         </section>
         <section name="Crawlers">
                 <p>
                         Crawlers are data source executable code.  They crawl a file
                         system, ftp site, web site, etc. to create the index.
                         These standard crawlers may not make ALL of Lucene's
                         functionality available, though they should be able to
                         make most of it available through configuration.
                 </p>
                 <!--<section name="AbstractIndexer">-->
                 <p>
                         <b> Abstract Crawler </b>
                 </p>
                         <p>
                                 The AbstractCrawler is basically the parent for all
                                 Crawler classes.  It provides implementation for the
                                 following functions/properties:
                         </p>
                         <ul>
                                 <li>
                                         index path - where to write the index.
                                 </li>
                                 <li>
                                         cui - create or update the index
                                 </li>
                                 <li>
                                         root context - the start of the pathname
                                         that should be replaced by the
                                         replace with property or dropped
                                         entirely.  Example: /opt/tomcat/webapps
                                 </li>
                                 <li>
                                         replace with - when specified replaces
                                         the root context.  Example:
                                         http://jakarta.apache.org.
                                 </li>
                                 <li>
                                         replacement type - the type of
                                         replace with path:  relative, URL or
                                         path.
                                 </li>
                                 <li>
                                         location - the location to start
                                         indexing at.
                                 </li>
                                 <li>
                                         doctypes - only index documents with
                                         these doctypes.  If not specified all
                                         registered mime-types are used.
                                         Example: "xml,doc,html"
                                 </li>
                                 <li>
                                         recursive - if not specified is turned
                                         off.
                                 </li>
                                 <li>
                                         level - optional level of directory or
                                         links to traverse.  By default is
                                         assumed to be infinite.  Recursive must
                                         be turned on or this is ignored.  Range:
                                         0 - Long.MAX_VALUE.
                                 </li>
                                 <li>
                                         SleeptimeBetweenCalls - can be used to
                                         avoid flooding a machine with too many
                                         requests
                                 </li>
                                 <li>
                                         RequestTimeout - kill the crawler
                                         request after the specified period of
                                         inactivity.
                                 </li>
                                 <li>
                                         IncludeFilter - include only items
                                         matching filter.  (can occur multiple
                                         times)
                                 </li>
                                 <li>
                                         ExcludeFilter - exclude only items
                                         matching filter.  (can occur multiple
                                         times)
                                 </li>
                                 <li>
                                         ExpandOnly - use but do not index items
                                         that match this pattern (regex?) (can
                                         occur multiple times)
                                 </li>
                                 <li>
                                         NoExpand - Index but do not follow the
                                         links in items that match this pattern
                                         (regex?) (can occur multiple times)
                                 </li>
                                 <li>
                                         MaxItems - stops indexing after x
                                         documents have been indexed.
                                 </li>
                                 <li>
                                         MaxMegs - stops indexing after x megs
                                         have been indexed..  (should this be in
                                         specific crawlers?)
                                 </li>
                                 <li>
                                         properties - in addition to the settings
                                         (probably from the command line) read
                                         this properties file and get them from
                                         it.  Command line options override
                                         the properties file in the case of
                                         duplicates.  There should also be an
                                         environment variable or VM parameter to
                                         set this.
                                 </li>
                         </ul>
                 <!--</section>-->
                 <!--<s2 title="FileSystemIndexer">-->
                         <p>
                               <b>FileSystemCrawler</b>
                         </p>
                         <p>
                                 This should extend the AbstractCrawler and
                                 support any additional options required for a
                                 file system index.
                         </p>
                 <!--</s2>-->
                 <!--<s2 title="HTTPIndexer">-->
                         <p>
                   <b>HTTP Crawler </b>
                         </p>
                         <p>
                                 Supports the AbstractCrawler options as well as:
                         </p>
                         <ul>
                                 <li>
                                         span hosts - Whether to span hosts or not,
                                         by default this should be no.
                                 </li>
                                 <li>
                                         restrict domains - (ignored if span
                                         hosts is not enabled).  Whether all
                                         spanned hosts must be in the same domain
                                         (default is off).
                                 </li>
                                 <li>
                                         try directories - Whether to attempt
                                         directory listings or not (so if you
                                         recurse and go to
                                         /nextcontext/index.html this option says
                                         to also try /nextcontext to get the dir
                                         listing)
                                 </li>
                                 <li>
                                         map extensions -
                                         (always/default/never/fallback).  Whether
                                         to always use extension mapping, by
                                         default (fallback to mime type), NEVER
                                         or fallback if mime is not available
                                         (default).
                                 </li>
                                 <li>
                                         ignore robots - ignore robots.txt, on or
                                         off (default - off)
                                 </li>
                         </ul>
         <!--        </s2> -->
         </section>

         <section name="MIMEMap">
                 <p>
                         A configurable registry of document types, their
                         description, an identifier, mime-type and file
                         extension.  This should map both MIME -> factory
                         and extension -> factory.
                 </p>
                 <p>
                         This might be configured at compile time or by a
                         properties file, etc.  For example:
                 </p>
                         <table>
                                 <tr>
                                         <td>Description</td>
                                         <td>Identifier</td>
                                         <td>Extensions</td>
                                         <td>MimeType</td>
                                         <td>DocumentFactory</td>
                                 </tr>
                                 <tr>
                                         <td>"Word Document"</td>
                                         <td>"doc"</td>
                                         <td>"doc"</td>
                                         <td>"vnd.application/ms-word"</td>
                                         <td>POIWordDocumentFactory</td>
                                 </tr>
                                 <tr>
                                         <td>"HTML Document"</td>
                                         <td>"html"</td>
                                         <td>"html,htm"</td>
                                         <td></td>
                                         <td>HTMLDocumentFactory</td>
                                 </tr>
                         </table>
         </section>
         <section name="DocumentFactory">
                 <p>
                         An interface for classes which create document objects
                         for particular file types.  Examples:
                         HTMLDocumentFactory, DOCDocumentFactory,
                         XLSDocumentFactory, XML DocumentFactory.
                 </p>
         </section>
         <section name="FieldMapping classes">
                 <p>
                         A class that maps standard fields from the
                         DocumentFactories into *fields* in the Document objects
                         they create.  I suggest that a regular expression system
                         or xpath might be the most universal way to do this.
                         For instance if perhaps I had an XML factory that
                         represented XML elements as fields, I could map content
                         from particular fields to their fields or suppress them
                         entirely.  We could even make this configurable.
                 </p>
                 <p>

                         for example:
                 </p>
                 <ul>
                         <li>
                                 htmldoc.properties
                         </li>
                         <li>
                         suppress=*
                         </li>
                         <li>
                         author=content:g/author\:\ ........................................./
                         </li>
                         <li>
                         author.suppress=false
                         </li>
                         <li>
                         title=content:g/title\:\ ........................................./
                         </li>
                         <li>
                         title.suppress=false
                         </li>
                 </ul>
                 <p>
                         In this example we map html documents such that all
                         fields are suppressed but author and title.  We map
                         author and title to anything in the content matching
                         author: (and x characters).  Okay my regular expresions
                         suck but hopefully you get the idea.
                 </p>
         </section>
         <section name="Final Thoughts">
                 <p>
                         We might also consider eliminating the DocumentFactory
                         entirely by making an AbstractDocument from which the
                         current document object would inherit from.  I
                         experimented with this locally, and it was a relatively
                         minor code change and there was of course no difference
                         in performance.  The Document Factory classes would
                         instead be instances of various subclasses of
                         AbstractDocument.
                 </p>
                 <p>
                         My inspiration for this is HTDig (http://www.htdig.org/).
                         While this goes slightly beyond what HTDig provides by
                         providing field mapping (where HTDIG is just interested
                         in Strings/numbers wherever they are found), it provides
                         at least what I would need to use this as a drop-in for
                         most places I contract at (with the obvious exception of
                         a default set of content handlers which would of course
                         develop naturally over time).
                 </p>
                 <p>
                         I am able to certainly contribute to this effort if the
                         development community is open to it.  I'd suggest we do
                         it iteratively in stages and not aim for all of this at
                         once (for instance leave out the field mapping at first).
                 </p>
                 <p>

                         Anyhow, please give me some feedback, counter
                         suggestions, let me know if I'm way off base or out of
                         line, etc. -Andy
                 </p>
         </section>

   </body>
 </document>
	<?xml version="1.0" encoding="UTF-8"?>

	<document>
	<properties>
	<title>Plan for enhancements to Lucene</title>
	<authors>
	<person email="acoliver@apache.org" name="Andrew C. Oliver" id="AO"/>
	</authors>
	</properties>
	<body>

	<section name="Purpose">
	<p>
	The purpose of this document is to outline plans for
	making <a href="http://jakarta.apache.org/lucene">
	Jakarta Lucene</a> work as a more general drop-in
	component. It makes the assumption that this is an
	objective for the Lucene user and development community.
	</p>
	<p>
	The best reference is <a href="http://www.htdig.org">
	htDig</a>, though it is not quite as sophisticated as
	Lucene, it has a number of features that make it
	desirable. It however is a traditional c-compiled app
	which makes it somewhat unpleasant to install on some
	platforms (like Solaris!).
	</p>
	<p>
	This plan is being submitted to the Lucene developer
	community for an initial reaction, advice, feedback and
	consent. Following this it will be submitted to the
	Lucene user community for support. Although, I'm (Andy
	Oliver) capable of providing these enhancements by
	myself, I'd of course prefer to work on them in concert
	with others.
	</p>
	<p>
	While I'm outlaying a fairly large feature set, these can
	be implemented incrementally of course (and are probably
	best if done that way).
	</p>
	</section>

	<section name="Goal and Objectives">
	<p>
	The goal is to provide features to Lucene that allow it
	to be used as a drop-in search engine. It should provide
	many of the features of projects like <a
	href="http://www.htdig.org">htDig</a> while surpassing
	them with unique Lucene features and capabilities such as
	easy installation on and java-supporting platform,
	and support for document fields and field searches. And
	of course, <a href="http://apache.org/LICENSE">
	a pragmatic software license</a>.
	</p>
	<p>
	To reach this goal we'll implement code to support the
	following objectives that augment but do not replace
	the current Lucene feature set.
	</p>
	<ul>
	<li>
	Document Location Independence - meaning mapping
	real contexts to runtime contexts.
	Essentially, if the document is at
	/var/www/htdocs/mydoc.html, I probably want it
	indexed as
	http://www.bigevilmegacorp.com/mydoc.html.
	</li>
	<li>
	Standard methods of creating central indicies -
	file system indexing is probably less useful in
	many environments than is remote indexing (for
	instance http). I would suggest that most folks
	would prefer that general functionality be
	supported by Lucene instead of having to write
	code for every indexing project. Obviously, if
	what they are doing is special they'll have to
	code, but general document indexing across
	web servers would not qualify.
	</li>
	<li>
	Document interpretation abstraction - currently
	one must handle document object construction via
	custom code. A standard interface for plugging
	in format handlers should be supported.
	</li>
	<li>
	Mime and file-extension to document
	interpretation mapping.
	</li>
	</ul>
	</section>
	<section name="Crawlers">
	<p>
	Crawlers are data source executable code. They crawl a file
	system, ftp site, web site, etc. to create the index.
	These standard crawlers may not make ALL of Lucene's
	functionality available, though they should be able to
	make most of it available through configuration.
	</p>
	<!--<section name="AbstractIndexer">-->
	<p>
	<b> Abstract Crawler </b>
	</p>
	<p>
	The AbstractCrawler is basically the parent for all
	Crawler classes. It provides implementation for the
	following functions/properties:
	</p>
	<ul>
	<li>
	index path - where to write the index.
	</li>
	<li>
	cui - create or update the index
	</li>
	<li>
	root context - the start of the pathname
	that should be replaced by the
	replace with property or dropped
	entirely. Example: /opt/tomcat/webapps
	</li>
	<li>
	replace with - when specified replaces
	the root context. Example:
	http://jakarta.apache.org.
	</li>
	<li>
	replacement type - the type of
	replace with path: relative, URL or
	path.
	</li>
	<li>
	location - the location to start
	indexing at.
	</li>
	<li>
	doctypes - only index documents with
	these doctypes. If not specified all
	registered mime-types are used.
	Example: "xml,doc,html"
	</li>
	<li>
	recursive - if not specified is turned
	off.
	</li>
	<li>
	level - optional level of directory or
	links to traverse. By default is
	assumed to be infinite. Recursive must
	be turned on or this is ignored. Range:
	0 - Long.MAX_VALUE.
	</li>
	<li>
	SleeptimeBetweenCalls - can be used to
	avoid flooding a machine with too many
	requests
	</li>
	<li>
	RequestTimeout - kill the crawler
	request after the specified period of
	inactivity.
	</li>
	<li>
	IncludeFilter - include only items
	matching filter. (can occur multiple
	times)
	</li>
	<li>
	ExcludeFilter - exclude only items
	matching filter. (can occur multiple
	times)
	</li>
	<li>
	ExpandOnly - use but do not index items
	that match this pattern (regex?) (can
	occur multiple times)
	</li>
	<li>
	NoExpand - Index but do not follow the
	links in items that match this pattern
	(regex?) (can occur multiple times)
	</li>
	<li>
	MaxItems - stops indexing after x
	documents have been indexed.
	</li>
	<li>
	MaxMegs - stops indexing after x megs
	have been indexed.. (should this be in
	specific crawlers?)
	</li>
	<li>
	properties - in addition to the settings
	(probably from the command line) read
	this properties file and get them from
	it. Command line options override
	the properties file in the case of
	duplicates. There should also be an
	environment variable or VM parameter to
	set this.
	</li>
	</ul>
	<!--</section>-->
	<!--<s2 title="FileSystemIndexer">-->
	<p>
	<b>FileSystemCrawler</b>
	</p>
	<p>
	This should extend the AbstractCrawler and
	support any additional options required for a
	file system index.
	</p>
	<!--</s2>-->
	<!--<s2 title="HTTPIndexer">-->
	<p>
	<b>HTTP Crawler </b>
	</p>
	<p>
	Supports the AbstractCrawler options as well as:
	</p>
	<ul>
	<li>
	span hosts - Whether to span hosts or not,
	by default this should be no.
	</li>
	<li>
	restrict domains - (ignored if span
	hosts is not enabled). Whether all
	spanned hosts must be in the same domain
	(default is off).
	</li>
	<li>
	try directories - Whether to attempt
	directory listings or not (so if you
	recurse and go to
	/nextcontext/index.html this option says
	to also try /nextcontext to get the dir
	listing)
	</li>
	<li>
	map extensions -
	(always/default/never/fallback). Whether
	to always use extension mapping, by
	default (fallback to mime type), NEVER
	or fallback if mime is not available
	(default).
	</li>
	<li>
	ignore robots - ignore robots.txt, on or
	off (default - off)
	</li>
	</ul>
	<!-- </s2> -->
	</section>

	<section name="MIMEMap">
	<p>
	A configurable registry of document types, their
	description, an identifier, mime-type and file
	extension. This should map both MIME -> factory
	and extension -> factory.
	</p>
	<p>
	This might be configured at compile time or by a
	properties file, etc. For example:
	</p>
	<table>
	<tr>
	<td>Description</td>
	<td>Identifier</td>
	<td>Extensions</td>
	<td>MimeType</td>
	<td>DocumentFactory</td>
	</tr>
	<tr>
	<td>"Word Document"</td>
	<td>"doc"</td>
	<td>"doc"</td>
	<td>"vnd.application/ms-word"</td>
	<td>POIWordDocumentFactory</td>
	</tr>
	<tr>
	<td>"HTML Document"</td>
	<td>"html"</td>
	<td>"html,htm"</td>
	<td></td>
	<td>HTMLDocumentFactory</td>
	</tr>
	</table>
	</section>
	<section name="DocumentFactory">
	<p>
	An interface for classes which create document objects
	for particular file types. Examples:
	HTMLDocumentFactory, DOCDocumentFactory,
	XLSDocumentFactory, XML DocumentFactory.
	</p>
	</section>
	<section name="FieldMapping classes">
	<p>
	A class that maps standard fields from the
	DocumentFactories into fields in the Document objects
	they create. I suggest that a regular expression system
	or xpath might be the most universal way to do this.
	For instance if perhaps I had an XML factory that
	represented XML elements as fields, I could map content
	from particular fields to their fields or suppress them
	entirely. We could even make this configurable.
	</p>
	<p>

	for example:
	</p>
	<ul>
	<li>
	htmldoc.properties
	</li>
	<li>
	suppress=*
	</li>
	<li>
	author=content:g/author\:\ ........................................./
	</li>
	<li>
	author.suppress=false
	</li>
	<li>
	title=content:g/title\:\ ........................................./
	</li>
	<li>
	title.suppress=false
	</li>
	</ul>
	<p>
	In this example we map html documents such that all
	fields are suppressed but author and title. We map
	author and title to anything in the content matching
	author: (and x characters). Okay my regular expresions
	suck but hopefully you get the idea.
	</p>
	</section>
	<section name="Final Thoughts">
	<p>
	We might also consider eliminating the DocumentFactory
	entirely by making an AbstractDocument from which the
	current document object would inherit from. I
	experimented with this locally, and it was a relatively
	minor code change and there was of course no difference
	in performance. The Document Factory classes would
	instead be instances of various subclasses of
	AbstractDocument.
	</p>
	<p>
	My inspiration for this is HTDig (http://www.htdig.org/).
	While this goes slightly beyond what HTDig provides by
	providing field mapping (where HTDIG is just interested
	in Strings/numbers wherever they are found), it provides
	at least what I would need to use this as a drop-in for
	most places I contract at (with the obvious exception of
	a default set of content handlers which would of course
	develop naturally over time).
	</p>
	<p>
	I am able to certainly contribute to this effort if the
	development community is open to it. I'd suggest we do
	it iteratively in stages and not aim for all of this at
	once (for instance leave out the field mapping at first).
	</p>
	<p>

	Anyhow, please give me some feedback, counter
	suggestions, let me know if I'm way off base or out of
	line, etc. -Andy
	</p>
	</section>

	</body>
	</document>