docs/benchmarks.html - lucene-solr - Git at Google

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
 <html>
 <head>
 <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
 <meta content="Apache Forrest" name="Generator">
 <meta name="Forrest-version" content="0.8">
 <meta name="Forrest-skin-name" content="lucene">
 <title>Apache Lucene - Resources - Performance Benchmarks</title>
 <link type="text/css" href="skin/basic.css" rel="stylesheet">
 <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
 <link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
 <link type="text/css" href="skin/profile.css" rel="stylesheet">
 <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
 <link rel="shortcut icon" href="images/favicon.ico">
 </head>
 <body onload="init()">
 <script type="text/javascript">ndeSetTextSize();</script>
 <div id="top">
 <!--+
     |breadtrail
     +-->
 <div class="breadtrail">
 <a href="http://www.apache.org/">Apache</a> &gt; <a href="http://lucene.apache.org/">Lucene</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
 </div>
 <!--+
     |header
     +-->
 <div class="header">
 <!--+
     |start group logo
     +-->
 <div class="grouplogo">
 <a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://www.apache.org/images/asf_logo_simple.png" title="Apache Lucene"></a>
 </div>
 <!--+
     |end group logo
     +-->
 <!--+
     |start Project Logo
     +-->
 <div class="projectlogo">
 <a href="http://lucene.apache.org/java/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/images/lucene_green_300.gif" title="Apache Lucene is a high-performance, full-featured text search engine library written entirely in
       Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform."></a>
 </div>
 <!--+
     |end Project Logo
     +-->
 <!--+
     |start Search
     +-->
 <div class="searchbox">
 <form action="http://search.lucidimagination.com/p:lucene" method="get" class="roundtopsmall">
 <input onFocus="getBlank (this, 'Search the site with Lucene');" size="25" name="q" id="query" type="text" value="Search the site with Lucene">&nbsp;
                     <input name="Search" value="Search" type="submit">
 </form>
 <div style="position: relative; top: -5px; left: -10px">Powered by <a href="http://www.lucidimagination.com" style="color: #033268">Lucid Imagination</a>
 </div>
 </div>
 <!--+
     |end search
     +-->
 <!--+
     |start Tabs
     +-->
 <ul id="tabs">
 <li class="current">
 <a class="selected" href="http://lucene.apache.org/java/docs/">Main</a>
 </li>
 <li>
 <a class="unselected" href="http://wiki.apache.org/lucene-java">Wiki</a>
 </li>
 <li class="current">
 <a class="selected" href="index.html">Lucene 2.4.1 Documentation</a>
 </li>
 </ul>
 <!--+
     |end Tabs
     +-->
 </div>
 </div>
 <div id="main">
 <div id="publishedStrip">
 <!--+
     |start Subtabs
     +-->
 <div id="level2tabs"></div>
 <!--+
     |end Endtabs
     +-->
 <script type="text/javascript"><!--
 document.write("Last Published: " + document.lastModified);
 //  --></script>
 </div>
 <!--+
     |breadtrail
     +-->
 <div class="breadtrail">

              &nbsp;
            </div>
 <!--+
     |start Menu, mainarea
     +-->
 <!--+
     |start Menu
     +-->
 <div id="menu">
 <div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
 <div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
 <div class="menuitem">
 <a href="index.html">Overview</a>
 </div>
 <div onclick="SwitchMenu('menu_1.1.2', 'skin/')" id="menu_1.1.2Title" class="menutitle">Javadocs</div>
 <div id="menu_1.1.2" class="menuitemgroup">
 <div class="menuitem">
 <a href="api/index.html">All</a>
 </div>
 <div class="menuitem">
 <a href="api/core/index.html">Core</a>
 </div>
 <div class="menuitem">
 <a href="api/demo/index.html">Demo</a>
 </div>
 <div onclick="SwitchMenu('menu_1.1.2.4', 'skin/')" id="menu_1.1.2.4Title" class="menutitle">Contrib</div>
 <div id="menu_1.1.2.4" class="menuitemgroup">
 <div class="menuitem">
 <a href="api/contrib-analyzers/index.html">Analyzers</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-ant/index.html">Ant</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-bdb/index.html">Bdb</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-bdb-je/index.html">Bdb-je</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-benchmark/index.html">Benchmark</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-highlighter/index.html">Highlighter</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-instantiated/index.html">Instantiated</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-lucli/index.html">Lucli</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-memory/index.html">Memory</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-misc/index.html">Miscellaneous</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-queries/index.html">Queries</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-regex/index.html">Regex</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-snowball/index.html">Snowball</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-spellchecker/index.html">Spellchecker</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-surround/index.html">Surround</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-swing/index.html">Swing</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-wikipedia/index.html">Wikipedia</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-wordnet/index.html">Wordnet</a>
 </div>
 <div class="menuitem">
 <a href="api/contrib-xml-query-parser/index.html">XML Query Parser</a>
 </div>
 </div>
 </div>
 <div class="menupage">
 <div class="menupagetitle">Benchmarks</div>
 </div>
 <div class="menuitem">
 <a href="contributions.html">Contributions</a>
 </div>
 <div class="menuitem">
 <a href="http://wiki.apache.org/lucene-java/LuceneFAQ">FAQ</a>
 </div>
 <div class="menuitem">
 <a href="fileformats.html">File Formats</a>
 </div>
 <div class="menuitem">
 <a href="gettingstarted.html">Getting Started</a>
 </div>
 <div class="menuitem">
 <a href="lucene-sandbox/index.html">Lucene Sandbox</a>
 </div>
 <div class="menuitem">
 <a href="queryparsersyntax.html">Query Syntax</a>
 </div>
 <div class="menuitem">
 <a href="scoring.html">Scoring</a>
 </div>
 <div class="menuitem">
 <a href="http://wiki.apache.org/lucene-java">Wiki</a>
 </div>
 </div>
 <div id="credit"></div>
 <div id="roundbottom">
 <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
 <!--+
   |alternative credits
   +-->
 <div id="credit2"></div>
 </div>
 <!--+
     |end Menu
     +-->
 <!--+
     |start content
     +-->
 <div id="content">
 <div title="Portable Document Format" class="pdflink">
 <a class="dida" href="benchmarks.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
         PDF</a>
 </div>
 <h1>Apache Lucene - Resources - Performance Benchmarks</h1>
 <div id="minitoc-area">
 <ul class="minitoc">
 <li>
 <a href="#Performance Benchmarks">Performance Benchmarks</a>
 </li>
 <li>
 <a href="#Benchmark Variables">Benchmark Variables</a>
 </li>
 <li>
 <a href="#User-submitted Benchmarks">User-submitted Benchmarks</a>
 <ul class="minitoc">
 <li>
 <a href="#Hamish Carpenter's benchmarks">Hamish Carpenter's benchmarks</a>
 </li>
 <li>
 <a href="#Justin Greene's benchmarks">Justin Greene's benchmarks</a>
 </li>
 <li>
 <a href="#Daniel Armbrust's benchmarks">Daniel Armbrust's benchmarks</a>
 </li>
 <li>
 <a href="#Geoffrey Peddle's benchmarks">Geoffrey Peddle's benchmarks</a>
 </li>
 </ul>
 </li>
 </ul>
 </div>


 <a name="N10013"></a><a name="Performance Benchmarks"></a>
 <h2 class="boxed">Performance Benchmarks</h2>
 <div class="section">
 <p>
                 The purpose of these user-submitted performance figures is to
                 give current and potential users of Lucene a sense
                 of how well Lucene scales. If the requirements for an upcoming
                 project is similar to an existing benchmark, you
                 will also have something to work with when designing the system
                 architecture for the application.
             </p>
 <p>
                 If you've conducted performance tests with Lucene, we'd
                 appreciate if you can submit these figures for display
                 on this page. Post these figures to the lucene-user mailing list
                 using this
                 <a href="benchmarktemplate.xml">template</a>.
             </p>
 </div>


 <a name="N10023"></a><a name="Benchmark Variables"></a>
 <h2 class="boxed">Benchmark Variables</h2>
 <div class="section">
 <p>

 <ul>

 <p>

 <b>Hardware Environment</b>
 <br>

 <li>
 <i>Dedicated machine for indexing</i>: Self-explanatory
                             (yes/no)</li>

 <li>
 <i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>

 <li>
 <i>RAM</i>: Self-explanatory</li>

 <li>
 <i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
                             RAID-1, RAID-5)</li>

 </p>

 <p>

 <b>Software environment</b>
 <br>

 <li>
 <i>Lucene Version</i>: Self-explanatory</li>

 <li>
 <i>Java Version</i>: Version of Java SDK/JRE that is run
                         </li>

 <li>
 <i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>

 <li>
 <i>OS Version</i>: Self-explanatory</li>

 <li>
 <i>Location of index</i>: Is the index stored in filesystem
                             or database? Is it on the same server(local) or
                             over the network?</li>

 </p>

 <p>

 <b>Lucene indexing variables</b>
 <br>

 <li>
 <i>Number of source documents</i>: Number of documents being
                             indexed</li>

 <li>
 <i>Total filesize of source documents</i>:
                             Self-explanatory</li>

 <li>
 <i>Average filesize of source documents</i>:
                             Self-explanatory</li>

 <li>
 <i>Source documents storage location</i>: Where are the
                             documents being indexed located?
                             Filesystem, DB, http, etc.</li>

 <li>
 <i>File type of source documents</i>: Types of files being
                             indexed, e.g. HTML files, XML files, PDF files, etc.</li>

 <li>
 <i>Parser(s) used, if any</i>: Parsers used for parsing the
                             various files for indexing,
                             e.g. XML parser, HTML parser, etc.</li>

 <li>
 <i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>

 <li>
 <i>Number of fields per document</i>: Number of Fields each
                             Document contains</li>

 <li>
 <i>Type of fields</i>: Type of each field</li>

 <li>
 <i>Index persistence</i>: Where the index is stored, e.g.
                             FSDirectory, SqlDirectory, etc.</li>

 </p>

 <p>

 <b>Figures</b>
 <br>

 <li>
 <i>Time taken (in ms/s as an average of at least 3 indexing
                                 runs)</i>: Time taken to index all files</li>

 <li>
 <i>Time taken / 1000 docs indexed</i>: Time taken to index
                             1000 files</li>

 <li>
 <i>Memory consumption</i>: Self-explanatory</li>

 <li>
 <i>Query speed</i>: average time a query takes, type
                             of queries (e.g. simple one-term query, phrase query),
                             not measuring any overhead outside Lucene</li>

 </p>

 <p>

 <b>Notes</b>
 <br>

 <li>
 <i>Notes</i>: Any comments which don't belong in the above,
                             special tuning/strategies, etc.</li>

 </p>

 </ul>

 </p>
 </div>


 <a name="N100CA"></a><a name="User-submitted Benchmarks"></a>
 <h2 class="boxed">User-submitted Benchmarks</h2>
 <div class="section">
 <p>
                 These benchmarks have been kindly submitted by Lucene users for
                 reference purposes.
             </p>
 <p>
 <b>We make NO guarantees regarding their accuracy or
                     validity.</b>

 </p>
 <p>We strongly recommend you conduct your own
                 performance benchmarks before deciding on a particular
                 hardware/software setup (and hopefully submit
                 these figures to us).
             </p>
 <a name="N100DA"></a><a name="Hamish Carpenter's benchmarks"></a>
 <h3 class="boxed">Hamish Carpenter's benchmarks</h3>
 <ul>

 <p>

 <b>Hardware Environment</b>
 <br>

 <li>
 <i>Dedicated machine for indexing</i>: yes</li>

 <li>
 <i>CPU</i>: Intel x86 P4 1.5Ghz</li>

 <li>
 <i>RAM</i>: 512 DDR</li>

 <li>
 <i>Drive configuration</i>: IDE 7200rpm Raid-1</li>

 </p>

 <p>

 <b>Software environment</b>
 <br>

 <li>
 <i>Lucene Version</i>: 1.3</li>

 <li>
 <i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>

 <li>
 <i>Java VM</i>: </li>

 <li>
 <i>OS Version</i>: Debian Linux 2.4.18-686</li>

 <li>
 <i>Location of index</i>: local</li>

 </p>

 <p>

 <b>Lucene indexing variables</b>
 <br>

 <li>
 <i>Number of source documents</i>: Random generator. Set
                             to make 1M documents
                             in 2x500,000 batches.</li>

 <li>
 <i>Total filesize of source documents</i>: &gt; 1GB if
                             stored</li>

 <li>
 <i>Average filesize of source documents</i>: 1KB</li>

 <li>
 <i>Source documents storage location</i>: Filesystem</li>

 <li>
 <i>File type of source documents</i>: Generated</li>

 <li>
 <i>Parser(s) used, if any</i>: </li>

 <li>
 <i>Analyzer(s) used</i>: Default</li>

 <li>
 <i>Number of fields per document</i>: 11</li>

 <li>
 <i>Type of fields</i>: 1 date, 1 id, 9 text</li>

 <li>
 <i>Index persistence</i>: FSDirectory</li>

 </p>

 <p>

 <b>Figures</b>
 <br>

 <li>
 <i>Time taken (in ms/s as an average of at least 3
                                 indexing runs)</i>: </li>

 <li>
 <i>Time taken / 1000 docs indexed</i>: 49 seconds</li>

 <li>
 <i>Memory consumption</i>:</li>

 </p>

 <p>

 <b>Notes</b>
 <br>

 <p>
                                 A windows client ran a random document generator which
                                 created
                                 documents based on some arrays of values and an excerpt
                                 (approx 1kb)
                                 from a text file of the bible (King James version).<br>
                                 These were submitted via a socket connection (open throughout
                                 indexing process).<br>
                                 The index writer was not closed between index calls.<br>
                                 This created a 400Mb index in 23 files (after
                                 optimization).<br>

 </p>

 <p>

 <u>Query details</u>:<br>

 </p>

 <p>
                                 Set up a threaded class to start x number of simultaneous
                                 threads to
                                 search the above created index.
                             </p>

 <p>
                                 Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
                                 (Teaser:goo* Tea
                                 ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
                                 +DisplayStartDate:[mkwsw2jk0
                                 -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
                             </p>

 <p>
                                 This query counted 34000 documents and I limited the returned
                                 documents
                                 to 5.
                             </p>

 <p>
                                 This is using Peter Halacsy's IndexSearcherCache slightly
                                 modified to
                                 be a singleton returned cached searchers for a given
                                 directory. This
                                 solved an initial problem with too many files open and
                                 running out of
                                 linux handles for them.
                             </p>

 <pre>
                                 Threads|Avg Time per query (ms)
                                 1       1009ms
                                 2       2043ms
                                 3       3087ms
                                 4       4045ms
                                 ..        .
                                 ..        .
                                 10      10091ms
                             </pre>

 <p>
                                 I removed the two date range terms from the query and it made
                                 a HUGE
                                 difference in performance. With 4 threads the avg time
                                 dropped to 900ms!
                             </p>

 <p>Other query optimizations made little difference.</p>

 </p>

 </ul>
 <p>
                     Hamish can be contacted at hamish at catalyst.net.nz.
                 </p>
 <a name="N1019F"></a><a name="Justin Greene's benchmarks"></a>
 <h3 class="boxed">Justin Greene's benchmarks</h3>
 <ul>

 <p>

 <b>Hardware Environment</b>
 <br>

 <li>
 <i>Dedicated machine for indexing</i>: No, but nominal
                             usage at time of indexing.</li>

 <li>
 <i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>

 <li>
 <i>RAM</i>: 1GB, 256MB allocated to JVM.</li>

 <li>
 <i>Drive configuration</i>: RAID 5 on Fibre Channel
                             Array</li>

 </p>

 <p>

 <b>Software environment</b>
 <br>

 <li>
 <i>Java Version</i>: 1.3.1_06</li>

 <li>
 <i>Java VM</i>: </li>

 <li>
 <i>OS Version</i>: Winnt 4/Sp6</li>

 <li>
 <i>Location of index</i>: local</li>

 </p>

 <p>

 <b>Lucene indexing variables</b>
 <br>

 <li>
 <i>Number of source documents</i>: about 60K</li>

 <li>
 <i>Total filesize of source documents</i>: 6.5GB</li>

 <li>
 <i>Average filesize of source documents</i>: 100K
                             (6.5GB/60K documents)</li>

 <li>
 <i>Source documents storage location</i>: filesystem on
                             NTFS</li>

 <li>
 <i>File type of source documents</i>: </li>

 <li>
 <i>Parser(s) used, if any</i>: Currently the only parser
                             used is the Quiotix html
                             parser.</li>

 <li>
 <i>Analyzer(s) used</i>: SimpleAnalyzer</li>

 <li>
 <i>Number of fields per document</i>: 8</li>

 <li>
 <i>Type of fields</i>: All strings, and all are stored
                             and indexed.</li>

 <li>
 <i>Index persistence</i>: FSDirectory</li>

 </p>

 <p>

 <b>Figures</b>
 <br>

 <li>
 <i>Time taken (in ms/s as an average of at least 3
                                 indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
                             minutes.  Note that the #
                             and size of documents changes daily.</li>

 <li>
 <i>Time taken / 1000 docs indexed</i>: </li>

 <li>
 <i>Memory consumption</i>: JVM is given 256MB and uses it
                             all.</li>

 </p>

 <p>

 <b>Notes</b>
 <br>

 <p>
                                 We have 10 threads reading files from the filesystem and
                                 parsing and
                                 analyzing them and the pushing them onto a queue and a single
                                 thread poping
                                 them from the queue and indexing.  Note that we are indexing
                                 email messages
                                 and are storing the entire plaintext in of the message in the
                                 index.  If the
                                 message contains attachment and we do not have a filter for
                                 the attachment
                                 (ie. we do not do PDFs yet), we discard the data.
                             </p>

 </p>

 </ul>
 <p>
                     Justin can be contacted at tvxh-lw4x at spamex.com.
                 </p>
 <a name="N1023A"></a><a name="Daniel Armbrust's benchmarks"></a>
 <h3 class="boxed">Daniel Armbrust's benchmarks</h3>
 <p>
                     My disclaimer is that this is a very poor "Benchmark".  It was not done for raw speed,
                     nor was the total index built in one shot.  The index was created on several different
                     machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
                     1 million documents per batch.  Each of these small indexes was then moved to a
                     much larger drive, where they were all merged together into a big index.
                     This process was done manually, over the course of several months, as the sources became available.
                 </p>
 <ul>

 <p>

 <b>Hardware Environment</b>
 <br>

 <li>
 <i>Dedicated machine for indexing</i>: no - The machine had moderate to low load.  However, the indexing process was built single
                             threaded, so it only took advantage of 1 of the processors.  It usually got 100% of this processor.</li>

 <li>
 <i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>

 <li>
 <i>RAM</i>: 4 GB Memory</li>

 <li>
 <i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>

 </p>

 <p>

 <b>Software environment</b>
 <br>

 <li>
 <i>Lucene Version</i>: 1.2</li>

 <li>
 <i>Java Version</i>: 1.3.1</li>

 <li>
 <i>Java VM</i>: </li>

 <li>
 <i>OS Version</i>: Sun 5.8 (64 bit)</li>

 <li>
 <i>Location of index</i>: local</li>

 </p>

 <p>

 <b>Lucene indexing variables</b>
 <br>

 <li>
 <i>Number of source documents</i>: 13,820,517</li>

 <li>
 <i>Total filesize of source documents</i>: 87.3 GB</li>

 <li>
 <i>Average filesize of source documents</i>: 6.3 KB</li>

 <li>
 <i>Source documents storage location</i>: Filesystem</li>

 <li>
 <i>File type of source documents</i>: XML</li>

 <li>
 <i>Parser(s) used, if any</i>: </li>

 <li>
 <i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>

 <li>
 <i>Number of fields per document</i>: 1 - 31</li>

 <li>
 <i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>

 <li>
 <i>Index persistence</i>: FSDirectory</li>

 <li>
 <i>Index size</i>: 12.5 GB</li>

 </p>

 <p>

 <b>Figures</b>
 <br>

 <li>
 <i>Time taken (in ms/s as an average of at least 3
                                 indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>

 <li>
 <i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>

 <li>
 <i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
                             1 GB of memory was allotted to the indexer</li>

 </p>

 <p>

 <b>Notes</b>
 <br>

 <p>
                                 The source documents were XML.  The "indexer" opened each document one at a time, ran an
                                 XSL transformation on them, and then proceeded to index the stream.  The indexer optimized
                                 the index every 50,000 documents (on this run) though previously, we optimized every
                                 300,000 documents.  The performance didn't change much either way.  We did no other
                                 tuning (RAM Directories, separate process to pretransform the source material, etc.)
                                 to make it index faster.  When all of these individual indexes were built, they were
                                 merged together into the main index.  That process usually took ~ a day.
                             </p>

 </p>

 </ul>
 <p>
                     Daniel can be contacted at Armbrust.Daniel at mayo.edu.
                 </p>
 <a name="N102E2"></a><a name="Geoffrey Peddle's benchmarks"></a>
 <h3 class="boxed">Geoffrey Peddle's benchmarks</h3>
 <p>
                   I'm doing a technical evaluation of search engines
                   for Ariba, an enterprise application software company.
                    I compared Lucene to a commercial C language based
                   search engine which I'll refer to as vendor A.
                   Overall Lucene's performance was similar to vendor A
                   and met our application's requirements.  I've
                   summarized our results below.
                 </p>
 <p>
                   Search scalability:<br>
                   We ran a set of 16 queries in a single thread for 20
                   iterations.  We report below the times for the last 15
                   iterations (ie after the system was warmed up).   The
                   4 sets of results below are for indexes with between
                   50,000 documents to 600,000 documents.  Although the
                   times for Lucene grew faster with document count than
                   vendor A they were comparable.
                 </p>
 <pre>
 50K  documents
 Lucene   5.2   seconds
 A        7.2
 200K
 Lucene   15.3
 A        15.2
 400K
 Lucene    28.2
 A         25.5
 600K
 Lucene    41
 A         33
 </pre>
 <p>
                   Individual Query times:<br>
                   Total query times are very similar between the 2
                   systems but there were larger differences when you
                   looked at individual queries.
                 </p>
 <p>
                   For simple queries with small result sets Vendor A was
                   consistently faster than Lucene.   For example a
                   single query might take vendor A 32 thousands of a
                   second and Lucene 64 thousands of a second.    Both
                   times are however well within acceptable response
                   times for our application.
                 </p>
 <p>
                   For simple queries with large result sets Vendor A was
                   consistently slower than Lucene.   For example a
                   single query might take vendor A 300 thousands of a
                   second and Lucene 200 thousands of a second.
                   For more complex queries of the form   (term1 or term2
                   or term3)  AND (term4 or term5 or term6) AND (term7 or
                   term8)    the results were more divergent.  For
                   queries with small result sets Vendor A generally had
                   very short response times and sometimes Lucene had
                   significantly larger response times.  For example
                   Vendor A might take 16 thousands of a second and
                   Lucene might take 156.   I do not consider it to be
                   the case that Lucene's response time grew unexpectedly
                   but rather that Vendor A appeared to be taking
                   advantage of an optimization which Lucene didn't have.
                     (I believe there's been discussions on the dev
                   mailing list on complex queries of this sort.)
                 </p>
 <p>
                   Index Size:<br>
                   For our test data the size of both indexes grew
                   linearly with the number of documents.   Note that
                   these sizes are compact sizes, not maximum size during
                   index loading.   The numbers below are from running du
                   -k in the directory containing the index data.   The
                   larger number's below for Vendor A may be because it
                   supports additional functionality not available in
                   Lucene.   I think it's the constant rate of growth
                   rather than the absolute amount which is more
                   important.
                 </p>
 <pre>
 50K  documents
 Lucene      45516 K
 A           63921
 200K
 Lucene      171565
 A           228370
 400K
 Lucene      345717
 A           457843
 600K
 Lucene      511338
 A           684913
 </pre>
 <p>
                   Indexing Times:<br>
                   These times are for reading the documents from our
                   database, processing them, inserting them into the
                   document search product and index compacting.   Our
                   data has a large number of fields/attributes.   For
                   this test I restricted Lucene to 24 attributes to
                   reduce the number of files created.  Doing this I was
                   able to specify a merge width for Lucene of 60.   I
                   found in general that Lucene indexing performance to
                   be very sensitive to changes in the merge width.
                   Note also that our application does a full compaction
                   after inserting every 20,000 documents.   These times
                   are just within our acceptable limits but we are
                   interested in alternatives to increase Lucene's
                   performance in this area.
                 </p>
 <p>

 <pre>
 600K documents
 Lucene       81 minutes
 A            34 minutes
 </pre>

 </p>
 <p>
                   (I don't have accurate results for all sizes on this
                   measure but believe that the indexing time for both
                   solutions grew essentially linearly with size.   The
                   time to compact the index generally grew with index
                   size but it's a small percent of overall time at these
                   sizes.)
                 </p>
 <ul>

 <p>

 <b>Hardware Environment</b>
 <br>

 <li>
 <i>Dedicated machine for indexing</i>: yes</li>

 <li>
 <i>CPU</i>: Dell Pentium 4 CPU 2.00Ghz, 1cpu</li>

 <li>
 <i>RAM</i>: 1 GB Memory</li>

 <li>
 <i>Drive configuration</i>: Fujitsu MAM3367MP SCSI </li>

 </p>

 <p>

 <b>Software environment</b>
 <br>

 <li>
 <i>Java Version</i>: 1.4.2_02</li>

 <li>
 <i>Java VM</i>: JDK</li>

 <li>
 <i>OS Version</i>: Windows XP </li>

 <li>
 <i>Location of index</i>: local</li>

 </p>

 <p>

 <b>Lucene indexing variables</b>
 <br>

 <li>
 <i>Number of source documents</i>: 600,000</li>

 <li>
 <i>Total filesize of source documents</i>: from database</li>

 <li>
 <i>Average filesize of source documents</i>: from database</li>

 <li>
 <i>Source documents storage location</i>: from database</li>

 <li>
 <i>File type of source documents</i>: XML</li>

 <li>
 <i>Parser(s) used, if any</i>: </li>

 <li>
 <i>Analyzer(s) used</i>: small variation on WhitespaceAnalyzer</li>

 <li>
 <i>Number of fields per document</i>: 24</li>

 <li>
 <i>Type of fields</i>: A1 keyword, 1 big unindexed, rest are unstored and a mix of tokenized/untokenized</li>

 <li>
 <i>Index persistence</i>: FSDirectory</li>

 <li>
 <i>Index size</i>: 12.5 GB</li>

 </p>

 <p>

 <b>Figures</b>
 <br>

 <li>
 <i>Time taken (in ms/s as an average of at least 3
                                 indexing runs)</i>: 600,000 documents in 81 minutes   (du -k = 511338)</li>

 <li>
 <i>Time taken / 1000 docs indexed</i>: 123 documents/second</li>

 <li>
 <i>Memory consumption</i>: -ms256m -mx512m -Xss4m -XX:MaxPermSize=512M</li>

 </p>

 <p>

 <b>Notes</b>
 <br>

 <p>

 <li>merge width of 60</li>

 <li>did a compact every 20,000 documents</li>

 </p>

 </p>

 </ul>
 </div>


 </div>
 <!--+
     |end content
     +-->
 <div class="clearboth">&nbsp;</div>
 </div>
 <div id="footer">
 <!--+
     |start bottomstrip
     +-->
 <div class="lastmodified">
 <script type="text/javascript"><!--
 document.write("Last Published: " + document.lastModified);
 //  --></script>
 </div>
 <div class="copyright">
         Copyright &copy;
          2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
 </div>
 <!--+
     |end bottomstrip
     +-->
 </div>
 </body>
 </html>