| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <title>Apache Lucene - Building and Installing the Basic Demo</title> |
| </head> |
| <body> |
| <p>The demo module offers simple example code to show the features of Lucene.</p> |
| <h1>Apache Lucene - Building and Installing the Basic Demo</h1> |
| <div id="minitoc-area"> |
| <ul class="minitoc"> |
| <li><a href="#About_this_Document">About this Document</a></li> |
| <li><a href="#About_the_Demo">About the Demo</a></li> |
| <li><a href="#Setting_your_CLASSPATH">Setting your CLASSPATH</a></li> |
| <li><a href="#Indexing_Files">Indexing Files</a></li> |
| <li><a href="#About_the_code">About the code</a></li> |
| <li><a href="#Location_of_the_source">Location of the source</a></li> |
| <li><a href="#IndexFiles">IndexFiles</a></li> |
| <li><a href="#Searching_Files">Searching Files</a></li> |
| </ul> |
| </div> |
| <a name="About_this_Document"></a> |
| <h2 class="boxed">About this Document</h2> |
| <div class="section"> |
| <p>This document is intended as a "getting started" guide to using and running |
| the Lucene demos. It walks you through some basic installation and |
| configuration.</p> |
| </div> |
| <a name="About_the_Demo"></a> |
| <h2 class="boxed">About the Demo</h2> |
| <div class="section"> |
| <p>The Lucene command-line demo code consists of an application that |
| demonstrates various functionalities of Lucene and how you can add Lucene to |
| your applications.</p> |
| </div> |
| <a name="Setting_your_CLASSPATH"></a> |
| <h2 class="boxed">Setting your CLASSPATH</h2> |
| <div class="section"> |
| <p>First, you should <a href= |
| "http://www.apache.org/dyn/closer.cgi/lucene/java/">download</a> the latest |
| Lucene distribution and then extract it to a working directory.</p> |
| <p>You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene |
| demo JAR. You should see the Lucene JAR file in the core/ directory you created |
| when you extracted the archive -- it should be named something like |
| <span class="codefrag">lucene-core-{version}.jar</span>. You should also see |
| files called <span class="codefrag">lucene-queryparser-{version}.jar</span>, |
| <span class= |
| "codefrag">lucene-analyzers-common-{version}.jar</span> and <span class= |
| "codefrag">lucene-demo-{version}.jar</span> under queryparser, analysis/common/ and demo/, |
| respectively.</p> |
| <p>Put all four of these files in your Java CLASSPATH.</p> |
| </div> |
| <a name="Indexing_Files"></a> |
| <h2 class="boxed">Indexing Files</h2> |
| <div class="section"> |
| <p>Once you've gotten this far you're probably itching to go. Let's <b>build an |
| index!</b> Assuming you've set your CLASSPATH correctly, just type:</p> |
| <pre> |
| java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} |
| </pre> |
| This will produce a subdirectory called <span class="codefrag">index</span> |
| which will contain an index of all of the Lucene source code. |
| <p>To <b>search the index</b> type:</p> |
| <pre> |
| java org.apache.lucene.demo.SearchFiles |
| </pre> |
| You'll be prompted for a query. Type in a gibberish or made up word (for example: |
| "superca<!-- need to break up word in a way that is not visibile so it doesn't cause this ile to match a search on this word -->lifragilisticexpialidocious"). |
| You'll see that there are no maching results in the lucene source code. |
| Now try entering the word "string". That should return a whole bunch |
| of documents. The results will page at every tenth result and ask you whether |
| you want more results.</div> |
| <a name="About_the_code"></a> |
| <h2 class="boxed">About the code</h2> |
| <div class="section"> |
| <p>In this section we walk through the sources behind the command-line Lucene |
| demo: where to find them, their parts and their function. This section is |
| intended for Java developers wishing to understand how to use Lucene in their |
| applications.</p> |
| </div> |
| <a name="Location_of_the_source"></a> |
| <h2 class="boxed">Location of the source</h2> |
| <div class="section"> |
| <p>The files discussed here are linked into this documentation directly: |
| <ul> |
| <li><a href="src-html/org/apache/lucene/demo/IndexFiles.html">IndexFiles.java</a>: code to create a Lucene index. |
| <li><a href="src-html/org/apache/lucene/demo/SearchFiles.html">SearchFiles.java</a>: code to search a Lucene index. |
| </ul> |
| </p> |
| </div> |
| <a name="IndexFiles" id="IndexFiles"></a> |
| <h2 class="boxed">IndexFiles</h2> |
| <div class="section"> |
| <p>As we discussed in the previous walk-through, the <a href= |
| "src-html/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class creates |
| a Lucene Index. Let's take a look at how it does this.</p> |
| <p>The <span class="codefrag">main()</span> method parses the command-line |
| parameters, then in preparation for instantiating |
| {@link org.apache.lucene.index.IndexWriter IndexWriter}, opens a |
| {@link org.apache.lucene.store.Directory Directory}, and |
| instantiates {@link org.apache.lucene.analysis.standard.StandardAnalyzer StandardAnalyzer} |
| and {@link org.apache.lucene.index.IndexWriterConfig IndexWriterConfig}.</p> |
| <p>The value of the <span class="codefrag">-index</span> command-line parameter |
| is the name of the filesystem directory where all index information should be |
| stored. If <span class="codefrag">IndexFiles</span> is invoked with a relative |
| path given in the <span class="codefrag">-index</span> command-line parameter, |
| or if the <span class="codefrag">-index</span> command-line parameter is not |
| given, causing the default relative index path "<span class= |
| "codefrag">index</span>" to be used, the index path will be created as a |
| subdirectory of the current working directory (if it does not already exist). |
| On some platforms, the index path may be created in a different directory (such |
| as the user's home directory).</p> |
| <p>The <span class="codefrag">-docs</span> command-line parameter value is the |
| location of the directory containing files to be indexed.</p> |
| <p>The <span class="codefrag">-update</span> command-line parameter tells |
| <span class="codefrag">IndexFiles</span> not to delete the index if it already |
| exists. When <span class="codefrag">-update</span> is not given, <span class= |
| "codefrag">IndexFiles</span> will first wipe the slate clean before indexing |
| any documents.</p> |
| <p>Lucene {@link org.apache.lucene.store.Directory Directory}s are used by |
| the <span class="codefrag">IndexWriter</span> to store information in the |
| index. In addition to the {@link org.apache.lucene.store.FSDirectory FSDirectory} |
| implementation we are using, there are several other <span class= |
| "codefrag">Directory</span> subclasses that can write to RAM, to databases, |
| etc.</p> |
| <p>Lucene {@link org.apache.lucene.analysis.Analyzer Analyzer}s are |
| processing pipelines that break up text into indexed tokens, a.k.a. terms, and |
| optionally perform other operations on these tokens, e.g. downcasing, synonym |
| insertion, filtering out unwanted tokens, etc. The <span class= |
| "codefrag">Analyzer</span> we are using is <span class= |
| "codefrag">StandardAnalyzer</span>, which creates tokens using the Word Break |
| rules from the Unicode Text Segmentation algorithm specified in <a href= |
| "http://unicode.org/reports/tr29/">Unicode Standard Annex #29</a>; converts |
| tokens to lowercase; and then filters out stopwords. Stopwords are common |
| language words such as articles (a, an, the, etc.) and other tokens that may |
| have less value for searching. It should be noted that there are different |
| rules for every language, and you should use the proper analyzer for each. |
| Lucene currently provides Analyzers for a number of different languages (see |
| the javadocs under <a href= |
| "../analyzers-common/overview-summary.html">lucene/analysis/common/src/java/org/apache/lucene/analysis</a>).</p> |
| <p>The <span class="codefrag">IndexWriterConfig</span> instance holds all |
| configuration for <span class="codefrag">IndexWriter</span>. For example, we |
| set the <span class="codefrag">OpenMode</span> to use here based on the value |
| of the <span class="codefrag">-update</span> command-line parameter.</p> |
| <p>Looking further down in the file, after <span class= |
| "codefrag">IndexWriter</span> is instantiated, you should see the <span class= |
| "codefrag">indexDocs()</span> code. This recursive function crawls the |
| directories and creates {@link org.apache.lucene.document.Document Document} objects. The |
| <span class="codefrag">Document</span> is simply a data object to represent the |
| text content from the file as well as its creation time and location. These |
| instances are added to the <span class="codefrag">IndexWriter</span>. If the |
| <span class="codefrag">-update</span> command-line parameter is given, the |
| <span class="codefrag">IndexWriterConfig</span> <span class= |
| "codefrag">OpenMode</span> will be set to {@link org.apache.lucene.index.IndexWriterConfig.OpenMode#CREATE_OR_APPEND |
| OpenMode.CREATE_OR_APPEND}, and rather than adding documents |
| to the index, the <span class="codefrag">IndexWriter</span> will |
| <strong>update</strong> them in the index by attempting to find an |
| already-indexed document with the same identifier (in our case, the file path |
| serves as the identifier); deleting it from the index if it exists; and then |
| adding the new document to the index.</p> |
| </div> |
| <a name="Searching_Files"></a> |
| <h2 class="boxed">Searching Files</h2> |
| <div class="section"> |
| <p>The <a href= |
| "src-html/org/apache/lucene/demo/SearchFiles.html">SearchFiles</a> class is |
| quite simple. It primarily collaborates with an |
| {@link org.apache.lucene.search.IndexSearcher IndexSearcher}, |
| {@link org.apache.lucene.analysis.standard.StandardAnalyzer StandardAnalyzer}, |
| (which is used in the <a href= |
| "src-html/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class as well) |
| and a {@link org.apache.lucene.queryparser.classic.QueryParser QueryParser}. The |
| query parser is constructed with an analyzer used to interpret your query text |
| in the same way the documents are interpreted: finding word boundaries, |
| downcasing, and removing useless words like 'a', 'an' and 'the'. The |
| {@link org.apache.lucene.search.Query} object contains the |
| results from the |
| {@link org.apache.lucene.queryparser.classic.QueryParser QueryParser} which |
| is passed to the searcher. Note that it's also possible to programmatically |
| construct a rich {@link org.apache.lucene.search.Query} object without using |
| the query parser. The query parser just enables decoding the <a href= |
| "../queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description"> |
| Lucene query syntax</a> into the corresponding |
| {@link org.apache.lucene.search.Query Query} object.</p> |
| <p><span class="codefrag">SearchFiles</span> uses the |
| {@link org.apache.lucene.search.IndexSearcher#search(org.apache.lucene.search.Query,int) |
| IndexSearcher.search(query,n)} method that returns |
| {@link org.apache.lucene.search.TopDocs TopDocs} with max |
| <span class="codefrag">n</span> hits. The results are printed in pages, sorted |
| by score (i.e. relevance).</p> |
| </div> |
| </body> |
| </html> |
| |