blob: bd2eb9b114b85eeb2dfb89b0c7b0727868af7a69 [file] [log] [blame]
<!doctype html>
<html>
<head>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE- 2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<link href="/theme/css/lucene/global.css?v=0e493d7a" rel="stylesheet" type="text/css">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/>
<meta name="Distribution" content="Global"/>
<meta name="Robots" content="index,follow"/>
<script type="text/javascript" src="/theme/javascript/lucene/prototype.js?v=0e493d7a"></script>
<script type="text/javascript" src="/theme/javascript/lucene/effects.js?v=0e493d7a"></script>
<script type="text/javascript" src="/theme/javascript/lucene/slides.js?v=0e493d7a"></script>
<script src="https://www.apachecon.com/event-images/snippet.js"></script> <title>Apache Lucene - Features</title>
<meta name="keywords"
content="apache, apache lucene, apache solr, solr, lucene
search, information retrieval, spell checking, faceting, inverted index,
open source"/> <meta property="og:type" content="website" />
<meta property="og:url" content="https://lucene.apache.org/pylucene/features.html"/>
<meta property="og:title" content="Features"/>
<meta property="og:description" content="Warning Before calling any PyLucene API that requires the Java VM, start it by calling initVM(classpath, ...). More about this..."/>
<meta property="og:image" content="https://lucene.apache.org/theme/images/lucene/lucene_og_image.png?v=0e493d7a"/>
<meta property="og:image:secure_url" content="https://lucene.apache.org/theme/images/lucene/lucene_og_image.png?v=0e493d7a"/>
<link rel="shortcut icon" type="image/png"
href="/theme/images/lucene/lucene-favicon.png?v=0e493d7a"/><link href="/theme/css/lucene/pylucene.css?v=0e493d7a" rel="stylesheet" type="text/css">
</head>
<body id="home">
<div id="wrap">
<div id="header">
<div id="logo" style="float:left">
<a href="/">
<img border="0" src="/theme/images/lucene/lucene_logo_green_300.png?v=0e493d7a" alt="Lucene Logo"/>
</a>
</div>
<!-- TODO: Search disabled as it does not work, 2021-02-21
<div id="search" style="float:right;zoom:1">
<form id="quick-search" method="GET" action="https://sematext.com/opensee/lucene" name="searchform">
<fieldset>
<input type="search" id="q" name="q" placeholder="Search with Apache Solr..." class="class1 class2 hint" accesskey="q">
</fieldset>
</form>
</div>-->
<div id="nav">
<ul>
<li><a href="/pylucene/index.html">PyLucene</a></li>
<li><a href="/pylucene/news.html">News</a></li>
<li><a href="/pylucene/jcc/index.html">JCC</a></li>
<li><a href="https://issues.apache.org/jira/browse/PYLUCENE">Issue Tracker</a></li>
<li><a href="/pylucene/mailing-lists.html">Mailing Lists</a></li>
<li><a class="last" href="/">Lucene TLP</a></li>
</ul>
</div>
</div> <!-- End #header -->
<div id="content-wrap" class="clearfix">
<div id="main">
<div>
<h1 class="title">Features</h1>
<h2 id="warning">Warning</h2>
<p>Before calling any PyLucene API that requires the Java VM, start it by
calling <em>initVM(classpath, ...)</em>. More about this function in <a href="jcc/features.html">here</a>.</p>
<h2 id="installing-pylucene">Installing PyLucene</h2>
<p>PyLucene is a Python extension built with <a href="jcc/">JCC</a>.</p>
<p>To build PyLucene, JCC needs to be built first. Sources for JCC are
included with the PyLucene sources. Instructions for building and
installing JCC are <a href="jcc/install.html">here</a>.</p>
<p>Instructions for building PyLucene are <a href="install.html">here</a>.</p>
<h2 id="api-documentation">API documentation</h2>
<p>PyLucene is closely tracking Java
Lucene™ releases.
It intends to supports the entire Lucene API.</p>
<p>PyLucene also includes a number of Lucene contrib packages: the Snowball analyzer
and stemmers, the highlighter package, analyzers for other languages than English,
regular expression queries, specialized queries such as 'more like this' and more.</p>
<p>This document only covers the pythonic extensions to Lucene offered
by PyLucene as well as some differences between the Java and Python
APIs. For the documentation on Java Lucene APIs,
see <a href="https://lucene.apache.org/java/docs/api/index.html">here</a>.</p>
<p>To help with debugging and to support some Lucene APIs, PyLucene also
exposes some Java runtime APIs.</p>
<h2 id="samples">Samples</h2>
<p>The best way to learn PyLucene is to look at the samples and tests included with
the PyLucene source release or on the web at:</p>
<ul>
<li><a href="https://svn.apache.org/viewvc/lucene/pylucene/trunk/samples">https://svn.apache.org/viewvc/lucene/pylucene/trunk/samples</a></li>
<li><a href="https://svn.apache.org/viewvc/lucene/pylucene/trunk/test3">https://svn.apache.org/viewvc/lucene/pylucene/trunk/test3</a></li>
</ul>
<h2 id="threading-support-with-attachcurrentthread">Threading support with attachCurrentThread</h2>
<p>Before PyLucene APIs can be used from a thread other than the main thread that was
not created by the Java Runtime, the <em>attachCurrentThread()</em> method must be
called on the <em>JCCEnv</em> object returned by the <em>initVM()</em> or <em>getVMEnv()</em> functions.</p>
<h2 id="exception-handling-with-lucenejavaerror">Exception handling with lucene.JavaError</h2>
<p>Java exceptions are caught at the language barrier and reported to Python by raising
a JavaError instance whose args tuple contains the actual Java Exception instance.</p>
<h2 id="handling-java-arrays">Handling Java arrays</h2>
<p>Java arrays are returned to Python in a <em>JArray</em> wrapper instance that
implements the Python sequence protocol. It is possible to change array elements
but not to change the array size.</p>
<p>A few Lucene APIs take array arguments and expect values to be returned in them.
To call such an API and be able to retrieve the array values after the call, a
Java array needs to instantiated first.<br/> For example, accessing termDocs:</p>
<div class="highlight"><pre><span></span><code><span class="err">termDocs = reader.termDocs(Term(&quot;isbn&quot;, isbn))&lt;br/&gt;</span>
<span class="err">docs = JArray(&#39;int&#39;)(1) # allocate an int[1] array&lt;br/&gt;</span>
<span class="err">freq = JArray(&#39;int&#39;)(1) # allocate an int[1] array&lt;br/&gt;</span>
<span class="err">if termDocs.read(docs, freq) == 1:&lt;br/&gt;</span>
<span class="err">&amp;nbsp;&amp;nbsp;bits.set(docs[0]) # access the array&#39;s first element&lt;br/&gt;</span>
</code></pre></div>
<p>In addition to <em>int</em>, the <em>JArray</em> function accepts <em>object</em>, <em>string</em>,
<em>bool</em>, <em>byte</em>, <em>char</em>, <em>double</em>, <em>float</em>, <em>long</em> and <em>short</em> to create an array
of the corresponding type. The <em>JArray('object')</em> constructor takes a second
argument denoting the class of the object elements. This argument is optional and
defaults to Object.</p>
<p>To convert a char array to a Python string use a <em>''.join(array)</em> construct.</p>
<p>Instead of an integer denoting the size of the desired Java array, a sequence of
objects of the expected element type may be passed in to the array constructor.<br/>
For example:</p>
<div class="highlight"><pre><span></span><code><span class="err">\# creating a Java array of double from the [1.5, 2.5] list&lt;br/&gt;</span>
<span class="err">JArray(&#39;double&#39;)([1.5, 2.5])&lt;br/&gt;</span>
</code></pre></div>
<p>All methods that expect an array also accept a sequence of Python objects of the
expected element type. If no values are expected from the array arguments after
the call, it is hence not necessary to instantiate a Java array to make such calls.</p>
<p>See <a href="jcc/features.html">JCC</a> for more information about handling arrays.</p>
<h2 id="differences-between-the-java-lucene-and-pylucene-apis">Differences between the Java Lucene and PyLucene APIs</h2>
<ul>
<li>
<p>The PyLucene API exposes all Java Lucene classes in a flat namespace in the
PyLucene module. For example, the Java import statement
<code>import org.apache.lucene.index.IndexReader;</code> corresponds to the Python import
statement <code>from lucene import IndexReader</code></p>
</li>
<li>
<p>Downcasting is a common operation in Java but not a concept in Python. Because
the wrapper objects implementing exactly the APIs of the declared type of the
wrapped object, all classes implement two class methods called instance_ and
cast_ that verify and cast an instance respectively.</p>
</li>
</ul>
<h2 id="phythonic-extensions-to-the-java-lucene-apis">Phythonic extensions to the Java Lucene APIs</h2>
<p>Java is a very verbose language. Python, on the other hand, offers many
syntactically attractive constructs for iteration, property access, etc... As
the Java Lucene samples from the <em>Lucene in Action</em> book were ported to Python,
PyLucene received a number of pythonic extensions listed here:</p>
<ul>
<li>Iterating search hits is a very common operation. Hits instances are iterable
in Python. Two values are returned for each iteration, the zero-based number of
the document in the Hits instance and the document instance itself.<br/>
The Java loop:</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="err">for (int i = 0; i &amp;lt; hits.length(); i++) {&lt;br/&gt;</span>
<span class="err">&amp;nbsp;&amp;nbsp;Document doc = hits.doc(i);&lt;br/&gt;</span>
<span class="err">&amp;nbsp;&amp;nbsp;System.out.println(hits.score(i) + &quot; : &quot; + doc.get(&quot;title&quot;));&lt;br/&gt;</span>
<span class="err">}&lt;br/&gt;</span>
</code></pre></div>
<p>can be written in Python:</p>
<div class="highlight"><pre><span></span><code><span class="err">for hit in hits:&lt;br/&gt;</span>
<span class="err">&amp;nbsp;&amp;nbsp;hit = Hit.cast_(hit)&lt;br/&gt;</span>
<span class="err">&amp;nbsp;&amp;nbsp;print hit.getScore(), &#39;:&#39;, hit.getDocument[&#39;title&#39;]&lt;br/&gt;</span>
</code></pre></div>
<p>if hit.iterator()'s next() method were declared to return <em>Hit</em> instead of
<em>Object</em>, the above cast_() call would not be unnecessary.<br/> The same java
loop can also be written:</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="n">xrange</span><span class="p">(</span><span class="nf">len</span><span class="p">(</span><span class="n">hits</span><span class="p">))</span><span class="err">:</span><span class="o">&lt;</span><span class="n">br</span><span class="o">/&gt;</span><span class="w"></span>
<span class="o">&amp;</span><span class="n">nbsp</span><span class="p">;</span><span class="o">&amp;</span><span class="n">nbsp</span><span class="p">;</span><span class="k">print</span><span class="w"> </span><span class="n">hits</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">i</span><span class="p">),</span><span class="w"> </span><span class="s1">&#39;:&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">hits</span><span class="o">[</span><span class="n">i</span><span class="o">][</span><span class="n">&#39;title&#39;</span><span class="o">]&lt;</span><span class="n">br</span><span class="o">/&gt;</span><span class="w"></span>
</code></pre></div>
<ul>
<li>Hits instances partially implement the Python 'sequence' protocol.<br/>
The Java expressions:</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="err">hits.length();&lt;br/&gt;</span>
<span class="err">doc = hits.get(i);&lt;br/&gt;</span>
</code></pre></div>
<p>are better written in Python:</p>
<div class="highlight"><pre><span></span><code><span class="nf">len</span><span class="p">(</span><span class="n">hits</span><span class="p">)</span><span class="o">&lt;</span><span class="n">br</span><span class="o">/&gt;</span><span class="w"></span>
<span class="n">doc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hits</span><span class="o">[</span><span class="n">i</span><span class="o">]&lt;</span><span class="n">br</span><span class="o">/&gt;</span><span class="w"></span>
</code></pre></div>
<ul>
<li>Document instances have fields whose values can be accessed through the mapping
protocol.<br/> The Java expression:</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="err">doc.get(&quot;title&quot;)</span>
</code></pre></div>
<p>is better written in Python:</p>
<div class="highlight"><pre><span></span><code><span class="err">doc[&#39;title&#39;]</span>
</code></pre></div>
<ul>
<li>Document instances can be iterated over for their fields.<br/> The Java loop:</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="err">Enumeration fields = doc.getFields();&lt;br/&gt;</span>
<span class="err">while (fields.hasMoreElements()) {&lt;br/&gt;</span>
<span class="err">&amp;nbsp;&amp;nbsp;Field field = (Field) fields.nextElement();&lt;br/&gt;</span>
<span class="err">&amp;nbsp;&amp;nbsp;...&lt;br/&gt;</span>
<span class="err">}&lt;br/&gt;</span>
</code></pre></div>
<p>is better written in Python:</p>
<div class="highlight"><pre><span></span><code><span class="err">for field in doc.getFields():&lt;br/&gt;</span>
<span class="err">&amp;nbsp;&amp;nbsp;field = Field.cast_(field)&lt;br/&gt;</span>
<span class="err">&amp;nbsp;&amp;nbsp;...&lt;br/&gt;</span>
</code></pre></div>
<p>Once JCC heeds Java 1.5 type parameters and once Java Lucene makes use of them,
such casting should become unnecessary</p>
<h2 id="extending-java-lucene-classes-from-python">Extending Java Lucene classes from Python</h2>
<p>Many areas of the Lucene API expect the programmer to provide their own implementation
or specialization of a feature where the default is inappropriate. For example,
text analyzers and tokenizers are an area where many parameters and environmental
or cultural factors are calling for customization.</p>
<p>PyLucene enables this by providing Java extension points listed below that serve
s proxies for Java to call back into the Python implementations of these customizations.</p>
<p>These extension points are simple Java classes that JCC generates the native C++
implementations for. It is easy to add more such extensions classes into the
'java' directory of the PyLucene source tree.</p>
<p>To learn more about this topic, please refer to the JCC <a href="jcc/features.html">documentation</a>.</p>
<p>Please refer to the classes in the 'java' tree for currently available extension
points. Examples of uses of these extension points are to be found in PyLucene's
unit tests.</p>
</div>
</div>
<div id="sidebar">
<div class="button-wrapper" style="margin-top: 40px;">
<div class="button-green">
<a href="https://www.apache.org/dyn/closer.lua/lucene/pylucene/">Download</a>
<div class="flap top">Click to begin</div>
<div class="flap bottom">of Apache PyLucene</div>
</div>
</div>
<h1 id="documentation">Documentation<a class="headerlink" href="#documentation" title="Permanent link"></a></h1>
<ul>
<li><a href="https://www.apache.org/licenses/">License</a></li>
<li><a href="/pylucene/features.html">Features</a></li>
<li><a href="/pylucene/install.html">Install</a></li>
</ul>
<h1 id="events">Events<a class="headerlink" href="#events" title="Permanent link"></a></h1>
<ul>
<a class="acevent" data-format="square" data-mode="light" data-width="160" data-style="border: 1px solid lightgrey"></a>
</ul>
<h1 id="asf-links">ASF links<a class="headerlink" href="#asf-links" title="Permanent link"></a></h1>
<ul>
<li><a href="https://www.apache.org">Apache Software Foundation</a></li>
<li><a href="https://www.apache.org/foundation/thanks.html">Thanks</a></li>
<li><a href="https://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
<li><a href="https://www.apache.org/security/">Security</a></li>
</ul>
<h1 id="related-projects">Related Projects<a class="headerlink" href="#related-projects" title="Permanent link"></a></h1>
<ul>
<li><a href="https://solr.apache.org">Apache Solr</a></li>
<li><a href="http://hadoop.apache.org">Apache Hadoop</a></li>
<li><a href="http://manifoldcf.apache.org/">Apache ManifoldCF</a></li>
<li><a href="http://lucenenet.apache.org/">Apache Lucene.Net</a></li>
<li><a href="http://mahout.apache.org">Apache Mahout</a></li>
<li><a href="http://nutch.apache.org">Apache Nutch</a></li>
<li><a href="http://opennlp.apache.org/">Apache OpenNLP</a></li>
<li><a href="http://tika.apache.org">Apache Tika</a></li>
<li><a href="http://zookeeper.apache.org">Apache Zookeeper</a></li>
</ul> </div>
</div> <!-- End #content-wrap -->
<div id="footer">
<div class="copyright">
<p>
Copyright &copy; 2011-2024 The Apache Software Foundation, Licensed under
the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. <a href="/privacy.html">Privacy Policy</a> <br/>
Apache and the Apache feather logo are trademarks of The Apache Software Foundation. Apache Lucene, Apache Solr and their
respective logos are trademarks of the Apache Software Foundation. Please see the <a href="https://www.apache.org/foundation/marks/">Apache Trademark Policy</a>
for more information.
</p>
</div>
</div> </div> <!-- End #wrap -->
</body>
</html>