| # Licensed to the Apache Software Foundation (ASF) under one or more |
| # contributor license agreements. See the NOTICE file distributed with |
| # this work for additional information regarding copyright ownership. |
| # The ASF licenses this file to You under the Apache License, Version 2.0 |
| # (the "License"); you may not use this file except in compliance with |
| # the License. You may obtain a copy of the License at |
| # |
| # http://www.apache.org/licenses/LICENSE-2.0 |
| # |
| # Unless required by applicable law or agreed to in writing, software |
| # distributed under the License is distributed on an "AS IS" BASIS, |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| # See the License for the specific language governing permissions and |
| # limitations under the License. |
| |
| =head1 NAME |
| |
| Lucy::Docs::IRTheory - Crash course in information retrieval. |
| |
| =head1 ABSTRACT |
| |
| Just enough Information Retrieval theory to find your way around Apache Lucy. |
| |
| =head1 Terminology |
| |
| Lucy uses some terminology from the field of information retrieval which |
| may be unfamiliar to many users. "Document" and "term" mean pretty much what |
| you'd expect them to, but others such as "posting" and "inverted index" need a |
| formal introduction: |
| |
| =over |
| |
| =item * |
| |
| I<document> - An atomic unit of retrieval. |
| |
| =item * |
| |
| I<term> - An attribute which describes a document. |
| |
| =item * |
| |
| I<posting> - One term indexing one document. |
| |
| =item * |
| |
| I<term list> - The complete list of terms which describe a document. |
| |
| =item * |
| |
| I<posting list> - The complete list of documents which a term indexes. |
| |
| =item * |
| |
| I<inverted index> - A data structure which maps from terms to documents. |
| |
| =back |
| |
| Since Lucy is a practical implementation of IR theory, it loads these |
| abstract, distilled definitions down with useful traits. For instance, a |
| "posting" in its most rarefied form is simply a term-document pairing; in |
| Lucy, the class L<Lucy::Index::Posting::MatchPosting> fills this |
| role. However, by associating additional information with a posting like the |
| number of times the term occurs in the document, we can turn it into a |
| L<ScorePosting|Lucy::Index::Posting::ScorePosting>, making it possible |
| to rank documents by relevance rather than just list documents which happen to |
| match in no particular order. |
| |
| =head1 TF/IDF ranking algorithm |
| |
| Lucy uses a variant of the well-established "Term Frequency / Inverse |
| Document Frequency" weighting scheme. A thorough treatment of TF/IDF is too |
| ambitious for our present purposes, but in a nutshell, it means that... |
| |
| =over |
| |
| =item |
| |
| in a search for C<skate park>, documents which score well for the |
| comparatively rare term C<skate> will rank higher than documents which score |
| well for the more common term C<park>. |
| |
| =item |
| |
| a 10-word text which has one occurrence each of both C<skate> and C<park> will |
| rank higher than a 1000-word text which also contains one occurrence of each. |
| |
| =back |
| |
| A web search for "tf idf" will turn up many excellent explanations of the |
| algorithm. |
| |
| =cut |
| |