perl/lib/Lucy/Docs/IRTheory.pod - lucy - Git at Google

 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 =head1 NAME

 Lucy::Docs::IRTheory - Crash course in information retrieval.

 =head1 ABSTRACT

 Just enough Information Retrieval theory to find your way around Apache Lucy.

 =head1 Terminology

 Lucy uses some terminology from the field of information retrieval which
 may be unfamiliar to many users.  "Document" and "term" mean pretty much what
 you'd expect them to, but others such as "posting" and "inverted index" need a
 formal introduction:

 =over

 =item *

 I<document> - An atomic unit of retrieval.

 =item *

 I<term> - An attribute which describes a document.

 =item *

 I<posting> - One term indexing one document.

 =item *

 I<term list> - The complete list of terms which describe a document.

 =item *

 I<posting list> - The complete list of documents which a term indexes.

 =item *

 I<inverted index> - A data structure which maps from terms to documents.

 =back

 Since Lucy is a practical implementation of IR theory, it loads these
 abstract, distilled definitions down with useful traits.  For instance, a
 "posting" in its most rarefied form is simply a term-document pairing; in
 Lucy, the class L<Lucy::Index::Posting::MatchPosting> fills this
 role.  However, by associating additional information with a posting like the
 number of times the term occurs in the document, we can turn it into a
 L<ScorePosting|Lucy::Index::Posting::ScorePosting>, making it possible
 to rank documents by relevance rather than just list documents which happen to
 match in no particular order.

 =head1 TF/IDF ranking algorithm

 Lucy uses a variant of the well-established "Term Frequency / Inverse
 Document Frequency" weighting scheme.  A thorough treatment of TF/IDF is too
 ambitious for our present purposes, but in a nutshell, it means that...

 =over

 =item

 in a search for C<skate park>, documents which score well for the
 comparatively rare term C<skate> will rank higher than documents which score
 well for the more common term C<park>.

 =item

 a 10-word text which has one occurrence each of both C<skate> and C<park> will
 rank higher than a 1000-word text which also contains one occurrence of each.

 =back

 A web search for "tf idf" will turn up many excellent explanations of the
 algorithm.

 =cut
	# Licensed to the Apache Software Foundation (ASF) under one or more
	# contributor license agreements. See the NOTICE file distributed with
	# this work for additional information regarding copyright ownership.
	# The ASF licenses this file to You under the Apache License, Version 2.0
	# (the "License"); you may not use this file except in compliance with
	# the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.

	=head1 NAME

	Lucy::Docs::IRTheory - Crash course in information retrieval.

	=head1 ABSTRACT

	Just enough Information Retrieval theory to find your way around Apache Lucy.

	=head1 Terminology

	Lucy uses some terminology from the field of information retrieval which
	may be unfamiliar to many users. "Document" and "term" mean pretty much what
	you'd expect them to, but others such as "posting" and "inverted index" need a
	formal introduction:

	=over

	=item *

	I<document> - An atomic unit of retrieval.

	=item *

	I<term> - An attribute which describes a document.

	=item *

	I<posting> - One term indexing one document.

	=item *

	I<term list> - The complete list of terms which describe a document.

	=item *

	I<posting list> - The complete list of documents which a term indexes.

	=item *

	I<inverted index> - A data structure which maps from terms to documents.

	=back

	Since Lucy is a practical implementation of IR theory, it loads these
	abstract, distilled definitions down with useful traits. For instance, a
	"posting" in its most rarefied form is simply a term-document pairing; in
	Lucy, the class L<Lucy::Index::Posting::MatchPosting> fills this
	role. However, by associating additional information with a posting like the
	number of times the term occurs in the document, we can turn it into a
	L<ScorePosting\|Lucy::Index::Posting::ScorePosting>, making it possible
	to rank documents by relevance rather than just list documents which happen to
	match in no particular order.

	=head1 TF/IDF ranking algorithm

	Lucy uses a variant of the well-established "Term Frequency / Inverse
	Document Frequency" weighting scheme. A thorough treatment of TF/IDF is too
	ambitious for our present purposes, but in a nutshell, it means that...

	=over

	=item

	in a search for C<skate park>, documents which score well for the
	comparatively rare term C<skate> will rank higher than documents which score
	well for the more common term C<park>.

	=item

	a 10-word text which has one occurrence each of both C<skate> and C<park> will
	rank higher than a 1000-word text which also contains one occurrence of each.

	=back

	A web search for "tf idf" will turn up many excellent explanations of the
	algorithm.

	=cut