perl/lib/Lucy/Docs/Cookbook/FastUpdates.pod - lucy - Git at Google

 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 =head1 NAME

 Lucy::Docs::Cookbook::FastUpdates - Near real-time index updates.

 =head1 ABSTRACT

 While index updates are fast on average, worst-case update performance may be
 significantly slower.  To make index updates consistently quick, we must
 manually intervene to control the process of index segment consolidation.

 =head1 The problem

 Ordinarily, modifying an index is cheap. New data is added to new segments,
 and the time to write a new segment scales more or less linearly with the
 number of documents added during the indexing session.

 Deletions are also cheap most of the time, because we don't remove documents
 immediately but instead mark them as deleted, and adding the deletion mark is
 cheap.

 However, as new segments are added and the deletion rate for existing segments
 increases, search-time performance slowly begins to degrade.  At some point,
 it becomes necessary to consolidate existing segments, rewriting their data
 into a new segment.

 If the recycled segments are small, the time it takes to rewrite them may not
 be significant.  Every once in a while, though, a large amount of data must be
 rewritten.

 =head1 Procrastinating and playing catch-up

 The simplest way to force fast index updates is to avoid rewriting anything.

 Indexer relies upon L<IndexManager|Lucy::Index::IndexManager>'s
 recycle() method to tell it which segments should be consolidated.  If we
 subclass IndexManager and override recycle() so that it always returns an
 empty array, we get consistently quick performance:

     package NoMergeManager;
     use base qw( Lucy::Index::IndexManager );
     sub recycle { [] }

     package main;
     my $indexer = Lucy::Index::Indexer->new(
         index => '/path/to/index',
         manager => NoMergeManager->new,
     );
     ...
     $indexer->commit;

 However, we can't procrastinate forever.  Eventually, we'll have to run an
 ordinary, uncontrolled indexing session, potentially triggering a large
 rewrite of lots of small and/or degraded segments:

     my $indexer = Lucy::Index::Indexer->new(
         index => '/path/to/index',
         # manager => NoMergeManager->new,
     );
     ...
     $indexer->commit;

 =head1 Acceptable worst-case update time, slower degradation

 Never merging anything at all in the main indexing process is probably
 overkill.  Small segments are relatively cheap to merge; we just need to guard
 against the big rewrites.

 Setting a ceiling on the number of documents in the segments to be recycled
 allows us to avoid a mass proliferation of tiny, single-document segments,
 while still offering decent worst-case update speed:

     package LightMergeManager;
     use base qw( Lucy::Index::IndexManager );

     sub recycle {
         my $self = shift;
         my $seg_readers = $self->SUPER::recycle(@_);
         @$seg_readers = grep { $_->doc_max < 10 } @$seg_readers;
         return $seg_readers;
     }

 However, we still have to consolidate every once in a while, and while that
 happens content updates will be locked out.

 =head1 Background merging

 If it's not acceptable to lock out updates while the index consolidation
 process runs, the alternative is to move the consolidation process out of
 band, using Lucy::Index::BackgroundMerger.

 It's never safe to have more than one Indexer attempting to modify the content
 of an index at the same time, but a BackgroundMerger and an Indexer can
 operate simultaneously:

     # Indexing process.
     use Scalar::Util qw( blessed );
     my $retries = 0;
     while (1) {
         eval {
             my $indexer = Lucy::Index::Indexer->new(
                     index => '/path/to/index',
                     manager => LightMergeManager->new,
                 );
             $indexer->add_doc($doc);
             $indexer->commit;
         };
         last unless $@;
         if ( blessed($@) and $@->isa("Lucy::Store::LockErr") ) {
             # Catch LockErr.
             warn "Couldn't get lock ($retries retries)";
             $retries++;
         }
         else {
             die "Write failed: $@";
         }
     }

     # Background merge process.
     my $manager = Lucy::Index::IndexManager->new;
     $index_manager->set_write_lock_timeout(60_000);
     my $bg_merger = Lucy::Index::BackgroundMerger->new(
         index   => '/path/to/index',
         manager => $manager,
     );
     $bg_merger->commit;

 The exception handling code becomes useful once you have more than one index
 modification process happening simultaneously.  By default, Indexer tries
 several times to acquire a write lock over the span of one second, then holds
 it until commit() completes.  BackgroundMerger handles most of its work
 without the write lock, but it does need it briefly once at the beginning and
 once again near the end.  Under normal loads, the internal retry logic will
 resolve conflicts, but if it's not acceptable to miss an insert, you probably
 want to catch LockErr exceptions thrown by Indexer.  In contrast, a LockErr
 from BackgroundMerger probably just needs to be logged.

 =cut
	# Licensed to the Apache Software Foundation (ASF) under one or more
	# contributor license agreements. See the NOTICE file distributed with
	# this work for additional information regarding copyright ownership.
	# The ASF licenses this file to You under the Apache License, Version 2.0
	# (the "License"); you may not use this file except in compliance with
	# the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.

	=head1 NAME

	Lucy::Docs::Cookbook::FastUpdates - Near real-time index updates.

	=head1 ABSTRACT

	While index updates are fast on average, worst-case update performance may be
	significantly slower. To make index updates consistently quick, we must
	manually intervene to control the process of index segment consolidation.

	=head1 The problem

	Ordinarily, modifying an index is cheap. New data is added to new segments,
	and the time to write a new segment scales more or less linearly with the
	number of documents added during the indexing session.

	Deletions are also cheap most of the time, because we don't remove documents
	immediately but instead mark them as deleted, and adding the deletion mark is
	cheap.

	However, as new segments are added and the deletion rate for existing segments
	increases, search-time performance slowly begins to degrade. At some point,
	it becomes necessary to consolidate existing segments, rewriting their data
	into a new segment.

	If the recycled segments are small, the time it takes to rewrite them may not
	be significant. Every once in a while, though, a large amount of data must be
	rewritten.

	=head1 Procrastinating and playing catch-up

	The simplest way to force fast index updates is to avoid rewriting anything.

	Indexer relies upon L<IndexManager\|Lucy::Index::IndexManager>'s
	recycle() method to tell it which segments should be consolidated. If we
	subclass IndexManager and override recycle() so that it always returns an
	empty array, we get consistently quick performance:

	package NoMergeManager;
	use base qw( Lucy::Index::IndexManager );
	sub recycle { [] }

	package main;
	my $indexer = Lucy::Index::Indexer->new(
	index => '/path/to/index',
	manager => NoMergeManager->new,
	);
	...
	$indexer->commit;

	However, we can't procrastinate forever. Eventually, we'll have to run an
	ordinary, uncontrolled indexing session, potentially triggering a large
	rewrite of lots of small and/or degraded segments:

	my $indexer = Lucy::Index::Indexer->new(
	index => '/path/to/index',
	# manager => NoMergeManager->new,
	);
	...
	$indexer->commit;

	=head1 Acceptable worst-case update time, slower degradation

	Never merging anything at all in the main indexing process is probably
	overkill. Small segments are relatively cheap to merge; we just need to guard
	against the big rewrites.

	Setting a ceiling on the number of documents in the segments to be recycled
	allows us to avoid a mass proliferation of tiny, single-document segments,
	while still offering decent worst-case update speed:

	package LightMergeManager;
	use base qw( Lucy::Index::IndexManager );

	sub recycle {
	my $self = shift;
	my $seg_readers = $self->SUPER::recycle(@_);
	@$seg_readers = grep { $_->doc_max < 10 } @$seg_readers;
	return $seg_readers;
	}

	However, we still have to consolidate every once in a while, and while that
	happens content updates will be locked out.

	=head1 Background merging

	If it's not acceptable to lock out updates while the index consolidation
	process runs, the alternative is to move the consolidation process out of
	band, using Lucy::Index::BackgroundMerger.

	It's never safe to have more than one Indexer attempting to modify the content
	of an index at the same time, but a BackgroundMerger and an Indexer can
	operate simultaneously:

	# Indexing process.
	use Scalar::Util qw( blessed );
	my $retries = 0;
	while (1) {
	eval {
	my $indexer = Lucy::Index::Indexer->new(
	index => '/path/to/index',
	manager => LightMergeManager->new,
	);
	$indexer->add_doc($doc);
	$indexer->commit;
	};
	last unless $@;
	if ( blessed($@) and $@->isa("Lucy::Store::LockErr") ) {
	# Catch LockErr.
	warn "Couldn't get lock ($retries retries)";
	$retries++;
	}
	else {
	die "Write failed: $@";
	}
	}

	# Background merge process.
	my $manager = Lucy::Index::IndexManager->new;
	$index_manager->set_write_lock_timeout(60_000);
	my $bg_merger = Lucy::Index::BackgroundMerger->new(
	index => '/path/to/index',
	manager => $manager,
	);
	$bg_merger->commit;

	The exception handling code becomes useful once you have more than one index
	modification process happening simultaneously. By default, Indexer tries
	several times to acquire a write lock over the span of one second, then holds
	it until commit() completes. BackgroundMerger handles most of its work
	without the write lock, but it does need it briefly once at the beginning and
	once again near the end. Under normal loads, the internal retry logic will
	resolve conflicts, but if it's not acceptable to miss an insert, you probably
	want to catch LockErr exceptions thrown by Indexer. In contrast, a LockErr
	from BackgroundMerger probably just needs to be logged.

	=cut