perl/lib/Lucy/Docs/FileFormat.pod - lucy - Git at Google

 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 =head1 NAME

 Lucy::Docs::FileFormat - Overview of index file format.

 =head1 OVERVIEW

 It is not necessary to understand the current implementation details of the
 index file format in order to use Apache Lucy effectively, but it may be
 helpful if you are interested in tweaking for high performance, exotic usage,
 or debugging and development.

 On a file system, an index is a directory.  The files inside have a
 hierarchical relationship: an index is made up of "segments", each of which is
 an independent inverted index with its own subdirectory; each segment is made
 up of several component parts.

     [index]--|
              |--snapshot_XXX.json
              |--schema_XXX.json
              |--write.lock
              |
              |--seg_1--|
              |         |--segmeta.json
              |         |--cfmeta.json
              |         |--cf.dat-------|
              |                         |--[lexicon]
              |                         |--[postings]
              |                         |--[documents]
              |                         |--[highlight]
              |                         |--[deletions]
              |
              |--seg_2--|
              |         |--segmeta.json
              |         |--cfmeta.json
              |         |--cf.dat-------|
              |                         |--[lexicon]
              |                         |--[postings]
              |                         |--[documents]
              |                         |--[highlight]
              |                         |--[deletions]
              |
              |--[...]--|

 =head1 Write-once philosophy

 All segment directory names consist of the string "seg_" followed by a number
 in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher numbers indicating
 more recent segments.  Once a segment is finished and committed, its name is
 never re-used and its files are never modified.

 Old segments become obsolete and can be removed when their data has been
 consolidated into new segments during the process of segment merging and
 optimization.  A fully-optimized index has only one segment.

 =head1 Top-level entries

 There are a handful of "top-level" files and directories which belong to the
 entire index rather than to a particular segment.

 =head2 snapshot_XXX.json

 A "snapshot" file, e.g. C<snapshot_m7p.json>, is list of index files and
 directories.  Because index files, once written, are never modified, the list
 of entries in a snapshot defines a point-in-time view of the data in an index.

 Like segment directories, snapshot files also utilize the
 unique-base-36-number naming convention; the higher the number, the more
 recent the file.  The appearance of a new snapshot file within the index
 directory constitutes an index update.  While a new segment is being written
 new files may be added to the index directory, but until a new snapshot file
 gets written, a Searcher opening the index for reading won't know about them.

 =head2 schema_XXX.json

 The schema file is a Schema object describing the index's format, serialized
 as JSON.  It, too, is versioned, and a given snapshot file will reference one
 and only one schema file.

 =head2 locks

 By default, only one indexing process may safely modify the index at any given
 time.  Processes reserve an index by laying claim to the C<write.lock> file
 within the C<locks/> directory.  A smattering of other lock files may be used
 from time to time, as well.

 =head1 A segment's component parts

 By default, each segment has up to five logical components: lexicon, postings,
 document storage, highlight data, and deletions.  Binary data from these
 components gets stored in virtual files within the "cf.dat" compound file;
 metadata is stored in a shared "segmeta.json" file.

 =head2 segmeta.json

 The segmeta.json file is a central repository for segment metadata.  In
 addition to information such as document counts and field numbers, it also
 warehouses arbitrary metadata on behalf of individual index components.

 =head2 Lexicon

 Each indexed field gets its own lexicon in each segment.  The exact files
 involved depend on the field's type, but generally speaking there will be two
 parts.  First, there's a primary C<lexicon-XXX.dat> file which houses a
 complete term list associating terms with corpus frequency statistics,
 postings file locations, etc.  Second, one or more "lexicon index" files may
 be present which contain periodic samples from the primary lexicon file to
 facilitate fast lookups.

 =head2 Postings

 "Posting" is a technical term from the field of
 L<information retrieval|Lucy::Docs::IRTheory>, defined as a single
 instance of a one term indexing one document.  If you are looking at the index
 in the back of a book, and you see that "freedom" is referenced on pages 8,
 86, and 240, that would be three postings, which taken together form a
 "posting list".  The same terminology applies to an index in electronic form.

 Each segment has one postings file per indexed field.  When a search is
 performed for a single term, first that term is looked up in the lexicon.  If
 the term exists in the segment, the record in the lexicon will contain
 information about which postings file to look at and where to look.

 The first thing any posting record tells you is a document id.  By iterating
 over all the postings associated with a term, you can find all the documents
 that match that term, a process which is analogous to looking up page numbers
 in a book's index.  However, each posting record typically contains other
 information in addition to document id, e.g. the positions at which the term
 occurs within the field.

 =head2 Documents

 The document storage section is a simple database, organized into two files:

 =over

 =item *

 B<documents.dat> - Serialized documents.

 =item *

 B<documents.ix> - Document storage index, a solid array of 64-bit integers
 where each integer location corresponds to a document id, and the value at
 that location points at a file position in the documents.dat file.

 =back

 =head2 Highlight data

 The files which store data used for excerpting and highlighting are organized
 similarly to the files used to store documents.

 =over

 =item *

 B<highlight.dat> - Chunks of serialized highlight data, one per doc id.

 =item *

 B<highlight.ix> - Highlight data index -- as with the C<documents.ix> file, a
 solid array of 64-bit file pointers.

 =back

 =head2 Deletions

 When a document is "deleted" from a segment, it is not actually purged right
 away; it is merely marked as "deleted" via a deletions file.  Deletions files
 contains bit vectors with one bit for each document in the segment; if bit
 #254 is set then document 254 is deleted, and if that document turns up in a
 search it will be masked out.

 It is only when a segment's contents are rewritten to a new segment during the
 segment-merging process that deleted documents truly go away.

 =head1 Compound Files

 If you peer inside an index directory, you won't actually find any files named
 "documents.dat", "highlight.ix", etc. unless there is an indexing process
 underway.  What you will find instead is one "cf.dat" and one "cfmeta.json"
 file per segment.

 To minimize the need for file descriptors at search-time, all per-segment
 binary data files are concatenated together in "cf.dat" at the close of each
 indexing session.  Information about where each file begins and ends is stored
 in C<cfmeta.json>.  When the segment is opened for reading, a single file
 descriptor per "cf.dat" file can be shared among several readers.

 =head1 A Typical Search

 Here's a simplified narrative, dramatizing how a search for "freedom" against
 a given segment plays out:

 =over

 =item 1

 The searcher asks the relevant Lexicon Index, "Do you know anything about
 'freedom'?"  Lexicon Index replies, "Can't say for sure, but if the main
 Lexicon file does, 'freedom' is probably somewhere around byte 21008".

 =item 2

 The main Lexicon tells the searcher "One moment, let me scan our records...
 Yes, we have 2 documents which contain 'freedom'.  You'll find them in
 seg_6/postings-4.dat starting at byte 66991."

 =item 3

 The Postings file says "Yep, we have 'freedom', all right!  Document id 40
 has 1 'freedom', and document 44 has 8.  If you need to know more, like if any
 'freedom' is part of the phrase 'freedom of speech', ask me about positions!

 =item 4

 If the searcher is only looking for 'freedom' in isolation, that's where it
 stops.  It now knows enough to assign the documents scores against "freedom",
 with the 8-freedom document likely ranking higher than the single-freedom
 document.

 =back
	# Licensed to the Apache Software Foundation (ASF) under one or more
	# contributor license agreements. See the NOTICE file distributed with
	# this work for additional information regarding copyright ownership.
	# The ASF licenses this file to You under the Apache License, Version 2.0
	# (the "License"); you may not use this file except in compliance with
	# the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.

	=head1 NAME

	Lucy::Docs::FileFormat - Overview of index file format.

	=head1 OVERVIEW

	It is not necessary to understand the current implementation details of the
	index file format in order to use Apache Lucy effectively, but it may be
	helpful if you are interested in tweaking for high performance, exotic usage,
	or debugging and development.

	On a file system, an index is a directory. The files inside have a
	hierarchical relationship: an index is made up of "segments", each of which is
	an independent inverted index with its own subdirectory; each segment is made
	up of several component parts.

	[index]--\|
	\|--snapshot_XXX.json
	\|--schema_XXX.json
	\|--write.lock
	\|
	\|--seg_1--\|
	\| \|--segmeta.json
	\| \|--cfmeta.json
	\| \|--cf.dat-------\|
	\| \|--[lexicon]
	\| \|--[postings]
	\| \|--[documents]
	\| \|--[highlight]
	\| \|--[deletions]
	\|
	\|--seg_2--\|
	\| \|--segmeta.json
	\| \|--cfmeta.json
	\| \|--cf.dat-------\|
	\| \|--[lexicon]
	\| \|--[postings]
	\| \|--[documents]
	\| \|--[highlight]
	\| \|--[deletions]
	\|
	\|--[...]--\|

	=head1 Write-once philosophy

	All segment directory names consist of the string "seg_" followed by a number
	in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher numbers indicating
	more recent segments. Once a segment is finished and committed, its name is
	never re-used and its files are never modified.

	Old segments become obsolete and can be removed when their data has been
	consolidated into new segments during the process of segment merging and
	optimization. A fully-optimized index has only one segment.

	=head1 Top-level entries

	There are a handful of "top-level" files and directories which belong to the
	entire index rather than to a particular segment.

	=head2 snapshot_XXX.json

	A "snapshot" file, e.g. C<snapshot_m7p.json>, is list of index files and
	directories. Because index files, once written, are never modified, the list
	of entries in a snapshot defines a point-in-time view of the data in an index.

	Like segment directories, snapshot files also utilize the
	unique-base-36-number naming convention; the higher the number, the more
	recent the file. The appearance of a new snapshot file within the index
	directory constitutes an index update. While a new segment is being written
	new files may be added to the index directory, but until a new snapshot file
	gets written, a Searcher opening the index for reading won't know about them.

	=head2 schema_XXX.json

	The schema file is a Schema object describing the index's format, serialized
	as JSON. It, too, is versioned, and a given snapshot file will reference one
	and only one schema file.

	=head2 locks

	By default, only one indexing process may safely modify the index at any given
	time. Processes reserve an index by laying claim to the C<write.lock> file
	within the C<locks/> directory. A smattering of other lock files may be used
	from time to time, as well.

	=head1 A segment's component parts

	By default, each segment has up to five logical components: lexicon, postings,
	document storage, highlight data, and deletions. Binary data from these
	components gets stored in virtual files within the "cf.dat" compound file;
	metadata is stored in a shared "segmeta.json" file.

	=head2 segmeta.json

	The segmeta.json file is a central repository for segment metadata. In
	addition to information such as document counts and field numbers, it also
	warehouses arbitrary metadata on behalf of individual index components.

	=head2 Lexicon

	Each indexed field gets its own lexicon in each segment. The exact files
	involved depend on the field's type, but generally speaking there will be two
	parts. First, there's a primary C<lexicon-XXX.dat> file which houses a
	complete term list associating terms with corpus frequency statistics,
	postings file locations, etc. Second, one or more "lexicon index" files may
	be present which contain periodic samples from the primary lexicon file to
	facilitate fast lookups.

	=head2 Postings

	"Posting" is a technical term from the field of
	L<information retrieval\|Lucy::Docs::IRTheory>, defined as a single
	instance of a one term indexing one document. If you are looking at the index
	in the back of a book, and you see that "freedom" is referenced on pages 8,
	86, and 240, that would be three postings, which taken together form a
	"posting list". The same terminology applies to an index in electronic form.

	Each segment has one postings file per indexed field. When a search is
	performed for a single term, first that term is looked up in the lexicon. If
	the term exists in the segment, the record in the lexicon will contain
	information about which postings file to look at and where to look.

	The first thing any posting record tells you is a document id. By iterating
	over all the postings associated with a term, you can find all the documents
	that match that term, a process which is analogous to looking up page numbers
	in a book's index. However, each posting record typically contains other
	information in addition to document id, e.g. the positions at which the term
	occurs within the field.

	=head2 Documents

	The document storage section is a simple database, organized into two files:

	=over

	=item *

	B<documents.dat> - Serialized documents.

	=item *

	B<documents.ix> - Document storage index, a solid array of 64-bit integers
	where each integer location corresponds to a document id, and the value at
	that location points at a file position in the documents.dat file.

	=back

	=head2 Highlight data

	The files which store data used for excerpting and highlighting are organized
	similarly to the files used to store documents.

	=over

	=item *

	B<highlight.dat> - Chunks of serialized highlight data, one per doc id.

	=item *

	B<highlight.ix> - Highlight data index -- as with the C<documents.ix> file, a
	solid array of 64-bit file pointers.

	=back

	=head2 Deletions

	When a document is "deleted" from a segment, it is not actually purged right
	away; it is merely marked as "deleted" via a deletions file. Deletions files
	contains bit vectors with one bit for each document in the segment; if bit
	#254 is set then document 254 is deleted, and if that document turns up in a
	search it will be masked out.

	It is only when a segment's contents are rewritten to a new segment during the
	segment-merging process that deleted documents truly go away.

	=head1 Compound Files

	If you peer inside an index directory, you won't actually find any files named
	"documents.dat", "highlight.ix", etc. unless there is an indexing process
	underway. What you will find instead is one "cf.dat" and one "cfmeta.json"
	file per segment.

	To minimize the need for file descriptors at search-time, all per-segment
	binary data files are concatenated together in "cf.dat" at the close of each
	indexing session. Information about where each file begins and ends is stored
	in C<cfmeta.json>. When the segment is opened for reading, a single file
	descriptor per "cf.dat" file can be shared among several readers.

	=head1 A Typical Search

	Here's a simplified narrative, dramatizing how a search for "freedom" against
	a given segment plays out:

	=over

	=item 1

	The searcher asks the relevant Lexicon Index, "Do you know anything about
	'freedom'?" Lexicon Index replies, "Can't say for sure, but if the main
	Lexicon file does, 'freedom' is probably somewhere around byte 21008".

	=item 2

	The main Lexicon tells the searcher "One moment, let me scan our records...
	Yes, we have 2 documents which contain 'freedom'. You'll find them in
	seg_6/postings-4.dat starting at byte 66991."

	=item 3

	The Postings file says "Yep, we have 'freedom', all right! Document id 40
	has 1 'freedom', and document 44 has 8. If you need to know more, like if any
	'freedom' is part of the phrase 'freedom of speech', ask me about positions!

	=item 4

	If the searcher is only looking for 'freedom' in isolation, that's where it
	stops. It now knows enough to assign the documents scores against "freedom",
	with the 8-freedom document likely ranking higher than the single-freedom
	document.

	=back