subversion/libsvn_fs_fs/structure-indexes - subversion - Git at Google

 This file describes the design, data model, and storage formats of FSFS
 index data.


 Design
 ======

 Each pack and each rev file using logical addressing contains exactly
 two index sections.  One, the log-to-phys index, maps the (rev, item_index)
 pairs to absolute file offsets.  The other, phys-to-log, is a reverse
 index that gives basic information on any file location.  This is enough
 to read and cache any data without traversing DAGs.

 Rev and pack files are immutable, so the same is true for index data.
 During a transaction or while packing a file, a proto index file gets
 written (actually, one log-to-phys and one phys-to-log).  They use a
 simpler, less compact format with fixed record lengths.  A proto index
 basically aggregates all the information that must later be transformed
 into the final index.


 General design concerns
 -----------------------

 In Subversion, there is no limit to the size of a revision; even practical
 limits are in the order of millions of changes at least.  Index data for
 these would be multiple megabytes in size with pack file indexes possibly
 approaching 1 GB.  To ensure we still get roughly O(1) access time, we
 need a hierarchical data structure.

 Therefore, the indexes will start with a header containing an array of
 references to sub-sections or pages.  The length of these pages varies
 but is limited to a size configurable in fsfs.conf.

 Finally, it is assumed that whole pages can be cached efficiently and
 with a high cache hit rate.  So, although a page may have a thousand or
 more entries, the access time is still amortized O(1) in many scenarios.


 Items and item types
 --------------------

 The index implementation treats item_index and item type as simple ints,
 except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED.
 Since they have been defined as 0, the code may test for "used" etc.
 by simply comparing with 0.

 See section "addressing modes" in structure to a list of item types
 and pre-defined item_index values.


 Encoding
 --------

 The final index data format is tuned for space and decoding efficiency.
 Indexes are stored as a sequence of variable integers.  The encoding is
 as follows:

 * Unsigned integers are stored in little endian order with a variable
   length 7b/8b encoding.  If most significant bit a byte has been set,
   the next byte has also belongs to the same value.

   0x00 .. 0x7f    -> 0x00 .. 0x7f               ( 7 bits stored in  8 bits)
   0x80 .. 0xff    -> 0x80 0x01 .. 0xff 0x01     (14 bits stored in 16 bits)
   0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f     (14 bits stored in 16 bits)
   0x100000000     -> 0x80 0x80 0x80 0x80 0x10   (35 bits stored in 40 bits)

   Technically, we can represent integers of arbitrary lengths.  Currently,
   we only generate and parse up to 64 bits.

 * Signed integers are mapped onto the unsigned value space as follows:

   x >= 0 ->  2 * x
   x < 0  -> -2 * x - 1

   Again, we can represent arbitrary length numbers that way but the code
   is currently restricted to 64 bits.

 Most data is unsigned by nature but will be stored differentially using
 signed integers.


 Encoding in proto-index files
 -----------------------------

 These have a much simpler encoding.  Throughout the files, all records have
 the same length (but different between L2P and P2L).  All records contain
 unsigned 64 bit integers only, stored in little endian byte order.


 Log-to-phys index
 =================

 This index has to map (rev, item_index) -> offset.  It assumes that the
 item_index values per revision are dense and start at 0.  There may be
 unused item_index values, though; the data structure simply gets less
 space-efficient when the more sparse the value space gets.


 Index data model
 ----------------

 hierarchy:

   header -> per-revision info -> page -> offset

   There is one entry per revision in the header.  Per revision there are
   one or more pages (exclusive to that revision) containing up to a known,
   fixed limit of entries (= page size).  The total access path is:

   pages = header->pages[revision];
   offsets = page = pages[item_index / page_size];
   offset = offsets[item_index % page_size];

   Different log-to-phys indexes in the same repository may have different
   page sizes but within any given index, the page size is the same and
   immutable.

 header:

   <first revision>   ... first revision covered by this index
   <revision count>   ... number of revision covered by this index
   <page size>        ... maximum number of entries per page
   <page table index> ... array, for each revision containing the index in
                          <page table> of the first page that belongs to
                          this revision.  This has <revision count>+1
                          entries to terminate the last revision.
   <page table>       ... array of page headers.  It has
                          <page table index>[<revision count>] entries.

 page table:

   <offset>           ... absolute position of the page contents within the
                          index
   <entry count>      ... number of offset entries in the page.
                          Must match <header>.<page size> unless this is
                          the last page for the respective revision.
   <size>             ... length in bytes of the on-disk page description.
                          Note that this is redundant with the <offset>.

 page:

   <entry count>      ... number of offset entries in the page.
                          Must match <header>.<page size> unless this is
                          the last page for the respective revision.
                          Redundant with <page table>.<entry count>
   <offsets>          ... array of absolute file positions within the rev /
                          pack file.  This has <entry count> entries.


 Index on-disk format
 --------------------

   index := "L2P-INDEX\n" header revisions pages offsets

   header := u(<header>.<first revision>) \
             u(<header>.<page size>) \
             u(<header>.<revision count>) \
             u(s(<header>.<page table>))

   revisions := u(  <header>.<page table index>[k+1]
                  - <header>.<page table index>[k]),
                for k in 0 .. <header>.<revision count>-1

   pages := u(<header>.<page table>[k].<size>) \
            u(<header>.<page table>[k].<entry count>),
            for k in 0 .. s(<header>.<page table>)-1

   offsets := page(k),
              for k in 0 .. s(<header>.<page table>)-1

   page(k) := i(<header>.<page table>[k].<offsets>[0]) \
              i(  <header>.<page table>[k].<offsets>[l] \
                - <header>.<page table>[k].<offsets>[l - 1]),
              for l in 1 .. s(<header>.<page table>[k].<entry count>)-1

   u(x) ... unsigned int x in 7b/8b encoding
   i(x) ... signed int x in 7b/8b encoding
   s(x) ... number of entries in array x


 Proto index file format
 -----------------------

 The index will be created from a "revision-less" proto index file
 containing (<offset><item_index>) pairs only.

 All mappings belonging to the same revision must be written in one go but
 there is no restriction on the order of those entries.  To signify that
 a new revision begins, a (0, 0) mapping must be written.  A (0, 0) entry
 at the beginning of the file is optional and will be ignored.

   <bof>         /* begin of proto index file for revision r and following */
   (0, 0)        /* mark start of revision r, optional for first rev */
   (off, item)*  /* zero or more mappings in random order */
   (0, 0)        /* mark start of revision r + 1 */
   (off, item)*  /* zero or more mappings in random order */
   (0, 0)        /* mark start of revision r + 2 */
   (off, item)*  /* zero or more mappings in random order */
   ...
   <eof>         /* end of file. */

 All entries are pairs of 64 bit unsigned integers in little endian order.


 Phys-to-log index
 =================

 This index has to map offset -> (rev, item_index, type, len, checksum).


 Index data model
 ----------------

 hierarchy:

   header -> page -> item info

   Logically, the index splits up index rev / pack file into pages of a
   fixed size.  That page size may differ from the FS's block size.  The
   index will describe each rev / pack file page with one index page.

   page = header->pages[offset % page_size];
   item info = binary_search(page.data, offset)

   Note that the current implementation will not return any data if the
   offset is does not match any actual item start.  To simplify the lookup,
   the last index page will have an "unused item" entry for the section
   behind EOF.  Holes aren't allowed as well, i.e. every byte of the rev /
   pack is expected to be covered by the index.

   Also, there may be items stretching across page borders or even over
   multiple pages.  The data model solves this issue by storing the item
   descriptions as a "primary array" and then representing the pages as
   ranges within that array.  Thus multiple pages may reference the same
   item description.

 header:

   <first revision>   ... first revision covered by this index
   <file size>        ... size of the rev / pack file in bytes
   <page size>        ... number of bytes in the rev / pack file covered by
                          each index page
   <page count>       ... number of pages
   <offsets>          ... array of page offsets, i.e. locations the page
                          data within the index.
                          This array has <page count> + 1 entries.

 page:

   <entries>          ... array of item descriptions, ordered by offset.
                          First and last entry may cross page boundaries.

 entry:

   <offset>           ... absolute position in the pack / rev file
   <size>             ... on-disk size of the item in the pack / rev file
   <type>             ... item type
   <FNV checksum>     ... modified 32 bit FNV-1a checksum of that section
                          of the pack / rev file (see below). 0 for empty
                          or zero-length items
   <revision>         ... revision that this item belongs to
   <item_index>       ... item_index within that revision


 Index on-disk format
 --------------------

   index := "P2L-INDEX\n" header pages items

   header := u(<header>.<first revision>) \
             u(<header>.<file size>) \
             u(<header>.<page size>) \
             u(<header>.<page count>)

   pages := u(<header>.<offsets>[k+1] - <header>.<offsets>[k]),
            for k in 0 .. <header>.<page count> -1

   items := u(<items in page k>[0].<offset>) \
            u(<items in page k>[l].<size>) \
            i(c(<items in page k>[l]) - c(<items of page k>[l-1])) \
            i(  <items in page k>[l].<revision>
              - <items in page k>[l-1].<revision>), \
            u(FNV checksum)
            for l in 0 .. s(<items in page k>)-1,
            for k in 0 .. <header>.<page count>-1

   u(x) ... unsigned int x in 7b/8b encoding
   i(x) ... signed int x in 7b/8b encoding
   s(x) ... number of entries in collection x
   c(x) ... compound function := x.<item_index> * 8 + x.<type>

   Access to negative indexes gives a 0 value.

   <Items in page k> are in strict ascending offset order.  Items that
   started after the begin of a given page and overlap with the next page
   will not be stored in the start page.  The runtime representation will
   duplicate items overlapping page boundaries; the on-disk representation
   will not.  Thus, pages inside large items will have zero entries on disk.


 Proto index file format
 -----------------------

 The index will be created from a proto index file containing simple
 instances of svn_fs_fs__p2l_entry_t with the following element order:

   item offset               as uint64
   item size                 as uint64
   item type                 as uint64
   modified FNV1a checksum   as uint64
   revision                  as uint64, with SVN_INVALID_REVNUM mapped to 0
                                        and revisions >= 0 stored as rev+1
   item index                as uint64

 All values are stored in little endian order.

 Page table and header information, except start revision and page size,
 can easily be derived from that information.

 All entries must be written in strict offset order.  Overlapping entries
 are not allowed; zero-length items are.

 In transactions, the final revision number may not be known when writing
 the proto index file (e.g. while still writing the proto rev file).  Items
 with revision set to SVN_INVALID_REVNUM will therefore be automatically
 updated when creating the final index.  This is possible in conjunction
 with rev files but not for pack files.


 FNV checksum
 ------------

 FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/
 For performance reasons we use a modified version:

 * split the input byte stream [b0 .. bN] into 4 sub-streams of equal
   length and up to 3 remnants:

   [b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant]

 * calculate 32 bit FNV-1a checksums for the 4 substreams:

   h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..])

 * concatenate the big endian representation of these checksums (4 bytes
   each) plus the remnant of the original stream into a 16 to 19 byte long
   intermediate:

   [i0 .. iK] = [big-endian(h0) ... big-endian(h3) remnant ], 16 <= K+1 <= 19

 * fold the variable-length intermediate into a compact 32 bit checksum:

   FNV checksum = fnv_1a([i0 .. iK])
	This file describes the design, data model, and storage formats of FSFS
	index data.


	Design
	======

	Each pack and each rev file using logical addressing contains exactly
	two index sections. One, the log-to-phys index, maps the (rev, item_index)
	pairs to absolute file offsets. The other, phys-to-log, is a reverse
	index that gives basic information on any file location. This is enough
	to read and cache any data without traversing DAGs.

	Rev and pack files are immutable, so the same is true for index data.
	During a transaction or while packing a file, a proto index file gets
	written (actually, one log-to-phys and one phys-to-log). They use a
	simpler, less compact format with fixed record lengths. A proto index
	basically aggregates all the information that must later be transformed
	into the final index.


	General design concerns
	-----------------------

	In Subversion, there is no limit to the size of a revision; even practical
	limits are in the order of millions of changes at least. Index data for
	these would be multiple megabytes in size with pack file indexes possibly
	approaching 1 GB. To ensure we still get roughly O(1) access time, we
	need a hierarchical data structure.

	Therefore, the indexes will start with a header containing an array of
	references to sub-sections or pages. The length of these pages varies
	but is limited to a size configurable in fsfs.conf.

	Finally, it is assumed that whole pages can be cached efficiently and
	with a high cache hit rate. So, although a page may have a thousand or
	more entries, the access time is still amortized O(1) in many scenarios.


	Items and item types
	--------------------

	The index implementation treats item_index and item type as simple ints,
	except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED.
	Since they have been defined as 0, the code may test for "used" etc.
	by simply comparing with 0.

	See section "addressing modes" in structure to a list of item types
	and pre-defined item_index values.


	Encoding
	--------

	The final index data format is tuned for space and decoding efficiency.
	Indexes are stored as a sequence of variable integers. The encoding is
	as follows:

	* Unsigned integers are stored in little endian order with a variable
	length 7b/8b encoding. If most significant bit a byte has been set,
	the next byte has also belongs to the same value.

	0x00 .. 0x7f -> 0x00 .. 0x7f ( 7 bits stored in 8 bits)
	0x80 .. 0xff -> 0x80 0x01 .. 0xff 0x01 (14 bits stored in 16 bits)
	0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f (14 bits stored in 16 bits)
	0x100000000 -> 0x80 0x80 0x80 0x80 0x10 (35 bits stored in 40 bits)

	Technically, we can represent integers of arbitrary lengths. Currently,
	we only generate and parse up to 64 bits.

	* Signed integers are mapped onto the unsigned value space as follows:

	x >= 0 -> 2 * x
	x < 0 -> -2 * x - 1

	Again, we can represent arbitrary length numbers that way but the code
	is currently restricted to 64 bits.

	Most data is unsigned by nature but will be stored differentially using
	signed integers.


	Encoding in proto-index files
	-----------------------------

	These have a much simpler encoding. Throughout the files, all records have
	the same length (but different between L2P and P2L). All records contain
	unsigned 64 bit integers only, stored in little endian byte order.


	Log-to-phys index
	=================

	This index has to map (rev, item_index) -> offset. It assumes that the
	item_index values per revision are dense and start at 0. There may be
	unused item_index values, though; the data structure simply gets less
	space-efficient when the more sparse the value space gets.


	Index data model
	----------------

	hierarchy:

	header -> per-revision info -> page -> offset

	There is one entry per revision in the header. Per revision there are
	one or more pages (exclusive to that revision) containing up to a known,
	fixed limit of entries (= page size). The total access path is:

	pages = header->pages[revision];
	offsets = page = pages[item_index / page_size];
	offset = offsets[item_index % page_size];

	Different log-to-phys indexes in the same repository may have different
	page sizes but within any given index, the page size is the same and
	immutable.

	header:

	<first revision> ... first revision covered by this index
	<revision count> ... number of revision covered by this index
	<page size> ... maximum number of entries per page
	<page table index> ... array, for each revision containing the index in
	<page table> of the first page that belongs to
	this revision. This has <revision count>+1
	entries to terminate the last revision.
	<page table> ... array of page headers. It has
	<page table index>[<revision count>] entries.

	page table:

	<offset> ... absolute position of the page contents within the
	index
	<entry count> ... number of offset entries in the page.
	Must match <header>.<page size> unless this is
	the last page for the respective revision.
	<size> ... length in bytes of the on-disk page description.
	Note that this is redundant with the <offset>.

	page:

	<entry count> ... number of offset entries in the page.
	Must match <header>.<page size> unless this is
	the last page for the respective revision.
	Redundant with <page table>.<entry count>
	<offsets> ... array of absolute file positions within the rev /
	pack file. This has <entry count> entries.


	Index on-disk format
	--------------------

	index := "L2P-INDEX\n" header revisions pages offsets

	header := u(<header>.<first revision>) \
	u(<header>.<page size>) \
	u(<header>.<revision count>) \
	u(s(<header>.<page table>))

	revisions := u( <header>.<page table index>[k+1]
	- <header>.<page table index>[k]),
	for k in 0 .. <header>.<revision count>-1

	pages := u(<header>.<page table>[k].<size>) \
	u(<header>.<page table>[k].<entry count>),
	for k in 0 .. s(<header>.<page table>)-1

	offsets := page(k),
	for k in 0 .. s(<header>.<page table>)-1

	page(k) := i(<header>.<page table>[k].<offsets>[0]) \
	i( <header>.<page table>[k].<offsets>[l] \
	- <header>.<page table>[k].<offsets>[l - 1]),
	for l in 1 .. s(<header>.<page table>[k].<entry count>)-1

	u(x) ... unsigned int x in 7b/8b encoding
	i(x) ... signed int x in 7b/8b encoding
	s(x) ... number of entries in array x


	Proto index file format
	-----------------------

	The index will be created from a "revision-less" proto index file
	containing (<offset><item_index>) pairs only.

	All mappings belonging to the same revision must be written in one go but
	there is no restriction on the order of those entries. To signify that
	a new revision begins, a (0, 0) mapping must be written. A (0, 0) entry
	at the beginning of the file is optional and will be ignored.

	<bof> /* begin of proto index file for revision r and following */
	(0, 0) /* mark start of revision r, optional for first rev */
	(off, item)* /* zero or more mappings in random order */
	(0, 0) /* mark start of revision r + 1 */
	(off, item)* /* zero or more mappings in random order */
	(0, 0) /* mark start of revision r + 2 */
	(off, item)* /* zero or more mappings in random order */
	...
	<eof> /* end of file. */

	All entries are pairs of 64 bit unsigned integers in little endian order.


	Phys-to-log index
	=================

	This index has to map offset -> (rev, item_index, type, len, checksum).


	Index data model
	----------------

	hierarchy:

	header -> page -> item info

	Logically, the index splits up index rev / pack file into pages of a
	fixed size. That page size may differ from the FS's block size. The
	index will describe each rev / pack file page with one index page.

	page = header->pages[offset % page_size];
	item info = binary_search(page.data, offset)

	Note that the current implementation will not return any data if the
	offset is does not match any actual item start. To simplify the lookup,
	the last index page will have an "unused item" entry for the section
	behind EOF. Holes aren't allowed as well, i.e. every byte of the rev /
	pack is expected to be covered by the index.

	Also, there may be items stretching across page borders or even over
	multiple pages. The data model solves this issue by storing the item
	descriptions as a "primary array" and then representing the pages as
	ranges within that array. Thus multiple pages may reference the same
	item description.

	header:

	<first revision> ... first revision covered by this index
	<file size> ... size of the rev / pack file in bytes
	<page size> ... number of bytes in the rev / pack file covered by
	each index page
	<page count> ... number of pages
	<offsets> ... array of page offsets, i.e. locations the page
	data within the index.
	This array has <page count> + 1 entries.

	page:

	<entries> ... array of item descriptions, ordered by offset.
	First and last entry may cross page boundaries.

	entry:

	<offset> ... absolute position in the pack / rev file
	<size> ... on-disk size of the item in the pack / rev file
	<type> ... item type
	<FNV checksum> ... modified 32 bit FNV-1a checksum of that section
	of the pack / rev file (see below). 0 for empty
	or zero-length items
	<revision> ... revision that this item belongs to
	<item_index> ... item_index within that revision


	Index on-disk format
	--------------------

	index := "P2L-INDEX\n" header pages items

	header := u(<header>.<first revision>) \
	u(<header>.<file size>) \
	u(<header>.<page size>) \
	u(<header>.<page count>)

	pages := u(<header>.<offsets>[k+1] - <header>.<offsets>[k]),
	for k in 0 .. <header>.<page count> -1

	items := u(<items in page k>[0].<offset>) \
	u(<items in page k>[l].<size>) \
	i(c(<items in page k>[l]) - c(<items of page k>[l-1])) \
	i( <items in page k>[l].<revision>
	- <items in page k>[l-1].<revision>), \
	u(FNV checksum)
	for l in 0 .. s(<items in page k>)-1,
	for k in 0 .. <header>.<page count>-1

	u(x) ... unsigned int x in 7b/8b encoding
	i(x) ... signed int x in 7b/8b encoding
	s(x) ... number of entries in collection x
	c(x) ... compound function := x.<item_index> * 8 + x.<type>

	Access to negative indexes gives a 0 value.

	<Items in page k> are in strict ascending offset order. Items that
	started after the begin of a given page and overlap with the next page
	will not be stored in the start page. The runtime representation will
	duplicate items overlapping page boundaries; the on-disk representation
	will not. Thus, pages inside large items will have zero entries on disk.


	Proto index file format
	-----------------------

	The index will be created from a proto index file containing simple
	instances of svn_fs_fs__p2l_entry_t with the following element order:

	item offset as uint64
	item size as uint64
	item type as uint64
	modified FNV1a checksum as uint64
	revision as uint64, with SVN_INVALID_REVNUM mapped to 0
	and revisions >= 0 stored as rev+1
	item index as uint64

	All values are stored in little endian order.

	Page table and header information, except start revision and page size,
	can easily be derived from that information.

	All entries must be written in strict offset order. Overlapping entries
	are not allowed; zero-length items are.

	In transactions, the final revision number may not be known when writing
	the proto index file (e.g. while still writing the proto rev file). Items
	with revision set to SVN_INVALID_REVNUM will therefore be automatically
	updated when creating the final index. This is possible in conjunction
	with rev files but not for pack files.


	FNV checksum
	------------

	FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/
	For performance reasons we use a modified version:

	* split the input byte stream [b0 .. bN] into 4 sub-streams of equal
	length and up to 3 remnants:

	[b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant]

	* calculate 32 bit FNV-1a checksums for the 4 substreams:

	h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..])

	* concatenate the big endian representation of these checksums (4 bytes
	each) plus the remnant of the original stream into a 16 to 19 byte long
	intermediate:

	[i0 .. iK] = [big-endian(h0) ... big-endian(h3) remnant ], 16 <= K+1 <= 19

	* fold the variable-length intermediate into a compact 32 bit checksum:

	FNV checksum = fnv_1a([i0 .. iK])