notes/fsfs-improvements.txt - subversion - Git at Google

 [ At the time of writing (1.7-dev), FSFS format 5 was current.  Since then
 format 5 has been retracted; it was never released.  1.7 shipped with f4 and
 1.8 will most likely ship with f6.  Format 5 was current during 1.7-dev and
 packed revprops into an authoritative SQLite database. ]

 Introduction
 ------------

 The FSFS external data format can be changed to allow for
 significantly reduced I/O overhead without changing the
 fundamental ideas behind FSFS.

 The whole idea is to rearrange packed revision info in a
 new FSFS format "6".  A two-way conversion between format
 "5" and "6" should be possible as well as supporting both
 formats for different repositories on the same server.


 Revision Order
 --------------

 FSFS format "5" packs revisions by putting revision 0 at
 the beginning of the file and concatenating the others in
 ascending order.  This intuitive design is counter-productive
 because we always trace data HEAD -> r0 and scanning a file
 backwards is expensive.

 +-------+
 | rev 0 |
 +-------+
 | rev 1 |
 +-------+

 :  ...  :

 +-------+
 | rev N |
 +-------+
 <EOF>

 A counter-intuitive but more efficient reversed order (de-
 scending revision order in a packed file) results in always
 reading forward through a file.

 +-------+
 | rev N |
 +-------+

 :  ...  :

 +-------+
 | rev 1 |
 +-------+
 | rev 0 |
 +-------+
 <EOF>


 Don't Interleave Revision and Content Data
 ------------------------------------------

 Format "5" keeps each revision as a single block.  When
 re-constructing a node content from the diffs,  we jump
 through a number of distant revision blocks.  For each node.

 +--------------+
 | rev N Reps   |
 +--------------+
 | rev N Header |
 +--------------+

 :     ...      :

 +--------------+
 | rev 1 Reps   |
 +--------------+
 | rev 1 Header |
 +--------------+
 | rev 0 Reps   |
 +--------------+
 | rev 0 Header |
 +--------------+
 <EOF>

 Instead,  delta-info should be removed from the revision
 blocks and put at the end of the file.  This speeds up
 header-only operations like "log" without impairing node
 content lookup.  And it lays the foundation for further
 improvements.

 +--------------+
 | rev N Header |
 +--------------+

 :     ...      :

 +--------------+
 | rev 1 Header |
 +--------------+
 | rev 0 Header |
 +--------------+
 | rev N Reps   |
 +--------------+

 :     ...      :

 +--------------+
 | rev 1 Reps   |
 +--------------+
 | rev 0 Reps   |
 +--------------+
 <EOF>

 The only downside is that putting headers first is somewhat
 more computationally expensive since offsets needs to be
 calculated in advance.  This is made even more difficult by
 using a variable length encoding for those offsets.


 Group the Representations by Delta Tree
 ---------------------------------------

 All diffs that are based on the same node within the packed
 revision file are put in one place.  Re-constructing a node
 would then involve reading the revision headers (relatively
 close together) and the content deltas after that (again
 with high locality).

 +------------------+
 | rev N Header     |
 +------------------+

 :       ...        :

 +------------------+
 | rev 1 Header     |
 +------------------+
 | rev 0 Header     |
 +------------------+
 | node tree A Reps |
 +------------------+
 | node tree B Reps |
 +------------------+

 :       ...        :

 +------------------+
 | node tree Z Reps |
 +------------------+
 <EOF>

 We can optimize that even further by exploiting the skip-
 delta information: the "walks" through the delta tree for a
 given node will merge bit by bit into a main "trunk". The
 nodes on these paths can be re-ordered such that very few
 seek() operations are required on average to reconstruct
 the node content in this file.

 Example of 8 changes in revs r0 .. r7 (for simplicity),
 looking at the delta-info only ("->" means "stored as diff
 against"):

 	r0
 	r1->r0
 	r2->r0
 	r3->r2
 	r4->r0
 	r5->r4
 	r6->r4
 	r7->r6

 Default ordering r0,r1,r2,r3,r4,r5,r6,r7 requires
 1/1/2/2/2/2/3/3 seeks.  For 2^N changes (N integer > 0),
 we need (N+1)/2 seeks on average.

 A path-optimized ordering r0,r4,r6,r7,r5,r2,r3,r1
 requires 1/1/1/1/2/2/2/2 seeks.  It's (N+3)/4 on average.
 That optimized ordering can easily be constructed by

 	* select the highest rev not covered, yet
 	* add its diff path to the existing result
 	  (starting with the first revision not
           already is in the result)
 	* repeat until all revs are covered

 Note that revision paths start with the low end of
 the revision range because contents gets reconstructed
 from r0 upwards.


 Move the Manifest Info into the Packed File
 -------------------------------------------

 Once we mastered the offset pre-calculation issue (see
 above),  we may as well put the manifest info at the be-
 ginning of the file.  This will benefit "log" in particular
 as only one consecutive chunk of data needs to be read
 from only one file.

 +------------------+
 | rev 0 Offset     |
 +------------------+
 | rev 1 Offset     |
 +------------------+

 :       ...        :

 +------------------+
 | rev N Offset     |
 +------------------+
 | rev N Header     |
 +------------------+

 :       ...        :

 +------------------+
 | rev 1 Header     |
 +------------------+
 | rev 0 Header     |
 +------------------+
 | node tree A Reps |
 +------------------+
 | node tree B Reps |
 +------------------+

 :       ...        :

 +------------------+
 | node tree Z Reps |
 +------------------+
 <EOF>
	[ At the time of writing (1.7-dev), FSFS format 5 was current. Since then
	format 5 has been retracted; it was never released. 1.7 shipped with f4 and
	1.8 will most likely ship with f6. Format 5 was current during 1.7-dev and
	packed revprops into an authoritative SQLite database. ]

	Introduction
	------------

	The FSFS external data format can be changed to allow for
	significantly reduced I/O overhead without changing the
	fundamental ideas behind FSFS.

	The whole idea is to rearrange packed revision info in a
	new FSFS format "6". A two-way conversion between format
	"5" and "6" should be possible as well as supporting both
	formats for different repositories on the same server.


	Revision Order
	--------------

	FSFS format "5" packs revisions by putting revision 0 at
	the beginning of the file and concatenating the others in
	ascending order. This intuitive design is counter-productive
	because we always trace data HEAD -> r0 and scanning a file
	backwards is expensive.

	+-------+
	\| rev 0 \|
	+-------+
	\| rev 1 \|
	+-------+

	: ... :

	+-------+
	\| rev N \|
	+-------+
	<EOF>

	A counter-intuitive but more efficient reversed order (de-
	scending revision order in a packed file) results in always
	reading forward through a file.

	+-------+
	\| rev N \|
	+-------+

	: ... :

	+-------+
	\| rev 1 \|
	+-------+
	\| rev 0 \|
	+-------+
	<EOF>


	Don't Interleave Revision and Content Data
	------------------------------------------

	Format "5" keeps each revision as a single block. When
	re-constructing a node content from the diffs, we jump
	through a number of distant revision blocks. For each node.

	+--------------+
	\| rev N Reps \|
	+--------------+
	\| rev N Header \|
	+--------------+

	: ... :

	+--------------+
	\| rev 1 Reps \|
	+--------------+
	\| rev 1 Header \|
	+--------------+
	\| rev 0 Reps \|
	+--------------+
	\| rev 0 Header \|
	+--------------+
	<EOF>

	Instead, delta-info should be removed from the revision
	blocks and put at the end of the file. This speeds up
	header-only operations like "log" without impairing node
	content lookup. And it lays the foundation for further
	improvements.

	+--------------+
	\| rev N Header \|
	+--------------+

	: ... :

	+--------------+
	\| rev 1 Header \|
	+--------------+
	\| rev 0 Header \|
	+--------------+
	\| rev N Reps \|
	+--------------+

	: ... :

	+--------------+
	\| rev 1 Reps \|
	+--------------+
	\| rev 0 Reps \|
	+--------------+
	<EOF>

	The only downside is that putting headers first is somewhat
	more computationally expensive since offsets needs to be
	calculated in advance. This is made even more difficult by
	using a variable length encoding for those offsets.


	Group the Representations by Delta Tree
	---------------------------------------

	All diffs that are based on the same node within the packed
	revision file are put in one place. Re-constructing a node
	would then involve reading the revision headers (relatively
	close together) and the content deltas after that (again
	with high locality).

	+------------------+
	\| rev N Header \|
	+------------------+

	: ... :

	+------------------+
	\| rev 1 Header \|
	+------------------+
	\| rev 0 Header \|
	+------------------+
	\| node tree A Reps \|
	+------------------+
	\| node tree B Reps \|
	+------------------+

	: ... :

	+------------------+
	\| node tree Z Reps \|
	+------------------+
	<EOF>

	We can optimize that even further by exploiting the skip-
	delta information: the "walks" through the delta tree for a
	given node will merge bit by bit into a main "trunk". The
	nodes on these paths can be re-ordered such that very few
	seek() operations are required on average to reconstruct
	the node content in this file.

	Example of 8 changes in revs r0 .. r7 (for simplicity),
	looking at the delta-info only ("->" means "stored as diff
	against"):

	r0
	r1->r0
	r2->r0
	r3->r2
	r4->r0
	r5->r4
	r6->r4
	r7->r6

	Default ordering r0,r1,r2,r3,r4,r5,r6,r7 requires
	1/1/2/2/2/2/3/3 seeks. For 2^N changes (N integer > 0),
	we need (N+1)/2 seeks on average.

	A path-optimized ordering r0,r4,r6,r7,r5,r2,r3,r1
	requires 1/1/1/1/2/2/2/2 seeks. It's (N+3)/4 on average.
	That optimized ordering can easily be constructed by

	* select the highest rev not covered, yet
	* add its diff path to the existing result
	(starting with the first revision not
	already is in the result)
	* repeat until all revs are covered

	Note that revision paths start with the low end of
	the revision range because contents gets reconstructed
	from r0 upwards.


	Move the Manifest Info into the Packed File
	-------------------------------------------

	Once we mastered the offset pre-calculation issue (see
	above), we may as well put the manifest info at the be-
	ginning of the file. This will benefit "log" in particular
	as only one consecutive chunk of data needs to be read
	from only one file.

	+------------------+
	\| rev 0 Offset \|
	+------------------+
	\| rev 1 Offset \|
	+------------------+

	: ... :

	+------------------+
	\| rev N Offset \|
	+------------------+
	\| rev N Header \|
	+------------------+

	: ... :

	+------------------+
	\| rev 1 Header \|
	+------------------+
	\| rev 0 Header \|
	+------------------+
	\| node tree A Reps \|
	+------------------+
	\| node tree B Reps \|
	+------------------+

	: ... :

	+------------------+
	\| node tree Z Reps \|
	+------------------+
	<EOF>