notes/wc-ng/nodes - subversion - Git at Google


 Description of the NODES table
 ==============================


  * Introduction
  * Inclusion of BASE nodes
  * Rows to store state
  * Ordering rows into layers
  * Visibility of multiple op_depth rows
  * Restructuring the tree means adding rows
  * Copies of mixed-revision subtrees become multiple layers
  * In a deleted subtree, all nodes get marked deleted explicitly
  * Nodes in a replaced subtree can have different presence values
  * Presence values of nodes in partially overlapping replacements
  * Status needs to consult the *two* topmost layers - sometimes


 Introduction
 ------------

 The entire original design of wc-ng evolves around the notion that
 there are a number of states in a working copy, each of which needs
 to be managed.  All operations - excluding merge - operate on three
 trees: BASE, WORKING and ACTUAL.

 For an in-depth description of what each means, the reader is referred
 to other documentation, also in the notes/ directory.  In short, BASE
 is what was checked out from the repository; WORKING includes
 tree restructuring; while ACTUAL also includes changes to properties
 and file contents.

 The idea that there are three trees works - mostly. There is no need
 for more trees outside the area of the metadata administration and even
 then three trees got us pretty far.  The problem starts when one realizes
 tree modifications can be overlapping or layered. Imagine a tree with
 a replaced subtree.  It's possible to replace a subtree within the
 replacement.  Imagine that happened and that the user wants to revert
 one of the replacements.  Given a 'flat' system, with just enough columns
 in the database to record the 'old' and 'new' information per node, a single
 revert can be supported.  However, in the example with the double
 replacement above, that would mean it's impossible to revert one of the
 two replacements: either there's not enough information in the deepest
 replacement to execute the highest level replacement or vice versa
 - depending on which information was selected to be stored in the "new"
 columns.

 The NODES table is the answer to this problem: instead of having a single
 row in a table with WORKING nodes with just enough columns to record
 (as per the example) a replacement, the solution is to record different
 layers of tree restructuring by having multiple rows.


 Inclusion of BASE nodes
 -----------------------

 The original technical design of wc-ng included a WORKING_NODE and a
 BASE_NODE table.  As described in the introduction, the WORKING_NODE
 table was replaced with NODES.  However, the BASE_NODE table stores
 roughly the same state information that WORKING_NODE did.  Additionally,
 in a number of situations, the system isn't interested in the type of
 state it gets returned (BASE or WORKING) - it just wants the latest.

 As a result the BASE_NODE table has been integrated into the NODES
 table.

 The main difference between the WORKING_NODE and BASE_NODE tables was
 that the BASE_NODE table contained a few caching fields which are
 not relevant to WORKING_NODE.  Moving those to a separate table was
 determined to be wasteful because the primary key of that table
 would be much larger than any information stored in it in the first
 place.


 Rows to store state
 -------------------

 Rows of the NODES table store state of nodes in the BASE tree
 and the layers in the WORKING tree.  Note that these nodes do not
 need to exist in the working copy presented to the user: they may
 be 'absent', 'not-present' or just removed (rm) without using
 Subversion commands.

 A row contains information linking to the repository, if the node
 was received from a repository.  This reference may be a link to
 the original nodes for copied or moved nodes, but for rows designating
 BASE state, they refer to the repository location which was checked
 out from.

 Additionally, the rows contain information about local modifications
 such as copy, move or delete operations.


 Ordering rows into layers
 -------------------------

 Since the table might contain more than one row per (wc_id, local_relpath)
 combination, an ordering mechanism needs to be added.  To that effect
 the 'op_depth' value has been devised.  The op_depth is an integer
 indicating the depth of the operation which modified the tree in order
 for the node to enter the state indicated in the row.

 Every row for the (wc_id, local_relpath) combination must have a unique
 op_depth associated with it.  The value of op_depth is related to the
 top-most node being modified in the given tree-restructuring
 operation (operation root or oproot).  E.g. upon deletion of a subtree,
 every node in the subtree will have a row in the table with the same
 op_depth, that being the depth of the top directory of the subtree.

 The op_depth is calculated by taking the number of path components in
 the local_relpath of the oproot. The unmodified tree (BASE) is identified
 by rows with an op_depth value 0.

 By having multiple restructuring operations on the same path in a modified
 subtree (most notably replacements), the table may end up with multiple rows
 with an op_depth bigger than 0.


 Visibility of multiple op_depth rows
 ------------------------------------

 As stated in the introduction, there's no need to leak the concept of
 multiple op_depth rows out of the meta data store - apart from the BASE
 and WORKING trees.

 As described before, the BASE tree is defined by op_depth == 0. WORKING as
 visible outside the metadata store maps back to those rows where
 op_depth == MAX(op_depth) for each (wc_id, local_relpath) combination.


 Restructuring the tree means adding rows
 ----------------------------------------

 The base idea behind the NODES table is that every tree restructuring
 operation causes nodes to be added to the table in order to best support
 the reversal process: in that case a revert simply means deletion of rows
 and bringing the subtree back into sync with the metadata.

 There's one exception: When a delete is followed by a copy or move to
 the deleted location - causing a replacement - a row with the right
 op_depth may already exist, due to the delete.  If so, it needs to be
 modified.  On revert, the modified nodes need to be restored to 'deleted'
 state, which itself can be reverted during the next revert.  (If the row did
 not exist with the right op_depth, then this copy or move is being performed
 at some greater depth than the delete, and then this copy or move will
 simply create rows at a new op_depth.)

 ### JAF: I don't think a replacement should be reverted in two stages, even
   though it was created in two stages.  I think 'revert' should restore the
   previous existing node, just like it does in WC-1.  A partial revert of
   this state is not a particularly helpful or frequent use case.

   GJS: I believe that wc_db should enable the individual reverts. The
   first revert will undo the add/copy, and the second revert will undo
   the delete. The UI (or the next level up in libsvn_wc) can collapse
   those into a single user action. This leaves us the future
   possibility of finer-grained reverts.

 ### EHU: The statement above probably means that *all* nodes in the subtree
   need to be rewritten: they all have a deleted state with the affected
   op_depth, meaning they probably need a 'replaced/copied-to' state with
   the same op_depth...

   GJS: not all nodes. The newly-arriving copy/move may have
   new/disjoint nodes that were not part of the deleted set. We will
   simply add new rows for these arriving nodes. Similarly, the
   arriving subtree may NOT have a similar node, so the deleted node
   remains untouched.


 Copies of mixed-revision subtrees become multiple layers
 --------------------------------------------------------

 In the design, every node which is not a child of its parent implies a
 tree restructuring operation having taken place.  When committing a
 mixed-revision subtree, the commit should mirror the actual mixed state
 of the tree.

 A mixed-revision tree which came about in the usual process of committing
 content changes - ie one without tree modifications - differs exactly in
 that respect:  the tree in the repository doesn't need to mirror the mixed
 revision state in the working copy.

 The idea is that every tree restructuring operation takes place on the
 oproot.  When a node or subtree within the copied tree isn't a direct
 child of its parent, most notably because it's at a different revision,
 that's a tree restructuring: a node of the same revision has been
 replaced by a node (of the same name) from another rev.

 By strict application of the design rule, all nodes and subtrees at
 different revision levels than their parents within the copied subtree,
 become an op_depth layer of their own.

 ### JAF: We don't have the info in the WC to be able to fill in the lower
   layers of this tree for the copy, if we are copying from BASE, because
   BASE is stored flat.  Therefore we won't be able to revert/delete/replace
   the different-revision sub-trees of this copied tree.  Therefore I think
   we have to store the mixed-rev copy flat (single op_depth) and modify the
   commit rules to act on revision-number changes within this flat tree.

   GJS: correct. Consider a root of the subtree at r10, and a
   descendant is at r12. We cannot create one layer at r10, and another
   at r12 because we do not have the descendant@r10 to place into the
   first layer. Thus, we need to use a single op_depth layer for this
   operation. At commit time, one copy will be me for the subtree from
   r10, a deletion will be made for the descendant, and then another
   copy performed for the r12 descendant.

 ### GJS: in the above scenario, we do not know if the descendant
   existed in r10, so the deletion may not be necessary (and could even
   throw an error!). I do not recall if our copy's destination is
   allowed to exist (ie. we have implied overwrite semantics in the
   repository).

   PM: Yes, we have overwrite sematics.  The FS layer on the server has
   magic that converts the copy of the r12 descendant into a replace if
   the descendant exists in r10.  The client does not send a delete.

   This magic applies to copies, not deletes, so there is a problem
   when the descendant is deleted in the mixed-revision copy in the
   working copy.  When faced with a copy of the subtree at r10 and a
   delete of a descendant at r12 the commit doesn't work at present.
   Deleting the descendant is wrong if it does not exist in r10, but
   not deleting it is wrong if it does exist.  I suppose the client
   could ask the server, or perhaps use multiple layers of BASE to
   track mixed-revisions (argh!).

 In a deleted subtree, all nodes get marked deleted explicitly
 -------------------------------------------------------------

 All nodes in a deleted subtree get marked 'deleted' explicitly in order
 to be able to query on a single node and find in its topmost layer
 that the node that might have ever existed at the given path does not
 exist there anymore.


 Presence values of nodes in partially overlapping replacements
 --------------------------------------------------------------

 Replacement - being a two-step operation consisting of a delete and an
 add/copy - causes all rows of the deleted subtree to be added with a
 new op_depth and presence value 'deleted'.  So far so good.

 Adding a tree on top of the same oproot will cause the oproot -
 and all overlapping children! - to switch their presence value to
 'normal'.  When a node replaces a deleted node it hides any deleted
 children of the previously deleted node, and may come with children of
 its own.  Some of the new children may have the same names as some of
 the deleted children, but these overlapping children should not be
 considered restructuring replacements.  Only the parent, with op_depth
 equal to the tree depth, is a restructuring replacement.


 Status needs to consult the *two* topmost layers - sometimes
 ------------------------------------------------------------

 As discussed before, every tree restructuring operation becomes an
 oproot, causing rows to be added with a new op_depth value.

 Status wants to report the oproots, making a clear distinction between
 adds and replacements.  However, both added and replaced nodes have
 the presence value 'normal'.  In order to make the distinction, status
 needs to determine if there was a node in the layer below the
 restructured layer.  In case there is, it must be a replacement,
 otherwise, it must be an addition.


 TODO:

        GJS: yup. tho it will complicate a revert of a copy/move-here
        since we will need to perform a query to see whether we should
        convert the copy/move into a deleted node, or whether to simply
        remove the node entirely.

        GJS: and yes, if wc_db performed a double-operation revert,
        then we wouldn't have to do this. arguably, we could push the
        2-op revert to a future release when we also choose to alter
        the higher layers to bring in the finer-grained control. (or
        maybe we have to look at prior nodes regardless, so checking
        for "replace with a deleted node" comes for no additional
        cost).

  * Document states of the table and their meaning (including values
     of the relevant columns)

	Description of the NODES table
	==============================


	* Introduction
	* Inclusion of BASE nodes
	* Rows to store state
	* Ordering rows into layers
	* Visibility of multiple op_depth rows
	* Restructuring the tree means adding rows
	* Copies of mixed-revision subtrees become multiple layers
	* In a deleted subtree, all nodes get marked deleted explicitly
	* Nodes in a replaced subtree can have different presence values
	* Presence values of nodes in partially overlapping replacements
	* Status needs to consult the two topmost layers - sometimes


	Introduction
	------------

	The entire original design of wc-ng evolves around the notion that
	there are a number of states in a working copy, each of which needs
	to be managed. All operations - excluding merge - operate on three
	trees: BASE, WORKING and ACTUAL.

	For an in-depth description of what each means, the reader is referred
	to other documentation, also in the notes/ directory. In short, BASE
	is what was checked out from the repository; WORKING includes
	tree restructuring; while ACTUAL also includes changes to properties
	and file contents.

	The idea that there are three trees works - mostly. There is no need
	for more trees outside the area of the metadata administration and even
	then three trees got us pretty far. The problem starts when one realizes
	tree modifications can be overlapping or layered. Imagine a tree with
	a replaced subtree. It's possible to replace a subtree within the
	replacement. Imagine that happened and that the user wants to revert
	one of the replacements. Given a 'flat' system, with just enough columns
	in the database to record the 'old' and 'new' information per node, a single
	revert can be supported. However, in the example with the double
	replacement above, that would mean it's impossible to revert one of the
	two replacements: either there's not enough information in the deepest
	replacement to execute the highest level replacement or vice versa
	- depending on which information was selected to be stored in the "new"
	columns.

	The NODES table is the answer to this problem: instead of having a single
	row in a table with WORKING nodes with just enough columns to record
	(as per the example) a replacement, the solution is to record different
	layers of tree restructuring by having multiple rows.



	Inclusion of BASE nodes
	-----------------------

	The original technical design of wc-ng included a WORKING_NODE and a
	BASE_NODE table. As described in the introduction, the WORKING_NODE
	table was replaced with NODES. However, the BASE_NODE table stores
	roughly the same state information that WORKING_NODE did. Additionally,
	in a number of situations, the system isn't interested in the type of
	state it gets returned (BASE or WORKING) - it just wants the latest.

	As a result the BASE_NODE table has been integrated into the NODES
	table.

	The main difference between the WORKING_NODE and BASE_NODE tables was
	that the BASE_NODE table contained a few caching fields which are
	not relevant to WORKING_NODE. Moving those to a separate table was
	determined to be wasteful because the primary key of that table
	would be much larger than any information stored in it in the first
	place.



	Rows to store state
	-------------------

	Rows of the NODES table store state of nodes in the BASE tree
	and the layers in the WORKING tree. Note that these nodes do not
	need to exist in the working copy presented to the user: they may
	be 'absent', 'not-present' or just removed (rm) without using
	Subversion commands.

	A row contains information linking to the repository, if the node
	was received from a repository. This reference may be a link to
	the original nodes for copied or moved nodes, but for rows designating
	BASE state, they refer to the repository location which was checked
	out from.

	Additionally, the rows contain information about local modifications
	such as copy, move or delete operations.



	Ordering rows into layers
	-------------------------

	Since the table might contain more than one row per (wc_id, local_relpath)
	combination, an ordering mechanism needs to be added. To that effect
	the 'op_depth' value has been devised. The op_depth is an integer
	indicating the depth of the operation which modified the tree in order
	for the node to enter the state indicated in the row.

	Every row for the (wc_id, local_relpath) combination must have a unique
	op_depth associated with it. The value of op_depth is related to the
	top-most node being modified in the given tree-restructuring
	operation (operation root or oproot). E.g. upon deletion of a subtree,
	every node in the subtree will have a row in the table with the same
	op_depth, that being the depth of the top directory of the subtree.

	The op_depth is calculated by taking the number of path components in
	the local_relpath of the oproot. The unmodified tree (BASE) is identified
	by rows with an op_depth value 0.

	By having multiple restructuring operations on the same path in a modified
	subtree (most notably replacements), the table may end up with multiple rows
	with an op_depth bigger than 0.



	Visibility of multiple op_depth rows
	------------------------------------

	As stated in the introduction, there's no need to leak the concept of
	multiple op_depth rows out of the meta data store - apart from the BASE
	and WORKING trees.

	As described before, the BASE tree is defined by op_depth == 0. WORKING as
	visible outside the metadata store maps back to those rows where
	op_depth == MAX(op_depth) for each (wc_id, local_relpath) combination.



	Restructuring the tree means adding rows
	----------------------------------------

	The base idea behind the NODES table is that every tree restructuring
	operation causes nodes to be added to the table in order to best support
	the reversal process: in that case a revert simply means deletion of rows
	and bringing the subtree back into sync with the metadata.

	There's one exception: When a delete is followed by a copy or move to
	the deleted location - causing a replacement - a row with the right
	op_depth may already exist, due to the delete. If so, it needs to be
	modified. On revert, the modified nodes need to be restored to 'deleted'
	state, which itself can be reverted during the next revert. (If the row did
	not exist with the right op_depth, then this copy or move is being performed
	at some greater depth than the delete, and then this copy or move will
	simply create rows at a new op_depth.)

	### JAF: I don't think a replacement should be reverted in two stages, even
	though it was created in two stages. I think 'revert' should restore the
	previous existing node, just like it does in WC-1. A partial revert of
	this state is not a particularly helpful or frequent use case.

	GJS: I believe that wc_db should enable the individual reverts. The
	first revert will undo the add/copy, and the second revert will undo
	the delete. The UI (or the next level up in libsvn_wc) can collapse
	those into a single user action. This leaves us the future
	possibility of finer-grained reverts.

	### EHU: The statement above probably means that all nodes in the subtree
	need to be rewritten: they all have a deleted state with the affected
	op_depth, meaning they probably need a 'replaced/copied-to' state with
	the same op_depth...

	GJS: not all nodes. The newly-arriving copy/move may have
	new/disjoint nodes that were not part of the deleted set. We will
	simply add new rows for these arriving nodes. Similarly, the
	arriving subtree may NOT have a similar node, so the deleted node
	remains untouched.


	Copies of mixed-revision subtrees become multiple layers
	--------------------------------------------------------

	In the design, every node which is not a child of its parent implies a
	tree restructuring operation having taken place. When committing a
	mixed-revision subtree, the commit should mirror the actual mixed state
	of the tree.

	A mixed-revision tree which came about in the usual process of committing
	content changes - ie one without tree modifications - differs exactly in
	that respect: the tree in the repository doesn't need to mirror the mixed
	revision state in the working copy.

	The idea is that every tree restructuring operation takes place on the
	oproot. When a node or subtree within the copied tree isn't a direct
	child of its parent, most notably because it's at a different revision,
	that's a tree restructuring: a node of the same revision has been
	replaced by a node (of the same name) from another rev.

	By strict application of the design rule, all nodes and subtrees at
	different revision levels than their parents within the copied subtree,
	become an op_depth layer of their own.

	### JAF: We don't have the info in the WC to be able to fill in the lower
	layers of this tree for the copy, if we are copying from BASE, because
	BASE is stored flat. Therefore we won't be able to revert/delete/replace
	the different-revision sub-trees of this copied tree. Therefore I think
	we have to store the mixed-rev copy flat (single op_depth) and modify the
	commit rules to act on revision-number changes within this flat tree.

	GJS: correct. Consider a root of the subtree at r10, and a
	descendant is at r12. We cannot create one layer at r10, and another
	at r12 because we do not have the descendant@r10 to place into the
	first layer. Thus, we need to use a single op_depth layer for this
	operation. At commit time, one copy will be me for the subtree from
	r10, a deletion will be made for the descendant, and then another
	copy performed for the r12 descendant.

	### GJS: in the above scenario, we do not know if the descendant
	existed in r10, so the deletion may not be necessary (and could even
	throw an error!). I do not recall if our copy's destination is
	allowed to exist (ie. we have implied overwrite semantics in the
	repository).

	PM: Yes, we have overwrite sematics. The FS layer on the server has
	magic that converts the copy of the r12 descendant into a replace if
	the descendant exists in r10. The client does not send a delete.

	This magic applies to copies, not deletes, so there is a problem
	when the descendant is deleted in the mixed-revision copy in the
	working copy. When faced with a copy of the subtree at r10 and a
	delete of a descendant at r12 the commit doesn't work at present.
	Deleting the descendant is wrong if it does not exist in r10, but
	not deleting it is wrong if it does exist. I suppose the client
	could ask the server, or perhaps use multiple layers of BASE to
	track mixed-revisions (argh!).

	In a deleted subtree, all nodes get marked deleted explicitly
	-------------------------------------------------------------

	All nodes in a deleted subtree get marked 'deleted' explicitly in order
	to be able to query on a single node and find in its topmost layer
	that the node that might have ever existed at the given path does not
	exist there anymore.


	Presence values of nodes in partially overlapping replacements
	--------------------------------------------------------------

	Replacement - being a two-step operation consisting of a delete and an
	add/copy - causes all rows of the deleted subtree to be added with a
	new op_depth and presence value 'deleted'. So far so good.

	Adding a tree on top of the same oproot will cause the oproot -
	and all overlapping children! - to switch their presence value to
	'normal'. When a node replaces a deleted node it hides any deleted
	children of the previously deleted node, and may come with children of
	its own. Some of the new children may have the same names as some of
	the deleted children, but these overlapping children should not be
	considered restructuring replacements. Only the parent, with op_depth
	equal to the tree depth, is a restructuring replacement.


	Status needs to consult the two topmost layers - sometimes
	------------------------------------------------------------

	As discussed before, every tree restructuring operation becomes an
	oproot, causing rows to be added with a new op_depth value.

	Status wants to report the oproots, making a clear distinction between
	adds and replacements. However, both added and replaced nodes have
	the presence value 'normal'. In order to make the distinction, status
	needs to determine if there was a node in the layer below the
	restructured layer. In case there is, it must be a replacement,
	otherwise, it must be an addition.




	TODO:

	GJS: yup. tho it will complicate a revert of a copy/move-here
	since we will need to perform a query to see whether we should
	convert the copy/move into a deleted node, or whether to simply
	remove the node entirely.

	GJS: and yes, if wc_db performed a double-operation revert,
	then we wouldn't have to do this. arguably, we could push the
	2-op revert to a future release when we also choose to alter
	the higher layers to bring in the finer-grained control. (or
	maybe we have to look at prior nodes regardless, so checking
	for "replace with a deleted node" comes for no additional
	cost).

	* Document states of the table and their meaning (including values
	of the relevant columns)