blob: 47fc04c9ed62b030e47b11fe2db3170f2e45e95c [file] [log] [blame]
-*- Text -*-
Content
=======
* Context
* Requirements
* Nice-to-have's
* Non-goals
* Open items / discussion points
* Problems in wc-1.0
* Possible solutions
- Developer sanity
- Speed
- Cross node type change representation
- Flexibility of metadata storage
- Transaction duration / memory management
- Working copy stability
- Transactional updates
- Work queue
* Prerequisites for a good wc implementation
* Modularization
* Implementation proposals for
- Mapping of svn_wc_entry_t fields to BASE/WORKING
- Basic storage mechanics
- Metadata schemas
- Commit process
- Random notes
- Code organization
- svn_wc.h API
* Upgrading old working copies
* Implementation plan
Context
=======
The working copy library has traditionally been a complex piece of
machinery and libsvn_wc-1.0 (wc-1.0 hereafter) was more a result of
evolution than it was a result of design. This can't be said to be
anybody's fault as much as it was unawareness of the developers at
the time with the problem(s) inherent to versioning trees instead of
files (as was the usual context within CVS). As a result, the WC
has been one of the most fragile areas of the Subversion versioning
model.
The wc is where a large number of issues come together which can
be considered separate issues in the remainder of the system, or
don't have any effect on the rest of the system at all. The following
things come to mind:
* Different behaviours required by different use-cases (users)
For example: some users want mtime's at checkout time
to be the checkout time, some want it to be the historical
value at check-in time (and others want different variants).
* Different filesystems behave differently, yet Subversion
is a cross platform tool and tries to behave the same on all
filesystems (timestamp resolution may be an example of this).
When considering the wc-1.0 design, one finds that there are a lot of
situations where the state of the versioned tree is poorly defined.
To clarify the tree state, the wc-ng design splits it into several
pieces. A versioned item in the working copy is described by one or
more nodes in the following schema.
* BASE: Nodes describing repository data in the WC.
Each BASE node corresponds to a particular repository URL and
revision. Mixed-revision working copies are still common.
Pristine file content is kept in the content store.
* WORKING: Nodes describing structural changes to BASE.
An item in the WC may have a BASE node, a WORKING node, or both.
The WORKING-only case arises when an item is added, copied-here,
or moved-here on a path that doesn't exist in BASE.
* NODE_DATA: Nodes describing layered structural changes to WORKING.
An example: suppose a directory is replaced, a directory is
copied into it, and a file is added to the subdirectory. BASE
and WORKING describe the replacement, NODE_DATA describes the
copied subdirectory, and NODE_DATA (again) describes the added
file.
* ACTUAL: Nodes describing content modifications and annotations.
An ACTUAL node contains one or more of the following:
- Text-modified flag (for a file)
- Properties
- Changelist name
- Conflict metadata
The text-modified flag may be out of sync with the file's real
status. The other data is set directly by Subversion operations.
Some tree conflicts lead to ACTUAL nodes that don't exist in the
working copy (e.g., when a merge tries to edit a file that
doesn't exist). These have no corresponding BASE, WORKING or
NODE_DATA nodes.
##BHB: For tree, text and property conflicts it would be nice to handle
MINE, THEIRS (and OLDER?) as semi-trees too. See this mail thread
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=984982
This would provide a clean way to access all information about the conflict
origins: $ svn info file@THEIRS
Requirements
============
* Developer sanity
From this requirement, a number of additional ones follow:
- Very explicit tree state management; clear difference between
each of the 5 states we may be looking at
- It must be "fun" to code wc-ng enhancements
* Speed
(Note: a trade off may be required for 'checkout' vs 'status' speed)
* Cross-node-type working copy changes
* Flexibility
The model should make it easy to support
- central vs local metadata storage
- Last modified timestamp behaviours
- .svn-less working copy subtrees
- different file-changed detection schemes
(e.g. full tree scan as in wc-1.0 as well as 'p4 edit')
* Graceful (defined) fallback for non-supported operations
When a checkout tries to create a symlink on an OS which supports
them, on a filesystem which doesn't, we should cope without
canceling the complete checkout. Same for marking metadata read-only.
* Gracefully handle symlinks in relation to any special-handling of
files (don't special-handle symlinks!)
* Clear/reparable tree state
Other than our current loggy system, I mean here: "there is a command
by which the user can restart the command he/she last issued and
Subversion will help complete that command", which differs from our
loggy system in the way that it will return the working copy to a
defined (but to the user unknown) state.
* Transactional/ repairable tree state (with which I mean something
which achieves the same as our loggy system, but better).
* Case sensitive filesystem aware / resilient
* Working copy stability; a number of scenario's with switch and
update obstructions used to leave the working copy unrecoverable
* Client side 'true renames' support where one side can't be committed
without the other (relates to issue #876)
###JSS: Perhaps this is obvious... I think that requirement is fine for the
user doing the commit. We still need to remember that another user doing
the update may not have authz permission to the directory it was renamed
into or may have a checkout of a sub-tree and that target directory may
not exist. Likewise, the original location might be unavailable too.
* Change detection should become entirely internal to libsvn_wc (referring
to the fact that libsvn_client currently calls svn_wait_for_timestamps()),
even though under 'use-commit-times=yes', this waiting is
completely useless.
* Last-modified recording as a preparation for solving issue #1256 and
as defined in this mail, also linked from the issue:
http://svn.haxx.se/dev/archive-2006-10/0193.shtml
* Representing "this node is part of a replaced-with-history tree and
I'm *not* in the replacement tree" as well as "... and I'm deleted
from the replacement tree" [issues #1962 and #2690]
Would-be-very-nice-to-have's
============================
* Multiple users with a single working copy (aka shared working copy)
* Ending up with an implementation which can use current WCs
(without conversion)
* Working copies/ metadata storages without local storage of text-bases
(other than a few cached ones)
Non-goals
=========
* Off-line commits
* Distributed VC
Open items / discussion points
==============================
* Files changed during the window "sent as part of commit" to
"post commit wc processing"; these are currently explicitly
supported. Do we want to keep this support (at the cost of speed)?
* Single working copy lock. Should we have one lock which locks the
entire working copy, disabling any parallel actions on disjoint
parts of the working copy?
* Meta data physical read-only marking (as in wc-1.0). Is it still
required, or should it become advisory (ie ignore errors on failure)?
* Is issue #1599 a real use-case we need to address?
(Loosing and regaining authz access with updates in between)
Problems in wc-1.0
==================
* There's no way to clear unused parts of the entries cache
* The code is littered with path calculations in order
to access different parts of the working copy (incl. admin areas)
* The code is littered with direct accesses to both wc files and
admin area files
* It's not always clear at which time log files are being processed
(ie transactions are being committed), meaning it's not always
clear at which version of a tree one is looking at: the pre or post
transformation versions...
* There's no support for nested transactions (even though some
functions want to start a new transaction, regardless whether one
was already started)
* It's very hard to determine when an action needs to be written
to a transaction or needs to be executed directly
* All code assumes local access to admin (meta)data
* The transaction system contains non-runnable commands
* It's possible to generate combinations of commands, each of which
is runnable, but the series isn't
* Long if() blocks to sort through all possible states of
WORKING, ACTUAL and BASE, without calling it that.
* Large if() blocks dealing with the difference between file and
directory nodes
* Many special-handling if()s for svn:special files
* Manipulation of paths, URLs and base-text paths in 1 function
* 'Switchedness' of subdirectories has to be derived from the
URLs of the parent and the child, but copied nodes also have
non-parent-child source URLs... (confusing)
* Duplication of data: a 'copied' boolean and a 'copy_source' URL field
* Checkouts fail when checking out files of different casing to a case
insensitive filesystem
* Checkouts fail when marking working copy admin data as read-only
is a non-supported FS operation (VFAT or Samba mounts on Linux have
this behaviour)
* Obstructed updates leave operations half done; in case of a switch,
it's not always possible to switch back (because the switch itself
may have left now-unversioned items behind)
* Directories which have their own children merged into them (which happens
when merging a directory-add) won't correctly fold the children into
schedule==normal, but instead leave them as schedule==add, resulting in
a double commit (through HTTP, other RA layers fold the double add, but
that's not the point) [see issue #1962]
* transaction files (ie log files) are XML files, requiring correct
encoding of characters and other values; given the short expected
life-time of a log file and the fact that we're almost completely sure
the log file is going to be read by the WC library anyway (no interchange
problems), this is a waste of processing time
* No strict separation between public and internal APIs: many public
APIs also used internally, growing arguments which *should* only
matter for internal use
* The lock strategy requires writing a file in every directory of a working copy,
which severely reduces our performance in several environments. (Windows,
NFS). Testing showed that in some cases we used more than 50 seconds on
writing 8000 lockfiles before we even started looking what to update. A new
lock strategy should reduce the number of writes necessary for locking with
depth infinity.
Possible solutions
==================
Developer sanity
----------------
Strict separation between modules should help keep code focused at one
task. Probably some of the required user-specific behaviours can (and
should) be hidden behind vtables; for example: setting the file stamp
to the commit time, last recorded time or leaving it at the current time
should be abstracted from.
Access to 'text bases' is another one of these areas: most routines in
wc-1.0 don't actually need access to a file (a stream would be fine as
well), but since the files are there, availability is assumed.
When abstracting all access into streams, the actual administration of
the BASE tree can be abstracted from: for all we know the 'tree storage
module' may be reading the stream directly off the repository server.
[The only module in wc-1.0 which *requires* access to the files is
the diff/merge library, because it rewinds to the start of the file
during its processing; an operation not supported by streams... and even
then, if these routines are passed file handles, they'll be quite
happy, meaning they still don't need to know where the text base /
source file is...]
###GJS: the APIs should use streams so that we can decompress as the
stream is being read. the diff library will need a callback of some
kind to perform the rewind, which will effectively just close and
reopen the stream. if it rewinds *multiple* times, then we may want
to cache the decompressed version of the file. I'll
investigate. Given our metadata/base-text storage system, I suspect
it will be very easy to cache decompressed copies for a while.
###GJS: a very reasonable strategy is: non-binary files are compressed
by default. binaries are stored uncompressed.
future improvement: extension-based choices, or some other control
In order to keep developers sane, it should be extremely clear at any
one time - when operating on a tree - which tree is being operated upon.
One way to prevent the lengthy 'if()' blocks currently in wc-1.0, would be
to design a dispatch mechanism based on the path-state in WORKING/BASE and the
required transformation, dispatching to (small) functions which perform
solely that specific task.
#####XBC Do please note that this suggests yet another instance of
pure polymorphism coded in C. This runs contrary to the
developer sanity requirement.
###GJS: agreed with XBC.
Speed
-----
wc-1.0 assumes the WORKING tree and the ACTUAL tree match, but then
goes out of its way to assure they actually do when deemed important.
The result is a library which calls stat() a lot more often than need be.
One of the possible improvements would be to make wc-ng read all of
the ACTUAL state (concentrated in one place, using apr_stat()), keeping
it around as long as required, matching it with the WORKING state before
operating on either (not only when deemed important!).
###GJS: working copy file counts are unbounded, so we need to be
careful about keeping "all" stat results in memory. I'll certainly
keep this in mind, however.
Working from the ACTUAL tree will also prove to be a step toward clarity
regarding the exact tree which is being operated upon.
[This suggestion from wc-improvements also applies to wc-ng:]
Most operations are I/O bound and have CPU to spare. Consider the virtue
of compressed text bases in order to reduce the amount of I/O required.
Another idea to reduce I/O is to eliminate atomic-rename-into-place for
the metadata part of the working copy: if a file is completely written,
store the name of the base-text/prop-text in the entries file, which gets
rewritten on most wc-transformations anyway.
###GJS: actually, I believe we *rarely* do full walks over the
filesystem multiple times. I doubt we will need to cache stat()
information. I think performance will primarily derive from
omitting the .svn/ location/walks and opening of multiple files
therein, in favor of a single SQLite database open.
of course, we'll analyze the situation, but I suspect that we will
be in great shape as a natural fallout of our new storage system.
Cross node type change representation
-------------------------------------
###GJS: this is not allowed in wc-1, but should be easily possible in
wc-ng. the WORKING tree's node kind is different than the BASE
tree. no big deal.
Flexibility of metadata storage
-------------------------------
There are 3 known models for storing metadata as requested by different
groups of users:
- in-subtree metadata storage (.svn subdir model, as in wc-1.0)
###GJS: euh... aren't we axing this? who has *requested* this?
- in-'tree root' metadata storage (working copy central)
- detached metadata storage (user-central)
- in $HOME/.subversion/
- in arbitrary location (e.g. $HOME is a (slow) NFS mount, and we
want the metadata on a local drive, such as /var/...)
A solution to implementing each of these behaviours in order to satisfy
the wide range of use-cases they solve, would be to define a module
interface and implement this interface three times (possibly using vtables).
Note that using within-module vtables should be less problematic than our
post-1.0 experiences with public vtables (such as the ra-layer vtable):
implementation details are allowed to differ between releases (even patch
releases).
###GJS: note that we are talking about both metadata AND base-text
content. (and yeah, optional and compresses base-texts can be done
during this rewrite) Also note that we might be able to share
base-text content across working copies if they are all keyed by
the MD5 hash into storage directories (under the user-central model)
###GJS: I don't think vtables are needed here. This is simply altering
the base location, not a whole new implementation. My plan is to
default to the "tree root" model with a .svn subdirectory. If a
.svn subdir is not found, then we fall back to looking in the
$HOME/.subversion/ directory (some subdir under there). If we
*still* don't find it, then some config options will point us to
the metadata/base-text location.
###GJS: my plan is to upgrade the working copy if we find a pre-1.7
working copy. all the data will be lifted from the multiple .svn
subdirectories, and relocated to the "proper" storage location.
This will be a non-reversable upgrade, and will preclude pre-1.7
clients from using that working copy again. Note: because of the
"destructive" nature of this upgrade, and the expected duration, we
will require the user to perform an explicit action ('svn upgrade')
in order to complete the upgrade. However, 1.7 will not be able to
*modify* wc-1.0 metadata -- just read it in order to upgrade it to
the new storage system.
When svn detects an old working copy, then it will error out and
request that the user run "svn upgrade" to upgrade their working copy
to the new format.
The metadata location is determined at one of two points:
* checkout time
* upgrade time
According to the user's config, the metadata will be placed in one of
three areas:
wcroot: at the root of the working copy in a .svn subdirectory
home: in the .subversion/wc/ subdirectory
/some/path: stored in the given path
All wcroot directories will have a .svn subdirectory. In that
directory will be the datastore, or there will be a file that provides
two pieces of information:
* absolute path to the (centralized) metadata
* absolute path of where this wcroot was created
With this information, we can link a wcroot to its metadata in the
centralized store. If the user has moved the wcroot (the stored path
is different from the current/actual path), then Subversion will exit
with an error. The user must then ###somehow tell svn that the wc has
been copied (duplicate the metadata for the wcroot) or moved (tweak
the path stored in the metadata and in the linkage file). Subversion
is unable to programmatically determine which operation was used.
Note that we use "svn upgrade" as the trigger to *perform* the upgrade.
The amount of file opens, parsing, moving, deleting, etc is expected
to consume significant amounts of I/O and (thus) cannot simply be done
on-the-fly without the user's knowledge and consent.
Transaction duration / memory management
----------------------------------------
The current pool-based memory management system is very good at managing
memory in a transaction-based processing model. In the wc library, a
'transaction' often spans more than one call into the library. We either
need a sane way to handle this kind of situation using pools, or we may
need a different memory management strategy in wc-ng.
Update (2009-05-10): pool-based management is still being used. we are
switching to a "dual pool" system that clarifies the intent of the
pools. the result_pool is used for return allocations, and
scratch_pool for any temporary allocations.
Working copy stability
----------------------
In light of obstructed updates it may not always be desirable to be able
to resume the current operation (as currently is the case): in some cases
the user may want to abort the operation, in other cases the user may
want to resolve the obstruction before re-executing the operation.
The solution to this problem could be 'atomic updates': receiving the
full working copy transformation, verifying prerequisites, creating
replacement files and directories and when all that succeeds, update
the working copy.
Full workin' copy unit tests:
Exactly because the working copy is such an important part of the
Subversion experience *and* because of the 'reputation' of wc-1.0,
we need a way to ensure wc-ng completely performs according to our
expectations. *The* way to ensure we're able to test the most contrived
edge-cases is to develop a full unit testing test-suite while developing
wc-ng. This will both be a measure to ensure working copy stability
as well as developer sanity: in the early stages of the wc-ng develop-
ment process, we'll be able to assess how well the design holds up
under more difficult 'weather'.
###GJS: agreed. as much as possible, when I (re)implement the old APIs
in terms of the new APIs, then I'll write a whitebox test. we'll
see how long I keep that up :-P
Update (2009-05-10): wc-ng currently passes the entire test
suite. Additional tests have been implemented ('entries_tests.py'
and other) to try and ensure this continued compatibility.
Transactional updates
---------------------
.. where 'update' is meant as 'user command', not 'svn update' per se.
When applied to files, this can be summarized as:
* Receive transformations (update, delete, add) from
the server,
Work Queue
----------
Certain operations that affect the filesystem require a stateful
marker that the operation needs to happen. The best example is when a
merge conflict occurs: several "pristine" files need to be placed into
the working copy (e.g. somefile.c.r34). Should processing fail after
the first file is placed, then we need to "remember" to resume the
operation and place the rest of the files.
The record of these needed operations will be placed into a "work
queue" which is a table recorded in the SQLite database. Much like the
original loggy, a working copy will be unusable until these actions
are run to completion.
Each work item must have the following properties:
* order-independent. the work items must be allowed to execute in any
sequence.
* idempotent. the work item must be able to run an arbitrary number of
times.
* resumable. whether a previous run completed, or was only partially
completed, the work item must be able to complete its operation.
* independent. each work item must affect only one node in the logical
trees. it can apply to any/all of BASE/WORKING/ACTUAL, but it must
apply to a single logical node.
* complete. a work item must represent a complete operation which
takes the WC from one stable state to another. thus, a work item
cannot be used to "return the wc to a stable state" (the operation
that made it unstable should be included in the work item).
The goal here is to reduce interactions across work items. Each must
be completely self-sufficient and resumable.
The wc_db API will provide a low-level framework for adding, fetching,
and completing these work items. Each work item will be described by a
skel, to be interpreted by higher levels.
### the "independent" requirement is subject to discussion. It may be
possible to have a work item that touches multiple nodes. As long
as it can definitely place those nodes into a specific state, then
it might be okay to operate on many.
Prerequisites for a good wc implementation
==========================================
These prerequisites are to be addressed, either as definitions
in this document, or elsewhere in the subversion (source) tree:
* Well defined behaviour for cross-node type updates/merges/..
(tree conflicts in particular)
* Well defined behaviour for special file handling
* Well defined behaviour for operations on locally missing items
(see issue #1082)
* Well defined change detection scheme for each of the different
last-modified handling strategies
* No special handling of symlinks: they are first class versioned objects
* Well defined behaviour for property changes on updates/merges/...
(this is a problem which may resemble tree conflicts!),
including 'svn:' special properties
* File name manipulation routines (availability)
* File name comparison routines (!) (availability; which compensate
for the different ways Unicode characters can be represented
[re: NFC/NFD Unicode issue])
###JSS: Talking with ehu on IRC when I asked him about how to handle this
issue: "if we accept that some repositories will be unusable with wc-ng,
then we can standardize anything that comes in from the server as well as
the directory side into the same encoding. we'd be writing files with the
standardized encoding." The rest of this conversation centered around the
fact that either APR or the OS will convert the filename to the correct
form for the filesystem when doing the stat() call. Note, ehu says: "(we'll
need to retain the filename we got from the server though: we'll need it to
describe the file through the editor interface: the server still allows all
encodings.)"
* URL manipulation routines (availability)
* URL comparison routines (availability; which compensate for
different ways the same URL can be encoded; see issue #2490)
* Modularization
* Agree on a UI to pull in other parts of the same repository
(NOT svn:externals) [relates to issue #1167]
#####XBC I submit this is a server-side feature that the client
(i.e. the WC library) should not know about.
* Agree on behaviour for update on moved items (relates to issue #1736)
* Case-sensitivity detection code to probe working copy filesystem
Implementation proposals
========================
Classification of svn_wc_entry_t fields to BASE/WORKING
-------------------------------------------------------
[Note: This section is mainly to clarify the difference between the BASE
and WORKING trees, it's not here to mean that we actually need all these
fields in wc-ng!]
Here are the mappings of all fields from svn_wc_entry_t to the BASE and
WORKING trees:
+-------------------------------+------+---------+
| svn_wc_entry_t | BASE | WORKING |
+-------------------------------+------+---------+
| name | x | x (1)|
| revision | x | x (2)|
| url | x | x (2)|
| repos | x | x (3)|
| uuid | x | x (3)|
| kind | x | x |
| absent | x | |
| copyfrom_url | | x |
| copyfrom_rev | | x |
| conflict_old | | x |
| conflict_new | | x |
| conflict_wrk | | x |
| prejfile | | x |
| text_time | | = |
| prop_time | | = |
| checksum | x | x (2)|
| cmt_rev | x | x (2)|
| cmt_date | x | x (2)|
| cmt_author | x | x (2)|
| lock_token | x(6)| |
| lock_owner | x | |
| lock_comment | x | |
| lock_creation_date | x | |
| has_props | x | x (4)|
| has_prop_mods | | = |
| cachable_props | x(5)| x (4)|
| present_props | x | x (4)|
| changelist | | x |
| working_size | | = |
| keep_local | | = |
| depth | x | x |
| schedule | | |
| copied | | |
| deleted | | |
| incomplete | | |
+-------------------------------+------+---------+
(1) if this one differs from BASE, it must point to the source of a rename
(2) for an add-with-history
(3) or can we assume single-repository working copies?
(4) can differ from BASE for add-with-history
(5) why is this a field at all; can't the WC code know?
(6) locks apply to in-repository paths, hence BASE
The fields marked with '=' are implementation details of internal detection
mechanisms, which means they don't belong in the public interface.
Fields with no check are to become obsolete. 'schedule', 'copied' and
'deleted' can be deducded from the difference between the BASE and WORKING
or WORKING and ACTUAL trees. 'incomplete' should become obsolete when the
goal of 'atomic updates' can be realised, in which case the tree can't be
in an incomplete yet locked state. This would also invalidate issue #1879.
Basic Storage Mechanics
-----------------------
All metadata will be stored into a single SQLite database. This
includes all of the "entry" fields *and* all of the properties
attached to the files/directories. SQLite transactions will be used
rather than the "loggy" mechanics of wc-1.0.
###GJS: note that atomicity across the sqlite database and the content
of the ACTUAL tree is freakin' difficult. idea to test: metadata
says "not sure of ACTUAL", and when ops complete successfully, then
we clear the flag. during any future operation, if the flag is
present, then we approach the ACTUAL with extreme prejudice. also
note that we can batch clearing of the flags as an optimistic
efficiency approach (since if we batch 100 and the last fails, then
the other 99 will be slower until the wc-ng determines the ACTUAL
is in fine shape and clears the flag for future operations).
###GJS: be wary of sqlite commit performance (based on some of my
prior experience with it). must have timing/debugging around the
commit operations. may need to use various transaction isolations
and/or batching of commits to get proper performance. thus, profile
output capability is mandatory to determine if we have issues, and
where they occur.
###JSS: I don't see how transactions by themselves can replace loggy.
Right now, if you abort something like 'svn update' or 'svn checkout',
loggy has recorded all the files to be downloaded, and will pick up
where it left off. We did this as an optimization to prevent
re-downloading a potentially large amount of data again. Seems like
we still need to provide that capability.
###GJS: sqlite transactions replace the atomicity that loggy was
originally designed for. it sounds like loggy is also be
used as a work queue, and that is easily handled in sqlite.
Base text data will be stored in a multi-level directory structure,
keyed/named by the checksum (MD5 or SHA1) of the file. The database
will record appropriate mappings, content/compression types, and
refcounts for the base text files (for the shared case). We will use a
single level of directories:
TEXT_BASE/7c/7ca344...
With 100k files spread across all of a user's working copies, that
will put 390 files into each subdirectory, which is quite fine. If the
user grows to a million files, then 3900 per subdir is still
reasonable. Two levels would effectively mean one file per subdir in
typical situations, which is a lot of disk overhead.
When the metadata is recorded in a central area (rather than the WC
root), then it is possible for the metadata and the base files to
become out of date with respect to all the working copies on the
system. We will revamp "svn cleanup" to re-tally the base text
reference counts, eliminate unreferenced bases, verify that the
working copies are still present, ensure the metadata <-> WC
integrity, deal with moves of metadata from central -> wc-root (can
happen if somebody rm -rf's the wc, then does a checkout and wants the
metadata at the wc-root (this time)), and other consistency checks.
Metadata Schemas
----------------
see libsvn_wc/wc-metatdata.sql3
The table below describes, in English, the various combinations of
"presence" values as the occur in the BASE_NODE and WORKING_NODE
tables.
BASE_NODE WORKING_NODE DESCRIPTION
normal <none> Node has been checked out from the
repository normally.
absent <none> Server has marked the node as "absent",
meaning the user does not have
authorization to view the content.
excluded <none> The node has been (locally) marked as
excluded from the working copy.
not-present <none> The node is not present at its current
revision. The parent directory has a
different revision which states the node
*is* present. This state is usually
reached by locally deleting a file and
committing it. Later, when an update is
run, the directory will be bumped to a
revision that does not contain the file,
and this not-present node will be cleared.
incomplete <none> The node is known, but the node's
information has not been downloaded
(yet) from the server. This typically
occurs from an interrupted checkout. The
parent directory was added, specifying
all the children, but the checkout was
stopped before fetching the child.
base-deleted * Not allowed. This presence is only valid
for the WORKING_NODE table.
absent <any> Not allowed. The name exists on the
server and cannot be modified in any way.
incomplete <any> An update "under" some local changes was
terminated before fetching information
on this node.
<none> normal This node has been locally-added through
a simple add, a copy, or a move (other
data needs to be examined to determine
what operation brought the node here).
normal normal The underlying BASE node has been
deleted, and a new node has been added
in its place (this is a "Replace").
excluded normal A node was excluded from the checkout or
update, and we have locally-added a new
node to *replace* it.
not-present normal A node is no longer present in the BASE
tree due to mixed-revision working copy
concerns. This is an addition (not a
replace) of a new node, in the same
location as a node that deleted (and
committed) at some point in history.
* absent Not allowed. This would imply that the
server has prevented our access, but
this is a local, uncommitted change. The
server cannot block the node.
(see Note 1)
* excluded An add-with-history or a move has been
performed, and this node has been
excluded from the working copy. Note:
plain adds cannot have an excluded
node -- we'd just not add the node.
Further note: the root of the copy/move
cannot be excluded since we need the
source information. The root may be
depth==empty, however.
* not-present This node has been locally-deleted. This
can only occur for a child of a copied
or moved subtree (for a plain add, we
simply revert the add; and must be a
child, or we'd just revert the whole
copy or move operation).
* incomplete This node is known, but the information
is missing. A copy, move, or deletion
has been interrupted, leaving a
directory with known children, but
lacking their state.
(see Note 2)
normal base-deleted The BASE node has been locally-deleted.
excluded base-deleted The node was excluded from the working
copy, but has been locally-deleted.
(see Note 3)
not-present base-deleted Not allowed. There is nothing to
delete. The not-present is a tool to
represent mixed-rev working copies;
there is (logically) no node to delete.
"<none>" means there is no row in the given table (so no presence value)
"<any>" means "any value"
"*" means "<none> or <any>"
Note 1: this implies you cannot copy/move a working copy tree that has
absent nodes in it. If that were made possible, then we may
(instead) want to model this as a copy/move followed by a
local-delete of the absent node(s).
Note 2: this will probably only apply to a repository-to-wc copy. For
wc-to-wc copies/moves, we will probably transact the entire operation
so that a child will never be incomplete.
Note 3: this may be possible, though we may need more state to pull
it off (eg. what revision of the node is "not there" is will be
deleted? need the rev for out-of-date checks)
Commit Process
--------------
Committing is essentially a review of all the rows in the WORKING_NODE
and ACTUAL_NODE tables, and sending appropriate instructions to the
server. After the commit, those rows are removed and the data
"collapsed" down into the BASE_NODE. For example, if a copy has been
performed, then WORKING_NODE contains data about the copy, and the new
BASE_NODE will "become" that WORKING_NODE. If the user additionally
modifies some properties (stored in ACTUAL_NODE), then those will also
fall down into the post-commit BASE_NODE row.
copy_tests 24 sets up a specific scenario that breaks a lot of the
code in libsvn_wc today (Sep 28, 2009). While the commit is being
processed, the database is temporarily placed into a state which does
have not have proper integrity.
This section is an attempt to document rules for how nodes are to be
treated during the commit process. These rules focus on the effects
upon an individual node, independent of what happens to any child
nodes on the theory that children can be configured as a mixed-ref
working copy with appropriate operations applied. Note that the
decision process is independent of the children, but the children
*will* be affected by the commit of the parent node. The commit MUST
operate in a top-down fashion, however, since (for example) it is
impossible to model a copied parent and an unmodified (committed) BASE
child.
There is a large operational difference between directories and
files/symlinks, so we'll divide the discussion along that line. Note
that we only consider the kind of the WORKING node; the kind of the
BASE will simply be replaced by the new node. If the BASE used to be a
directory, then its (obsolete) children must be removed from BASE_NODE
during this commit proces.
NOTE: if either the BASE_NODE or the WORKING_NODE has an "incomplete"
presence, then it CAN NOT be committed. It means we are missing
information that may be required to properly commit a change to that
node.
### hmm. during the commit, we create incomplete BASE nodes (see
### below). so this is more of a statement of the starting condition.
FILES AND SYMLINKS
working-presence: normal
base-presence: *
The WORKING and ACTUAL data is collapsed down into BASE_NODE, with
the new revision.
working-presence: excluded
base-presence: *
A new, excluded BASE_NODE is constructed, and the WORKING_NODE is
removed. Any BASE_NODE rows which appear to be descendants of this
(used-to-be-directory) node are removed. There should be no
descendants in the WORKING_NODE table.
### what information do we keep for excluded nodes?
### note: at this point, there is no user command to exclude
### files/symlinks. but we will be able to at some point...
working-presence: not-present
base-presence: *
NOTE: this situation should never be seen, since the node's parent
should have been committed first, which handles this node as part of
the child processing (see <normal, *> resolution for directories).
### in short, this work-presence can only occur for a deleted
### *subroot*. and we cannot commit *just* this node. must commit
### the root of any copy/move operation.
working-presence: base-deleted
base-presence: *
The WORKING_NODE and BASE_NODE rows are removed.
### if base-presence == excluded, there are some concerns:
### -- what information do we keep for base=excluded nodes?
### -- the base-deleted node would (at least) need to retain the
### revision in order to mark it for deletion.
### -- note: at this point, there is no user command to exclude
### files/symlinks. but we will be able to at some point...
actual: row is available
working-presence: <none>
base-presence: normal
ACTUAL_NODE.properties are folded into the BASE_NODE and the
revision is bumped.
Note: there is no data in ACTUAL_NODE other than properties.
Conflict information exists, but that must be cleared before
committing is possible.
DIRECTORIES
Whenever a directory is bumped to a new revision, the new set of
children is provided. This is required, in order to maintain the
proper integrity.
Example: if a child is to be added in this new revision, but a failure
happened between the directory-commit and the add-child processes,
then there would be no record of the added child. The directory would
not know it was missing a child and would report "at revision R" to
the server, implying that the child is present.
working-presence: normal
base-presence: *
The WORKING and ACTUAL data is collapsed down into BASE_NODE, with
the new revision.
The depth status of the WORKING node is carried over to the BASE
node. Children in the WORKING_NODE table should align with that
depth value, and the commit will iterate over each available child
row. Its status will be examined, and specific action taken:
normal
An incomplete node should be added for this child. This node
will become its own add/copy/move root, and will be handled as a
separate action (via recursion over the children).
excluded
An incomplete node should be added for this child. No action
taken. This node will become an excluded BASE node when it is
handled as a separate action (via recursion).
not-present
This row in WORKING_NODE is removed, along with descendant
nodes. The directory will not list this node in its (new) set of
children. Any BASE_NODE row at this path is also removed, along
with any descendant nodes.
base-deleted
No action taken. This node will be removed when it is handled
as a separate action (via recursion).
working-presence: excluded
base-presence: *
A new, excluded BASE_NODE is constructed, and the WORKING_NODE is
removed. Any BASE_NODE rows which appear to be descendants of this
(used-to-be-directory) node are removed. There should be no
descendants in the WORKING_NODE table.
### what information do we keep for excluded nodes?
working-presence: not-present
base-presence: *
NOTE: this situation should never be seen, since the node's parent
should have been committed first, which handles this node as part of
the child processing (see <normal, *> resolution).
working-presence: base-deleted
base-presence: *
The WORKING_NODE and BASE_NODE rows are removed.
actual: row is available
working-presence: <none>
base-presence: normal
ACTUAL_NODE.properties are folded into the BASE_NODE and the
revision is bumped.
Note: there is no data in ACTUAL_NODE other than properties.
Conflict information exists, but that must be cleared before
committing is possible.
Random Notes
------------
### break down all modification operations to things that operate on a
small/fixed set of rows. if a large sequence of operations fails,
then it can leave the system in reparable state, since most were
performed. note that ACTUAL can change at any time, thus all mods
should be able to compensate for ACTUAL being something
unexpected. thus, the transformative operations should be able to
fail in such a way as to leave ACTUAL pretty bunged up.
### probably want to special-case the checksum and BASETEXT entry for
the "empty file"
Code Organization
-----------------
libsvn_wc/wc_db.h (symbols: svn_wc__db_*)
Storage subsystem for the WC metadata/base-text information.
This is a private API, and the rest of the WC will be rebuilt
on top of this.
This code deals with storage, and transactional modifications
of the data.
Note: this is a random-access, low-level API. Editors will be
built on top of this layer.
libsvn_wc/workqueue.h (symbols: svn_wc__wq_*)
The "work queue" is a subsystem to replace the old "loggy"
subsystem. It will perform (primarily) filesystem operations
in a transactional way.
svn_wc.h API
------------
Note that we also have an opportunity to revamp the WC API. Things
like access batons will definitely disappear, but there will most
likely be great opportunities for other design changes.
Note that removing access batons (and other API changes) will ripple
up until libsvn_client, and may even have an affect on *its* API.
### the form of a new API is unknown/TBD.
We are going to add svn_wc_context_t to be created once per process,
and passed to all svn_wc functions. This will replace the (often
confusing) use of access batons.
Implementation note: this context will hold an svn_wc__db_t handle,
and a pointer to the process's svn_config_t object.
Upgrading old working copies
============================
When WC-NG finds a working copy which is pre-wc-ng, it will quit, prompting the
user to run 'svn upgrade' to upgrade the working copy to a wc-ng state. The
reason for not upgrading on-the-fly is two-fold:
* We anticipate this process to be irreversible, so we want to ensure the
user wants to upgrade (no silent upgrading/breakage).
* The upgrade may be I/O and computationally intensive, and keeping with the
principle of least surprise, we want to ensure that the upgrade is done
intentionally, when the user expects it.
Here's how we plan on implementing 'svn upgrade', so that it maintains
consistency across the working copy atomically. We have two requirements:
do the upgrade completely, and *don't* leave the working copy in an unusable
state if the upgrade fails. The steps for upgrading are:
1) Create the new wc.db as invisible.db in the working copy root
2) Upgrade the current directory into invisible.db
3) Drop a flag file into the directory to signal an "in process" upgrade
4) Recurse on step 2 for each of the subdirectories
5) Move invisible.db to wc.db in the working copy root
6) Recursively remove each of the .svn subdirs for each wc subdir
Note that nowhere do we attempt to run or upgrade old logs. This is
intentional. In order to simplify the development and maintenance burden,
we intend to bail when the upgrade process encounters a working copy with
un-run logs. In this state, it will be up to the user to run 'svn cleanup'
with the prior version of Subversion to ensure the working copy is in an
upgradable state. Failing that, the user can always do a fresh checkout.
The atomic step is step 5. Should the upgrade process be interrupted prior
to Step 5, the working copy will still be usable by a pre-wc-ng client, but
will just have extra stuff in the .svn directories, namely invisible.db in
the root, and the various flag files everywhere. Should the upgrade get
interrupted *after* Step 5 (but before all the .svn directories are removed),
the .svn directories will show up as unversioned directories. Not the ideal,
but not terribly bad, either.
### perhaps we should have a way of cleaning up all the .svn dirs from an
interrupted upgrade?
### what if somebody attempts to use an old Subversion on a working copy
with a .svn which hasn't been harvested yet? it should succeed, but
may leave a discrepancy between the wc-ng database and the .svn
metadata, since old Subversions don't know to recurse up the tree.
could we do something to the working copy to render it unusable by older
subversions, but without too much pain? how about simply bumping the
format number in entries.c? a properly chosen format number would make
older Subversions complain, but also make Subversion 1.7 prompt for an
upgrade. hmm.....
When an upgrade is restarted, invisible.db will just be blown away and
recreated from scratch, since already-upgraded directories could have been
modified between invocations of 'svn upgrade'.
Implementation Plan
===================
The following are tests which need to be accomplished for WC-NG. There
isn't a strict ordering here, but rather a possible plan. There may be
dependencies between some items, but that is left as an exercise for the
reader.
* Pristine file management
* Properties management
* Tree management (BASE v. WORKING v. ACTUAL for APIs and storage)
* Journaled actions
* Finding/using the correct admin area
* Upgrading
- Including multiple heterogenous admin areas
* Move entries into SQLite
* Relocating datastore in useful ways
Afterwards, we'll need:
* A second pass at the WC code to find/fix patterns and solutions.
* Revamp of WC API, to propagate up into libsvn_client.
* Reexamine any client/wc interactions, and look for final cleanups.
Near-Term Plan
--------------
Note: we originally envisioned the "ordering" below. In practice,
however, we have been attacking the overall problem from multiple
angles. Typically, we are finding conceptual/API bottlenecks that
make it hard to accomplish a number of other tasks. We solve the
bottleneck, and move on to solving the higher-level problems. It
is a continual "evolutionary" process. Some temporary APIs are being
introduced to help bridge the conceptual gap between wc-1 and wc-ng.
These should disappear by release time, but serve to mitigate code
disruption and potential for error.
1. convert entries.c to use sqlite directly. migrate 'entries' file
during this step. the sqlite file will be in-memory if we are not
allowed to auto-upgrade the WC; otherwise, we'll write the sqlite
database into .svn/
note: the presence of 'wc.db' (or whatever its name) will indicate
a minimum format level. the user field in the database
contains the schema version which is our further format-level
descriptor value.
[ this has been largely done. ]
2. convert entries.c to use wc_db. shift the sqlite code into wc_db.
note: this is a separate step from 1. there is a paradigm shift
between how entries.c works and wc_db works. we want to
ignore that in Step 1, and then handle it in this Step.
note: put wc_db handle into lock->shared and share the handle
across all directories/batons.
[ because of the borken way the upper layers use the entries API, using
the wc_db APIs to write entries proves difficult, since we violate
all kinds of constraints. ]
3. convert props.c to use wc_db. migrate props to db simultaneously.
[ this is currently in-process. ]
4. implement 'svn cleanup' as an upgrade path from old-style working
copies to wc-ng
[ done, but work will continue as the wc-ng format continues to evolve. ]
5. incremental shift of pristines from N files into pristine db.
note: we could continue to leave .revert-base while we migrate the
primary base into the pristine dataset.
6. shift libsvn_wc from using entries.h to using wc_db.h.
note: since entries.h is "merely" a wrapper for wc_db.h, this will
allow the libsvn_wc to start using the new wc_db APIs
wherever it is easy/possible.
goal: all libsvn_wc code uses wc_db.h, and entries.h exists solely
to support old backwards-compat code.
[ this isn't quite a discreet task, but is happening gradually as we work
through the libsvn_wc API. ]
7. centralize the metadata and pristines
note: this will also involve merging datastores
8. replace loggy with sqlite-based work journal
Endgame
-------
As WC-NG development has progress, and many of the above milestones have been
met, we've identified the following milestones leading to the completion of
wc-ng development (and hence 1.7). They are not all necessarily serially
dependent, but some dependencies do exist.
1. Move properties into wc.db.
2. Convert loggy actions into work queue actions.
3. Move pristines into a SHA-1 based store.
4. Consolidate metadata into the centralized system.
5. Test, tweak and release!
The above items are milestones. There are a number of work items that
need to be completed in/around the milestones. The progress of this
work can be roughly measured by the tools/dev/wc-ng/count-progress.py
script:
* remove use of svn_wc_adm_access_t and svn_wc_entry_t
* review/revamp the requirements, definitions, and use of the
svn_wc__node_* and svn_wc__db_temp_* functions