blob: 601a536e3c1b1a6a493a9032d6b82f80b3cb4888 [file] [log] [blame]
A Streamlined HTTP Protocol for Subversion
GOAL
====
Write a new HTTP protocol for svn -- one which is entirely proprietary
and designed for speed and comprehensibility.
PURPOSE / HISTORY
=================
Subversion standardized on Apache and the WebDAV/DeltaV protocol as a
back in the earliest days of development, based on some very strong
value propositions:
A. Able to go through corporate firewalls
B. Zillions of authn/authz options via Apache
C. Standardized encryption (SSL)
D. Excellent logging
E. Built-in repository browsing
F. Caching within intermediate proxies
G. Interoperability with other WebDAV clients
Unfortunately, DeltaV is an insanely complex and inefficient protocol,
and doesn't fit Subversion's model well at all. The result is that
Subversion speaks a "limited portion" of DeltaV, and pays a huge
performance price for this complexity.
A typical network trace involves dozens of unnecessary turnarounds
where the client keeps asking for the same information over and over
again, all for the sake of following DeltaV. And then once the client
has "discovered" the information it needs, it often ends up making a
custom REPORT request anyway. Most svn operations are at least twice
as slow over HTTP than over the custom svnserve protocol.
The existing HTTP protocol is also devilishly hard to comprehend or
extend, since it requires understanding of the DeltaV spec, and
exactly to what partial-degree we support that standrd.
REQUIREMENTS
============
Write a new HTTP protocol for svn ("HTTP v2"). Map RA requests
directly to HTTP requests.
* svn over HTTP should be much faster (eliminate extra turnarounds)
* svn over HTTP should be almost as easy to extend as svnserve.
* svn over HTTP should be comprehensible to devs and users both
(require no knowledge of DeltaV concepts).
* svn over HTTP should be designed for optimum cacheability by web
proxies.
* svn over HTTP should make use of pipelined requests when possible.
MILE-HIGH DESIGN
================
* Write new mod_svn module. Design it to (optionally) run
side-by-side with mod_dav_svn on the same public URI.
* Extend libsvn_ra_serf to detect the new Apache protocol and if
present, use it.
* Client/server compatibility:
- newer clients can still operate against old servers: they look
for new protocol in OPTIONS response; if not available, fall
back to making DeltaV requests.
- older clients can still operate against new servers: mod_svn
DECLINEs any old-style DeltaV request, allowing mod_dav_svn to
handle it instead.
* To upgrade a service, admins simply install mod_svn next to
mod_dav_svn. They then ask their users to "upgrade the client to
get better HTTP speed".
In theory, mod_svn should operate completely standalone (and should be
tested this way.) If an admin wants to support older clients or add
webdav functionality (such as autoversioning), then mod_dav_svn can be
installed "behind" mod_svn at the same URI.
DESIGN
======
1. Client-Server Negotiation
----------------------------
The administrator makes an svn repository available via mod_svn at a
specific URI, which we'll refer to as the "repository root URI".
(This same URI might also be serviced by mod_dav_svn too.)
mod_svn then advertises the new protocol in an OPTIONS response
against the repository root URI. It specifically includes a mininum
and maximum version number of the protocol it understands.
ra_serf always starts an RA session with an OPTIONS request against
the repository root URI. If new protocol isn't present (or is an
unsuitable version), it falls back to DeltaV protocol.
TODO: like svnserve, mod_svn may also want to advertise specific
features in its OPTIONS response.
2. General Command Mechanism
----------------------------
From here, the client initiates HTTP requests match up with the
svn_ra.h interfaces. Each RA 'command' takes a set of parameters
and represents a single network turnaround.
The standard pattern is to follow the lead of the mercurial network
protocol and embed these commands in either HTTP/1.1 GET or POST
methods against the repository root URI. The command and parameters
are embedded into the request URI itself as standard query syntax.
For example, if the repository is available at the root URI
'/repos', then a client might send requests like these:
GET /repos?cmd=get-latest-rev
GET /repos?cmd=rev-proplist&r=23
GET /repos/trunk/foo.c?cmd=get-file&r=23
In general, we try to make these requests line up with the
corresponding RA APIs. One exception, however, is that we don't
split the RA_session URI and the 'path' parameter into two pieces.
Using 'path' as a query parameter is weird. So for an RA call like
this:
svn_ra_open(&ra_session, "/repos/trunk/src");
svn_ra_blort(ra_session, "foo.c", 23);
...we'd issue a command like this:
GET /repos/trunk/src/foo.c?cmd=blort&rev=23
For requests which require real input data in the bodies, (such as
'update' or 'commit') we use a POST request. For example:
POST /repos?cmd=update&targetrev=100
[body contains complete 'update report' describing working copy's
revisions; response is a complete editor-drive.]
POST /repos?cmd=commit&keeplocks=true
[body contains complete editor-drive from client, including
possible revision-props that need changing (like svn:log), as
well as any necessary lock-tokens. response is the newly
committed revision number.]
3. Representation of structured data in request/response bodies
---------------------------------------------------------------
XML is out : there's a huge performance penalty for producing and
consuming it, which is why companies like Facebook and Google have
released 'fast wire serialization' libraries like Thrift and
Protocol Buffers. Unfortunately, these libraries require entire
structures to be held in memory in order to serialize/deserialze
them, and this isn't an option when dealing with something
(potentially) infinitely large like an editor-drive.
Luckily, svnserve already has a nice lisp-like representation of the
editor drive, and we can share its parsing/unparsing code.
We can also use this same representation for things like property
lists.
## TODO: flesh out examples here
4. Requests
-----------
The following request types generally correspond to the routines
svn_ra.h API; where they diverge, they do so in order to improve
performance, cacheability, or pipelining potential.
According to proper HTTP specs, GET requests do not contain request
bodies; if a request needs a body, we use POST instead.
* get-latest-rev
GET /repos[/path]/|get-latest-rev
response: revnum
cacheable: NO, this value changes all the time.
Note: The [/path] portion is tolerated but ignored by the server,
since the HEAD revision is a repository-wide attribute.
* get-dated-rev
GET /repos[/path]/!get-dated-rev/DATESTRING
DATESTRING must be in subversion standard format
(e.g. 2008-10-15T12:29:52.526295Z). See svn_time.h.
response: revnum
cacheable: SOMETIMES: assuming it's a non-HEAD rev
Note: The [/path] portion is tolerated but ignored by the server,
since the revision being fetched is part of repository's general
history.
* change-rev-prop
POST /repos[/path]/!change-rev-prop/REV/PROPNAME
REV is a revision number.
PROPNAME is the name of a revision property, properly URI-encoded.
The body of the request contains eithr a binary value, or is empty
(in which case the revprop is deleted.)
response: no body response. (response code 200 implies success.)
cacheable: NO, this is a write request.
Note: The [/path] portion is tolerated but ignored by the server,
since the revprop being changed is a repository-wide feature.
* rev-proplist
GET /repos[/path]/!rev-proplist/REV
REV is a revision number.
response: a list of property/value pairs (format TBD)
cacheable: NO, because revprops are mutable.
Note: The [/path] portion is tolerated but ignored by the server,
since the revprops are a repository-wide feature.
* rev-prop
GET /repos[/path]/!rev-prop/REV/PROPNAME
REV is a revision number.
PROPNAME is the name of a revision property, properly URI-encoded.
response: a binary property value (200), or 404 if not found.
cacheable: NO, because revprops are mutable.
Note: The [/path] portion is tolerated but ignored by the server,
since the revprop is a repository-wide feature.
* get-file
GET /repos/path/!get-file/REV/[tp]
REV is either a revision number or the string 'HEAD'.
Final path component is either 't', 'p', 'tp', or non-existent:
't' means the file's text is wanted.
'p' means the file's properties are wanted.
'tp' (or non-existence) means both text and props are wanted.
response: (structural encoding TBD:)
revnum
checksum of text (if text requested)
proplist (if props requested)
text (if text requested)
cacheable: YES, but only if a specific revnum was requested
(i.e. not HEAD)
Note: two simpler, alternate URI forms work for fetching *just*
the file's contents (no revnum, checksum, or props):
GET /repos/path
GET /repos/path?rev=number
cacheable: YES, but only if a specific revnum is given in the
query arg.
* get-dir
GET /repos/path/!get-dir/REV/[kshctla]/[tp]
REV is either a revision number or the string 'HEAD'.
Penultimate path component is a set of characters indicating which
dirent fields to return:
k kind
s size
h has-props?
c created-rev
t time (created-date)
l last-author
a all fields
Final path component is either 't', 'p', 'tp', or non-existent:
't' means the directory's dirents are wanted.
'p' means the directory's properties are wanted.
'tp' (or non-existence) means both text and props are wanted.
response: (structural encoding TBD):
revnum
proplist (if props requested)
list of dirents: (name, kind, size, has-props,
created-rev, time, last-author)
cacheable: YES, but only if specific revnum is requested.
examples:
Get all dirents of path@38:
GET /repos/path/!get-dir/38/a
Get only dirent names and kinds for path@HEAD, no properties:
GET /repos/path/!get-dir/HEAD/k/t
.... NOT YET FINISHED ....
5. Implementation details
A. (new) mod_svn module
mod_svn requirements:
* operates completely standalone
* provides reasonable opportunity for proxy caching
* provides reasonable opportunity for pipelining clients
* must DECLINE DeltaV requests, so mod_dav_svn can be
installed 'behind it' on the same <Location>, for
compatibility with old clients.
B. (new) libsvn_ra_http.so library
Uses serf library like libsvn_ra_serf, but speaks new http v2
protocol.
When libsvn_ra (the switching library) decides that some http RA
module is necessary, have it first call a utility function to do
an OPTIONS probe, then decide to use either ra_http or ra_serf.
This strategy aligns with the way in which libsvn_ra currently
"chooses" either ra_neon or ra_serf based on a runtime config
file.
6. Optimization Possibilities
* Have the svn client stash some metadata which records whether a
working copy comes from a 'v2' HTTP server or not. This would
save us from doing an extra OPTIONS probe at the start of each RA
session.