notes/http-protocol-v2.txt - subversion - Git at Google

                A Streamlined HTTP Protocol for Subversion

 GOAL
 ====

 Write a new HTTP protocol for svn, one which is entirely proprietary
 and designed for speed.


 PURPOSE / HISTORY
 =================

 Subversion standardized on Apache and the WebDAV/DeltaV protocol as a
 back in the earliest days of development, based on some very strong
 value propositions:

   A. Able to go through corporate firewalls
   B. Zillions of authn/authz options via Apache
   C. Standardized encryption (SSL)
   D. Excellent logging
   E. Built-in repository browsing
   F. Interoperability with other WebDAV clients
   G. Caching within intermediate proxies

 Unfortunately, DeltaV is an insanely complex and inefficient protocol,
 and doesn't fit Subversion's model well at all.  The result is that
 Subversion speaks a "limited portion" of DeltaV, and pays a huge price
 for this complexity:  speed.

 A typical network trace involves dozens of unnecessary turnarounds
 where the client keeps asking for the same information over and over
 again, all for the sake of following DeltaV.  And then once the client
 has "discovered" the information it needs, it often ends up making a
 custom REPORT request anyway.  Most svn operations are at least twice
 as slow over HTTP than over the custom svnserve protocol.


 PROPOSAL
 ========

 Write a new HTTP protocol for svn ("HTTP v2").  Map RA requests
 directly to HTTP requests.

   * svn over HTTP should be much faster (eliminate turnarounds)

   * svn over HTTP should be almost as easy to extend as svnserve.

   * svn over HTTP should be comprehensible to devs and users both
     (require no knowledge of DeltaV concepts).

   * svn over HTTP should be designed for optimum cacheability by
     proxy-servers.


 MILE-HIGH DESIGN
 ================

   * Write new mod_svn module.  Design it to run side-by-side with
     mod_dav_svn on the same public URI.

   * Extend libsvn_ra_serf to detect the Apache feature and if present,
     speak the new protocol.

   * Client/server compatibility:

       - newer clients can still operate against old servers: they look
         for new protocol in OPTIONS response; if not available, fall
         back to making DeltaV requests.

       - older clients can still operate against new servers: mod_svn
         DECLINEs any old-style DeltaV request, allowing mod_dav_svn to
         handle it instead.

   * To upgrade a service, admins simply install mod_svn next to
     mod_dav_svn.  They then ask their users to "upgrade the client to
     get better HTTP speed".

 In theory, mod_svn should operate completely standalone (and should be
 tested this way.)  If an admin wants to support older clients, or add
 webdav functionality (such as autoversioning), then mod_dav_svn can be
 installed "behind" mod_svn.


 DESIGN
 ======


 1. Client-Server Negotiation
 ----------------------------

   The administrator makes an svn repository available via mod_svn at a
   specific URI, which we'll refer to as the "repository root URI".
   (This same URI might also be serviced by mod_dav_svn too.)

   mod_svn then advertises the new protocol in an OPTIONS response
   against the repository root URI.  It specifically includes a mininum
   and maximum version number of the protocol it understands.

   ra_serf always starts an RA session with an OPTIONS request against
   the repository root URI.  If new protocol isn't present (or an
   unsuitable version), it falls back to DeltaV protocol.

   TODO: like svnserve, mod_svn may also want to advertise specific
   features in its OPTIONS response.


 2. General Command Mechanism
 ----------------------------

   From here, the client initiates HTTP requests match up with the
   svn_ra.h interfaces.  Each RA 'command' takes a set of parameters
   and represents a single network turnaround.

   The standard pattern is to follow the lead of the mercurial network
   protocol and embed these commands in either HTTP/1.1 GET or POST
   methods against the repository root URI.  The command and parameters
   are embedded into the request URI itself as standard query syntax.

   For example, if the repository is available at the root URI
   '/repos', then a client might send requests like these:

      GET /repos?cmd=get-latest-rev

      GET /repos?cmd=rev-proplist&r=23

      GET /repos/trunk/foo.c?cmd=get-file&r=23

   In general, we try to make these requests line up with the
   corresponding RA APIs.  One exception, however, is that we don't
   split the RA_session URI and the 'path' parameter into two pieces.
   Using 'path' as a query parameter is weird.  So for an RA call like
   this:

       svn_ra_open(&ra_session, "/repos/trunk/src");
       svn_ra_blort(ra_session, "foo.c", 23);

   ...we'd issue a command like this:

       GET /repos/trunk/src/foo.c?cmd=blort&rev=23

   For requests which require real input data in the bodies, (such as
   'update' or 'commit') we use a POST request.  For example:

      POST /repos?cmd=update&targetrev=100

      [body contains complete 'update report' describing working copy's
      revisions;  response is a complete editor-drive.]

      POST /repos?cmd=commit&keeplocks=true

      [body contains complete editor-drive from client, including
      possible revision-props that need changing (like svn:log), as
      well as any necessary lock-tokens.  response is the newly
      committed revision number.]


 3. Representation of structured data in request/response bodies
 ---------------------------------------------------------------

   XML is out : there's a huge performance penalty for producing and
   consuming it, which is why companies like Facebook and Google have
   released 'fast wire serialization' libraries like Thrift and
   Protocol Buffers.  Unfortunately, these libraries require entire
   structures to be held in memory in order to serialize/deserialze
   them, and this isn't an option when dealing with something
   (potentially) infinitely large like an editor-drive.

   Luckily, svnserve already has a nice lisp-like representation of the
   editor drive, and we can share its parsing/unparsing code.

   We can also use this same representation for things like property
   lists.

   ## TODO:  flesh out examples here


 4. Commands
 -----------

 In the list of commands, all commands are assumed to be attached as
 ?cmd=command to the request URI.  Command parameters are all
 query-encoded (&parm=val), and optional parameters are listed in
 square brackets.  Server response values are assumed to be in response
 bodies.


   get-latest-rev

     GET /repos[/path]?cmd=get-latest-rev

     response: revnum

   get-dated-rev

     GET /repos[/path]?cmd=get-dated-rev&date=string

     response: revnum

   change-rev-prop

     POST /repos[/path]?cmd=change-rev-prop&rev=num&name=string

     [body contains binary value.  If body is empty rev-prop is deleted.]

   rev-proplist

     GET /repos[/path]?cmd=rev-proplist&rev=num

     response:  proplist

   rev-prop

     GET /repos[/path]?cmd=rev-prop&rev=num&name=string

     response:  propval (may be binary data)

   commit

     POST /repos[/path]?cmd=commit[&keep-locks=bool]

     [body contains:
              optional list of revprops (including svn:log)
              optional list of lockpath:locktoken pairs
              editor-drive relative to /repos[/path] ]

     response:  new revnum OR commit-error-message

   get-file

     GET /repos/path&cmd=get-file[&rev=num&want-props=bool&want-contents=bool]

     If optional params aren't specified, rev defaults to HEAD, and
     want-props and want-contents default to 'true'.

     response: an s-expression containing:
                  revnum
                  checksum
                  props [if requested]
                  contents [if requested]

      *** Note that two simpler URI forms work for fetching *raw* file
          contents as well (no checksum, props, rev):

             GET /repos/path
             GET /repos/path?rev=number

     ## TODO: need to design cacheability of both the long-form and
        short-form of these requests.


   .... NOT YET FINISHED ....


 5. Implementation details


   A.  (new) mod_svn module

       mod_svn requirements:

          * operates completely standalone
          * provides reasonable opportunity for proxy caching
          * provides reasonable opportunity for pipelining clients
          * must DECLINE DeltaV requests, so mod_dav_svn can be
            installed 'behind it' on the same <Location>, for
            compatibility with old clients.


   B.  (new) libsvn_ra_http.so library

       Uses serf library like libsvn_ra_serf, but speaks new http v2
       protocol.

       When libsvn_ra (the switching library) decides that some http RA
       module is necessary, have it first call a utility function to do
       an OPTIONS probe, then decide to use either ra_http or ra_serf.

       This strategy aligns with the way in which libsvn_ra currently
       "chooses" either ra_neon or ra_serf based on a runtime config
       file.


 6. Optimization Possibilities


    * Have the svn client stash some metadata which records whether a
      working copy comes from a 'v2' HTTP server or not.  This would
      save us from doing an extra OPTIONS probe at the start of each RA
      session.
	A Streamlined HTTP Protocol for Subversion

	GOAL
	====

	Write a new HTTP protocol for svn, one which is entirely proprietary
	and designed for speed.


	PURPOSE / HISTORY
	=================

	Subversion standardized on Apache and the WebDAV/DeltaV protocol as a
	back in the earliest days of development, based on some very strong
	value propositions:

	A. Able to go through corporate firewalls
	B. Zillions of authn/authz options via Apache
	C. Standardized encryption (SSL)
	D. Excellent logging
	E. Built-in repository browsing
	F. Interoperability with other WebDAV clients
	G. Caching within intermediate proxies

	Unfortunately, DeltaV is an insanely complex and inefficient protocol,
	and doesn't fit Subversion's model well at all. The result is that
	Subversion speaks a "limited portion" of DeltaV, and pays a huge price
	for this complexity: speed.

	A typical network trace involves dozens of unnecessary turnarounds
	where the client keeps asking for the same information over and over
	again, all for the sake of following DeltaV. And then once the client
	has "discovered" the information it needs, it often ends up making a
	custom REPORT request anyway. Most svn operations are at least twice
	as slow over HTTP than over the custom svnserve protocol.


	PROPOSAL
	========

	Write a new HTTP protocol for svn ("HTTP v2"). Map RA requests
	directly to HTTP requests.

	* svn over HTTP should be much faster (eliminate turnarounds)

	* svn over HTTP should be almost as easy to extend as svnserve.

	* svn over HTTP should be comprehensible to devs and users both
	(require no knowledge of DeltaV concepts).

	* svn over HTTP should be designed for optimum cacheability by
	proxy-servers.


	MILE-HIGH DESIGN
	================

	* Write new mod_svn module. Design it to run side-by-side with
	mod_dav_svn on the same public URI.

	* Extend libsvn_ra_serf to detect the Apache feature and if present,
	speak the new protocol.

	* Client/server compatibility:

	- newer clients can still operate against old servers: they look
	for new protocol in OPTIONS response; if not available, fall
	back to making DeltaV requests.

	- older clients can still operate against new servers: mod_svn
	DECLINEs any old-style DeltaV request, allowing mod_dav_svn to
	handle it instead.

	* To upgrade a service, admins simply install mod_svn next to
	mod_dav_svn. They then ask their users to "upgrade the client to
	get better HTTP speed".

	In theory, mod_svn should operate completely standalone (and should be
	tested this way.) If an admin wants to support older clients, or add
	webdav functionality (such as autoversioning), then mod_dav_svn can be
	installed "behind" mod_svn.



	DESIGN
	======


	1. Client-Server Negotiation
	----------------------------

	The administrator makes an svn repository available via mod_svn at a
	specific URI, which we'll refer to as the "repository root URI".
	(This same URI might also be serviced by mod_dav_svn too.)

	mod_svn then advertises the new protocol in an OPTIONS response
	against the repository root URI. It specifically includes a mininum
	and maximum version number of the protocol it understands.

	ra_serf always starts an RA session with an OPTIONS request against
	the repository root URI. If new protocol isn't present (or an
	unsuitable version), it falls back to DeltaV protocol.

	TODO: like svnserve, mod_svn may also want to advertise specific
	features in its OPTIONS response.


	2. General Command Mechanism
	----------------------------

	From here, the client initiates HTTP requests match up with the
	svn_ra.h interfaces. Each RA 'command' takes a set of parameters
	and represents a single network turnaround.

	The standard pattern is to follow the lead of the mercurial network
	protocol and embed these commands in either HTTP/1.1 GET or POST
	methods against the repository root URI. The command and parameters
	are embedded into the request URI itself as standard query syntax.

	For example, if the repository is available at the root URI
	'/repos', then a client might send requests like these:

	GET /repos?cmd=get-latest-rev

	GET /repos?cmd=rev-proplist&r=23

	GET /repos/trunk/foo.c?cmd=get-file&r=23

	In general, we try to make these requests line up with the
	corresponding RA APIs. One exception, however, is that we don't
	split the RA_session URI and the 'path' parameter into two pieces.
	Using 'path' as a query parameter is weird. So for an RA call like
	this:

	svn_ra_open(&ra_session, "/repos/trunk/src");
	svn_ra_blort(ra_session, "foo.c", 23);

	...we'd issue a command like this:

	GET /repos/trunk/src/foo.c?cmd=blort&rev=23

	For requests which require real input data in the bodies, (such as
	'update' or 'commit') we use a POST request. For example:

	POST /repos?cmd=update&targetrev=100

	[body contains complete 'update report' describing working copy's
	revisions; response is a complete editor-drive.]

	POST /repos?cmd=commit&keeplocks=true

	[body contains complete editor-drive from client, including
	possible revision-props that need changing (like svn:log), as
	well as any necessary lock-tokens. response is the newly
	committed revision number.]


	3. Representation of structured data in request/response bodies
	---------------------------------------------------------------

	XML is out : there's a huge performance penalty for producing and
	consuming it, which is why companies like Facebook and Google have
	released 'fast wire serialization' libraries like Thrift and
	Protocol Buffers. Unfortunately, these libraries require entire
	structures to be held in memory in order to serialize/deserialze
	them, and this isn't an option when dealing with something
	(potentially) infinitely large like an editor-drive.

	Luckily, svnserve already has a nice lisp-like representation of the
	editor drive, and we can share its parsing/unparsing code.

	We can also use this same representation for things like property
	lists.

	## TODO: flesh out examples here


	4. Commands
	-----------

	In the list of commands, all commands are assumed to be attached as
	?cmd=command to the request URI. Command parameters are all
	query-encoded (&parm=val), and optional parameters are listed in
	square brackets. Server response values are assumed to be in response
	bodies.


	get-latest-rev

	GET /repos[/path]?cmd=get-latest-rev

	response: revnum

	get-dated-rev

	GET /repos[/path]?cmd=get-dated-rev&date=string

	response: revnum

	change-rev-prop

	POST /repos[/path]?cmd=change-rev-prop&rev=num&name=string

	[body contains binary value. If body is empty rev-prop is deleted.]

	rev-proplist

	GET /repos[/path]?cmd=rev-proplist&rev=num

	response: proplist

	rev-prop

	GET /repos[/path]?cmd=rev-prop&rev=num&name=string

	response: propval (may be binary data)

	commit

	POST /repos[/path]?cmd=commit[&keep-locks=bool]

	[body contains:
	optional list of revprops (including svn:log)
	optional list of lockpath:locktoken pairs
	editor-drive relative to /repos[/path] ]

	response: new revnum OR commit-error-message

	get-file

	GET /repos/path&cmd=get-file[&rev=num&want-props=bool&want-contents=bool]

	If optional params aren't specified, rev defaults to HEAD, and
	want-props and want-contents default to 'true'.

	response: an s-expression containing:
	revnum
	checksum
	props [if requested]
	contents [if requested]

	*** Note that two simpler URI forms work for fetching raw file
	contents as well (no checksum, props, rev):

	GET /repos/path
	GET /repos/path?rev=number

	## TODO: need to design cacheability of both the long-form and
	short-form of these requests.


	.... NOT YET FINISHED ....



	5. Implementation details


	A. (new) mod_svn module

	mod_svn requirements:

	* operates completely standalone
	* provides reasonable opportunity for proxy caching
	* provides reasonable opportunity for pipelining clients
	* must DECLINE DeltaV requests, so mod_dav_svn can be
	installed 'behind it' on the same <Location>, for
	compatibility with old clients.


	B. (new) libsvn_ra_http.so library

	Uses serf library like libsvn_ra_serf, but speaks new http v2
	protocol.

	When libsvn_ra (the switching library) decides that some http RA
	module is necessary, have it first call a utility function to do
	an OPTIONS probe, then decide to use either ra_http or ra_serf.

	This strategy aligns with the way in which libsvn_ra currently
	"chooses" either ra_neon or ra_serf based on a runtime config
	file.



	6. Optimization Possibilities


	* Have the svn client stash some metadata which records whether a
	working copy comes from a 'v2' HTTP server or not. This would
	save us from doing an extra OPTIONS probe at the start of each RA
	session.