src/replication/conflicts.rst - couchdb-documentation - Git at Google

 .. Licensed under the Apache License, Version 2.0 (the "License"); you may not
 .. use this file except in compliance with the License. You may obtain a copy of
 .. the License at
 ..
 ..   http://www.apache.org/licenses/LICENSE-2.0
 ..
 .. Unless required by applicable law or agreed to in writing, software
 .. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 .. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 .. License for the specific language governing permissions and limitations under
 .. the License.

 .. _replication/conflicts:

 ==============================
 Replication and conflict model
 ==============================

 Let's take the following example to illustrate replication and conflict
 handling.

 - Alice has a document containing Bob's business card;
 - She synchronizes it between her desktop PC and her laptop;
 - On the desktop PC, she updates Bob's E-mail address;
   Without syncing again, she updates Bob's mobile number on the laptop;
 - Then she replicates the two to each other again.

 So on the desktop the document has Bob's new E-mail address and his old mobile
 number, and on the laptop it has his old E-mail address and his new mobile
 number.

 The question is, what happens to these conflicting updated documents?

 CouchDB replication
 ===================

 CouchDB works with JSON documents inside databases. Replication of databases
 takes place over HTTP, and can be either a "pull" or a "push", but is
 unidirectional. So the easiest way to perform a full sync is to do a "push"
 followed by a "pull" (or vice versa).

 So, Alice creates v1 and sync it. She updates to v2a on one side and v2b on the
 other, and then replicates. What happens?

 The answer is simple: both versions exist on both sides!

 .. code-block:: text

       DESKTOP                          LAPTOP
     +---------+
     | /db/bob |                                     INITIAL
     |   v1    |                                     CREATION
     +---------+

     +---------+                      +---------+
     | /db/bob |  ----------------->  | /db/bob |     PUSH
     |   v1    |                      |   v1    |
     +---------+                      +---------+

     +---------+                      +---------+  INDEPENDENT
     | /db/bob |                      | /db/bob |     LOCAL
     |   v2a   |                      |   v2b   |     EDITS
     +---------+                      +---------+

     +---------+                      +---------+
     | /db/bob |  ----------------->  | /db/bob |     PUSH
     |   v2a   |                      |   v2a   |
     +---------+                      |   v2b   |
                                      +---------+

     +---------+                      +---------+
     | /db/bob |  <-----------------  | /db/bob |     PULL
     |   v2a   |                      |   v2a   |
     |   v2b   |                      |   v2b   |
     +---------+                      +---------+

 After all, this is not a file system, so there's no restriction that only one
 document can exist with the name /db/bob. These are just "conflicting" revisions
 under the same name.

 Because the changes are always replicated, the data is safe. Both machines have
 identical copies of both documents, so failure of a hard drive on either side
 won't lose any of the changes.

 Another thing to notice is that peers do not have to be configured or tracked.
 You can do regular replications to peers, or you can do one-off, ad-hoc pushes
 or pulls. After the replication has taken place, there is no record kept of
 which peer any particular document or revision came from.

 So the question now is: what happens when you try to read /db/bob? By default,
 CouchDB picks one arbitrary revision as the "winner", using a deterministic
 algorithm so that the same choice will be made on all peers. The same happens
 with views: the deterministically-chosen winner is the only revision fed into
 your map function.

 Let's say that the winner is v2a. On the desktop, if Alice reads the document
 she'll see v2a, which is what she saved there. But on the laptop, after
 replication, she'll also see only v2a. It could look as if the changes she made
 there have been lost - but of course they have not, they have just been hidden
 away as a conflicting revision. But eventually she'll need these changes merged
 into Bob's business card, otherwise they will effectively have been lost.

 Any sensible business-card application will, at minimum, have to present the
 conflicting versions to Alice and allow her to create a new version
 incorporating information from them all. Ideally it would merge the updates
 itself.

 Conflict avoidance
 ==================

 When working on a single node, CouchDB will avoid creating conflicting revisions
 by returning a :statuscode:`409` error. This is because, when you
 PUT a new version of a document, you must give the ``_rev`` of the previous
 version. If that ``_rev`` has already been superseded, the update is rejected
 with a  :statuscode:`409` response.

 So imagine two users on the same node are fetching Bob's business card, updating
 it concurrently, and writing it back:

 .. code-block:: text

     USER1    ----------->  GET /db/bob
              <-----------  {"_rev":"1-aaa", ...}

     USER2    ----------->  GET /db/bob
              <-----------  {"_rev":"1-aaa", ...}

     USER1    ----------->  PUT /db/bob?rev=1-aaa
              <-----------  {"_rev":"2-bbb", ...}

     USER2    ----------->  PUT /db/bob?rev=1-aaa
              <-----------  409 Conflict  (not saved)

 User2's changes are rejected, so it's up to the app to fetch /db/bob again,
 and either:

 #. apply the same changes as were applied to the earlier revision, and submit
    a new PUT
 #. redisplay the document so the user has to edit it again
 #. just overwrite it with the document being saved before (which is not
    advisable, as user1's changes will be silently lost)

 So when working in this mode, your application still has to be able to handle
 these conflicts and have a suitable retry strategy, but these conflicts never
 end up inside the database itself.

 Revision tree
 =============

 When you update a document in CouchDB, it keeps a list of the previous
 revisions. In the case where conflicting updates are introduced, this history
 branches into a tree, where the current conflicting revisions for this document
 form the tips (leaf nodes) of this tree:

 .. code-block:: text

       ,--> r2a
     r1 --> r2b
       `--> r2c

 Each branch can then extend its history - for example if you read revision r2b
 and then PUT with ?rev=r2b then you will make a new revision along that
 particular branch.

 .. code-block:: text

       ,--> r2a -> r3a -> r4a
     r1 --> r2b -> r3b
       `--> r2c -> r3c

 Here, (r4a, r3b, r3c) are the set of conflicting revisions. The way you resolve
 a conflict is to delete the leaf nodes along the other branches. So when you
 combine (r4a+r3b+r3c) into a single merged document, you would replace r4a and
 delete r3b and r3c.

 .. code-block:: text

       ,--> r2a -> r3a -> r4a -> r5a
     r1 --> r2b -> r3b -> (r4b deleted)
       `--> r2c -> r3c -> (r4c deleted)

 Note that r4b and r4c still exist as leaf nodes in the history tree, but as
 deleted docs. You can retrieve them but they will be marked ``"_deleted":true``.

 When you compact a database, the bodies of all the non-leaf documents are
 discarded. However, the list of historical _revs is retained, for the benefit of
 later conflict resolution in case you meet any old replicas of the database at
 some time in future. There is "revision pruning" to stop this getting
 arbitrarily large.

 Working with conflicting documents
 ==================================

 The basic :get:`/{doc}/{docid}` operation will not show you any
 information about conflicts. You see only the deterministically-chosen winner,
 and get no indication as to whether other conflicting revisions exist or not:

 .. code-block:: javascript

     {
         "_id":"test",
         "_rev":"2-b91bb807b4685080c6a651115ff558f5",
         "hello":"bar"
     }

 If you do ``GET /db/test?conflicts=true``, and the document is in a conflict
 state, then you will get the winner plus a _conflicts member containing an array
 of the revs of the other, conflicting revision(s). You can then fetch them
 individually using subsequent ``GET /db/test?rev=xxxx`` operations:

 .. code-block:: javascript

     {
         "_id":"test",
         "_rev":"2-b91bb807b4685080c6a651115ff558f5",
         "hello":"bar",
         "_conflicts":[
             "2-65db2a11b5172bf928e3bcf59f728970",
             "2-5bc3c6319edf62d4c624277fdd0ae191"
         ]
     }

 If you do ``GET /db/test?open_revs=all`` then you will get all the leaf nodes of
 the revision tree. This will give you all the current conflicts, but will also
 give you leaf nodes which have been deleted (i.e. parts of the conflict history
 which have since been resolved). You can remove these by filtering out documents
 with ``"_deleted":true``:

 .. code-block:: javascript

     [
         {"ok":{"_id":"test","_rev":"2-5bc3c6319edf62d4c624277fdd0ae191","hello":"foo"}},
         {"ok":{"_id":"test","_rev":"2-65db2a11b5172bf928e3bcf59f728970","hello":"baz"}},
         {"ok":{"_id":"test","_rev":"2-b91bb807b4685080c6a651115ff558f5","hello":"bar"}}
     ]

 The ``"ok"`` tag is an artifact of ``open_revs``, which also lets you list
 explicit revisions as a JSON array, e.g. ``open_revs=[rev1,rev2,rev3]``. In this
 form, it would be possible to request a revision which is now missing, because
 the database has been compacted.

 .. note::
     The order of revisions returned by ``open_revs=all`` is **NOT** related to
     the deterministic "winning" algorithm. In the above example, the winning
     revision is 2-b91b... and happens to be returned last, but in other cases it
     can be returned in a different position.

 Once you have retrieved all the conflicting revisions, your application can then
 choose to display them all to the user. Or it could attempt to merge them, write
 back the merged version, and delete the conflicting versions - that is, to
 resolve the conflict permanently.

 As described above, you need to update one revision and delete all the
 conflicting revisions explicitly. This can be done using a single `POST` to
 ``_bulk_docs``, setting ``"_deleted":true`` on those revisions you wish to
 delete.

 Multiple document API
 =====================

 Finding conflicted documents with Mango
 ---------------------------------------

 .. versionadded:: 2.2.0

 CouchDB's :ref:`Mango system <api/db/_find>` allows easy querying of
 documents with conflicts, returning the full body of each document as well.

 Here's how to use it to find all conflicts in a database:

 .. code-block:: bash

     $ curl -X POST http://127.0.0.1/dbname/_find \
         -d '{"selector": {"_conflicts": { "$exists": true}}, "conflicts": true}' \
         -Hcontent-type:application/json

 .. code-block:: javascript

     {"docs": [
     {"_id":"doc","_rev":"1-3975759ccff3842adf690a5c10caee42","a":2,"_conflicts":["1-23202479633c2b380f79507a776743d5"]}
     ],
     "bookmark": "g1AAAABheJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYozA1kgKQ6YVA5QkBFMgKSVDHWNjI0MjEzMLc2MjZONkowtDNLMLU0NzBPNzc3MTYxTTLOysgCY2ReV"}

 The ``bookmark`` value can be used to navigate through additional pages of
 results if necessary. Mango by default only returns 25 results per request.

 If you expect to run this query often, be sure to create a Mango secondary
 index to speed the query:

 .. code-block:: bash

     $ curl -X POST http://127.0.0.1/dbname/_index \
         -d '{"index":{"fields": ["_conflicts"]}}' \
         -Hcontent-type:application/json

 Of course, the selector can be enhanced to filter documents on additional
 keys in the document. Be sure to add those keys to your secondary index as
 well, or a full database scan will be triggered.

 Finding conflicted documents using the ``_all_docs`` index
 ----------------------------------------------------------

 You can fetch multiple documents at once using ``include_docs=true`` on a view.
 However, a ``conflicts=true`` request is ignored; the "doc" part of the value
 never includes a ``_conflicts`` member. Hence you would need to do another query
 to determine for each document whether it is in a conflicting state:

 .. code-block:: bash

     $ curl 'http://127.0.0.1:5984/conflict_test/_all_docs?include_docs=true&conflicts=true'

 .. code-block:: javascript

     {
         "total_rows":1,
         "offset":0,
         "rows":[
             {
                 "id":"test",
                 "key":"test",
                 "value":{"rev":"2-b91bb807b4685080c6a651115ff558f5"},
                 "doc":{
                     "_id":"test",
                     "_rev":"2-b91bb807b4685080c6a651115ff558f5",
                     "hello":"bar"
                 }
             }
         ]
     }

 .. code-block:: bash

     $ curl 'http://127.0.0.1:5984/conflict_test/test?conflicts=true'

 .. code-block:: javascript

     {
         "_id":"test",
         "_rev":"2-b91bb807b4685080c6a651115ff558f5",
         "hello":"bar",
         "_conflicts":[
             "2-65db2a11b5172bf928e3bcf59f728970",
             "2-5bc3c6319edf62d4c624277fdd0ae191"
         ]
     }

 View map functions
 ==================

 Views only get the winning revision of a document. However they do also get a
 ``_conflicts`` member if there are any conflicting revisions. This means you can
 write a view whose job is specifically to locate documents with conflicts.
 Here is a simple map function which achieves this:

 .. code-block:: javascript

     function(doc) {
         if (doc._conflicts) {
             emit(null, [doc._rev].concat(doc._conflicts));
         }
     }

 which gives the following output:

 .. code-block:: javascript

     {
         "total_rows":1,
         "offset":0,
         "rows":[
             {
                 "id":"test",
                 "key":null,
                 "value":[
                     "2-b91bb807b4685080c6a651115ff558f5",
                     "2-65db2a11b5172bf928e3bcf59f728970",
                     "2-5bc3c6319edf62d4c624277fdd0ae191"
                 ]
             }
         ]
     }

 If you do this, you can have a separate "sweep" process which periodically scans
 your database, looks for documents which have conflicts, fetches the conflicting
 revisions, and resolves them.

 Whilst this keeps the main application simple, the problem with this approach is
 that there will be a window between a conflict being introduced and it being
 resolved. From a user's viewpoint, this may appear that the document they just
 saved successfully may suddenly lose their changes, only to be resurrected some
 time later. This may or may not be acceptable.

 Also, it's easy to forget to start the sweeper, or not to implement it properly,
 and this will introduce odd behaviour which will be hard to track down.

 CouchDB's "winning" revision algorithm may mean that information drops out of a
 view until a conflict has been resolved. Consider Bob's business card again;
 suppose Alice has a view which emits mobile numbers, so that her telephony
 application can display the caller's name based on caller ID. If there are
 conflicting documents with Bob's old and new mobile numbers, and they happen to
 be resolved in favour of Bob's old number, then the view won't be able to
 recognise his new one. In this particular case, the application might have
 preferred to put information from both the conflicting documents into the view,
 but this currently isn't possible.

 Suggested algorithm to fetch a document with conflict resolution:

 #. Get document via ``GET docid?conflicts=true`` request
 #. For each member in the ``_conflicts`` array call ``GET docid?rev=xxx``.
    If any errors occur at this stage, restart from step 1.
    (There could be a race where someone else has already resolved this conflict
    and deleted that rev)
 #. Perform application-specific merging
 #. Write ``_bulk_docs`` with an update to the first rev and deletes of the other
    revs.

 This could either be done on every read (in which case you could replace all
 calls to GET in your application with calls to a library which does the above),
 or as part of your sweeper code.

 And here is an example of this in Ruby using the low-level `RestClient`_:

 .. _RestClient: https://rubygems.org/gems/rest-client

 .. code-block:: ruby

     require 'rubygems'
     require 'rest_client'
     require 'json'
     DB="http://127.0.0.1:5984/conflict_test"

     # Write multiple documents
     def writem(docs)
         JSON.parse(RestClient.post("#{DB}/_bulk_docs", {
             "docs" => docs,
         }.to_json))
     end

     # Write one document, return the rev
     def write1(doc, id=nil, rev=nil)
         doc['_id'] = id if id
         doc['_rev'] = rev if rev
         writem([doc]).first['rev']
     end

     # Read a document, return *all* revs
     def read1(id)
         retries = 0
         loop do
             # FIXME: escape id
             res = [JSON.parse(RestClient.get("#{DB}/#{id}?conflicts=true"))]
             if revs = res.first.delete('_conflicts')
                 begin
                     revs.each do |rev|
                         res << JSON.parse(RestClient.get("#{DB}/#{id}?rev=#{rev}"))
                     end
                 rescue
                     retries += 1
                     raise if retries >= 5
                     next
                 end
             end
             return res
         end
     end

     # Create DB
     RestClient.delete DB rescue nil
     RestClient.put DB, {}.to_json

     # Write a document
     rev1 = write1({"hello"=>"xxx"},"test")
     p read1("test")

     # Make three conflicting versions
     write1({"hello"=>"foo"},"test",rev1)
     write1({"hello"=>"bar"},"test",rev1)
     write1({"hello"=>"baz"},"test",rev1)

     res = read1("test")
     p res

     # Now let's replace these three with one
     res.first['hello'] = "foo+bar+baz"
     res.each_with_index do |r,i|
         unless i == 0
             r.replace({'_id'=>r['_id'], '_rev'=>r['_rev'], '_deleted'=>true})
         end
     end
     writem(res)

     p read1("test")

 An application written this way never has to deal with a ``PUT 409``, and is
 automatically multi-master capable.

 You can see that it's straightforward enough when you know what you're doing.
 It's just that CouchDB doesn't currently provide a convenient HTTP API for
 "fetch all conflicting revisions", nor "PUT to supersede these N revisions", so
 you need to wrap these yourself. At the time of writing, there are no known
 client-side libraries which provide support for this.

 Merging and revision history
 ============================

 Actually performing the merge is an application-specific function. It depends
 on the structure of your data. Sometimes it will be easy: e.g. if a document
 contains a list which is only ever appended to, then you can perform a union of
 the two list versions.

 Some merge strategies look at the changes made to an object, compared to its
 previous version. This is how Git's merge function works.

 For example, to merge Bob's business card versions v2a and v2b, you could look
 at the differences between v1 and v2b, and then apply these changes to v2a as
 well.

 With CouchDB, you can sometimes get hold of old revisions of a document.
 For example, if you fetch ``/db/bob?rev=v2b&revs_info=true`` you'll get a list
 of the previous revision ids which ended up with revision v2b. Doing the same
 for v2a you can find their common ancestor revision. However if the database
 has been compacted, the content of that document revision will have been lost.
 ``revs_info`` will still show that v1 was an ancestor, but report it as
 "missing"::

     BEFORE COMPACTION           AFTER COMPACTION

          ,-> v2a                     v2a
        v1
          `-> v2b                     v2b

 So if you want to work with diffs, the recommended way is to store those diffs
 within the new revision itself. That is: when you replace v1 with v2a, include
 an extra field or attachment in v2a which says which fields were changed from
 v1 to v2a. This unfortunately does mean additional book-keeping for your
 application.

 Comparison with other replicating data stores
 =============================================

 The same issues arise with other replicating systems, so it can be instructive
 to look at these and see how they compare with CouchDB. Please feel free to add
 other examples.

 Unison
 ------

 `Unison`_ is a bi-directional file synchronisation tool. In this case, the
 business card would be a file, say `bob.vcf`.

 .. _Unison: http://www.cis.upenn.edu/~bcpierce/unison/

 When you run unison, changes propagate both ways. If a file has changed on one
 side but not the other, the new replaces the old. Unison maintains a local state
 file so that it knows whether a file has changed since the last successful
 replication.

 In our example it has changed on both sides. Only one file called `bob.vcf`
 can exist within the file system. Unison solves the problem by simply ducking
 out: the user can choose to replace the remote version with the local version,
 or vice versa (both of which would lose data), but the default action is to
 leave both sides unchanged.

 From Alice's point of view, at least this is a simple solution. Whenever she's
 on the desktop she'll see the version she last edited on the desktop, and
 whenever she's on the laptop she'll see the version she last edited there.

 But because no replication has actually taken place, the data is not protected.
 If her laptop hard drive dies, she'll lose all her changes made on the laptop;
 ditto if her desktop hard drive dies.

 It's up to her to copy across one of the versions manually (under a different
 filename), merge the two, and then finally push the merged version to the other
 side.

 Note also that the original file (version v1) has been lost at this point.
 So it's not going to be known from inspection alone whether v2a or v2b has the
 most up-to-date E-mail address for Bob, or which version has the most up-to-date
 mobile number. Alice has to remember which one she entered last.

 Git
 ---

 `Git`_ is a well-known distributed source control system. Like Unison, Git deals
 with files. However, Git considers the state of a whole set of files as a single
 object, the "tree". Whenever you save an update, you create a "commit" which
 points to both the updated tree and the previous commit(s), which in turn point
 to the previous tree(s). You therefore have a full history of all the states of
 the files. This history forms a branch, and a pointer is kept to the tip of the
 branch, from which you can work backwards to any previous state. The "pointer"
 is an SHA1 hash of the tip commit.

 .. _Git: http://git-scm.com/

 If you are replicating with one or more peers, a separate branch is made for
 each of those peers. For example, you might have::

     main               -- my local branch
     remotes/foo/main   -- branch on peer 'foo'
     remotes/bar/main   -- branch on peer 'bar'

 In the regular workflow, replication is a "pull", importing changes from
 a remote peer into the local repository. A "pull" does two things: first "fetch"
 the state of the peer into the remote tracking branch for that peer; and then
 attempt to "merge" those changes into the local branch.

 Now let's consider the business card. Alice has created a Git repo containing
 ``bob.vcf``, and cloned it across to the other machine. The branches look like
 this, where ``AAAAAAAA`` is the SHA1 of the commit::

     ---------- desktop ----------           ---------- laptop ----------
     main: AAAAAAAA                        main: AAAAAAAA
     remotes/laptop/main: AAAAAAAA         remotes/desktop/main: AAAAAAAA

 Now she makes a change on the desktop, and commits it into the desktop repo;
 then she makes a different change on the laptop, and commits it into the laptop
 repo::

     ---------- desktop ----------           ---------- laptop ----------
     main: BBBBBBBB                        main: CCCCCCCC
     remotes/laptop/main: AAAAAAAA         remotes/desktop/main: AAAAAAAA

 Now on the desktop she does ``git pull laptop``. First, the remote objects
 are copied across into the local repo and the remote tracking branch is
 updated::

     ---------- desktop ----------           ---------- laptop ----------
     main: BBBBBBBB                        main: CCCCCCCC
     remotes/laptop/main: CCCCCCCC         remotes/desktop/main: AAAAAAAA

 .. note::
     The repo still contains AAAAAAAA because commits BBBBBBBB and CCCCCCCC
     point to it.

 Then Git will attempt to merge the changes in. Knowing that
 the parent commit to ``CCCCCCCC`` is ``AAAAAAAA``, it takes a diff between
 ``AAAAAAAA`` and ``CCCCCCCC`` and tries to apply it to ``BBBBBBBB``.

 If this is successful, then you'll get a new version with a merge commit::

     ---------- desktop ----------           ---------- laptop ----------
     main: DDDDDDDD                        main: CCCCCCCC
     remotes/laptop/main: CCCCCCCC         remotes/desktop/main: AAAAAAAA

 Then Alice has to logon to the laptop and run ``git pull desktop``. A similar
 process occurs. The remote tracking branch is updated::

     ---------- desktop ----------           ---------- laptop ----------
     main: DDDDDDDD                        main: CCCCCCCC
     remotes/laptop/main: CCCCCCCC         remotes/desktop/main: DDDDDDDD

 Then a merge takes place. This is a special case: ``CCCCCCCC`` is one of the
 parent commits of ``DDDDDDDD``, so the laptop can `fast forward` update from
 ``CCCCCCCC`` to ``DDDDDDDD`` directly without having to do any complex merging.
 This leaves the final state as::

     ---------- desktop ----------           ---------- laptop ----------
     main: DDDDDDDD                        main: DDDDDDDD
     remotes/laptop/main: CCCCCCCC         remotes/desktop/main: DDDDDDDD

 Now this is all and good, but you may wonder how this is relevant when thinking
 about CouchDB.

 First, note what happens in the case when the merge algorithm fails.
 The changes are still propagated from the remote repo into the local one, and
 are available in the remote tracking branch. So, unlike Unison, you know the
 data is protected. It's just that the local working copy may fail to update, or
 may diverge from the remote version. It's up to you to create and commit the
 combined version yourself, but you are guaranteed to have all the history you
 might need to do this.

 Note that while it is possible to build new merge algorithms into Git,
 the standard ones are focused on line-based changes to source code. They don't
 work well for XML or JSON if it's presented without any line breaks.

 The other interesting consideration is multiple peers. In this case you have
 multiple remote tracking branches, some of which may match your local branch,
 some of which may be behind you, and some of which may be ahead of you
 (i.e. contain changes that you haven't yet merged)::

     main: AAAAAAAA
     remotes/foo/main: BBBBBBBB
     remotes/bar/main: CCCCCCCC
     remotes/baz/main: AAAAAAAA

 Note that each peer is explicitly tracked, and therefore has to be explicitly
 created. If a peer becomes stale or is no longer needed, it's up to you to
 remove it from your configuration and delete the remote tracking branch.
 This is different from CouchDB, which doesn't keep any peer state in the
 database.

 Another difference between CouchDB and Git is that it maintains all history
 back to time
 zero - Git compaction keeps diffs between all those versions in order to reduce
 size, but CouchDB discards them. If you are constantly updating a document,
 the size of a Git repo would grow forever. It is possible (with some effort)
 to use "history rewriting" to make Git forget commits earlier than a particular
 one.

 .. _replication/conflicts/git:

 What is the CouchDB replication protocol? Is it like Git?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 :Author: Jason Smith
 :Date: 2011-01-29
 :Source: `StackOverflow <http://stackoverflow.com/questions/4766391/what-is-the-couchdb-replication-protocol-is-it-like-git>`_

 **Key points**

 **If you know Git, then you know how Couch replication works.** Replicating is
 *very* similar to pushing or pulling with distributed source managers like Git.

 **CouchDB replication does not have its own protocol.** A replicator simply
 connects to two DBs as a client, then reads from one and writes to the other.
 Push replication is reading the local data and updating the remote DB;
 pull replication is vice versa.

 * **Fun fact 1**: The replicator is actually an independent Erlang application,
   in its own process. It connects to both couches, then reads records from one
   and writes them to the other.
 * **Fun fact 2**: CouchDB has no way of knowing who is a normal client and who
   is a replicator (let alone whether the replication is push or pull).
   It all looks like client connections. Some of them read records. Some of them
   write records.

 **Everything flows from the data model**

 The replication algorithm is trivial, uninteresting. A trained monkey could
 design it. It's simple because the cleverness is the data model, which has these
 useful characteristics:

 #. Every record in CouchDB is completely independent of all others. That sucks
    if you want to do a JOIN or a transaction, but it's awesome if you want to
    write a replicator. Just figure out how to replicate one record, and then
    repeat that for each record.
 #. Like Git, records have a linked-list revision history. A record's revision ID
    is the checksum of its own data. Subsequent revision IDs are checksums of:
    the new data, plus the revision ID of the previous.

 #. In addition to application data (``{"name": "Jason", "awesome": true}``),
    every record stores the evolutionary time line of all previous revision IDs
    leading up to itself.

    - Exercise: Take a moment of quiet reflection. Consider any two different
      records, A and B. If A's revision ID appears in B's time line, then B
      definitely evolved from A. Now consider Git's fast-forward merges.
      Do you hear that? That is the sound of your mind being blown.

 #. Git isn't really a linear list. It has forks, when one parent has multiple
    children. CouchDB has that too.

    - Exercise: Compare two different records, A and B. A's revision ID does not
      appear in B's time line; however, one revision ID, C, is in both A's and
      B's time line. Thus A didn't evolve from B. B didn't evolve from A. But
      rather, A and B have a common ancestor C. In Git, that is a "fork." In
      CouchDB, it's a "conflict."

    - In Git, if both children go on to develop their time lines independently,
      that's cool. Forks totally support that.
    - In CouchDB, if both children go on to develop their time lines
      independently, that cool too. Conflicts totally support that.
    - **Fun fact 3**: CouchDB "conflicts" do not correspond to Git "conflicts."
      A Couch conflict is a divergent revision history, what Git calls a "fork."
      For this reason the CouchDB community pronounces "conflict" with a silent
      `n`: "co-flicked."

 #. Git also has merges, when one child has multiple parents. CouchDB *sort* of
    has that too.

    - **In the data model, there is no merge.** The client simply marks one
      time line as deleted and continues to work with the only extant time line.
    - **In the application, it feels like a merge.** Typically, the client merges
      the *data* from each time line in an application-specific way.
      Then it writes the new data to the time line. In Git, this is like copying
      and pasting the changes from branch A into branch B, then committing to
      branch B and deleting branch A. The data was merged, but there was no
      `git merge`.
    - These behaviors are different because, in Git, the time line itself is
      important; but in CouchDB, the data is important and the time line is
      incidental—it's just there to support replication. That is one reason why
      CouchDB's built-in revisioning is inappropriate for storing revision data
      like a wiki page.

 **Final notes**

 At least one sentence in this writeup (possibly this one) is complete BS.