| .. Licensed under the Apache License, Version 2.0 (the "License"); you may not |
| .. use this file except in compliance with the License. You may obtain a copy of |
| .. the License at |
| .. |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| .. |
| .. Unless required by applicable law or agreed to in writing, software |
| .. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT |
| .. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the |
| .. License for the specific language governing permissions and limitations under |
| .. the License. |
| |
| .. _intro/consistency: |
| |
| ==================== |
| Eventual Consistency |
| ==================== |
| |
| In the previous document :ref:`intro/why`, we saw that CouchDB's flexibility |
| allows us to evolve our data as our applications grow and change. In this topic, |
| we'll explore how working "with the grain" of CouchDB promotes simplicity in |
| our applications and helps us naturally build scalable, distributed systems. |
| |
| Working with the Grain |
| ====================== |
| |
| A *distributed system* is a system that operates robustly over a wide network. |
| A particular feature of network computing is that network links can |
| potentially disappear, and there are plenty of strategies for managing this |
| type of network segmentation. CouchDB differs from others by accepting |
| eventual consistency, as opposed to putting absolute consistency ahead of raw |
| availability, like `RDBMS`_ or `Paxos`_. What these systems have in common is |
| an awareness that data acts differently when many people are accessing it |
| simultaneously. Their approaches differ when it comes to which aspects of |
| *consistency*, *availability*, or *partition* tolerance they prioritize. |
| |
| Engineering distributed systems is tricky. Many of the caveats and "gotchas" |
| you will face over time aren't immediately obvious. We don't have all the |
| solutions, and CouchDB isn't a panacea, but when you work with CouchDB's |
| grain rather than against it, the path of least resistance leads you to |
| naturally scalable applications. |
| |
| Of course, building a distributed system is only the beginning. A website |
| with a database that is available only half the time is next to worthless. |
| Unfortunately, the traditional relational database approach to consistency |
| makes it very easy for application programmers to rely on global state, |
| global clocks, and other high availability no-nos, without even realizing |
| that they're doing so. Before examining how CouchDB promotes scalability, |
| we'll look at the constraints faced by a distributed system. After we've seen |
| the problems that arise when parts of your application can't rely on being |
| in constant contact with each other, we'll see that CouchDB provides an |
| intuitive and useful way for modeling applications around high availability. |
| |
| .. _RDBMS: http://en.wikipedia.org/wiki/Relational_database_management_system |
| .. _Paxos: http://en.wikipedia.org/wiki/Paxos_%28computer_science%29 |
| |
| .. _cap: |
| |
| The CAP Theorem |
| =============== |
| |
| The CAP theorem describes a few different strategies for distributing |
| application logic across networks. CouchDB's solution uses replication to |
| propagate application changes across participating nodes. This is a |
| fundamentally different approach from consensus algorithms and relational |
| databases, which operate at different intersections of consistency, |
| availability, and partition tolerance. |
| |
| The CAP theorem, shown in :ref:`intro/consistency-01`, |
| identifies three distinct concerns: |
| |
| - **Consistency**: |
| All database clients see the same data, even with concurrent updates. |
| - **Availability**: |
| All database clients are able to access some version of the data. |
| - **Partition tolerance**: |
| The database can be split over multiple servers. |
| |
| Pick two. |
| |
| .. _intro/consistency-01: |
| |
| .. figure:: ../../images/intro-consistency-01.png |
| :align: center |
| :alt: The CAP theorem |
| |
| Figure 1. The CAP theorem |
| |
| When a system grows large enough that a single database node is unable to |
| handle the load placed on it, a sensible solution is to add more servers. |
| When we add nodes, we have to start thinking about how to partition data |
| between them. Do we have a few databases that share exactly the same data? |
| Do we put different sets of data on different database servers? |
| Do we let only certain database servers write data and let others handle |
| the reads? |
| |
| Regardless of which approach we take, the one problem we'll keep bumping into |
| is that of keeping all these database servers in sync. If you write some |
| information to one node, how are you going to make sure that a read request |
| to another database server reflects this newest information? These events |
| might be milliseconds apart. Even with a modest collection of database |
| servers, this problem can become extremely complex. |
| |
| When it's absolutely critical that all clients see a consistent view of the |
| database, the users of one node will have to wait for any other nodes to come |
| into agreement before being able to read or write to the database. |
| In this instance, we see that availability takes a backseat to consistency. |
| However, there are situations where availability trumps consistency: |
| |
| Each node in a system should be able to make decisions purely based on |
| local state. If you need to do something under high load with failures |
| occurring and you need to reach agreement, you're lost. If you're |
| concerned about scalability, any algorithm that forces you to run |
| agreement will eventually become your bottleneck. Take that as a given. |
| |
| -- Werner Vogels, Amazon CTO and Vice President |
| |
| If availability is a priority, we can let clients write data to one node of |
| the database without waiting for other nodes to come into agreement. |
| If the database knows how to take care of reconciling these operations between |
| nodes, we achieve a sort of "eventual consistency" in exchange for high |
| availability. This is a surprisingly applicable trade-off for many applications. |
| |
| Unlike traditional relational databases, where each action performed is |
| necessarily subject to database-wide consistency checks, |
| CouchDB makes it really simple to build applications that sacrifice immediate |
| consistency for the huge performance improvements that come with simple |
| distribution. |
| |
| Local Consistency |
| ================= |
| |
| Before we attempt to understand how CouchDB operates in a cluster, |
| it's important that we understand the inner workings of a single CouchDB node. |
| The CouchDB API is designed to provide a convenient but thin wrapper around |
| the database core. By taking a closer look at the structure of the database |
| core, we'll have a better understanding of the API that surrounds it. |
| |
| The Key to Your Data |
| -------------------- |
| |
| At the heart of CouchDB is a powerful *B-tree* storage engine. |
| A B-tree is a sorted data structure that allows for searches, insertions, |
| and deletions in logarithmic time. As :ref:`intro/consistency-02` |
| illustrates, CouchDB uses this B-tree storage engine for all internal data, |
| documents, and views. If we understand one, we will understand them all. |
| |
| .. _intro/consistency-02: |
| |
| .. figure:: ../../images/intro-consistency-02.png |
| :align: center |
| :alt: Anatomy of a view request |
| |
| Figure 2. Anatomy of a view request |
| |
| CouchDB uses MapReduce to compute the results of a view. MapReduce makes use |
| of two functions, "map" and "reduce", which are applied to each document in |
| isolation. Being able to isolate these operations means that view computation |
| lends itself to parallel and incremental computation. More important, |
| because these functions produce key/value pairs, CouchDB is able to insert |
| them into the B-tree storage engine, sorted by key. Lookups by key, |
| or key range, are extremely efficient operations with a B-tree, |
| described in `big O` notation as ``O(log N)`` and ``O(log N + K)``, |
| respectively. |
| |
| In CouchDB, we access documents and view results by key or key range. |
| This is a direct mapping to the underlying operations performed on CouchDB's |
| B-tree storage engine. Along with document inserts and updates, |
| this direct mapping is the reason we describe CouchDB's API as being a thin |
| wrapper around the database core. |
| |
| Being able to access results by key alone is a very important restriction |
| because it allows us to make huge performance gains. As well as the massive |
| speed improvements, we can partition our data over multiple nodes, |
| without affecting our ability to query each node in isolation. |
| `BigTable`_, `Hadoop`_, `SimpleDB`_, and `memcached`_ restrict object lookups |
| by key for exactly these reasons. |
| |
| .. _BigTable: http://en.wikipedia.org/wiki/BigTable |
| .. _Hadoop: http://hadoop.apache.org |
| .. _SimpleDB: http://aws.amazon.com/simpledb/ |
| .. _memcached: http://memcached.org |
| |
| No Locking |
| ---------- |
| |
| A table in a relational database is a single data structure. If you want to |
| modify a table -- say, update a row -- the database system must ensure |
| that nobody else is trying to update that row and that nobody can read from |
| that row while it is being updated. The common way to handle this uses what's |
| known as a lock. If multiple clients want to access a table, the first client |
| gets the lock, making everybody else wait. When the first client's request is |
| processed, the next client is given access while everybody else waits, |
| and so on. This serial execution of requests, even when they arrived in |
| parallel, wastes a significant amount of your server's processing power. |
| Under high load, a relational database can spend more time figuring out who |
| is allowed to do what, and in which order, than it does doing any actual work. |
| |
| .. note:: |
| Modern relational databases avoid locks by implementing MVCC under |
| the hood, but hide it from the end user, requiring them to coordinate |
| concurrent changes of single rows or fields. |
| |
| Instead of locks, CouchDB uses `Multi-Version Concurrency Control` (MVCC) to |
| manage concurrent access to the database. :ref:`intro/consistency-03` |
| illustrates the differences between MVCC and traditional locking mechanisms. |
| MVCC means that CouchDB can run at full speed, all the time, |
| even under high load. Requests are run in parallel, making excellent use of |
| every last drop of processing power your server has to offer. |
| |
| .. _intro/consistency-03: |
| |
| .. figure:: ../../images/intro-consistency-03.png |
| :align: center |
| :alt: MVCC means no locking |
| |
| Figure 3. MVCC means no locking |
| |
| Documents in CouchDB are versioned, much like they would be in a regular |
| version control system such as `Subversion`_. If you want to change |
| a value in a document, you create an entire new version of that document |
| and save it over the old one. After doing this, you end up with two versions |
| of the same document, one old and one new. |
| |
| How does this offer an improvement over locks? Consider a set of requests |
| wanting to access a document. The first request reads the document. |
| While this is being processed, a second request changes the document. |
| Since the second request includes a completely new version of the document, |
| CouchDB can simply append it to the database without having to wait for the |
| read request to finish. |
| |
| When a third request wants to read the same document, CouchDB will point it |
| to the new version that has just been written. During this whole process, |
| the first request could still be reading the original version. |
| |
| A read request will always see the most recent snapshot of your database at |
| the time of the beginning of the request. |
| |
| .. _Subversion: http://subversion.apache.org/ |
| |
| Validation |
| ========== |
| |
| As application developers, we have to think about what sort of input we |
| should accept and what we should reject. The expressive power to do this type |
| of validation over complex data within a traditional relational database |
| leaves a lot to be desired. Fortunately, CouchDB provides a powerful way to |
| perform per-document validation from within the database. |
| |
| CouchDB can validate documents using JavaScript functions similar to those |
| used for MapReduce. Each time you try to modify a document, |
| CouchDB will pass the validation function a copy of the existing document, |
| a copy of the new document, and a collection of additional information, |
| such as user authentication details. The validation function now has the |
| opportunity to approve or deny the update. |
| |
| By working with the grain and letting CouchDB do this for us, |
| we save ourselves a tremendous amount of CPU cycles that would otherwise have |
| been spent serializing object graphs from SQL, converting them into domain |
| objects, and using those objects to do application-level validation. |
| |
| Distributed Consistency |
| ======================= |
| |
| Maintaining consistency within a single database node is relatively easy for |
| most databases. The real problems start to surface when you try to maintain |
| consistency between multiple database servers. If a client makes a write |
| operation on server `A`, how do we make sure that this is consistent with |
| server `B`, or `C`, or `D`? For relational databases, this is a very complex |
| problem with entire books devoted to its solution. You could use |
| multi-master, single-master, partitioning, sharding, write-through caches, |
| and all sorts of other complex techniques. |
| |
| Incremental Replication |
| ======================= |
| |
| CouchDB's operations take place within the context of a single document. |
| As CouchDB achieves eventual consistency between multiple databases by using |
| incremental replication you no longer have to worry about your database |
| servers being able to stay in constant communication. Incremental replication |
| is a process where document changes are periodically copied between servers. |
| We are able to build what's known as a *shared nothing* cluster of databases |
| where each node is independent and self-sufficient, leaving no single point |
| of contention across the system. |
| |
| Need to scale out your CouchDB database cluster? Just throw in another server. |
| |
| As illustrated in :ref:`intro/consistency-04`, with CouchDB's incremental |
| replication, you can synchronize your data between any two databases however |
| you like and whenever you like. After replication, each database is able |
| to work independently. |
| |
| You could use this feature to synchronize database servers within a cluster |
| or between data centers using a job scheduler such as cron, |
| or you could use it to synchronize data with your laptop for offline work as |
| you travel. Each database can be used in the usual fashion, |
| and changes between databases can be synchronized later in both directions. |
| |
| .. _intro/consistency-04: |
| |
| .. figure:: ../../images/intro-consistency-04.png |
| :align: center |
| :alt: Incremental replication between CouchDB nodes |
| |
| Figure 4. Incremental replication between CouchDB nodes |
| |
| What happens when you change the same document in two different databases and |
| want to synchronize these with each other? CouchDB's replication system |
| comes with automatic conflict detection and resolution. When CouchDB detects |
| that a document has been changed in both databases, it flags this document |
| as being in conflict, much like they would be in a regular version control |
| system. |
| |
| This isn't as troublesome as it might first sound. When two versions of a |
| document conflict during replication, the winning version is saved as the |
| most recent version in the document's history. Instead of throwing the losing |
| version away, as you might expect, CouchDB saves this as a previous version |
| in the document's history, so that you can access it if you need to. This |
| happens automatically and consistently, so both databases will make exactly |
| the same choice. |
| |
| It is up to you to handle conflicts in a way that makes sense for your |
| application. You can leave the chosen document versions in place, |
| revert to the older version, or try to merge the two versions and save the |
| result. |
| |
| Case Study |
| ========== |
| |
| Greg Borenstein, a friend and coworker, built a small library for converting |
| Songbird playlists to JSON objects and decided to store these in CouchDB as |
| part of a backup application. The completed software uses CouchDB's MVCC and |
| document revisions to ensure that Songbird playlists are backed up robustly |
| between nodes. |
| |
| .. note:: |
| `Songbird`_ is a free software media player with an integrated web browser, |
| based on the Mozilla XULRunner platform. Songbird is available for Microsoft |
| Windows, Apple Mac OS X, Solaris, and Linux. |
| |
| .. _Songbird: http://en.wikipedia.org/wiki/Songbird_%28software%29 |
| |
| Let's examine the workflow of the Songbird backup application, |
| first as a user backing up from a single computer, and then using Songbird to |
| synchronize playlists between multiple computers. We'll see how document |
| revisions turn what could have been a hairy problem into something that *just |
| works*. |
| |
| The first time we use this backup application, we feed our playlists to the |
| application and initiate a backup. Each playlist is converted to a JSON |
| object and handed to a CouchDB database. As illustrated in |
| :ref:`intro/consistency-05`, CouchDB hands back the document ID and |
| revision of each playlist as it's saved to the database. |
| |
| .. _intro/consistency-05: |
| |
| .. figure:: ../../images/intro-consistency-05.png |
| :align: center |
| :alt: Backing up to a single database |
| |
| Figure 5. Backing up to a single database |
| |
| After a few days, we find that our playlists have been updated and we want to |
| back up our changes. After we have fed our playlists to the backup |
| application, it fetches the latest versions from CouchDB, |
| along with the corresponding document revisions. When the application hands |
| back the new playlist document, CouchDB requires that the document revision |
| is included in the request. |
| |
| CouchDB then makes sure that the document revision handed to it in the |
| request matches the current revision held in the database. Because CouchDB |
| updates the revision with every modification, if these two are out of sync it |
| suggests that someone else has made changes to the document between the time |
| we requested it from the database and the time we sent our updates. Making |
| changes to a document after someone else has modified it without first |
| inspecting those changes is usually a bad idea. |
| |
| Forcing clients to hand back the correct document revision is the heart of |
| CouchDB's optimistic concurrency. |
| |
| We have a laptop we want to keep synchronized with our desktop computer. |
| With all our playlists on our desktop, the first step is to |
| "restore from backup" onto our laptop. This is the first time we've done this, |
| so afterward our laptop should hold an exact replica of our desktop playlist |
| collection. |
| |
| After editing our Argentine Tango playlist on our laptop to add a few new |
| songs we've purchased, we want to save our changes. The backup application |
| replaces the playlist document in our laptop CouchDB database and a new |
| document revision is generated. A few days later, we remember our new songs |
| and want to copy the playlist across to our desktop computer. As illustrated |
| in :ref:`intro/consistency-06`, the backup application copies the new document |
| and the new revision to the desktop CouchDB database. Both CouchDB databases |
| now have the same document revision. |
| |
| .. _intro/consistency-06: |
| |
| .. figure:: ../../images/intro-consistency-06.png |
| :align: center |
| :alt: Synchronizing between two databases |
| |
| Figure 6. Synchronizing between two databases |
| |
| Because CouchDB tracks document revisions, it ensures that updates like these |
| will work only if they are based on current information. If we had made |
| modifications to the playlist backups between synchronization, |
| things wouldn't go as smoothly. |
| |
| We back up some changes on our laptop and forget to synchronize. A few days |
| later, we're editing playlists on our desktop computer, make a backup, |
| and want to synchronize this to our laptop. As illustrated in |
| :ref:`intro/consistency-07`, when our backup application tries to replicate |
| between the two databases, CouchDB sees that the changes being sent from our |
| desktop computer are modifications of out-of-date documents and helpfully |
| informs us that there has been a conflict. |
| |
| Recovering from this error is easy to accomplish from an application |
| perspective. Just download CouchDB's version of the playlist and provide an |
| opportunity to merge the changes or save local modifications into a new |
| playlist. |
| |
| .. _intro/consistency-07: |
| |
| .. figure:: ../../images/intro-consistency-07.png |
| :align: center |
| :alt: Synchronization conflicts between two databases |
| |
| Figure 7. Synchronization conflicts between two databases |
| |
| Wrapping Up |
| =========== |
| |
| CouchDB's design borrows heavily from web architecture and the lessons |
| learned deploying massively distributed systems on that architecture. |
| By understanding why this architecture works the way it does, |
| and by learning to spot which parts of your application can be easily |
| distributed and which parts cannot, you'll enhance your ability to design |
| distributed and scalable applications, with CouchDB or without it. |
| |
| We've covered the main issues surrounding CouchDB's consistency model and |
| hinted at some of the benefits to be had when you work *with* CouchDB and not |
| against it. But enough theory -- let's get up and running and see what all the |
| fuss is about! |