src/ddocs/views/joins.rst - couchdb-documentation - Git at Google

 .. Licensed under the Apache License, Version 2.0 (the "License"); you may not
 .. use this file except in compliance with the License. You may obtain a copy of
 .. the License at
 ..
 ..   http://www.apache.org/licenses/LICENSE-2.0
 ..
 .. Unless required by applicable law or agreed to in writing, software
 .. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 .. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 .. License for the specific language governing permissions and limitations under
 .. the License.

 .. _views/json:

 ================
 Joins With Views
 ================

 Linked Documents
 ================

 If your :ref:`map function <mapfun>` emits an object value which has
 ``{'_id': XXX}`` and you :ref:`query view <api/ddoc/view>` with
 ``include_docs=true`` parameter, then CouchDB will fetch the document with id
 ``XXX`` rather than the document which was processed to emit the key/value pair.

 This means that if one document contains the ids of other documents, it can
 cause those documents to be fetched in the view too, adjacent to the same key
 if required.

 For example, if you have the following hierarchically-linked documents:

 .. code-block:: javascript

     [
         { "_id": "11111" },
         { "_id": "22222", "ancestors": ["11111"], "value": "hello" },
         { "_id": "33333", "ancestors": ["22222","11111"], "value": "world" }
     ]

 You can emit the values with the ancestor documents adjacent to them in the view
 like this:

 .. code-block:: javascript

     function(doc) {
         if (doc.value) {
             emit([doc.value, 0], null);
             if (doc.ancestors) {
                 for (var i in doc.ancestors) {
                     emit([doc.value, Number(i)+1], {_id: doc.ancestors[i]});
                 }
             }
         }
     }

 The result you get is:

 .. code-block:: javascript

     {
         "total_rows": 5,
         "offset": 0,
         "rows": [
             {
                 "id": "22222",
                 "key": [
                     "hello",
                     0
                 ],
                 "value": null,
                 "doc": {
                     "_id": "22222",
                     "_rev": "1-0eee81fecb5aa4f51e285c621271ff02",
                     "ancestors": [
                         "11111"
                     ],
                     "value": "hello"
                 }
             },
             {
                 "id": "22222",
                 "key": [
                     "hello",
                     1
                 ],
                 "value": {
                     "_id": "11111"
                 },
                 "doc": {
                     "_id": "11111",
                     "_rev": "1-967a00dff5e02add41819138abb3284d"
                 }
             },
             {
                 "id": "33333",
                 "key": [
                     "world",
                     0
                 ],
                 "value": null,
                 "doc": {
                     "_id": "33333",
                     "_rev": "1-11e42b44fdb3d3784602eca7c0332a43",
                     "ancestors": [
                         "22222",
                         "11111"
                     ],
                     "value": "world"
                 }
             },
             {
                 "id": "33333",
                 "key": [
                     "world",
                     1
                 ],
                 "value": {
                     "_id": "22222"
                 },
                 "doc": {
                     "_id": "22222",
                     "_rev": "1-0eee81fecb5aa4f51e285c621271ff02",
                     "ancestors": [
                         "11111"
                     ],
                     "value": "hello"
                 }
             },
             {
                 "id": "33333",
                 "key": [
                     "world",
                     2
                 ],
                 "value": {
                     "_id": "11111"
                 },
                 "doc": {
                     "_id": "11111",
                     "_rev": "1-967a00dff5e02add41819138abb3284d"
                 }
             }
         ]
     }

 which makes it very cheap to fetch a document plus all its ancestors in one
 query.

 Note that the ``"id"`` in the row is still that of the originating document.
 The only difference is that ``include_docs`` fetches a different doc.

 The current revision of the document is resolved at query time, not at the time
 the view is generated. This means that if a new revision of the linked document
 is added later, it will appear in view queries even though the view itself
 hasn't changed. To force a specific revision of a linked document to be used,
 emit a ``"_rev"`` property as well as ``"_id"``.

 Using View Collation
 ====================

 :Author: Christopher Lenz
 :Date: 2007-10-05
 :Source: http://www.cmlenz.net/archives/2007/10/couchdb-joins

 Just today, there was a discussion on IRC on how you'd go about modeling a
 simple blogging system with “post” and “comment” entities, where any blog
 post might have N comments. If you'd be using an SQL database, you'd obviously
 have two tables with foreign keys and you'd be using joins. (At least until you
 needed to add some `denormalization`_).

 .. _denormalization: http://en.wikipedia.org/wiki/Denormalization

 But what would the “obvious” approach in CouchDB look like?

 Approach #1: Comments Inlined
 -----------------------------

 A simple approach would be to have one document per blog post, and store the
 comments inside that document:

 .. code-block:: javascript

     {
         "_id": "myslug",
         "_rev": "123456",
         "author": "john",
         "title": "My blog post",
         "content": "Bla bla bla …",
         "comments": [
             {"author": "jack", "content": "…"},
             {"author": "jane", "content": "…"}
         ]
     }

 .. note::
     Of course the model of an actual blogging system would be more extensive,
     you'd have tags, timestamps, etc, etc. This is just to demonstrate the basics.

 The obvious advantage of this approach is that the data that belongs together
 is stored in one place. Delete the post, and you automatically delete the
 corresponding comments, and so on.

 You may be thinking that putting the comments inside the blog post document
 would not allow us to query for the comments themselves, but you'd be wrong.
 You could trivially write a CouchDB view that would return all comments across
 all blog posts, keyed by author:

 .. code-block:: javascript

     function(doc) {
         for (var i in doc.comments) {
             emit(doc.comments[i].author, doc.comments[i].content);
         }
     }

 Now you could list all comments by a particular user by invoking the view and
 passing it a ``?key="username"`` query string parameter.

 However, this approach has a drawback that can be quite significant for many
 applications: To add a comment to a post, you need to:

 - Fetch the blog post document
 - Add the new comment to the JSON structure
 - Send the updated document to the server

 Now if you have multiple client processes adding comments at roughly the same
 time, some of them will get a `HTTP 409 Conflict` error on step 3 (that's
 optimistic concurrency in action). For some applications this makes sense, but
 in many other apps, you'd want to append new related data regardless of whether
 other data has been added in the meantime.

 The only way to allow non-conflicting addition of related data is by putting
 that related data into separate documents.

 Approach #2: Comments Separate
 ------------------------------

 Using this approach you'd have one document per blog post, and one document per
 comment. The comment documents would have a “backlink” to the post they belong
 to.

 The blog post document would look similar to the above, minus the comments
 property. Also, we'd now have a type property on all our documents so that we
 can tell the difference between posts and comments:

 .. code-block:: javascript

     {
         "_id": "myslug",
         "_rev": "123456",
         "type": "post",
         "author": "john",
         "title": "My blog post",
         "content": "Bla bla bla …"
     }

 The comments themselves are stored in separate documents, which also have a type
 property (this time with the value “comment”), and additionally feature a post
 property containing the ID of the post document they belong to:

 .. code-block:: javascript

     {
         "_id": "ABCDEF",
         "_rev": "123456",
         "type": "comment",
         "post": "myslug",
         "author": "jack",
         "content": "…"
     }

 .. code-block:: javascript

     {
         "_id": "DEFABC",
         "_rev": "123456",
         "type": "comment",
         "post": "myslug",
         "author": "jane",
         "content": "…"
     }

 To list all comments per blog post, you'd add a simple view, keyed by blog post
 ID:

 .. code-block:: javascript

     function(doc) {
         if (doc.type == "comment") {
             emit(doc.post, {author: doc.author, content: doc.content});
         }
     }

 And you'd invoke that view passing it a ``?key="post_id"`` query string
 parameter.

 Viewing all comments by author is just as easy as before:

 .. code-block:: javascript

     function(doc) {
         if (doc.type == "comment") {
             emit(doc.author, {post: doc.post, content: doc.content});
         }
     }

 So this is better in some ways, but it also has a disadvantage.
 Imagine you want to display a blog post with all the associated comments on the
 same web page. With our first approach, we needed just a single request to the
 CouchDB server, namely a ``GET`` request to the document. With this second
 approach, we need two requests: a ``GET`` request to the post document, and a
 ``GET`` request to the view that returns all comments for the post.

 That is okay, but not quite satisfactory. Just imagine you wanted to add
 threaded comments: you'd now need an additional fetch per comment. What we'd
 probably want then would be a way to join the blog post and the various comments
 together to be able to retrieve them with a single HTTP request.

 This was when Damien Katz, the author of CouchDB, chimed in to the discussion
 on IRC to show us the way.

 Optimization: Using the Power of View Collation
 -----------------------------------------------

 Obvious to Damien, but not at all obvious to the rest of us: it's fairly simple
 to make a view that includes both the content of the blog post document, and
 the content of all the comments associated with that post. The way you do that
 is by using `complex keys`. Until now we've been using simple string values for
 the view keys, but in fact they can be arbitrary JSON values, so let's make
 some use of that:

 .. code-block:: javascript

     function(doc) {
         if (doc.type == "post") {
             emit([doc._id, 0], null);
         } else if (doc.type == "comment") {
             emit([doc.post, 1], null);
         }
     }

 Okay, this may be confusing at first. Let's take a step back and look at what
 views in CouchDB are really about.

 CouchDB views are basically highly efficient on-disk dictionaries that map keys
 to values, where the key is automatically indexed and can be used to filter
 and/or sort the results you get back from your views. When you “invoke” a view,
 you can say that you're only interested in a subset of the view rows by
 specifying a ``?key=foo`` query string parameter. Or you can specify
 ``?startkey=foo`` and/or ``?endkey=bar`` query string parameters to fetch rows
 over a range of keys. Finally, by adding ``?include_docs=true`` to the query,
 the result will include the full body of each emitted document.

 It's also important to note that keys are always used for collating (i.e.
 sorting) the rows. CouchDB has well defined (but as of yet undocumented) rules
 for comparing arbitrary JSON objects for collation. For example, the JSON value
 ``["foo", 2]`` is sorted after (considered “greater than”) the values
 ``["foo"]`` or ``["foo", 1, "bar"]``, but before e.g. ``["foo", 2, "bar"]``.
 This feature enables a whole class of tricks that are rather non-obvious...

 .. seealso::

     :ref:`views/collation`

 With that in mind, let's return to the view function above. First note that,
 unlike the previous view functions we've used here, this view handles both
 "post" and "comment" documents, and both of them end up as rows in the same
 view. Also, the key in this view is not just a simple string, but an array.
 The first element in that array is always the ID of the post, regardless of
 whether we're processing an actual post document, or a comment associated with
 a post. The second element is 0 for post documents, and 1 for comment documents.

 Let's assume we have two blog posts in our database. Without limiting the view
 results via ``key``, ``startkey``, or ``endkey``, we'd get back something like
 the following:

 .. code-block:: javascript

     {
         "total_rows": 5, "offset": 0, "rows": [{
                 "id": "myslug",
                 "key": ["myslug", 0],
                 "value": null
             }, {
                 "id": "ABCDEF",
                 "key": ["myslug", 1],
                 "value": null
             }, {
                 "id": "DEFABC",
                 "key": ["myslug", 1],
                 "value": null
             }, {
                 "id": "other_slug",
                 "key": ["other_slug", 0],
                 "value": null
             }, {
                 "id": "CDEFAB",
                 "key": ["other_slug", 1],
                 "value": null
             },
         ]
     }

 .. note::
     The ``...`` placeholders here would contain the complete JSON encoding of the
     corresponding documents

 Now, to get a specific blog post and all associated comments, we'd invoke that
 view with the query string::

     ?startkey=["myslug"]&endkey=["myslug", 2]&include_docs=true

 We'd get back the first three rows, those that belong to the ``myslug`` post,
 but not the others, along with the full bodies of each document. Et voila, we
 now have the data we need to display a post with all associated comments,
 retrieved via a single ``GET`` request.

 You may be asking what the 0 and 1 parts of the keys are for. They're simply
 to ensure that the post document is always sorted before the the associated
 comment documents. So when you get back the results from this view for a
 specific post, you'll know that the first row contains the data for the blog
 post itself, and the remaining rows contain the comment data.

 One remaining problem with this model is that comments are not ordered, but
 that's simply because we don't have date/time information associated with them.
 If we had, we'd add the timestamp as third element of the key array, probably
 as ISO date/time strings. Now we would continue using the query string
 ``?startkey=["myslug"]&endkey=["myslug", 2]&include_docs=true`` to fetch the
 blog post and all associated comments, only now they'd be in chronological
 order.
	.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
	.. use this file except in compliance with the License. You may obtain a copy of
	.. the License at
	..
	.. http://www.apache.org/licenses/LICENSE-2.0
	..
	.. Unless required by applicable law or agreed to in writing, software
	.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
	.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
	.. License for the specific language governing permissions and limitations under
	.. the License.

	.. _views/json:

	================
	Joins With Views
	================

	Linked Documents
	================

	If your :ref:`map function <mapfun>` emits an object value which has
	``{'_id': XXX}`` and you :ref:`query view <api/ddoc/view>` with
	``include_docs=true`` parameter, then CouchDB will fetch the document with id
	``XXX`` rather than the document which was processed to emit the key/value pair.

	This means that if one document contains the ids of other documents, it can
	cause those documents to be fetched in the view too, adjacent to the same key
	if required.

	For example, if you have the following hierarchically-linked documents:

	.. code-block:: javascript

	[
	{ "_id": "11111" },
	{ "_id": "22222", "ancestors": ["11111"], "value": "hello" },
	{ "_id": "33333", "ancestors": ["22222","11111"], "value": "world" }
	]

	You can emit the values with the ancestor documents adjacent to them in the view
	like this:

	.. code-block:: javascript

	function(doc) {
	if (doc.value) {
	emit([doc.value, 0], null);
	if (doc.ancestors) {
	for (var i in doc.ancestors) {
	emit([doc.value, Number(i)+1], {_id: doc.ancestors[i]});
	}
	}
	}
	}

	The result you get is:

	.. code-block:: javascript

	{
	"total_rows": 5,
	"offset": 0,
	"rows": [
	{
	"id": "22222",
	"key": [
	"hello",
	0
	],
	"value": null,
	"doc": {
	"_id": "22222",
	"_rev": "1-0eee81fecb5aa4f51e285c621271ff02",
	"ancestors": [
	"11111"
	],
	"value": "hello"
	}
	},
	{
	"id": "22222",
	"key": [
	"hello",
	1
	],
	"value": {
	"_id": "11111"
	},
	"doc": {
	"_id": "11111",
	"_rev": "1-967a00dff5e02add41819138abb3284d"
	}
	},
	{
	"id": "33333",
	"key": [
	"world",
	0
	],
	"value": null,
	"doc": {
	"_id": "33333",
	"_rev": "1-11e42b44fdb3d3784602eca7c0332a43",
	"ancestors": [
	"22222",
	"11111"
	],
	"value": "world"
	}
	},
	{
	"id": "33333",
	"key": [
	"world",
	1
	],
	"value": {
	"_id": "22222"
	},
	"doc": {
	"_id": "22222",
	"_rev": "1-0eee81fecb5aa4f51e285c621271ff02",
	"ancestors": [
	"11111"
	],
	"value": "hello"
	}
	},
	{
	"id": "33333",
	"key": [
	"world",
	2
	],
	"value": {
	"_id": "11111"
	},
	"doc": {
	"_id": "11111",
	"_rev": "1-967a00dff5e02add41819138abb3284d"
	}
	}
	]
	}

	which makes it very cheap to fetch a document plus all its ancestors in one
	query.

	Note that the ``"id"`` in the row is still that of the originating document.
	The only difference is that ``include_docs`` fetches a different doc.

	The current revision of the document is resolved at query time, not at the time
	the view is generated. This means that if a new revision of the linked document
	is added later, it will appear in view queries even though the view itself
	hasn't changed. To force a specific revision of a linked document to be used,
	emit a ``"_rev"`` property as well as ``"_id"``.

	Using View Collation
	====================

	:Author: Christopher Lenz
	:Date: 2007-10-05
	:Source: http://www.cmlenz.net/archives/2007/10/couchdb-joins

	Just today, there was a discussion on IRC on how you'd go about modeling a
	simple blogging system with “post” and “comment” entities, where any blog
	post might have N comments. If you'd be using an SQL database, you'd obviously
	have two tables with foreign keys and you'd be using joins. (At least until you
	needed to add some `denormalization`_).

	.. _denormalization: http://en.wikipedia.org/wiki/Denormalization

	But what would the “obvious” approach in CouchDB look like?

	Approach #1: Comments Inlined
	-----------------------------

	A simple approach would be to have one document per blog post, and store the
	comments inside that document:

	.. code-block:: javascript

	{
	"_id": "myslug",
	"_rev": "123456",
	"author": "john",
	"title": "My blog post",
	"content": "Bla bla bla …",
	"comments": [
	{"author": "jack", "content": "…"},
	{"author": "jane", "content": "…"}
	]
	}

	.. note::
	Of course the model of an actual blogging system would be more extensive,
	you'd have tags, timestamps, etc, etc. This is just to demonstrate the basics.

	The obvious advantage of this approach is that the data that belongs together
	is stored in one place. Delete the post, and you automatically delete the
	corresponding comments, and so on.

	You may be thinking that putting the comments inside the blog post document
	would not allow us to query for the comments themselves, but you'd be wrong.
	You could trivially write a CouchDB view that would return all comments across
	all blog posts, keyed by author:

	.. code-block:: javascript

	function(doc) {
	for (var i in doc.comments) {
	emit(doc.comments[i].author, doc.comments[i].content);
	}
	}

	Now you could list all comments by a particular user by invoking the view and
	passing it a ``?key="username"`` query string parameter.

	However, this approach has a drawback that can be quite significant for many
	applications: To add a comment to a post, you need to:

	- Fetch the blog post document
	- Add the new comment to the JSON structure
	- Send the updated document to the server

	Now if you have multiple client processes adding comments at roughly the same
	time, some of them will get a `HTTP 409 Conflict` error on step 3 (that's
	optimistic concurrency in action). For some applications this makes sense, but
	in many other apps, you'd want to append new related data regardless of whether
	other data has been added in the meantime.

	The only way to allow non-conflicting addition of related data is by putting
	that related data into separate documents.

	Approach #2: Comments Separate
	------------------------------

	Using this approach you'd have one document per blog post, and one document per
	comment. The comment documents would have a “backlink” to the post they belong
	to.

	The blog post document would look similar to the above, minus the comments
	property. Also, we'd now have a type property on all our documents so that we
	can tell the difference between posts and comments:

	.. code-block:: javascript

	{
	"_id": "myslug",
	"_rev": "123456",
	"type": "post",
	"author": "john",
	"title": "My blog post",
	"content": "Bla bla bla …"
	}

	The comments themselves are stored in separate documents, which also have a type
	property (this time with the value “comment”), and additionally feature a post
	property containing the ID of the post document they belong to:

	.. code-block:: javascript

	{
	"_id": "ABCDEF",
	"_rev": "123456",
	"type": "comment",
	"post": "myslug",
	"author": "jack",
	"content": "…"
	}

	.. code-block:: javascript

	{
	"_id": "DEFABC",
	"_rev": "123456",
	"type": "comment",
	"post": "myslug",
	"author": "jane",
	"content": "…"
	}

	To list all comments per blog post, you'd add a simple view, keyed by blog post
	ID:

	.. code-block:: javascript

	function(doc) {
	if (doc.type == "comment") {
	emit(doc.post, {author: doc.author, content: doc.content});
	}
	}

	And you'd invoke that view passing it a ``?key="post_id"`` query string
	parameter.

	Viewing all comments by author is just as easy as before:

	.. code-block:: javascript

	function(doc) {
	if (doc.type == "comment") {
	emit(doc.author, {post: doc.post, content: doc.content});
	}
	}

	So this is better in some ways, but it also has a disadvantage.
	Imagine you want to display a blog post with all the associated comments on the
	same web page. With our first approach, we needed just a single request to the
	CouchDB server, namely a ``GET`` request to the document. With this second
	approach, we need two requests: a ``GET`` request to the post document, and a
	``GET`` request to the view that returns all comments for the post.

	That is okay, but not quite satisfactory. Just imagine you wanted to add
	threaded comments: you'd now need an additional fetch per comment. What we'd
	probably want then would be a way to join the blog post and the various comments
	together to be able to retrieve them with a single HTTP request.

	This was when Damien Katz, the author of CouchDB, chimed in to the discussion
	on IRC to show us the way.

	Optimization: Using the Power of View Collation
	-----------------------------------------------

	Obvious to Damien, but not at all obvious to the rest of us: it's fairly simple
	to make a view that includes both the content of the blog post document, and
	the content of all the comments associated with that post. The way you do that
	is by using `complex keys`. Until now we've been using simple string values for
	the view keys, but in fact they can be arbitrary JSON values, so let's make
	some use of that:

	.. code-block:: javascript

	function(doc) {
	if (doc.type == "post") {
	emit([doc._id, 0], null);
	} else if (doc.type == "comment") {
	emit([doc.post, 1], null);
	}
	}

	Okay, this may be confusing at first. Let's take a step back and look at what
	views in CouchDB are really about.

	CouchDB views are basically highly efficient on-disk dictionaries that map keys
	to values, where the key is automatically indexed and can be used to filter
	and/or sort the results you get back from your views. When you “invoke” a view,
	you can say that you're only interested in a subset of the view rows by
	specifying a ``?key=foo`` query string parameter. Or you can specify
	``?startkey=foo`` and/or ``?endkey=bar`` query string parameters to fetch rows
	over a range of keys. Finally, by adding ``?include_docs=true`` to the query,
	the result will include the full body of each emitted document.

	It's also important to note that keys are always used for collating (i.e.
	sorting) the rows. CouchDB has well defined (but as of yet undocumented) rules
	for comparing arbitrary JSON objects for collation. For example, the JSON value
	``["foo", 2]`` is sorted after (considered “greater than”) the values
	``["foo"]`` or ``["foo", 1, "bar"]``, but before e.g. ``["foo", 2, "bar"]``.
	This feature enables a whole class of tricks that are rather non-obvious...

	.. seealso::

	:ref:`views/collation`

	With that in mind, let's return to the view function above. First note that,
	unlike the previous view functions we've used here, this view handles both
	"post" and "comment" documents, and both of them end up as rows in the same
	view. Also, the key in this view is not just a simple string, but an array.
	The first element in that array is always the ID of the post, regardless of
	whether we're processing an actual post document, or a comment associated with
	a post. The second element is 0 for post documents, and 1 for comment documents.

	Let's assume we have two blog posts in our database. Without limiting the view
	results via ``key``, ``startkey``, or ``endkey``, we'd get back something like
	the following:

	.. code-block:: javascript

	{
	"total_rows": 5, "offset": 0, "rows": [{
	"id": "myslug",
	"key": ["myslug", 0],
	"value": null
	}, {
	"id": "ABCDEF",
	"key": ["myslug", 1],
	"value": null
	}, {
	"id": "DEFABC",
	"key": ["myslug", 1],
	"value": null
	}, {
	"id": "other_slug",
	"key": ["other_slug", 0],
	"value": null
	}, {
	"id": "CDEFAB",
	"key": ["other_slug", 1],
	"value": null
	},
	]
	}

	.. note::
	The ``...`` placeholders here would contain the complete JSON encoding of the
	corresponding documents

	Now, to get a specific blog post and all associated comments, we'd invoke that
	view with the query string::

	?startkey=["myslug"]&endkey=["myslug", 2]&include_docs=true

	We'd get back the first three rows, those that belong to the ``myslug`` post,
	but not the others, along with the full bodies of each document. Et voila, we
	now have the data we need to display a post with all associated comments,
	retrieved via a single ``GET`` request.

	You may be asking what the 0 and 1 parts of the keys are for. They're simply
	to ensure that the post document is always sorted before the the associated
	comment documents. So when you get back the results from this view for a
	specific post, you'll know that the first row contains the data for the blog
	post itself, and the remaining rows contain the comment data.

	One remaining problem with this model is that comments are not ordered, but
	that's simply because we don't have date/time information associated with them.
	If we had, we'd add the timestamp as third element of the key array, probably
	as ISO date/time strings. Now we would continue using the query string
	``?startkey=["myslug"]&endkey=["myslug", 2]&include_docs=true`` to fetch the
	blog post and all associated comments, only now they'd be in chronological
	order.