blob: 2b443746343b7e3f206f6fc09f40e1ec02cb39b4 [file] [log] [blame]
.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
.. use this file except in compliance with the License. You may obtain a copy of
.. the License at
..
.. http://www.apache.org/licenses/LICENSE-2.0
..
.. Unless required by applicable law or agreed to in writing, software
.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
.. License for the specific language governing permissions and limitations under
.. the License.
.. _views/intro:
=====================
Introduction to Views
=====================
Views are useful for many purposes:
- Filtering the documents in your database to find those relevant to a
particular process.
- Extracting data from your documents and presenting it in a specific order.
- Building efficient indexes to find documents by any value or structure that
resides in them.
- Use these indexes to represent relationships among documents.
- Finally, with views you can make all sorts of calculations on the data in your
documents. For example, if documents represent your company’s financial
transactions, a view can answer the question of what the spending was in the
last week, month, or year.
What Is a View?
===============
Let’s go through the different use cases. First is extracting data that you
might need for a special purpose in a specific order. For a front page, we want
a list of blog post titles sorted by date. We’ll work with a set of example
documents as we walk through how views work:
.. code-block:: javascript
{
"_id":"biking",
"_rev":"AE19EBC7654",
"title":"Biking",
"body":"My biggest hobby is mountainbiking. The other day...",
"date":"2009/01/30 18:04:11"
}
.. code-block:: javascript
{
"_id":"bought-a-cat",
"_rev":"4A3BBEE711",
"title":"Bought a Cat",
"body":"I went to the the pet store earlier and brought home a little kitty...",
"date":"2009/02/17 21:13:39"
}
.. code-block:: javascript
{
"_id":"hello-world",
"_rev":"43FBA4E7AB",
"title":"Hello World",
"body":"Well hello and welcome to my new blog...",
"date":"2009/01/15 15:52:20"
}
Three will do for the example. Note that the documents are sorted by "_id",
which is how they are stored in the database. Now we define a view.
Bear with us without an explanation while we show you some code:
.. code-block:: javascript
function(doc) {
if(doc.date && doc.title) {
emit(doc.date, doc.title);
}
}
This is a `map function`, and it is written in JavaScript. If you are not
familiar with JavaScript but have used C or any other C-like language such as
Java, PHP, or C#, this should look familiar. It is a simple function definition.
You provide CouchDB with view functions as strings stored inside the ``views``
field of a design document. To create this view you can use this command:
.. code-block:: console
curl -X PUT http://admin:password@127.0.0.1:5984/db/_design/my_ddoc
-d '{"views":{"my_filter":{"map":
"function(doc) { if(doc.date && doc.title) { emit(doc.date, doc.title); }}"}}}'
You don’t run the JavaScript function yourself. Instead, when you
`query your view`, CouchDB takes the source code and runs it for you on every
document in the database your view was defined in. You `query your view` to
retrieve the `view result` using the following command:
.. code-block:: console
curl -X GET http://admin:password@127.0.0.1:5984/db/_design/my_ddoc/_view/my_filter
All map functions have a single parameter doc. This is a single document in
the database. Our map function checks whether our document has a ``date`` and
a ``title`` attribute — luckily, all of our documents have them — and then calls
the built-in :js:func:`emit` function with these two attributes as arguments.
The :js:func:`emit` function always takes two arguments: the first is ``key``,
and the second is ``value``. The ``emit(key, value)`` function creates an entry
in our `view result`. One more thing: the ``emit()`` function can be called
multiple times in the map function to create multiple entries in the view
results from a single document, but we are not doing that yet.
CouchDB takes whatever you pass into the emit() function and puts it into a list
(see Table 1, “View results” below). Each row in that list includes the `key`
and `value`. More importantly, the list is sorted by key (by ``doc.date``
in our case). The most important feature of a view result is that it is sorted
by `key`. We will come back to that over and over again to do neat things. Stay
tuned.
Table 1. View results:
+-----------------------+------------------+
| Key | Value |
+=======================+==================+
| "2009/01/15 15:52:20" | "Hello World" |
+-----------------------+------------------+
| "2009/01/30 18:04:11" | "Biking" |
+-----------------------+------------------+
| "2009/02/17 21:13:39" | "Bought a Cat" |
+-----------------------+------------------+
When you query your view, CouchDB takes the source code and runs it for you on
every document in the database. If you have a lot of documents, that takes
quite a bit of time and you might wonder if it is not horribly inefficient
to do this. Yes, it would be, but CouchDB is designed to avoid any extra costs:
it only runs through all documents once, when you first query your view.
If a document is changed, the map function is only run once, to recompute
the keys and values for that single document.
The view result is stored in a B-tree, just like the structure that is
responsible for holding your documents. View B-trees are stored in their
own file, so that for high-performance CouchDB usage, you can keep views on
their own disk. The B-tree provides very fast lookups of rows by key, as well
as efficient streaming of rows in a key range. In our example, a single view
can answer all questions that involve time: “Give me all the blog posts from
last week” or “last month” or “this year.” Pretty neat.
When we query our view, we get back a list of all documents sorted by date.
Each row also includes the post title so we can construct links to posts.
Table 1 is just a graphical representation of the view result.
The actual result is JSON-encoded and contains a little more metadata:
.. code-block:: javascript
{
"total_rows": 3,
"offset": 0,
"rows": [
{
"key": "2009/01/15 15:52:20",
"id": "hello-world",
"value": "Hello World"
},
{
"key": "2009/01/30 18:04:11",
"id": "biking",
"value": "Biking"
},
{
"key": "2009/02/17 21:13:39",
"id": "bought-a-cat",
"value": "Bought a Cat"
}
]
}
Now, the actual result is not as nicely formatted and doesn’t include any
superfluous whitespace or newlines, but this is better for you (and us!)
to read and understand. Where does that "id" member in the result rows come
from? That wasn’t there before. That’s because we omitted it earlier to avoid
confusion. CouchDB automatically includes the document ID of the document that
created the entry in the view result. We’ll use this as well when constructing
links to the blog post pages.
.. warning::
Do not emit the entire document as the value of your ``emit(key, value)``
statement unless you're sure you know you want it. This stores an entire
additional copy of your document in the view's secondary index. Views with
``emit(key, doc)`` take longer to update, longer to write to disk, and
consume significantly more disk space. The only advantage is that they
are faster to query than using the ``?include_docs=true`` parameter when
querying a view.
Consider the trade-offs before emitting the entire document. Often it is
sufficient to emit only a portion of the document, or just a single key /
value pair, in your views.
Efficient Lookups
=================
Let’s move on to the second use case for views: “building efficient indexes to
find documents by any value or structure that resides in them.” We already
explained the efficient indexing, but we skipped a few details. This is a good
time to finish this discussion as we are looking at map functions that are a
little more complex.
First, back to the B-trees! We explained that the B-tree that backs the
key-sorted view result is built only once, when you first query a view,
and all subsequent queries will just read the B-tree instead of executing
the map function for all documents again. What happens, though, when you change
a document, add a new one, or delete one? Easy: CouchDB is smart enough
to find the rows in the view result that were created by a specific document.
It marks them invalid so that they no longer show up in view results.
If the document was deleted, we’re good — the resulting B-tree reflects the
state of the database. If a document got updated, the new document is run
through the map function and the resulting new lines are inserted into
the B-tree at the correct spots. New documents are handled in the same way.
The B-tree is a very efficient data structure for our needs, and the crash-only
design of CouchDB databases is carried over to the view indexes as well.
To add one more point to the efficiency discussion: usually multiple documents
are updated between view queries. The mechanism explained in the previous
paragraph gets applied to all changes in the database since the last time
the view was queried in a batch operation, which makes things even faster and
is generally a better use of your resources.
Find One
--------
On to more complex map functions. We said “find documents by any value or
structure that resides in them.” We already explained how to extract a value
by which to sort a list of views (our date field). The same mechanism is used
for fast lookups. The URI to query to get a view’s result is
``/database/_design/designdocname/_view/viewname``. This gives you a list of all
rows in the view. We have only three documents, so things are small, but with
thousands of documents, this can get long. You can add view parameters to the
URI to constrain the result set. Say we know the date of a blog post.
To find a single document, we would use
``/blog/_design/docs/_view/by_date?key="2009/01/30 18:04:11"``
to get the “Biking” blog post. Remember that you can place whatever you like
in the key parameter to the emit() function. Whatever you put in there, we can
now use to look up exactly — and fast.
Note that in the case where multiple rows have the same key (perhaps we design
a view where the key is the name of the post’s author), key queries can return
more than one row.
Find Many
---------
We talked about “getting all posts for last month.” If it’s February now,
this is as easy as::
/blog/_design/docs/_view/by_date?startkey="2010/01/01 00:00:00"&endkey="2010/02/00 00:00:00"
The ``startkey`` and ``endkey`` parameters specify an inclusive range on which
we can search.
To make things a little nicer and to prepare for a future example, we are going
to change the format of our date field. Instead of a string, we are going to use
an array, where individual members are part of a timestamp in decreasing
significance. This sounds fancy, but it is rather easy. Instead of::
{
"date": "2009/01/31 00:00:00"
}
we use::
{
"date": [2009, 1, 31, 0, 0, 0]
}
Our map function does not have to change for this, but our view result looks
a little different:
Table 2. New view results:
+---------------------------+------------------+
| Key | Value |
+===========================+==================+
| [2009, 1, 15, 15, 52, 20] | "Hello World" |
+---------------------------+------------------+
| [2009, 2, 17, 21, 13, 39] | "Biking" |
+---------------------------+------------------+
| [2009, 1, 30, 18, 4, 11] | "Bought a Cat" |
+---------------------------+------------------+
And our queries change to::
/blog/_design/docs/_view/by_date?startkey=[2010, 1, 1, 0, 0, 0]&endkey=[2010, 2, 1, 0, 0, 0]
For all you care, this is just a change in syntax, not meaning. But it shows
you the power of views. Not only can you construct an index with scalar values
like strings and integers, you can also use JSON structures as keys for your
views. Say we tag our documents with a list of tags and want to see all tags,
but we don’t care for documents that have not been tagged.
.. code-block:: javascript
{
...
tags: ["cool", "freak", "plankton"],
...
}
.. code-block:: javascript
{
...
tags: [],
...
}
.. code-block:: javascript
function(doc) {
if(doc.tags.length > 0) {
for(var idx in doc.tags) {
emit(doc.tags[idx], null);
}
}
}
This shows a few new things. You can have conditions on structure
(``if(doc.tags.length > 0)``) instead of just values. This is also an example of
how a map function calls :js:func:`emit` multiple times per document.
And finally, you can pass null instead of a value to the value parameter.
The same is true for the key parameter. We’ll see in a bit how that is useful.
Reversed Results
----------------
To retrieve view results in reverse order, use the ``descending=true`` query
parameter. If you are using a ``startkey`` parameter, you will find that CouchDB
returns different rows or no rows at all. What’s up with that?
It’s pretty easy to understand when you see how view query options work under
the hood. A view is stored in a tree structure for fast lookups. Whenever you
query a view, this is how CouchDB operates:
#. Starts reading at the top, or at the position that ``startkey`` specifies,
if present.
#. Returns one row at a time until the end or until it hits ``endkey``,
if present.
If you specify ``descending=true``, the reading direction is reversed,
not the sort order of the rows in the view. In addition, the same two-step
procedure is followed.
Say you have a view result that looks like this:
+-----+-------+
| Key | Value |
+=====+=======+
| 0 | "foo" |
+-----+-------+
| 1 | "bar" |
+-----+-------+
| 2 | "baz" |
+-----+-------+
Here are potential query options: ``?startkey=1&descending=true``. What will
CouchDB do? See #1 above: it jumps to ``startkey``, which is the row with the
key ``1``, and starts reading backward until it hits the end of the view.
So the particular result would be:
+-----+-------+
| Key | Value |
+=====+=======+
| 1 | "bar" |
+-----+-------+
| 0 | "foo" |
+-----+-------+
This is very likely not what you want. To get the rows with the indexes ``1``
and ``2`` in reverse order, you need to switch the ``startkey`` to ``endkey``:
``endkey=1&descending=true``:
+-----+-------+
| Key | Value |
+=====+=======+
| 2 | "baz" |
+-----+-------+
| 1 | "bar" |
+-----+-------+
Now that looks a lot better. CouchDB started reading at the bottom of the view
and went backward until it hit ``endkey``.
The View to Get Comments for Posts
==================================
We use an array key here to support the ``group_level`` reduce query parameter.
CouchDB’s views are stored in the B-tree file structure. Because of the way
B-trees are structured, we can cache the intermediate reduce results in the
non-leaf nodes of the tree, so reduce queries can be computed along arbitrary
key ranges in logarithmic time. See Figure 1, “Comments map function”.
In the blog app, we use ``group_level`` reduce queries to compute the count of
comments both on a per-post and total basis, achieved by querying the same view
index with different methods. With some array keys, and assuming each key has
the value ``1``:
.. code-block:: javascript
["a","b","c"]
["a","b","e"]
["a","c","m"]
["b","a","c"]
["b","a","g"]
the reduce view:
.. code-block:: javascript
function(keys, values, rereduce) {
return sum(values)
}
or:
.. code-block:: javascript
_sum
which is a built-in CouchDB reduce function (the others are ``_count`` and
``_stats``). ``_sum`` here returns the total number of rows between the start
and end key. So with ``startkey=["a","b"]&endkey=["b"]`` (which includes the
first three of the above keys) the result would equal ``3``. The effect is to
count rows. If you’d like to count rows without depending on the row value,
you can switch on the ``rereduce`` parameter:
.. code-block:: javascript
function(keys, values, rereduce) {
if (rereduce) {
return sum(values);
} else {
return values.length;
}
}
.. note::
The JavaScript function above could be effectively replaced by the built-in
``_count``.
.. figure:: ../../../images/views-intro-01.png
:align: center
:scale: 50 %
:alt: Comments map function
Figure 1. Comments map function
This is the reduce view used by the example app to count comments, while
utilizing the map to output the comments, which are more useful than just
``1`` over and over. It pays to spend some time playing around with map and
reduce functions. Fauxton is OK for this, but it doesn’t give full access to
all the query parameters. Writing your own test code for views in your language
of choice is a great way to explore the nuances and capabilities of CouchDB’s
incremental MapReduce system.
Anyway, with a ``group_level`` query, you’re basically running a series of
reduce range queries: one for each group that shows up at the level you query.
Let’s reprint the key list from earlier, grouped at level ``1``:
.. code-block:: javascript
["a"] 3
["b"] 2
And at ``group_level=2``:
.. code-block:: javascript
["a","b"] 2
["a","c"] 1
["b","a"] 2
Using the parameter ``group=true`` makes it behave as though it were
``group_level=999``, so in the case of our current example, it would give the
number ``1`` for each key, as there are no exactly duplicated keys.
Reduce/Rereduce
===============
We briefly talked about the ``rereduce`` parameter to the reduce function.
We’ll explain what’s up with it in this section. By now, you should have learned
that your view result is stored in B-tree index structure for efficiency.
The existence and use of the ``rereduce`` parameter is tightly coupled to how
the B-tree index works.
Consider the map result are:
.. code-block:: javascript
"afrikaans", 1
"afrikaans", 1
"chinese", 1
"chinese", 1
"chinese", 1
"chinese", 1
"french", 1
"italian", 1
"italian", 1
"spanish", 1
"vietnamese", 1
"vietnamese", 1
Example 1. Example view result (mmm, food)
When we want to find out how many dishes there are per origin, we can reuse
the simple reduce function shown earlier:
.. code-block:: javascript
function(keys, values, rereduce) {
return sum(values);
}
Figure 2, “The B-tree index” shows a simplified version of what the B-tree index
looks like. We abbreviated the key strings.
.. figure:: ../../../images/views-intro-02.png
:align: center
:alt: The B-tree index
Figure 2. The B-tree index
The view result is what computer science grads call a “pre-order” walk through
the tree. We look at each element in each node starting from the left. Whenever
we see that there is a subnode to descend into, we descend and start reading
the elements in that subnode. When we have walked through the entire tree,
we’re done.
You can see that CouchDB stores both keys and values inside each leaf node.
In our case, it is simply always ``1``, but you might have a value where you
count other results and then all rows have a different value. What’s important
is that CouchDB runs all elements that are within a node into the reduce
function (setting the ``rereduce`` parameter to false) and stores the result
inside the parent node along with the edge to the subnode. In our case, each
edge has a 3 representing the reduce value for the node it points to.
.. note::
In reality, nodes have more than 1,600 elements in them. CouchDB computes
the result for all the elements in multiple iterations over the elements in
a single node, not all at once (which would be disastrous for memory
consumption).
Now let’s see what happens when we run a query. We want to know how many
"chinese" entries we have. The query option is simple: ``?key="chinese"``.
See Figure 3, “The B-tree index reduce result”.
.. figure:: ../../../images/views-intro-03.png
:align: center
:alt: The B-tree index reduce result
Figure 3. The B-tree index reduce result
CouchDB detects that all values in the subnode include the "chinese" key.
It concludes that it can take just the 3 values associated with that node to
compute the final result. It then finds the node left to it and sees that it’s
a node with keys outside the requested range (``key=`` requests a range where
the beginning and the end are the same value). It concludes that it has to use
the "chinese" element’s value and the other node’s value and run them through
the reduce function with the ``rereduce`` parameter set to true.
The reduce function effectively calculates 3 + 1 at query time and returns the
desired result. The next example shows some pseudocode that shows the last
invocation of the reduce function with actual values:
.. code-block:: javascript
function(null, [3, 1], true) {
return sum([3, 1]);
}
Now, we said your reduce function must actually reduce your values. If you see
the B-tree, it should become obvious what happens when you don’t reduce your
values. Consider the following map result and reduce function. This time we
want to get a list of all the unique labels in our view:
.. code-block:: javascript
"abc", "afrikaans"
"cef", "afrikaans"
"fhi", "chinese"
"hkl", "chinese"
"ino", "chinese"
"lqr", "chinese"
"mtu", "french"
"owx", "italian"
"qza", "italian"
"tdx", "spanish"
"xfg", "vietnamese"
"zul", "vietnamese"
We don’t care for the key here and only list all the labels we have. Our reduce
function removes duplicates:
.. code-block:: javascript
function(keys, values, rereduce) {
var unique_labels = {};
values.forEach(function(label) {
if(!unique_labels[label]) {
unique_labels[label] = true;
}
});
return unique_labels;
}
This translates to Figure 4, “An overflowing reduce index”.
We hope you get the picture. The way the B-tree storage works means that if you
don’t actually reduce your data in the reduce function, you end up having
CouchDB copy huge amounts of data around that grow linearly, if not faster,
with the number of rows in your view.
CouchDB will be able to compute the final result, but only for views with a few
rows. Anything larger will experience a ridiculously slow view build time.
To help with that, CouchDB since version 0.10.0 will throw an error if your
reduce function does not reduce its input values.
.. figure:: ../../../images/views-intro-04.png
:align: center
:alt: An overflowing reduce index
Figure 4. An overflowing reduce index
One vs. Multiple Design Documents
=================================
A common question is: when should I split multiple views into multiple design
documents, or keep them together?
Each view you create corresponds to one B-tree. All views in a single design
document will live in the same set of index files on disk (one file per
database shard; in 2.0+ by default, 8 files per node).
The most practical consideration for separating views into separate documents
is how often you change those views. Views that change often, and are in the
same design document as other views, will invalidate those other views'
indexes when the design document is written, forcing them all to rebuild from
scratch. Obviously you will want to avoid this in production!
However, when you have multiple views with the same map function in the same
design document, CouchDB will optimize and only calculate that map function
once. This lets you have two views with different *reduce* functions (say,
one with ``_sum`` and one with ``_stats``) but build only a single copy
of the mapped index. It also saves disk space and the time to write multiple
copies to disk.
Another benefit of having multiple views in the same design document is that
the index files can keep a single index of backwards references from docids
to rows. CouchDB needs these "back refs" to invalidate rows in a view when a
document is deleted (otherwise, a delete would force a total rebuild!)
One other consideration is that each separate design document will spawn
another (set of) ``couchjs`` processes to generate the view, one per shard.
Depending on the number of cores on your server(s), this may be efficient
(using all of the idle cores you have) or inefficient (overloading the CPU on
your servers). The exact situation will depend on your deployment architecture.
So, should you use one or multiple design documents? The choice is yours.
Lessons Learned
===============
- If you don’t use the key field in the map function, you are probably doing it
wrong.
- If you are trying to make a list of values unique in the reduce functions,
you are probably doing it wrong.
- If you don’t reduce your values to a single scalar value or a small
fixed-sized object or array with a fixed number of scalar values of small
sizes, you are probably doing it wrong.
Wrapping Up
===========
Map functions are side effect–free functions that take a document as argument
and `emit` key/value pairs. CouchDB stores the emitted rows by constructing a
sorted B-tree index, so row lookups by key, as well as streaming operations
across a range of rows, can be accomplished in a small memory and processing
footprint, while writes avoid seeks. Generating a view takes ``O(N)``, where
``N`` is the total number of rows in the view. However, querying a view is very
quick, as the B-tree remains shallow even when it contains many, many keys.
Reduce functions operate on the sorted rows emitted by map view functions.
CouchDB’s reduce functionality takes advantage of one of the fundamental
properties of B-tree indexes: for every leaf node (a sorted row), there is a
chain of internal nodes reaching back to the root. Each leaf node in the B-tree
carries a few rows (on the order of tens, depending on row size), and each
internal node may link to a few leaf nodes or other internal nodes.
The reduce function is run on every node in the tree in order to calculate
the final reduce value. The end result is a reduce function that can be
incrementally updated upon changes to the map function, while recalculating
the reduction values for a minimum number of nodes. The initial reduction is
calculated once per each node (inner and leaf) in the tree.
When run on leaf nodes (which contain actual map rows), the reduce function’s
third parameter, ``rereduce``, is false. The arguments in this case are the keys
and values as output by the map function. The function has a single returned
reduction value, which is stored on the inner node that a working set of leaf
nodes have in common, and is used as a cache in future reduce calculations.
When the reduce function is run on inner nodes, the ``rereduce`` flag is
``true``. This allows the function to account for the fact that it will be
receiving its own prior output. When ``rereduce`` is true, the values passed to
the function are intermediate reduction values as cached from previous
calculations. When the tree is more than two levels deep, the `rereduce` phase
is repeated, consuming chunks of the previous level’s output until the final
reduce value is calculated at the root node.
A common mistake new CouchDB users make is attempting to construct complex
aggregate values with a reduce function. Full reductions should result in a
scalar value, like 5, and not, for instance, a JSON hash with a set of unique
keys and the count of each. The problem with this approach is that you’ll end
up with a very large final value. The number of unique keys can be nearly as
large as the number of total keys, even for a large set. It is fine to combine
a few scalar calculations into one reduce function; for instance, to find the
total, average, and standard deviation of a set of numbers in a single function.
If you’re interested in pushing the edge of CouchDB’s incremental reduce
functionality, have a look at `Google’s paper on Sawzall`_, which gives examples
of some of the more exotic reductions that can be accomplished in a system with
similar constraints.
.. _Google’s paper on Sawzall: http://research.google.com/archive/sawzall.html