src/cluster/sharding.rst - couchdb-documentation - Git at Google

 .. Licensed under the Apache License, Version 2.0 (the "License"); you may not
 .. use this file except in compliance with the License. You may obtain a copy of
 .. the License at
 ..
 ..   http://www.apache.org/licenses/LICENSE-2.0
 ..
 .. Unless required by applicable law or agreed to in writing, software
 .. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 .. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 .. License for the specific language governing permissions and limitations under
 .. the License.

 .. _cluster/sharding:

 ================
 Shard Management
 ================

 .. _cluster/sharding/scaling-out:

 Introduction
 ------------

 This document discusses how sharding works in CouchDB along with how to
 safely add, move, remove, and create placement rules for shards and
 shard replicas.

 A `shard
 <https://en.wikipedia.org/wiki/Shard_(database_architecture)>`__ is a
 horizontal partition of data in a database. Partitioning data into
 shards and distributing copies of each shard (called "shard replicas" or
 just "replicas") to different nodes in a cluster gives the data greater
 durability against node loss. CouchDB clusters automatically shard
 databases and distribute the subsets of documents that compose each
 shard among nodes. Modifying cluster membership and sharding behavior
 must be done manually.

 Shards and Replicas
 ~~~~~~~~~~~~~~~~~~~

 How many shards and replicas each database has can be set at the global
 level, or on a per-database basis. The relevant parameters are ``q`` and
 ``n``.

 *q* is the number of database shards to maintain. *n* is the number of
 copies of each document to distribute. The default value for ``n`` is ``3``,
 and for ``q`` is ``8``. With ``q=8``, the database is split into 8 shards. With
 ``n=3``, the cluster distributes three replicas of each shard. Altogether,
 that's 24 shard replicas for a single database. In a default 3-node cluster,
 each node would receive 8 shards. In a 4-node cluster, each node would
 receive 6 shards. We recommend in the general case that the number of
 nodes in your cluster should be a multiple of ``n``, so that shards are
 distributed evenly.

 CouchDB nodes have a ``etc/local.ini`` file with a section named
 `cluster <../config/cluster.html>`__ which looks like this:

 ::

     [cluster]
     q=8
     n=3

 These settings can be modified to set sharding defaults for all
 databases, or they can be set on a per-database basis by specifying the
 ``q`` and ``n`` query parameters when the database is created. For
 example:

 .. code:: bash

     $ curl -X PUT "$COUCH_URL:5984/database-name?q=4&n=2"

 That creates a database that is split into 4 shards and 2 replicas,
 yielding 8 shard replicas distributed throughout the cluster.

 Quorum
 ~~~~~~

 Depending on the size of the cluster, the number of shards per database,
 and the number of shard replicas, not every node may have access to
 every shard, but every node knows where all the replicas of each shard
 can be found through CouchDB's internal shard map.

 Each request that comes in to a CouchDB cluster is handled by any one
 random coordinating node. This coordinating node proxies the request to
 the other nodes that have the relevant data, which may or may not
 include itself. The coordinating node sends a response to the client
 once a `quorum
 <https://en.wikipedia.org/wiki/Quorum_(distributed_computing)>`__ of
 database nodes have responded; 2, by default. The default required size
 of a quorum is equal to ``r=w=((n+1)/2)`` where ``r`` refers to the size
 of a read quorum, ``w`` refers to the size of a write quorum, and ``n``
 refers to the number of replicas of each shard. In a default cluster where
 ``n`` is 3, ``((n+1)/2)`` would be 2.

 .. note::
     Each node in a cluster can be a coordinating node for any one
     request. There are no special roles for nodes inside the cluster.

 The size of the required quorum can be configured at request time by
 setting the ``r`` parameter for document and view reads, and the ``w``
 parameter for document writes. For example, here is a request that
 directs the coordinating node to send a response once at least two nodes
 have responded:

 .. code:: bash

     $ curl "$COUCH_URL:5984/<doc>?r=2"

 Here is a similar example for writing a document:

 .. code:: bash

     $ curl -X PUT "$COUCH_URL:5984/<doc>?w=2" -d '{...}'

 Setting ``r`` or ``w`` to be equal to ``n`` (the number of replicas)
 means you will only receive a response once all nodes with relevant
 shards have responded or timed out, and as such this approach does not
 guarantee `ACIDic consistency
 <https://en.wikipedia.org/wiki/ACID#Consistency>`__. Setting ``r`` or
 ``w`` to 1 means you will receive a response after only one relevant
 node has responded.

 .. _cluster/sharding/move:

 Moving a shard
 --------------

 This section describes how to manually place and replace shards. These
 activities are critical steps when you determine your cluster is too big
 or too small, and want to resize it successfully, or you have noticed
 from server metrics that database/shard layout is non-optimal and you
 have some "hot spots" that need resolving.

 Consider a three-node cluster with q=8 and n=3. Each database has 24
 shards, distributed across the three nodes. If you :ref:`add a fourth
 node <cluster/nodes/add>` to the cluster, CouchDB will not redistribute
 existing database shards to it. This leads to unbalanced load, as the
 new node will only host shards for databases created after it joined the
 cluster. To balance the distribution of shards from existing databases,
 they must be moved manually.

 Moving shards between nodes in a cluster involves the following steps:

 0. :ref:`Ensure the target node has joined the cluster <cluster/nodes/add>`.
 1. Copy the shard(s) and any secondary
    :ref:`index shard(s) onto the target node <cluster/sharding/copying>`.
 2. :ref:`Set the target node to maintenance mode <cluster/sharding/mm>`.
 3. Update cluster metadata
    :ref:`to reflect the new target shard(s) <cluster/sharding/add-shard>`.
 4. Monitor internal replication
    :ref:`to ensure up-to-date shard(s) <cluster/sharding/verify>`.
 5. :ref:`Clear the target node's maintenance mode <cluster/sharding/mm-2>`.
 6. Update cluster metadata again
    :ref:`to remove the source shard(s)<cluster/sharding/remove-shard>`
 7. Remove the shard file(s) and secondary index file(s)
    :ref:`from the source node <cluster/sharding/remove-shard-files>`.

 .. _cluster/sharding/copying:

 Copying shard files
 ~~~~~~~~~~~~~~~~~~~

 .. note::
     Technically, copying database and secondary index
     shards is optional. If you proceed to the next step without
     performing this data copy, CouchDB will use internal replication
     to populate the newly added shard replicas. However, copying files
     is faster than internal replication, especially on a busy cluster,
     which is why we recommend performing this manual data copy first.

 Shard files live in the ``data/shards`` directory of your CouchDB
 install. Within those subdirectories are the shard files themselves. For
 instance, for a ``q=8`` database called ``abc``, here is its database shard
 files:

 ::

   data/shards/00000000-1fffffff/abc.1529362187.couch
   data/shards/20000000-3fffffff/abc.1529362187.couch
   data/shards/40000000-5fffffff/abc.1529362187.couch
   data/shards/60000000-7fffffff/abc.1529362187.couch
   data/shards/80000000-9fffffff/abc.1529362187.couch
   data/shards/a0000000-bfffffff/abc.1529362187.couch
   data/shards/c0000000-dfffffff/abc.1529362187.couch
   data/shards/e0000000-ffffffff/abc.1529362187.couch

 Secondary indexes (including JavaScript views, Erlang views and Mango
 indexes) are also sharded, and their shards should be moved to save the
 new node the effort of rebuilding the view. View shards live in
 ``data/.shards``. For example:

 ::

   data/.shards
   data/.shards/e0000000-ffffffff/_replicator.1518451591_design
   data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview
   data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
   data/.shards/c0000000-dfffffff
   data/.shards/c0000000-dfffffff/_replicator.1518451591_design
   data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview
   data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
   ...

 Since they are files, you can use ``cp``, ``rsync``,
 ``scp`` or other file-copying command to copy them from one node to
 another. For example:

 .. code:: bash

     # one one machine
     $ mkdir -p data/.shards/<range>
     $ mkdir -p data/shards/<range>
     # on the other
     $ scp <couch-dir>/data/.shards/<range>/<database>.<datecode>* \
       <node>:<couch-dir>/data/.shards/<range>/
     $ scp <couch-dir>/data/shards/<range>/<database>.<datecode>.couch \
       <node>:<couch-dir>/data/shards/<range>/

 .. note::
     Remember to move view files before database files! If a view index
     is ahead of its database, the database will rebuild it from
     scratch.

 .. _cluster/sharding/mm:

 Set the target node to ``true`` maintenance mode
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Before telling CouchDB about these new shards on the node, the node
 must be put into maintenance mode. Maintenance mode instructs CouchDB to
 return a ``404 Not Found`` response on the ``/_up`` endpoint, and
 ensures it does not participate in normal interactive clustered requests
 for its shards. A properly configured load balancer that uses ``GET
 /_up`` to check the health of nodes will detect this 404 and remove the
 node from circulation, preventing requests from being sent to that node.
 For example, to configure HAProxy to use the ``/_up`` endpoint, use:

 ::

   http-check disable-on-404
   option httpchk GET /_up

 If you do not set maintenance mode, or the load balancer ignores this
 maintenance mode status, after the next step is performed the cluster
 may return incorrect responses when consulting the node in question. You
 don't want this! In the next steps, we will ensure that this shard is
 up-to-date before allowing it to participate in end-user requests.

 To enable maintenance mode:

 .. code::bash

     $ curl -X PUT -H "Content-type: application/json" \
         $COUCH_URL:5984/_node/<nodename>/_config/couchdb/maintenance_mode \
         -d "\"true\""

 Then, verify that the node is in maintenance mode by performing a ``GET
 /_up`` on that node's individual endpoint:

 .. code::bash

     $ curl -v $COUCH_URL/_up
     …
     < HTTP/1.1 404 Object Not Found
     …
     {"status":"maintenance_mode"}

 Finally, check that your load balancer has removed the node from the
 pool of available backend nodes.

 .. _cluster/sharding/add-shard:

 Updating cluster metadata to reflect the new target shard(s)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Now we need to tell CouchDB that the target node (which must already be
 :ref:`joined to the cluster <cluster/nodes/add>`) should be hosting
 shard replicas for a given database.

 To update the cluster metadata, use the special ``/_dbs`` database,
 which is an internal CouchDB database that maps databases to shards and
 nodes. This database is replicated between nodes. It is accessible only
 via a node-local port, usually at port 5986. By default, this port is
 only available on the localhost interface for security purposes.

 First, retrieve the database's current metadata:

 .. code:: bash

     $ curl http://localhost:5986/_dbs/{name}
     {
       "_id": "{name}",
       "_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
       "shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
       "changelog": [
         ["add", "00000000-1fffffff", "node1@xxx.xxx.xxx.xxx"],
         ["add", "00000000-1fffffff", "node2@xxx.xxx.xxx.xxx"],
         ["add", "00000000-1fffffff", "node3@xxx.xxx.xxx.xxx"],
         …
       ],
       "by_node": {
         "node1@xxx.xxx.xxx.xxx": [
           "00000000-1fffffff",
           …
         ],
         …
       },
       "by_range": {
         "00000000-1fffffff": [
           "node1@xxx.xxx.xxx.xxx",
           "node2@xxx.xxx.xxx.xxx",
           "node3@xxx.xxx.xxx.xxx"
         ],
         …
       }
     }

 Here is a brief anatomy of that document:

 -  ``_id``: The name of the database.
 -  ``_rev``: The current revision of the metadata.
 -  ``shard_suffix``: A timestamp of the database's creation, marked as
    seconds after the Unix epoch mapped to the codepoints for ASCII
    numerals.
 -  ``changelog``: History of the database's shards.
 -  ``by_node``: List of shards on each node.
 -  ``by_range``: On which nodes each shard is.

 To reflect the shard move in the metadata, there are three steps:

 1. Add appropriate changelog entries.
 2. Update the ``by_node`` entries.
 3. Update the ``by_range`` entries.

 .. warning::
     Be very careful! Mistakes during this process can
     irreparably corrupt the cluster!

 As of this writing, this process must be done manually.

 To add a shard to a node, add entries like this to the database
 metadata's ``changelog`` attribute:

 .. code:: json

     ["add", "<range>", "<node-name>"]

 The ``<range>`` is the specific shard range for the shard. The ``<node-
 name>`` should match the name and address of the node as displayed in
 ``GET /_membership`` on the cluster.

 .. note::
     When removing a shard from a node, specify ``remove`` instead of ``add``.

 Once you have figured out the new changelog entries, you will need to
 update the ``by_node`` and ``by_range`` to reflect who is storing what
 shards. The data in the changelog entries and these attributes must
 match. If they do not, the database may become corrupted.

 Continuing our example, here is an updated version of the metadata above
 that adds shards to an additional node called ``node4``:

 .. code:: json

     {
       "_id": "{name}",
       "_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
       "shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
       "changelog": [
         ["add", "00000000-1fffffff", "node1@xxx.xxx.xxx.xxx"],
         ["add", "00000000-1fffffff", "node2@xxx.xxx.xxx.xxx"],
         ["add", "00000000-1fffffff", "node3@xxx.xxx.xxx.xxx"],
         …
         ["add", "00000000-1fffffff", "node4@xxx.xxx.xxx.xxx"]
       ],
       "by_node": {
         "node1@xxx.xxx.xxx.xxx": [
           "00000000-1fffffff",
           …
         ],
         …
         "node4@xxx.xxx.xxx.xxx": [
           "00000000-1fffffff"
         ]
       },
       "by_range": {
         "00000000-1fffffff": [
           "node1@xxx.xxx.xxx.xxx",
           "node2@xxx.xxx.xxx.xxx",
           "node3@xxx.xxx.xxx.xxx",
           "node4@xxx.xxx.xxx.xxx"
         ],
         …
       }
     }

 Now you can ``PUT`` this new metadata:

 .. code:: bash

     $ curl -X PUT http://localhost:5986/_dbs/{name} -d '{...}'

 .. _cluster/sharding/verify:

 Monitor internal replication to ensure up-to-date shard(s)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 After you complete the previous step, as soon as CouchDB receives a
 write request for a shard on the target node, CouchDB will check if the
 target node's shard(s) are up to date. If it finds they are not up to
 date, it will trigger an internal replication job to complete this task.
 You can observe this happening by triggering a write to the database
 (update a document, or create a new one), while monitoring the
 ``/_node/<nodename>/_system`` endpoint, which includes the
 ``internal_replication_jobs`` metric.

 Once this metric has returned to the baseline from before you wrote the
 document, or is ``0``, the shard replica is ready to serve data and we
 can bring the node out of maintenance mode.

 .. _cluster/sharding/mm-2:

 Clear the target node's maintenance mode
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 You can now let the node start servicing data requests by
 putting ``"false"`` to the maintenance mode configuration endpoint, just
 as in step 2.

 Verify that the node is not in maintenance mode by performing a ``GET
 /_up`` on that node's individual endpoint.

 Finally, check that your load balancer has returned the node to the pool
 of available backend nodes.

 .. _cluster/sharding/remove-shard:

 Update cluster metadata again to remove the source shard
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Now, remove the source shard from the shard map the same way that you
 added the new target shard to the shard map in step 2. Be sure to add
 the ``["remove", <range>, <source-shard>]`` entry to the end of the
 changelog as well as modifying both the ``by_node`` and ``by_range`` sections of
 the database metadata document.

 .. _cluster/sharding/remove-shard-files:

 Remove the shard and secondary index files from the source node
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Finally, you can remove the source shard replica by deleting its file from the
 command line on the source host, along with any view shard replicas:

 .. code::bash

     $ rm <couch-dir>/data/shards/<range>/<dbname>.<datecode>.couch
     $ rm -r <couch-dir>/data/.shards/<range>/<dbname>.<datecode>*

 Congratulations! You have moved a database shard replica. By adding and removing
 database shard replicas in this way, you can change the cluster's shard layout,
 also known as a shard map.

 Specifying database placement
 -----------------------------

 You can configure CouchDB to put shard replicas on certain nodes at
 database creation time using placement rules.

 First, each node must be labeled with a zone attribute. This defines
 which zone each node is in. You do this by editing the node’s document
 in the ``/_nodes`` database, which is accessed through the node-local
 port. Add a key value pair of the form:

 ::

     "zone": "{zone-name}"

 Do this for all of the nodes in your cluster. For example:

 .. code:: bash

     $ curl -X PUT http://localhost:5986/_nodes/<node-name> \
         -d '{ \
             "_id": "<node-name>",
             "_rev": "<rev>",
             "zone": "<zone-name>"
             }'

 In the local config file (``local.ini``) of each node, define a
 consistent cluster-wide setting like:

 ::

     [cluster]
     placement = <zone-name-1>:2,<zone-name-2>:1

 In this example, CouchDB will ensure that two replicas for a shard will
 be hosted on nodes with the zone attribute set to ``<zone-name-1>`` and
 one replica will be hosted on a new with the zone attribute set to
 ``<zone-name-2>``.

 This approach is flexible, since you can also specify zones on a per-
 database basis by specifying the placement setting as a query parameter
 when the database is created, using the same syntax as the ini file:

 .. code:: bash

     curl -X PUT $COUCH_URL:5984/<dbname>?zone=<zone>

 Note that you can also use this system to ensure certain nodes in the
 cluster do not host any replicas for newly created databases, by giving
 them a zone attribute that does not appear in the ``[cluster]``
 placement string.

 Resharding a database to a new q value
 --------------------------------------

 The ``q`` value for a database can only be set when the database is
 created, precluding live resharding. Instead, to reshard a database, it
 must be regenerated. Here are the steps:

 1. Create a temporary database with the desired shard settings, by
    specifying the q value as a query parameter during the PUT
    operation.
 2. Stop clients accessing the database.
 3. Replicate the primary database to the temporary one. Multiple
    replications may be required if the primary database is under
    active use.
 4. Delete the primary database. **Make sure nobody is using it!**
 5. Recreate the primary database with the desired shard settings.
 6. Clients can now access the database again.
 7. Replicate the temporary back to the primary.
 8. Delete the temporary database.

 Once all steps have completed, the database can be used again. The
 cluster will create and distribute its shards according to placement
 rules automatically.

 Downtime can be avoided in production if the client application(s) can
 be instructed to use the new database instead of the old one, and a cut-
 over is performed during a very brief outage window.
	.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
	.. use this file except in compliance with the License. You may obtain a copy of
	.. the License at
	..
	.. http://www.apache.org/licenses/LICENSE-2.0
	..
	.. Unless required by applicable law or agreed to in writing, software
	.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
	.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
	.. License for the specific language governing permissions and limitations under
	.. the License.

	.. _cluster/sharding:

	================
	Shard Management
	================

	.. _cluster/sharding/scaling-out:

	Introduction
	------------

	This document discusses how sharding works in CouchDB along with how to
	safely add, move, remove, and create placement rules for shards and
	shard replicas.

	A `shard
	<https://en.wikipedia.org/wiki/Shard_(database_architecture)>`__ is a
	horizontal partition of data in a database. Partitioning data into
	shards and distributing copies of each shard (called "shard replicas" or
	just "replicas") to different nodes in a cluster gives the data greater
	durability against node loss. CouchDB clusters automatically shard
	databases and distribute the subsets of documents that compose each
	shard among nodes. Modifying cluster membership and sharding behavior
	must be done manually.

	Shards and Replicas
	~~~~~~~~~~~~~~~~~~~

	How many shards and replicas each database has can be set at the global
	level, or on a per-database basis. The relevant parameters are ``q`` and
	``n``.

	q is the number of database shards to maintain. n is the number of
	copies of each document to distribute. The default value for ``n`` is ``3``,
	and for ``q`` is ``8``. With ``q=8``, the database is split into 8 shards. With
	``n=3``, the cluster distributes three replicas of each shard. Altogether,
	that's 24 shard replicas for a single database. In a default 3-node cluster,
	each node would receive 8 shards. In a 4-node cluster, each node would
	receive 6 shards. We recommend in the general case that the number of
	nodes in your cluster should be a multiple of ``n``, so that shards are
	distributed evenly.

	CouchDB nodes have a ``etc/local.ini`` file with a section named
	`cluster <../config/cluster.html>`__ which looks like this:

	::

	[cluster]
	q=8
	n=3

	These settings can be modified to set sharding defaults for all
	databases, or they can be set on a per-database basis by specifying the
	``q`` and ``n`` query parameters when the database is created. For
	example:

	.. code:: bash

	$ curl -X PUT "$COUCH_URL:5984/database-name?q=4&n=2"

	That creates a database that is split into 4 shards and 2 replicas,
	yielding 8 shard replicas distributed throughout the cluster.

	Quorum
	~~~~~~

	Depending on the size of the cluster, the number of shards per database,
	and the number of shard replicas, not every node may have access to
	every shard, but every node knows where all the replicas of each shard
	can be found through CouchDB's internal shard map.

	Each request that comes in to a CouchDB cluster is handled by any one
	random coordinating node. This coordinating node proxies the request to
	the other nodes that have the relevant data, which may or may not
	include itself. The coordinating node sends a response to the client
	once a `quorum
	<https://en.wikipedia.org/wiki/Quorum_(distributed_computing)>`__ of
	database nodes have responded; 2, by default. The default required size
	of a quorum is equal to ``r=w=((n+1)/2)`` where ``r`` refers to the size
	of a read quorum, ``w`` refers to the size of a write quorum, and ``n``
	refers to the number of replicas of each shard. In a default cluster where
	``n`` is 3, ``((n+1)/2)`` would be 2.

	.. note::
	Each node in a cluster can be a coordinating node for any one
	request. There are no special roles for nodes inside the cluster.

	The size of the required quorum can be configured at request time by
	setting the ``r`` parameter for document and view reads, and the ``w``
	parameter for document writes. For example, here is a request that
	directs the coordinating node to send a response once at least two nodes
	have responded:

	.. code:: bash

	$ curl "$COUCH_URL:5984/<doc>?r=2"

	Here is a similar example for writing a document:

	.. code:: bash

	$ curl -X PUT "$COUCH_URL:5984/<doc>?w=2" -d '{...}'

	Setting ``r`` or ``w`` to be equal to ``n`` (the number of replicas)
	means you will only receive a response once all nodes with relevant
	shards have responded or timed out, and as such this approach does not
	guarantee `ACIDic consistency
	<https://en.wikipedia.org/wiki/ACID#Consistency>`__. Setting ``r`` or
	``w`` to 1 means you will receive a response after only one relevant
	node has responded.

	.. _cluster/sharding/move:

	Moving a shard
	--------------

	This section describes how to manually place and replace shards. These
	activities are critical steps when you determine your cluster is too big
	or too small, and want to resize it successfully, or you have noticed
	from server metrics that database/shard layout is non-optimal and you
	have some "hot spots" that need resolving.

	Consider a three-node cluster with q=8 and n=3. Each database has 24
	shards, distributed across the three nodes. If you :ref:`add a fourth
	node <cluster/nodes/add>` to the cluster, CouchDB will not redistribute
	existing database shards to it. This leads to unbalanced load, as the
	new node will only host shards for databases created after it joined the
	cluster. To balance the distribution of shards from existing databases,
	they must be moved manually.

	Moving shards between nodes in a cluster involves the following steps:

	0. :ref:`Ensure the target node has joined the cluster <cluster/nodes/add>`.
	1. Copy the shard(s) and any secondary
	:ref:`index shard(s) onto the target node <cluster/sharding/copying>`.
	2. :ref:`Set the target node to maintenance mode <cluster/sharding/mm>`.
	3. Update cluster metadata
	:ref:`to reflect the new target shard(s) <cluster/sharding/add-shard>`.
	4. Monitor internal replication
	:ref:`to ensure up-to-date shard(s) <cluster/sharding/verify>`.
	5. :ref:`Clear the target node's maintenance mode <cluster/sharding/mm-2>`.
	6. Update cluster metadata again
	:ref:`to remove the source shard(s)<cluster/sharding/remove-shard>`
	7. Remove the shard file(s) and secondary index file(s)
	:ref:`from the source node <cluster/sharding/remove-shard-files>`.

	.. _cluster/sharding/copying:

	Copying shard files
	~~~~~~~~~~~~~~~~~~~

	.. note::
	Technically, copying database and secondary index
	shards is optional. If you proceed to the next step without
	performing this data copy, CouchDB will use internal replication
	to populate the newly added shard replicas. However, copying files
	is faster than internal replication, especially on a busy cluster,
	which is why we recommend performing this manual data copy first.

	Shard files live in the ``data/shards`` directory of your CouchDB
	install. Within those subdirectories are the shard files themselves. For
	instance, for a ``q=8`` database called ``abc``, here is its database shard
	files:

	::

	data/shards/00000000-1fffffff/abc.1529362187.couch
	data/shards/20000000-3fffffff/abc.1529362187.couch
	data/shards/40000000-5fffffff/abc.1529362187.couch
	data/shards/60000000-7fffffff/abc.1529362187.couch
	data/shards/80000000-9fffffff/abc.1529362187.couch
	data/shards/a0000000-bfffffff/abc.1529362187.couch
	data/shards/c0000000-dfffffff/abc.1529362187.couch
	data/shards/e0000000-ffffffff/abc.1529362187.couch

	Secondary indexes (including JavaScript views, Erlang views and Mango
	indexes) are also sharded, and their shards should be moved to save the
	new node the effort of rebuilding the view. View shards live in
	``data/.shards``. For example:

	::

	data/.shards
	data/.shards/e0000000-ffffffff/_replicator.1518451591_design
	data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview
	data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
	data/.shards/c0000000-dfffffff
	data/.shards/c0000000-dfffffff/_replicator.1518451591_design
	data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview
	data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
	...

	Since they are files, you can use ``cp``, ``rsync``,
	``scp`` or other file-copying command to copy them from one node to
	another. For example:

	.. code:: bash

	# one one machine
	$ mkdir -p data/.shards/<range>
	$ mkdir -p data/shards/<range>
	# on the other
	$ scp <couch-dir>/data/.shards/<range>/<database>.<datecode>* \
	<node>:<couch-dir>/data/.shards/<range>/
	$ scp <couch-dir>/data/shards/<range>/<database>.<datecode>.couch \
	<node>:<couch-dir>/data/shards/<range>/

	.. note::
	Remember to move view files before database files! If a view index
	is ahead of its database, the database will rebuild it from
	scratch.

	.. _cluster/sharding/mm:

	Set the target node to ``true`` maintenance mode
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Before telling CouchDB about these new shards on the node, the node
	must be put into maintenance mode. Maintenance mode instructs CouchDB to
	return a ``404 Not Found`` response on the ``/_up`` endpoint, and
	ensures it does not participate in normal interactive clustered requests
	for its shards. A properly configured load balancer that uses ``GET
	/_up`` to check the health of nodes will detect this 404 and remove the
	node from circulation, preventing requests from being sent to that node.
	For example, to configure HAProxy to use the ``/_up`` endpoint, use:

	::

	http-check disable-on-404
	option httpchk GET /_up

	If you do not set maintenance mode, or the load balancer ignores this
	maintenance mode status, after the next step is performed the cluster
	may return incorrect responses when consulting the node in question. You
	don't want this! In the next steps, we will ensure that this shard is
	up-to-date before allowing it to participate in end-user requests.

	To enable maintenance mode:

	.. code::bash

	$ curl -X PUT -H "Content-type: application/json" \
	$COUCH_URL:5984/_node/<nodename>/_config/couchdb/maintenance_mode \
	-d "\"true\""

	Then, verify that the node is in maintenance mode by performing a ``GET
	/_up`` on that node's individual endpoint:

	.. code::bash

	$ curl -v $COUCH_URL/_up
	…
	< HTTP/1.1 404 Object Not Found
	…
	{"status":"maintenance_mode"}

	Finally, check that your load balancer has removed the node from the
	pool of available backend nodes.

	.. _cluster/sharding/add-shard:

	Updating cluster metadata to reflect the new target shard(s)
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Now we need to tell CouchDB that the target node (which must already be
	:ref:`joined to the cluster <cluster/nodes/add>`) should be hosting
	shard replicas for a given database.

	To update the cluster metadata, use the special ``/_dbs`` database,
	which is an internal CouchDB database that maps databases to shards and
	nodes. This database is replicated between nodes. It is accessible only
	via a node-local port, usually at port 5986. By default, this port is
	only available on the localhost interface for security purposes.

	First, retrieve the database's current metadata:

	.. code:: bash

	$ curl http://localhost:5986/_dbs/{name}
	{
	"_id": "{name}",
	"_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
	"shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
	"changelog": [
	["add", "00000000-1fffffff", "node1@xxx.xxx.xxx.xxx"],
	["add", "00000000-1fffffff", "node2@xxx.xxx.xxx.xxx"],
	["add", "00000000-1fffffff", "node3@xxx.xxx.xxx.xxx"],
	…
	],
	"by_node": {
	"node1@xxx.xxx.xxx.xxx": [
	"00000000-1fffffff",
	…
	],
	…
	},
	"by_range": {
	"00000000-1fffffff": [
	"node1@xxx.xxx.xxx.xxx",
	"node2@xxx.xxx.xxx.xxx",
	"node3@xxx.xxx.xxx.xxx"
	],
	…
	}
	}

	Here is a brief anatomy of that document:

	- ``_id``: The name of the database.
	- ``_rev``: The current revision of the metadata.
	- ``shard_suffix``: A timestamp of the database's creation, marked as
	seconds after the Unix epoch mapped to the codepoints for ASCII
	numerals.
	- ``changelog``: History of the database's shards.
	- ``by_node``: List of shards on each node.
	- ``by_range``: On which nodes each shard is.

	To reflect the shard move in the metadata, there are three steps:

	1. Add appropriate changelog entries.
	2. Update the ``by_node`` entries.
	3. Update the ``by_range`` entries.

	.. warning::
	Be very careful! Mistakes during this process can
	irreparably corrupt the cluster!

	As of this writing, this process must be done manually.

	To add a shard to a node, add entries like this to the database
	metadata's ``changelog`` attribute:

	.. code:: json

	["add", "<range>", "<node-name>"]

	The ``<range>`` is the specific shard range for the shard. The ``<node-
	name>`` should match the name and address of the node as displayed in
	``GET /_membership`` on the cluster.

	.. note::
	When removing a shard from a node, specify ``remove`` instead of ``add``.

	Once you have figured out the new changelog entries, you will need to
	update the ``by_node`` and ``by_range`` to reflect who is storing what
	shards. The data in the changelog entries and these attributes must
	match. If they do not, the database may become corrupted.

	Continuing our example, here is an updated version of the metadata above
	that adds shards to an additional node called ``node4``:

	.. code:: json

	{
	"_id": "{name}",
	"_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
	"shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
	"changelog": [
	["add", "00000000-1fffffff", "node1@xxx.xxx.xxx.xxx"],
	["add", "00000000-1fffffff", "node2@xxx.xxx.xxx.xxx"],
	["add", "00000000-1fffffff", "node3@xxx.xxx.xxx.xxx"],
	…
	["add", "00000000-1fffffff", "node4@xxx.xxx.xxx.xxx"]
	],
	"by_node": {
	"node1@xxx.xxx.xxx.xxx": [
	"00000000-1fffffff",
	…
	],
	…
	"node4@xxx.xxx.xxx.xxx": [
	"00000000-1fffffff"
	]
	},
	"by_range": {
	"00000000-1fffffff": [
	"node1@xxx.xxx.xxx.xxx",
	"node2@xxx.xxx.xxx.xxx",
	"node3@xxx.xxx.xxx.xxx",
	"node4@xxx.xxx.xxx.xxx"
	],
	…
	}
	}

	Now you can ``PUT`` this new metadata:

	.. code:: bash

	$ curl -X PUT http://localhost:5986/_dbs/{name} -d '{...}'

	.. _cluster/sharding/verify:

	Monitor internal replication to ensure up-to-date shard(s)
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	After you complete the previous step, as soon as CouchDB receives a
	write request for a shard on the target node, CouchDB will check if the
	target node's shard(s) are up to date. If it finds they are not up to
	date, it will trigger an internal replication job to complete this task.
	You can observe this happening by triggering a write to the database
	(update a document, or create a new one), while monitoring the
	``/_node/<nodename>/_system`` endpoint, which includes the
	``internal_replication_jobs`` metric.

	Once this metric has returned to the baseline from before you wrote the
	document, or is ``0``, the shard replica is ready to serve data and we
	can bring the node out of maintenance mode.

	.. _cluster/sharding/mm-2:

	Clear the target node's maintenance mode
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	You can now let the node start servicing data requests by
	putting ``"false"`` to the maintenance mode configuration endpoint, just
	as in step 2.

	Verify that the node is not in maintenance mode by performing a ``GET
	/_up`` on that node's individual endpoint.

	Finally, check that your load balancer has returned the node to the pool
	of available backend nodes.

	.. _cluster/sharding/remove-shard:

	Update cluster metadata again to remove the source shard
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Now, remove the source shard from the shard map the same way that you
	added the new target shard to the shard map in step 2. Be sure to add
	the ``["remove", <range>, <source-shard>]`` entry to the end of the
	changelog as well as modifying both the ``by_node`` and ``by_range`` sections of
	the database metadata document.

	.. _cluster/sharding/remove-shard-files:

	Remove the shard and secondary index files from the source node
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Finally, you can remove the source shard replica by deleting its file from the
	command line on the source host, along with any view shard replicas:

	.. code::bash

	$ rm <couch-dir>/data/shards/<range>/<dbname>.<datecode>.couch
	$ rm -r <couch-dir>/data/.shards/<range>/<dbname>.<datecode>*

	Congratulations! You have moved a database shard replica. By adding and removing
	database shard replicas in this way, you can change the cluster's shard layout,
	also known as a shard map.

	Specifying database placement
	-----------------------------

	You can configure CouchDB to put shard replicas on certain nodes at
	database creation time using placement rules.

	First, each node must be labeled with a zone attribute. This defines
	which zone each node is in. You do this by editing the node’s document
	in the ``/_nodes`` database, which is accessed through the node-local
	port. Add a key value pair of the form:

	::

	"zone": "{zone-name}"

	Do this for all of the nodes in your cluster. For example:

	.. code:: bash

	$ curl -X PUT http://localhost:5986/_nodes/<node-name> \
	-d '{ \
	"_id": "<node-name>",
	"_rev": "<rev>",
	"zone": "<zone-name>"
	}'

	In the local config file (``local.ini``) of each node, define a
	consistent cluster-wide setting like:

	::

	[cluster]
	placement = <zone-name-1>:2,<zone-name-2>:1

	In this example, CouchDB will ensure that two replicas for a shard will
	be hosted on nodes with the zone attribute set to ``<zone-name-1>`` and
	one replica will be hosted on a new with the zone attribute set to
	``<zone-name-2>``.

	This approach is flexible, since you can also specify zones on a per-
	database basis by specifying the placement setting as a query parameter
	when the database is created, using the same syntax as the ini file:

	.. code:: bash

	curl -X PUT $COUCH_URL:5984/<dbname>?zone=<zone>

	Note that you can also use this system to ensure certain nodes in the
	cluster do not host any replicas for newly created databases, by giving
	them a zone attribute that does not appear in the ``[cluster]``
	placement string.

	Resharding a database to a new q value
	--------------------------------------

	The ``q`` value for a database can only be set when the database is
	created, precluding live resharding. Instead, to reshard a database, it
	must be regenerated. Here are the steps:

	1. Create a temporary database with the desired shard settings, by
	specifying the q value as a query parameter during the PUT
	operation.
	2. Stop clients accessing the database.
	3. Replicate the primary database to the temporary one. Multiple
	replications may be required if the primary database is under
	active use.
	4. Delete the primary database. Make sure nobody is using it!
	5. Recreate the primary database with the desired shard settings.
	6. Clients can now access the database again.
	7. Replicate the temporary back to the primary.
	8. Delete the temporary database.

	Once all steps have completed, the database can be used again. The
	cluster will create and distribute its shards according to placement
	rules automatically.

	Downtime can be avoided in production if the client application(s) can
	be instructed to use the new database instead of the old one, and a cut-
	over is performed during a very brief outage window.