zookeeper-docs/src/main/resources/markdown/zookeeperReconfig.md - zookeeper - Git at Google

 <!--
 Copyright 2002-2004 The Apache Software Foundation

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 //-->

 # ZooKeeper Dynamic Reconfiguration

 * [Overview](#ch_reconfig_intro)
 * [Changes to Configuration Format](#ch_reconfig_format)
     * [Specifying the client port](#sc_reconfig_clientport)
     * [Specifying multiple server addresses](#sc_multiaddress)
     * [The standaloneEnabled flag](#sc_reconfig_standaloneEnabled)
     * [The reconfigEnabled flag](#sc_reconfig_reconfigEnabled)
     * [Dynamic configuration file](#sc_reconfig_file)
     * [Backward compatibility](#sc_reconfig_backward)
 * [Upgrading to 3.5.0](#ch_reconfig_upgrade)
 * [Dynamic Reconfiguration of the ZooKeeper Ensemble](#ch_reconfig_dyn)
     * [API](#ch_reconfig_api)
     * [Security](#sc_reconfig_access_control)
     * [Retrieving the current dynamic configuration](#sc_reconfig_retrieving)
     * [Modifying the current dynamic configuration](#sc_reconfig_modifying)
         * [General](#sc_reconfig_general)
         * [Incremental mode](#sc_reconfig_incremental)
         * [Non-incremental mode](#sc_reconfig_nonincremental)
         * [Conditional reconfig](#sc_reconfig_conditional)
         * [Error conditions](#sc_reconfig_errors)
         * [Additional comments](#sc_reconfig_additional)
 * [Rebalancing Client Connections](#ch_reconfig_rebalancing)

 <a name="ch_reconfig_intro"></a>

 ## Overview

 Prior to the 3.5.0 release, the membership and all other configuration
 parameters of Zookeeper were static - loaded during boot and immutable at
 runtime. Operators resorted to ''rolling restarts'' - a manually intensive
 and error-prone method of changing the configuration that has caused data
 loss and inconsistency in production.

 Starting with 3.5.0, “rolling restarts” are no longer needed!
 ZooKeeper comes with full support for automated configuration changes: the
 set of Zookeeper servers, their roles (participant / observer), all ports,
 and even the quorum system can be changed dynamically, without service
 interruption and while maintaining data consistency. Reconfigurations are
 performed immediately, just like other operations in ZooKeeper. Multiple
 changes can be done using a single reconfiguration command. The dynamic
 reconfiguration functionality does not limit operation concurrency, does
 not require client operations to be stopped during reconfigurations, has a
 very simple interface for administrators and no added complexity to other
 client operations.

 New client-side features allow clients to find out about configuration
 changes and to update the connection string (list of servers and their
 client ports) stored in their ZooKeeper handle. A probabilistic algorithm
 is used to rebalance clients across the new configuration servers while
 keeping the extent of client migrations proportional to the change in
 ensemble membership.

 This document provides the administrator manual for reconfiguration.
 For a detailed description of the reconfiguration algorithms, performance
 measurements, and more, please see our paper:

 * *Shraer, A., Reed, B., Malkhi, D., Junqueira, F. Dynamic
 Reconfiguration of Primary/Backup Clusters. In _USENIX Annual
 Technical Conference (ATC)_(2012), 425-437* :
     Links: [paper (pdf)](https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf), [slides (pdf)](https://www.usenix.org/sites/default/files/conference/protected-files/shraer\_atc12\_slides.pdf), [video](https://www.usenix.org/conference/atc12/technical-sessions/presentation/shraer), [hadoop summit slides](http://www.slideshare.net/Hadoop\_Summit/dynamic-reconfiguration-of-zookeeper)

 **Note:** Starting with 3.5.3, the dynamic reconfiguration
 feature is disabled by default, and has to be explicitly turned on via
 [reconfigEnabled](zookeeperAdmin.html#sc_advancedConfiguration) configuration option.

 <a name="ch_reconfig_format"></a>

 ## Changes to Configuration Format

 <a name="sc_reconfig_clientport"></a>

 ### Specifying the client port

 A client port of a server is the port on which the server accepts plaintext (non-TLS) client connection requests
 and secure client port is the port on which the server accepts TLS client connection requests.

 Starting with 3.5.0 the
 _clientPort_ and _clientPortAddress_ configuration parameters should no longer be used in zoo.cfg.

 Starting with 3.10.0 the
 _secureClientPort_ and _secureClientPortAddress_ configuration parameters should no longer be used in zoo.cfg.

 Instead, this information is now part of the server keyword specification, which
 becomes as follows:

     server.<positive id> = <address1>:<quorum port>:<leader election port>[:role];[[<client port address>:]<client port>][;[<secure client port address>:]<secure client port>]

 - [New in ZK 3.10.0] The client port specification is optional and is to the right of the
   first semicolon. The secure client port specification is also optional and is to the right
   of the second semicolon. However, both the client port and secure client port specification
   cannot be omitted, at least one of them should be present. If the user intends to omit client
   port specification and provide only secure client port specification (TLS-only server), a second
   semicolon should still be specified to indicate an empty client port specification (see last
   example below). In either spec, the port address is optional, and if not specified it defaults
   to "0.0.0.0".
 - As usual, role is also optional, it can be _participant_ or _observer_ (_participant_ by default).

 Examples of legal server statements:

     server.5 = 125.23.63.23:1234:1235;1236                                             (non-TLS server)
     server.5 = 125.23.63.23:1234:1235;1236;1237                                        (non-TLS + TLS server)
     server.5 = 125.23.63.23:1234:1235;;1237                                            (TLS-only server)
     server.5 = 125.23.63.23:1234:1235:participant;1236                                 (non-TLS server)
     server.5 = 125.23.63.23:1234:1235:observer;1236                                    (non-TLS server)
     server.5 = 125.23.63.23:1234:1235;125.23.63.24:1236                                (non-TLS server)
     server.5 = 125.23.63.23:1234:1235:participant;125.23.63.23:1236                    (non-TLS server)
     server.5 = 125.23.63.23:1234:1235:participant;125.23.63.23:1236;125.23.63.23:1237  (non-TLS + TLS server)
     server.5 = 125.23.63.23:1234:1235:participant;;125.23.63.23:1237                   (TLS-only server)


 <a name="sc_multiaddress"></a>

 ### Specifying multiple server addresses

 Since ZooKeeper 3.6.0 it is possible to specify multiple addresses for each
 ZooKeeper server (see [ZOOKEEPER-3188](https://issues.apache.org/jira/projects/ZOOKEEPER/issues/ZOOKEEPER-3188)).
 This helps to increase availability and adds network level
 resiliency to ZooKeeper. When multiple physical network interfaces are used
 for the servers, ZooKeeper is able to bind on all interfaces and runtime switching
 to a working interface in case a network error. The different addresses can be
 specified in the config using a pipe ('|') character.

 Examples for a valid configurations using multiple addresses:

     server.2=zoo2-net1:2888:3888|zoo2-net2:2889:3889;2188
     server.2=zoo2-net1:2888:3888|zoo2-net2:2889:3889|zoo2-net3:2890:3890;2188
     server.2=zoo2-net1:2888:3888|zoo2-net2:2889:3889;zoo2-net1:2188
     server.2=zoo2-net1:2888:3888:observer|zoo2-net2:2889:3889:observer;2188

 <a name="sc_reconfig_standaloneEnabled"></a>

 ### The _standaloneEnabled_ flag

 Prior to 3.5.0, one could run ZooKeeper in Standalone mode or in a
 Distributed mode. These are separate implementation stacks, and
 switching between them during run time is not possible. By default (for
 backward compatibility) _standaloneEnabled_ is set to
 _true_. The consequence of using this default is that
 if started with a single server the ensemble will not be allowed to
 grow, and if started with more than one server it will not be allowed to
 shrink to contain fewer than two participants.

 Setting the flag to _false_ instructs the system
 to run the Distributed software stack even if there is only a single
 participant in the ensemble. To achieve this the (static) configuration
 file should contain:

     standaloneEnabled=false**

 With this setting it is possible to start a ZooKeeper ensemble
 containing a single participant and to dynamically grow it by adding
 more servers. Similarly, it is possible to shrink an ensemble so that
 just a single participant remains, by removing servers.

 Since running the Distributed mode allows more flexibility, we
 recommend setting the flag to _false_. We expect that
 the legacy Standalone mode will be deprecated in the future.

 <a name="sc_reconfig_reconfigEnabled"></a>

 ### The _reconfigEnabled_ flag

 Starting with 3.5.0 and prior to 3.5.3, there is no way to disable
 dynamic reconfiguration feature. We would like to offer the option of
 disabling reconfiguration feature because with reconfiguration enabled,
 we have a security concern that a malicious actor can make arbitrary changes
 to the configuration of a ZooKeeper ensemble, including adding a compromised
 server to the ensemble. We prefer to leave to the discretion of the user to
 decide whether to enable it or not and make sure that the appropriate security
 measure are in place. So in 3.5.3 the [reconfigEnabled](zookeeperAdmin.html#sc_advancedConfiguration) configuration option is introduced
 such that the reconfiguration feature can be completely disabled and any attempts
 to reconfigure a cluster through reconfig API with or without authentication
 will fail by default, unless **reconfigEnabled** is set to
 **true**.

 To set the option to true, the configuration file (zoo.cfg) should contain:

     reconfigEnabled=true

 <a name="sc_reconfig_file"></a>

 ### Dynamic configuration file

 Starting with 3.5.0 we're distinguishing between dynamic
 configuration parameters, which can be changed during runtime, and
 static configuration parameters, which are read from a configuration
 file when a server boots and don't change during its execution. For now,
 the following configuration keywords are considered part of the dynamic
 configuration: _server_, _group_
 and _weight_.

 Dynamic configuration parameters are stored in a separate file on
 the server (which we call the dynamic configuration file). This file is
 linked from the static config file using the new
 _dynamicConfigFile_ keyword.

 **Example**

 #### zoo_replicated1.cfg


     tickTime=2000
     dataDir=/zookeeper/data/zookeeper1
     initLimit=5
     syncLimit=2
     dynamicConfigFile=/zookeeper/conf/zoo_replicated1.cfg.dynamic


 #### zoo_replicated1.cfg.dynamic


     server.1=125.23.63.23:2780:2783:participant;2791
     server.2=125.23.63.24:2781:2784:participant;2792
     server.3=125.23.63.25:2782:2785:participant;2793


 When the ensemble configuration changes, the static configuration
 parameters remain the same. The dynamic parameters are pushed by
 ZooKeeper and overwrite the dynamic configuration files on all servers.
 Thus, the dynamic configuration files on the different servers are
 usually identical (they can only differ momentarily when a
 reconfiguration is in progress, or if a new configuration hasn't
 propagated yet to some of the servers). Once created, the dynamic
 configuration file should not be manually altered. Changed are only made
 through the new reconfiguration commands outlined below. Note that
 changing the config of an offline cluster could result in an
 inconsistency with respect to configuration information stored in the
 ZooKeeper log (and the special configuration znode, populated from the
 log) and is therefore highly discouraged.

 **Example 2**

 Users may prefer to initially specify a single configuration file.
 The following is thus also legal:

 #### zoo_replicated1.cfg


     tickTime=2000
     dataDir=/zookeeper/data/zookeeper1
     initLimit=5
     syncLimit=2
     clientPort=


 The configuration files on each server will be automatically split
 into dynamic and static files, if they are not already in this format.
 So the configuration file above will be automatically transformed into
 the two files in Example 1. Note that the clientPort and
 clientPortAddress lines (if specified) will be automatically removed
 during this process, if they are redundant (as in the example above).
 The original static configuration file is backed up (in a .bak
 file).

 <a name="sc_reconfig_backward"></a>

 ### Backward compatibility

 We still support the old configuration format. For example, the
 following configuration file is acceptable (but not recommended):

 #### zoo_replicated1.cfg

     tickTime=2000
     dataDir=/zookeeper/data/zookeeper1
     initLimit=5
     syncLimit=2
     clientPort=2791
     server.1=125.23.63.23:2780:2783:participant
     server.2=125.23.63.24:2781:2784:participant
     server.3=125.23.63.25:2782:2785:participant


 During boot, a dynamic configuration file is created and contains
 the dynamic part of the configuration as explained earlier. In this
 case, however, the line "clientPort=2791" will remain in the static
 configuration file of server 1 since it is not redundant -- it was not
 specified as part of the "server.1=..." using the format explained in
 the section [Changes to Configuration Format](#ch_reconfig_format). If a reconfiguration
 is invoked that sets the client port of server 1, we remove
 "clientPort=2791" from the static configuration file (the dynamic file
 now contain this information as part of the specification of server
 1).

 <a name="ch_reconfig_upgrade"></a>

 ## Upgrading to 3.5.0

 Upgrading a running ZooKeeper ensemble to 3.5.0 should be done only
 after upgrading your ensemble to the 3.4.6 release. Note that this is only
 necessary for rolling upgrades (if you're fine with shutting down the
 system completely, you don't have to go through 3.4.6). If you attempt a
 rolling upgrade without going through 3.4.6 (for example from 3.4.5), you
 may get the following error:

     2013-01-30 11:32:10,663 [myid:2] - INFO [localhost/127.0.0.1:2784:QuorumCnxManager$Listener@498] - Received connection request /127.0.0.1:60876
     2013-01-30 11:32:10,663 [myid:2] - WARN [localhost/127.0.0.1:2784:QuorumCnxManager@349] - Invalid server id: -65536

 During a rolling upgrade, each server is taken down in turn and
 rebooted with the new 3.5.0 binaries. Before starting the server with
 3.5.0 binaries, we highly recommend updating the configuration file so
 that all server statements "server.x=..." contain client ports (see the
 section [Specifying the client port](#sc_reconfig_clientport)). As explained earlier
 you may leave the configuration in a single file, as well as leave the
 clientPort/clientPortAddress statements (although if you specify client
 ports in the new format, these statements are now redundant).

 <a name="ch_reconfig_dyn"></a>

 ## Dynamic Reconfiguration of the ZooKeeper Ensemble

 The ZooKeeper Java and C API were extended with getConfig and reconfig
 commands that facilitate reconfiguration. Both commands have a synchronous
 (blocking) variant and an asynchronous one. We demonstrate these commands
 here using the Java CLI, but note that you can similarly use the C CLI or
 invoke the commands directly from a program just like any other ZooKeeper
 command.

 <a name="ch_reconfig_api"></a>

 ### API

 There are two sets of APIs for both Java and C client.

 * ***Reconfiguration API*** :
     Reconfiguration API is used to reconfigure the ZooKeeper cluster.
     Starting with 3.5.3, reconfiguration Java APIs are moved into ZooKeeperAdmin class
     from ZooKeeper class, and use of this API requires ACL setup and user
     authentication (see [Security](#sc_reconfig_access_control) for more information.).

 * ***Get Configuration API*** :
     Get configuration APIs are used to retrieve ZooKeeper cluster configuration information
     stored in /zookeeper/config znode. Use of this API does not require specific setup or authentication,
     because /zookeeper/config is readable to any users.

 <a name="sc_reconfig_access_control"></a>

 ### Security

 Prior to **3.5.3**, there is no enforced security mechanism
 over reconfig so any ZooKeeper clients that can connect to ZooKeeper server ensemble
 will have the ability to change the state of a ZooKeeper cluster via reconfig.
 It is thus possible for a malicious client to add compromised server to an ensemble,
 e.g., add a compromised server, or remove legitimate servers.
 Cases like these could be security vulnerabilities on a case by case basis.

 To address this security concern, we introduced access control over reconfig
 starting from **3.5.3** such that only a specific set of users
 can use reconfig commands or APIs, and these users need be configured explicitly. In addition,
 the setup of ZooKeeper cluster must enable authentication so ZooKeeper clients can be authenticated.

 We also provide an escape hatch for users who operate and interact with a ZooKeeper ensemble in a secured
 environment (i.e. behind company firewall). For those users who want to use reconfiguration feature but
 don't want the overhead of configuring an explicit list of authorized user for reconfig access checks,
 they can set ["skipACL"](zookeeperAdmin.html#sc_authOptions) to "yes" which will
 skip ACL check and allow any user to reconfigure cluster.

 Overall, ZooKeeper provides flexible configuration options for the reconfigure feature
 that allow a user to choose based on user's security requirement.
 We leave to the discretion of the user to decide appropriate security measure are in place.

 * ***Access Control*** :
     The dynamic configuration is stored in a special znode
     ZooDefs.CONFIG_NODE = /zookeeper/config. This node by default is read only
     for all users, except super user and users that's explicitly configured for write
     access.
     Clients that need to use reconfig commands or reconfig API should be configured as users
     that have write access to CONFIG_NODE. By default, only the super user has full control including
     write access to CONFIG_NODE. Additional users can be granted write access through superuser
     by setting an ACL that has write permission associated with specified user.
     A few examples of how to setup ACLs and use reconfiguration API with authentication can be found in
     ReconfigExceptionTest.java and TestReconfigServer.cc.

 * ***Authentication*** :
     Authentication of users is orthogonal to the access control and is delegated to
     existing authentication mechanism supported by ZooKeeper's pluggable authentication schemes.
     See [ZooKeeper and SASL](https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zookeeper+and+SASL) for more details on this topic.

 * ***Disable ACL check*** :
     ZooKeeper supports ["skipACL"](zookeeperAdmin.html#sc_authOptions) option such that ACL
     check will be completely skipped, if skipACL is set to "yes". In such cases any unauthenticated
     users can use reconfig API.

 <a name="sc_reconfig_retrieving"></a>

 ### Retrieving the current dynamic configuration

 The dynamic configuration is stored in a special znode
 ZooDefs.CONFIG_NODE = /zookeeper/config. The new
 `config` CLI command reads this znode (currently it is
 simply a wrapper to `get /zookeeper/config`). As with
 normal reads, to retrieve the latest committed value you should do a
 `sync` first.

     [zk: 127.0.0.1:2791(CONNECTED) 3] config
     server.1=localhost:2780:2783:participant;localhost:2791
     server.2=localhost:2781:2784:participant;localhost:2792
     server.3=localhost:2782:2785:participant;localhost:2793

 Notice the last line of the output. This is the configuration
 version. The version equals to the zxid of the reconfiguration command
 which created this configuration. The version of the first established
 configuration equals to the zxid of the NEWLEADER message sent by the
 first successfully established leader. When a configuration is written
 to a dynamic configuration file, the version automatically becomes part
 of the filename and the static configuration file is updated with the
 path to the new dynamic configuration file. Configuration files
 corresponding to earlier versions are retained for backup
 purposes.

 During boot time the version (if it exists) is extracted from the
 filename. The version should never be altered manually by users or the
 system administrator. It is used by the system to know which
 configuration is most up-to-date. Manipulating it manually can result in
 data loss and inconsistency.

 Just like a `get` command, the
 `config` CLI command accepts the _-w_
 flag for setting a watch on the znode, and _-s_ flag for
 displaying the Stats of the znode. It additionally accepts a new flag
 _-c_ which outputs only the version and the client
 connection string corresponding to the current configuration. For
 example, for the configuration above we would get:

     [zk: 127.0.0.1:2791(CONNECTED) 17] config -c
     400000003 localhost:2791,localhost:2793,localhost:2792

 Note that when using the API directly, this command is called
 `getConfig`.

 As any read command it returns the configuration known to the
 follower to which your client is connected, which may be slightly
 out-of-date. One can use the `sync` command for
 stronger guarantees. For example using the Java API:

     zk.sync(ZooDefs.CONFIG_NODE, void_callback, context);
     zk.getConfig(watcher, callback, context);

 Note: in 3.5.0 it doesn't really matter which path is passed to the
 `sync()` command as all the server's state is brought
 up to date with the leader (so one could use a different path instead of
 ZooDefs.CONFIG_NODE). However, this may change in the future.

 <a name="sc_reconfig_modifying"></a>

 ### Modifying the current dynamic configuration

 Modifying the configuration is done through the
 `reconfig` command. There are two modes of
 reconfiguration: incremental and non-incremental (bulk). The
 non-incremental simply specifies the new dynamic configuration of the
 system. The incremental specifies changes to the current configuration.
 The `reconfig` command returns the new
 configuration.

 A few examples are in: *ReconfigTest.java*,
 *ReconfigRecoveryTest.java* and
 *TestReconfigServer.cc*.

 <a name="sc_reconfig_general"></a>

 #### General

 **Removing servers:** Any server can
 be removed, including the leader (although removing the leader will
 result in a short unavailability, see Figures 6 and 8 in the [paper](https://www.usenix.org/conference/usenixfederatedconferencesweek/dynamic-recon%EF%AC%81guration-primarybackup-clusters)). The server will not be shut-down automatically.
 Instead, it becomes a "non-voting follower". This is somewhat similar
 to an observer in that its votes don't count towards the Quorum of
 votes necessary to commit operations. However, unlike a non-voting
 follower, an observer doesn't actually see any operation proposals and
 does not ACK them. Thus a non-voting follower has a more significant
 negative effect on system throughput compared to an observer.
 Non-voting follower mode should only be used as a temporary mode,
 before shutting the server down, or adding it as a follower or as an
 observer to the ensemble. We do not shut the server down automatically
 for two main reasons. The first reason is that we do not want all the
 clients connected to this server to be immediately disconnected,
 causing a flood of connection requests to other servers. Instead, it
 is better if each client decides when to migrate independently. The
 second reason is that removing a server may sometimes (rarely) be
 necessary in order to change it from "observer" to "participant" (this
 is explained in the section [Additional comments](#sc_reconfig_additional)).

 Note that the new configuration should have some minimal number of
 participants in order to be considered legal. If the proposed change
 would leave the cluster with less than 2 participants and standalone
 mode is enabled (standaloneEnabled=true, see the section [The _standaloneEnabled_ flag](#sc_reconfig_standaloneEnabled)), the reconfig will not be
 processed (BadArgumentsException). If standalone mode is disabled
 (standaloneEnabled=false) then it's legal to remain with 1 or more
 participants.

 **Adding servers:** Before a
 reconfiguration is invoked, the administrator must make sure that a
 quorum (majority) of participants from the new configuration are
 already connected and synced with the current leader. To achieve this
 we need to connect a new joining server to the leader before it is
 officially part of the ensemble. This is done by starting the joining
 server using an initial list of servers which is technically not a
 legal configuration of the system but (a) contains the joiner, and (b)
 gives sufficient information to the joiner in order for it to find and
 connect to the current leader. We list a few different options of
 doing this safely.

 1. Initial configuration of joiners is comprised of servers in
   the last committed configuration and one or more joiners, where
   **joiners are listed as observers.**
   For example, if servers D and E are added at the same time to (A,
   B, C) and server C is being removed, the initial configuration of
   D could be (A, B, C, D) or (A, B, C, D, E), where D and E are
   listed as observers. Similarly, the configuration of E could be
   (A, B, C, E) or (A, B, C, D, E), where D and E are listed as
   observers. **Note that listing the joiners as
   observers will not actually make them observers - it will only
   prevent them from accidentally forming a quorum with other
   joiners.** Instead, they will contact the servers in the
   current configuration and adopt the last committed configuration
   (A, B, C), where the joiners are absent. Configuration files of
   joiners are backed up and replaced automatically as this happens.
   After connecting to the current leader, joiners become non-voting
   followers until the system is reconfigured and they are added to
   the ensemble (as participant or observer, as appropriate).
 1. Initial configuration of each joiner is comprised of servers
   in the last committed configuration + **the
   joiner itself, listed as a participant.** For example, to
   add a new server D to a configuration consisting of servers (A, B,
   C), the administrator can start D using an initial configuration
   file consisting of servers (A, B, C, D). If both D and E are added
   at the same time to (A, B, C), the initial configuration of D
   could be (A, B, C, D) and the configuration of E could be (A, B,
   C, E). Similarly, if D is added and C is removed at the same time,
   the initial configuration of D could be (A, B, C, D). Never list
   more than one joiner as participant in the initial configuration
   (see warning below).
 1. Whether listing the joiner as an observer or as participant,
   it is also fine not to list all the current configuration servers,
   as long as the current leader is in the list. For example, when
   adding D we could start D with a configuration file consisting of
   just (A, D) if A is the current leader. however this is more
   fragile since if A fails before D officially joins the ensemble, D
   doesn’t know anyone else and therefore the administrator will have
   to intervene and restart D with another server list.

 ######Note
 >##### Warning

 >Never specify more than one joining server in the same initial
 configuration as participants. Currently, the joining servers don’t
 know that they are joining an existing ensemble; if multiple joiners
 are listed as participants they may form an independent quorum
 creating a split-brain situation such as processing operations
 independently from your main ensemble. It is OK to list multiple
 joiners as observers in an initial config.

 If the configuration of existing servers changes or they become unavailable
 before the joiner succeeds to connect and learn about configuration changes, the
 joiner may need to be restarted with an updated configuration file in order to be
 able to connect.

 Finally, note that once connected to the leader, a joiner adopts
 the last committed configuration, in which it is absent (the initial
 config of the joiner is backed up before being rewritten). If the
 joiner restarts in this state, it will not be able to boot since it is
 absent from its configuration file. In order to start it you’ll once
 again have to specify an initial configuration.

 **Modifying server parameters:** One
 can modify any of the ports of a server, or its role
 (participant/observer) by adding it to the ensemble with different
 parameters. This works in both the incremental and the bulk
 reconfiguration modes. It is not necessary to remove the server and
 then add it back; just specify the new parameters as if the server is
 not yet in the system. The server will detect the configuration change
 and perform the necessary adjustments. See an example in the section
 [Incremental mode](#sc_reconfig_incremental) and an exception to this
 rule in the section [Additional comments](#sc_reconfig_additional).

 It is also possible to change the Quorum System used by the
 ensemble (for example, change the Majority Quorum System to a
 Hierarchical Quorum System on the fly). This, however, is only allowed
 using the bulk (non-incremental) reconfiguration mode. In general,
 incremental reconfiguration only works with the Majority Quorum
 System. Bulk reconfiguration works with both Hierarchical and Majority
 Quorum Systems.

 **Performance Impact:** There is
 practically no performance impact when removing a follower, since it
 is not being automatically shut down (the effect of removal is that
 the server's votes are no longer being counted). When adding a server,
 there is no leader change and no noticeable performance disruption.
 For details and graphs please see Figures 6, 7 and 8 in the [paper](https://www.usenix.org/conference/usenixfederatedconferencesweek/dynamic-recon%EF%AC%81guration-primarybackup-clusters).

 The most significant disruption will happen when a leader change
 is caused, in one of the following cases:

 1. Leader is removed from the ensemble.
 1. Leader's role is changed from participant to observer.
 1. The port used by the leader to send transactions to others
   (quorum port) is modified.

 In these cases we perform a leader hand-off where the old leader
 nominates a new leader. The resulting unavailability is usually
 shorter than when a leader crashes since detecting leader failure is
 unnecessary and electing a new leader can usually be avoided during a
 hand-off (see Figures 6 and 8 in the [paper](https://www.usenix.org/conference/usenixfederatedconferencesweek/dynamic-recon%EF%AC%81guration-primarybackup-clusters)).

 When the client port of a server is modified, it does not drop
 existing client connections. New connections to the server will have
 to use the new client port.

 **Progress guarantees:** Up to the
 invocation of the reconfig operation, a quorum of the old
 configuration is required to be available and connected for ZooKeeper
 to be able to make progress. Once reconfig is invoked, a quorum of
 both the old and of the new configurations must be available. The
 final transition happens once (a) the new configuration is activated,
 and (b) all operations scheduled before the new configuration is
 activated by the leader are committed. Once (a) and (b) happen, only a
 quorum of the new configuration is required. Note, however, that
 neither (a) nor (b) are visible to a client. Specifically, when a
 reconfiguration operation commits, it only means that an activation
 message was sent out by the leader. It does not necessarily mean that
 a quorum of the new configuration got this message (which is required
 in order to activate it) or that (b) has happened. If one wants to
 make sure that both (a) and (b) has already occurred (for example, in
 order to know that it is safe to shut down old servers that were
 removed), one can simply invoke an update
 (`set-data`, or some other quorum operation, but not
 a `sync`) and wait for it to commit. An alternative
 way to achieve this was to introduce another round to the
 reconfiguration protocol (which, for simplicity and compatibility with
 Zab, we decided to avoid).

 <a name="sc_reconfig_incremental"></a>

 #### Incremental mode

 The incremental mode allows adding and removing servers to the
 current configuration. Multiple changes are allowed. For
 example:

     > reconfig -remove 3 -add
     server.5=125.23.63.23:1234:1235;1236

 Both the add and the remove options get a list of comma separated
 arguments (no spaces):

     > reconfig -remove 3,4 -add
     server.5=localhost:2111:2112;2113,6=localhost:2114:2115:observer;2116

 The format of the server statement is exactly the same as
 described in the section [Specifying the client port](#sc_reconfig_clientport) and
 includes the client port. Notice that here instead of "server.5=" you
 can just say "5=". In the example above, if server 5 is already in the
 system, but has different ports or is not an observer, it is updated
 and once the configuration commits becomes an observer and starts
 using these new ports. This is an easy way to turn participants into
 observers and vice versa or change any of their ports, without
 rebooting the server.

 ZooKeeper supports two types of Quorum Systems – the simple
 Majority system (where the leader commits operations after receiving
 ACKs from a majority of voters) and a more complex Hierarchical
 system, where votes of different servers have different weights and
 servers are divided into voting groups. Currently, incremental
 reconfiguration is allowed only if the last proposed configuration
 known to the leader uses a Majority Quorum System
 (BadArgumentsException is thrown otherwise).

 Incremental mode - examples using the Java API:

     List<String> leavingServers = new ArrayList<String>();
     leavingServers.add("1");
     leavingServers.add("2");
     byte[] config = zk.reconfig(null, leavingServers, null, -1, new Stat());

     List<String> leavingServers = new ArrayList<String>();
     List<String> joiningServers = new ArrayList<String>();
     leavingServers.add("1");
     joiningServers.add("server.4=localhost:1234:1235;1236");
     byte[] config = zk.reconfig(joiningServers, leavingServers, null, -1, new Stat());

     String configStr = new String(config);
     System.out.println(configStr);

 There is also an asynchronous API, and an API accepting comma
 separated Strings instead of List<String>. See
 src/java/main/org/apache/zookeeper/ZooKeeper.java.

 <a name="sc_reconfig_nonincremental"></a>

 #### Non-incremental mode

 The second mode of reconfiguration is non-incremental, whereby a
 client gives a complete specification of the new dynamic system
 configuration. The new configuration can either be given in place or
 read from a file:

     > reconfig -file newconfig.cfg

 //newconfig.cfg is a dynamic config file, see [Dynamic configuration file](#sc_reconfig_file)

     > reconfig -members
     server.1=125.23.63.23:2780:2783:participant;2791,server.2=125.23.63.24:2781:2784:participant;2792,server.3=125.23.63.25:2782:2785:participant;2793}}

 The new configuration may use a different Quorum System. For
 example, you may specify a Hierarchical Quorum System even if the
 current ensemble uses a Majority Quorum System.

 Bulk mode - example using the Java API:

     List<String> newMembers = new ArrayList<String>();
     newMembers.add("server.1=1111:1234:1235;1236");
     newMembers.add("server.2=1112:1237:1238;1239");
     newMembers.add("server.3=1114:1240:1241:observer;1242");

     byte[] config = zk.reconfig(null, null, newMembers, -1, new Stat());

     String configStr = new String(config);
     System.out.println(configStr);

 There is also an asynchronous API, and an API accepting comma
 separated String containing the new members instead of
 List<String>. See
 src/java/main/org/apache/zookeeper/ZooKeeper.java.

 <a name="sc_reconfig_conditional"></a>

 #### Conditional reconfig

 Sometimes (especially in non-incremental mode) a new proposed
 configuration depends on what the client "believes" to be the current
 configuration, and should be applied only to that configuration.
 Specifically, the `reconfig` succeeds only if the
 last configuration at the leader has the specified version.

     > reconfig -file <filename> -v <version>

 In the previously listed Java examples, instead of -1 one could
 specify a configuration version to condition the
 reconfiguration.

 <a name="sc_reconfig_errors"></a>

 #### Error conditions

 In addition to normal ZooKeeper error conditions, a
 reconfiguration may fail for the following reasons:

 1. another reconfig is currently in progress
   (ReconfigInProgress)
 1. the proposed change would leave the cluster with less than 2
   participants, in case standalone mode is enabled, or, if
   standalone mode is disabled then its legal to remain with 1 or
   more participants (BadArgumentsException)
 1. no quorum of the new configuration was connected and
   up-to-date with the leader when the reconfiguration processing
   began (NewConfigNoQuorum)
 1. `-v x` was specified, but the version
 `y` of the latest configuration is not
 `x` (BadVersionException)
 1. an incremental reconfiguration was requested but the last
   configuration at the leader uses a Quorum System which is
   different from the Majority system (BadArgumentsException)
 1. syntax error (BadArgumentsException)
 1. I/O exception when reading the configuration from a file
   (BadArgumentsException)

 Most of these are illustrated by test-cases in
 *ReconfigFailureCases.java*.

 <a name="sc_reconfig_additional"></a>

 #### Additional comments

 **Liveness:** To better understand
 the difference between incremental and non-incremental
 reconfiguration, suppose that client C1 adds server D to the system
 while a different client C2 adds server E. With the non-incremental
 mode, each client would first invoke `config` to find
 out the current configuration, and then locally create a new list of
 servers by adding its own suggested server. The new configuration can
 then be submitted using the non-incremental
 `reconfig` command. After both reconfigurations
 complete, only one of E or D will be added (not both), depending on
 which client's request arrives second to the leader, overwriting the
 previous configuration. The other client can repeat the process until
 its change takes effect. This method guarantees system-wide progress
 (i.e., for one of the clients), but does not ensure that every client
 succeeds. To have more control C2 may request to only execute the
 reconfiguration in case the version of the current configuration
 hasn't changed, as explained in the section [Conditional reconfig](#sc_reconfig_conditional). In this way it may avoid blindly
 overwriting the configuration of C1 if C1's configuration reached the
 leader first.

 With incremental reconfiguration, both changes will take effect as
 they are simply applied by the leader one after the other to the
 current configuration, whatever that is (assuming that the second
 reconfig request reaches the leader after it sends a commit message
 for the first reconfig request -- currently the leader will refuse to
 propose a reconfiguration if another one is already pending). Since
 both clients are guaranteed to make progress, this method guarantees
 stronger liveness. In practice, multiple concurrent reconfigurations
 are probably rare. Non-incremental reconfiguration is currently the
 only way to dynamically change the Quorum System. Incremental
 configuration is currently only allowed with the Majority Quorum
 System.

 **Changing an observer into a
 follower:** Clearly, changing a server that participates in
 voting into an observer may fail if error (2) occurs, i.e., if fewer
 than the minimal allowed number of participants would remain. However,
 converting an observer into a participant may sometimes fail for a
 more subtle reason: Suppose, for example, that the current
 configuration is (A, B, C, D), where A is the leader, B and C are
 followers and D is an observer. In addition, suppose that B has
 crashed. If a reconfiguration is submitted where D is said to become a
 follower, it will fail with error (3) since in this configuration, a
 majority of voters in the new configuration (any 3 voters), must be
 connected and up-to-date with the leader. An observer cannot
 acknowledge the history prefix sent during reconfiguration, and
 therefore it does not count towards these 3 required servers and the
 reconfiguration will be aborted. In case this happens, a client can
 achieve the same task by two reconfig commands: first invoke a
 reconfig to remove D from the configuration and then invoke a second
 command to add it back as a participant (follower). During the
 intermediate state D is a non-voting follower and can ACK the state
 transfer performed during the second reconfig command.

 <a name="ch_reconfig_rebalancing"></a>

 ## Rebalancing Client Connections

 When a ZooKeeper cluster is started, if each client is given the same
 connection string (list of servers), the client will randomly choose a
 server in the list to connect to, which makes the expected number of
 client connections per server the same for each of the servers. We
 implemented a method that preserves this property when the set of servers
 changes through reconfiguration. See Sections 4 and 5.1 in the [paper](https://www.usenix.org/conference/usenixfederatedconferencesweek/dynamic-recon%EF%AC%81guration-primarybackup-clusters).

 In order for the method to work, all clients must subscribe to
 configuration changes (by setting a watch on /zookeeper/config either
 directly or through the `getConfig` API command). When
 the watch is triggered, the client should read the new configuration by
 invoking `sync` and `getConfig` and if
 the configuration is indeed new invoke the
 `updateServerList` API command. To avoid mass client
 migration at the same time, it is better to have each client sleep a
 random short period of time before invoking
 `updateServerList`.

 A few examples can be found in:
 *StaticHostProviderTest.java* and
 *TestReconfig.cc*

 Example (this is not a recipe, but a simplified example just to
 explain the general idea):

     public void process(WatchedEvent event) {
         synchronized (this) {
             if (event.getType() == EventType.None) {
                 connected = (event.getState() == KeeperState.SyncConnected);
                 notifyAll();
             } else if (event.getPath()!=null &&  event.getPath().equals(ZooDefs.CONFIG_NODE)) {
                 // in prod code never block the event thread!
                 zk.sync(ZooDefs.CONFIG_NODE, this, null);
                 zk.getConfig(this, this, null);
             }
         }
     }

     public void processResult(int rc, String path, Object ctx, byte[] data, Stat stat) {
         if (path!=null &&  path.equals(ZooDefs.CONFIG_NODE)) {
             String config[] = ConfigUtils.getClientConfigStr(new String(data)).split(" ");   // similar to config -c
             long version = Long.parseLong(config[0], 16);
             if (this.configVersion == null){
                  this.configVersion = version;
             } else if (version > this.configVersion) {
                 hostList = config[1];
                 try {
                     // the following command is not blocking but may cause the client to close the socket and
                     // migrate to a different server. In practice it's better to wait a short period of time, chosen
                     // randomly, so that different clients migrate at different times
                     zk.updateServerList(hostList);
                 } catch (IOException e) {
                     System.err.println("Error updating server list");
                     e.printStackTrace();
                 }
                 this.configVersion = version;
             }
         }
     }