qpid/cpp/design_docs/new-ha-design.txt - qpid - Git at Google

 -*-org-*-
 # Licensed to the Apache Software Foundation (ASF) under one
 # or more contributor license agreements.  See the NOTICE file
 # distributed with this work for additional information
 # regarding copyright ownership.  The ASF licenses this file
 # to you under the Apache License, Version 2.0 (the
 # "License"); you may not use this file except in compliance
 # with the License.  You may obtain a copy of the License at
 #
 #   http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing,
 # software distributed under the License is distributed on an
 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.

 * An active-passive, hot-standby design for Qpid clustering.

 This document describes an active-passive approach to HA based on
 queue browsing to replicate message data.

 See [[./old-cluster-issues.txt]] for issues with the old design.

 ** Active-active vs. active-passive (hot-standby)

 An active-active cluster allows clients to connect to any broker in
 the cluster. If a broker fails, clients can fail-over to any other
 live broker.

 A hot-standby cluster has only one active broker at a time (the
 "primary") and one or more brokers on standby (the "backups"). Clients
 are only served by the primary, clients that connect to a backup are
 redirected to the primary. The backups are kept up-to-date in real
 time by the primary, if the primary fails a backup is elected to be
 the new primary.

 The main problem with active-active is co-ordinating consumers of the
 same queue on multiple brokers such that there are no duplicates in
 normal operation. There are 2 approaches:

 Predictive: each broker predicts which messages others will take. This
 the main weakness of the old design so not appealing.

 Locking: brokers "lock" a queue in order to take messages. This is
 complex to implement and it is not straighforward to determine the
 best strategy for passing the lock. In tests to date it results in
 very high latencies (10x standalone broker).

 Hot-standby removes this problem. Only the primary can modify queues
 so it just has to tell the backups what it is doing, there's no
 locking.

 The primary can enqueue messages and replicate asynchronously -
 exactly like the store does, but it "writes" to the replicas over the
 network rather than writing to disk.

 ** Replicating browsers

 The unit of replication is a replicating browser. This is an AMQP
 consumer that browses a remote queue via a federation link and
 maintains a local replica of the queue. As well as browsing the remote
 messages as they are added the browser receives dequeue notifications
 when they are dequeued remotely.

 On the primary broker incoming mesage transfers are completed only when
 all of the replicating browsers have signaled completion. Thus a completed
 message is guaranteed to be on the backups.

 ** Failover and Cluster Resource Managers

 We want to delegate the failover management to an existing cluster
 resource manager. Initially this is rgmanager from Cluster Suite, but
 other managers (e.g. PaceMaker) could be supported in future.

 Rgmanager takes care of starting and stopping brokers and informing
 brokers of their roles as primary or backup. It ensures there's
 exactly one primary broker running at any time. It also tracks quorum
 and protects against split-brain.

 Rgmanger can also manage a virtual IP address so clients can just
 retry on a single address to fail over. Alternatively we will also
 support configuring a fixed list of broker addresses when qpid is run
 outside of a resource manager.

 Aside: Cold-standby is also possible using rgmanager with shared
 storage for the message store (e.g. GFS). If the broker fails, another
 broker is started on a different node and and recovers from the
 store. This bears investigation but the store recovery times are
 likely too long for failover.

 ** Replicating configuration

 New queues and exchanges and their bindings also need to be replicated.
 This is done by a QMF client that registers for configuration changes
 on the remote broker and mirrors them in the local broker.

 ** Use of CPG (openais/corosync)

 CPG is not required in this model, an external cluster resource
 manager takes care of membership and quorum.

 ** Selective replication

 In this model it's easy to support selective replication of individual queues via
 configuration.

 Explicit exchange/queue qpid.replicate argument:
 - none: the object is not replicated
 - configuration: queues, exchanges and bindings are replicated but messages are not.
 - messages: configuration and messages are replicated.

 TODO: provide configurable default for qpid.replicate

 [GRS: current prototype relies on queue sequence for message identity
 so selectively replicating certain messages on a given queue would be
 challenging. Selectively replicating certain queues however is trivial.]

 ** Inconsistent errors

 The new design eliminates most sources of inconsistent errors in the
 old design (connections, sessions, security, management etc.) and
 eliminates the need to stall the whole cluster till an error is
 resolved. We still have to handle inconsistent store errors when store
 and cluster are used together.

 We have 2 options (configurable) for handling inconsistent errors,
 on the backup that fails to store a message from primary we can:
 - Abort the backup broker allowing it to be re-started.
 - Raise a critical error on the backup broker but carry on with the message lost.
 We can configure the option to abort or carry on per-queue, we
 will also provide a broker-wide configurable default.

 ** New backups connecting to primary.

 When the primary fails, one of the backups becomes primary and the
 others connect to the new primary as backups.

 The backups can take advantage of the messages they already have
 backed up, the new primary only needs to replicate new messages.

 To keep the N-way guarantee the primary needs to delay completion on
 new messages until all the back-ups have caught up. However if a
 backup does not catch up within some timeout, it is considered dead
 and its messages are completed so the cluster can carry on with N-1
 members.


 ** Broker discovery and lifecycle.

 The cluster has a client URL that can contain a single virtual IP
 address or a list of real IP addresses for the cluster.

 In backup mode, brokers reject connections normal client connections
 so clients will fail over to the primary. HA admin tools mark their
 connections so they are allowed to connect to backup brokers.

 Clients discover the primary by re-trying connection to the client URL
 until the successfully connect to the primary. In the case of a
 virtual IP they re-try the same address until it is relocated to the
 primary. In the case of a list of IPs the client tries each in
 turn. Clients do multiple retries over a configured period of time
 before giving up.

 Backup brokers discover the primary in the same way as clients. There
 is a separate broker URL for brokers since they often will connect
 over a different network. The broker URL has to be a list of real
 addresses rather than a virtual address.

 Brokers have the following states:
 - connecting: Backup broker trying to connect to primary - loops retrying broker URL.
 - catchup: Backup connected to primary, catching up on pre-existing configuration & messages.
 - ready: Backup fully caught-up, ready to take over as primary.
 - primary: Acting as primary, serving clients.

 ** Interaction with rgmanager

 rgmanager interacts with qpid via 2 service scripts: backup &
 primary. These scripts interact with the underlying qpidd
 service. rgmanager picks the new primary when the old primary
 fails. In a partition it also takes care of killing inquorate brokers.

 *** Initial cluster start

 rgmanager starts the backup service on all nodes and the primary service on one node.

 On the backup nodes qpidd is in the connecting state. The primary node goes into
 the primary state. Backups discover the primary, connect and catch up.

 *** Failover

 primary broker or node fails. Backup brokers see disconnect and go
 back to connecting mode.

 rgmanager notices the failure and starts the primary service on a new node.
 This tells qpidd to go to primary mode. Backups re-connect and catch up.

 The primary can only be started on nodes where there is a ready backup service.
 If the backup is catching up, it's not eligible to take over as primary.

 *** Failback

 Cluster of N brokers has suffered a failure, only N-1 brokers
 remain. We want to start a new broker (possibly on a new node) to
 restore redundancy.

 If the new broker has a new IP address, the sysadmin pushes a new URL
 to all the existing brokers.

 The new broker starts in connecting mode. It discovers the primary,
 connects and catches up.

 *** Failure during failback

 A second failure occurs before the new backup B completes its catch
 up. The backup B refuses to become primary by failing the primary
 start script if rgmanager chooses it, so rgmanager will try another
 (hopefully caught-up) backup to become primary.

 *** Backup failure

 If a backup fails it is re-started. It connects and catches up from scratch
 to become a ready backup.

 ** Interaction with the store.

 Clean shutdown: entire cluster is shut down cleanly by an admin tool:
 - primary stops accepting client connections till shutdown is complete.
 - backups come fully up to speed with primary state.
 - all shut down marking stores as 'clean' with an identifying  UUID.

 After clean shutdown the cluster can re-start automatically as all nodes
 have equivalent stores. Stores starting up with the wrong UUID will fail.

 Stored status: clean(UUID)/dirty, primary/backup, generation number.
 - All stores are marked dirty except after a clean shutdown.
 - Generation number: passed to backups and incremented by new primary.

 After total crash must manually identify the "best" store, provide admin tool.
 Best = highest generation number among stores in primary state.

 Recovering from total crash: all brokers will refuse to start as all stores are dirty.
 Check the stores manually to find the best one, then either:
 1. Copy stores:
  - copy good store to all hosts
  - restart qpidd on all hosts.
 2. Erase stores:
  - Erase the store on all other hosts.
  - Restart qpidd on the good store and wait for it to become primary.
  - Restart qpidd on all other hosts.

 Broker startup with store:
 - Dirty: refuse to start
 - Clean:
   - Start and load from store.
   - When connecting as backup, check UUID matches primary, shut down if not.
 - Empty: start ok, no UUID check with primary.

 ** Current Limitations

 (In no particular order at present)

 For message replication:

 LM1 - The re-synchronisation does not handle the case where a newly elected
 primary is *behind* one of the other backups. To address this I propose
 a new event for restting the sequence that the new primary would send
 out on detecting that a replicating browser is ahead of it, requesting
 that the replica revert back to a particular sequence number. The
 replica on receiving this event would then discard (i.e. dequeue) all
 the messages ahead of that sequence number and reset the counter to
 correctly sequence any subsequently delivered messages.

 LM2 - There is a need to handle wrap-around of the message sequence to avoid
 confusing the resynchronisation where a replica has been disconnected
 for a long time, sufficient for the sequence numbering to wrap around.

 LM3 - Transactional changes to queue state are not replicated atomically.

 LM4 - Acknowledgements are confirmed to clients before the message has been
 dequeued from replicas or indeed from the local store if that is
 asynchronous.

 LM5 - During failover, messages (re)published to a queue before there are
 the requisite number of replication subscriptions established will be
 confirmed to the publisher before they are replicated, leaving them
 vulnerable to a loss of the new primary before they are replicated.

 LM6 - persistence: In the event of a total cluster failure there are
 no tools to automatically identify the "latest" store. Once this
 is manually identfied, all other stores belonging to cluster members
 must be erased and the latest node must started as primary.

 LM6 - persistence: In the event of a single node failure, that nodes
 store must be erased before it can re-join the cluster.

 For configuration propagation:

 LC2 - Queue and exchange propagation is entirely asynchronous. There
 are three cases to consider here for queue creation:

 (a) where queues are created through the addressing syntax supported
 the messaging API, they should be recreated if needed on failover and
 message replication if required is dealt with seperately;

 (b) where queues are created using configuration tools by an
 administrator or by a script they can query the backups to verify the
 config has propagated and commands can be re-run if there is a failure
 before that;

 (c) where applications have more complex programs on which
 queues/exchanges are created using QMF or directly via 0-10 APIs, the
 completion of the command will not guarantee that the command has been
 carried out on other nodes.

 I.e. case (a) doesn't require anything (apart from LM5 in some cases),
 case (b) can be addressed in a simple manner through tooling but case
 (c) would require changes to the broker to allow client to simply
 determine when the command has fully propagated.

 LC3 - Queues that are not in the query response received when a
 replica establishes a propagation subscription but exist locally are
 not deleted. I.e. Deletion of queues/exchanges while a replica is not
 connected will not be propagated. Solution is to delete any queues
 marked for propagation that exist locally but do not show up in the
 query response.

 LC4 - It is possible on failover that the new primary did not
 previously receive a given QMF event while a backup did (sort of an
 analogous situation to LM1 but without an easy way to detect or remedy
 it).

 LC5 - Need richer control over which queues/exchanges are propagated, and
 which are not.

 LC6 - The events and query responses are not fully synchronized.

       In particular it *is* possible to not receive a delete event but
       for the deleted object to still show up in the query response
       (meaning the deletion is 'lost' to the update).

       It is also possible for an create event to be received as well
       as the created object being in the query response. Likewise it
       is possible to receive a delete event and a query response in
       which the object no longer appears. In these cases the event is
       essentially redundant.

       It is not possible to miss a create event and yet not to have
       the object in question in the query response however.


 * Benefits compared to previous cluster implementation.

 - Does not depend on openais/corosync, does not require multicast.
 - Can be integrated with different resource managers: for example rgmanager, PaceMaker, Veritas.
 - Can be ported to/implemented in other environments: e.g. Java, Windows
 - Disaster Recovery is just another backup, no need for separate queue replication mechanism.
 - Can take advantage of resource manager features, e.g. virtual IP addresses.
 - Fewer inconsistent errors (store failures) that can be handled without killing brokers.
 - Improved performance
 * User Documentation Notes

 Notes to seed initial user documentation. Loosely tracking the implementation,
 some points mentioned in the doc may not be implemented yet.

 ** High Availability Overview

 HA is implemented using a 'hot standby' approach. Clients are directed
 to a single "primary" broker. The primary executes client requests and
 also replicates them to one or more "backup" brokers. If the primary
 fails, one of the backups takes over the role of primary carrying on
 from where the primary left off. Clients will fail over to the new
 primary automatically and continue their work.

 TODO: at least once, deduplication.

 ** Enabling replication on the client.

 To enable replication set the qpid.replicate argument when creating a
 queue or exchange.

 This can have one of 3 values
 - none: the object is not replicated
 - configuration: queues, exchanges and bindings are replicated but messages are not.
 - messages: configuration and messages are replicated.

 TODO: examples
 TODO: more options for default value of qpid.replicate

 A HA client connection has multiple addresses, one for each broker. If
 the it fails to connect to an address, or the connection breaks,
 it will automatically fail-over to another address.

 Only the primary broker accepts connections, the backup brokers
 redirect connection attempts to the primary. If the primary fails, one
 of the backups is promoted to primary and clients fail-over to the new
 primary.

 TODO: using multiple-address connections, examples c++, python, java.

 TODO: dynamic cluster addressing?

 TODO: need de-duplication.

 ** Enabling replication on the broker.

 Network topology: backup links, separate client/broker networks.
 Describe failover mechanisms.
 - Client view: URLs, failover, exclusion & discovery.
 - Broker view: similar.
 Role of rmganager

 ** Configuring rgmanager

 ** Configuring qpidd
	--org--
	# Licensed to the Apache Software Foundation (ASF) under one
	# or more contributor license agreements. See the NOTICE file
	# distributed with this work for additional information
	# regarding copyright ownership. The ASF licenses this file
	# to you under the Apache License, Version 2.0 (the
	# "License"); you may not use this file except in compliance
	# with the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing,
	# software distributed under the License is distributed on an
	# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	# KIND, either express or implied. See the License for the
	# specific language governing permissions and limitations
	# under the License.

	* An active-passive, hot-standby design for Qpid clustering.

	This document describes an active-passive approach to HA based on
	queue browsing to replicate message data.

	See [[./old-cluster-issues.txt]] for issues with the old design.

	** Active-active vs. active-passive (hot-standby)

	An active-active cluster allows clients to connect to any broker in
	the cluster. If a broker fails, clients can fail-over to any other
	live broker.

	A hot-standby cluster has only one active broker at a time (the
	"primary") and one or more brokers on standby (the "backups"). Clients
	are only served by the primary, clients that connect to a backup are
	redirected to the primary. The backups are kept up-to-date in real
	time by the primary, if the primary fails a backup is elected to be
	the new primary.

	The main problem with active-active is co-ordinating consumers of the
	same queue on multiple brokers such that there are no duplicates in
	normal operation. There are 2 approaches:

	Predictive: each broker predicts which messages others will take. This
	the main weakness of the old design so not appealing.

	Locking: brokers "lock" a queue in order to take messages. This is
	complex to implement and it is not straighforward to determine the
	best strategy for passing the lock. In tests to date it results in
	very high latencies (10x standalone broker).

	Hot-standby removes this problem. Only the primary can modify queues
	so it just has to tell the backups what it is doing, there's no
	locking.

	The primary can enqueue messages and replicate asynchronously -
	exactly like the store does, but it "writes" to the replicas over the
	network rather than writing to disk.

	** Replicating browsers

	The unit of replication is a replicating browser. This is an AMQP
	consumer that browses a remote queue via a federation link and
	maintains a local replica of the queue. As well as browsing the remote
	messages as they are added the browser receives dequeue notifications
	when they are dequeued remotely.

	On the primary broker incoming mesage transfers are completed only when
	all of the replicating browsers have signaled completion. Thus a completed
	message is guaranteed to be on the backups.

	** Failover and Cluster Resource Managers

	We want to delegate the failover management to an existing cluster
	resource manager. Initially this is rgmanager from Cluster Suite, but
	other managers (e.g. PaceMaker) could be supported in future.

	Rgmanager takes care of starting and stopping brokers and informing
	brokers of their roles as primary or backup. It ensures there's
	exactly one primary broker running at any time. It also tracks quorum
	and protects against split-brain.

	Rgmanger can also manage a virtual IP address so clients can just
	retry on a single address to fail over. Alternatively we will also
	support configuring a fixed list of broker addresses when qpid is run
	outside of a resource manager.

	Aside: Cold-standby is also possible using rgmanager with shared
	storage for the message store (e.g. GFS). If the broker fails, another
	broker is started on a different node and and recovers from the
	store. This bears investigation but the store recovery times are
	likely too long for failover.

	** Replicating configuration

	New queues and exchanges and their bindings also need to be replicated.
	This is done by a QMF client that registers for configuration changes
	on the remote broker and mirrors them in the local broker.

	** Use of CPG (openais/corosync)

	CPG is not required in this model, an external cluster resource
	manager takes care of membership and quorum.

	** Selective replication

	In this model it's easy to support selective replication of individual queues via
	configuration.

	Explicit exchange/queue qpid.replicate argument:
	- none: the object is not replicated
	- configuration: queues, exchanges and bindings are replicated but messages are not.
	- messages: configuration and messages are replicated.

	TODO: provide configurable default for qpid.replicate

	[GRS: current prototype relies on queue sequence for message identity
	so selectively replicating certain messages on a given queue would be
	challenging. Selectively replicating certain queues however is trivial.]

	** Inconsistent errors

	The new design eliminates most sources of inconsistent errors in the
	old design (connections, sessions, security, management etc.) and
	eliminates the need to stall the whole cluster till an error is
	resolved. We still have to handle inconsistent store errors when store
	and cluster are used together.

	We have 2 options (configurable) for handling inconsistent errors,
	on the backup that fails to store a message from primary we can:
	- Abort the backup broker allowing it to be re-started.
	- Raise a critical error on the backup broker but carry on with the message lost.
	We can configure the option to abort or carry on per-queue, we
	will also provide a broker-wide configurable default.

	** New backups connecting to primary.

	When the primary fails, one of the backups becomes primary and the
	others connect to the new primary as backups.

	The backups can take advantage of the messages they already have
	backed up, the new primary only needs to replicate new messages.

	To keep the N-way guarantee the primary needs to delay completion on
	new messages until all the back-ups have caught up. However if a
	backup does not catch up within some timeout, it is considered dead
	and its messages are completed so the cluster can carry on with N-1
	members.


	** Broker discovery and lifecycle.

	The cluster has a client URL that can contain a single virtual IP
	address or a list of real IP addresses for the cluster.

	In backup mode, brokers reject connections normal client connections
	so clients will fail over to the primary. HA admin tools mark their
	connections so they are allowed to connect to backup brokers.

	Clients discover the primary by re-trying connection to the client URL
	until the successfully connect to the primary. In the case of a
	virtual IP they re-try the same address until it is relocated to the
	primary. In the case of a list of IPs the client tries each in
	turn. Clients do multiple retries over a configured period of time
	before giving up.

	Backup brokers discover the primary in the same way as clients. There
	is a separate broker URL for brokers since they often will connect
	over a different network. The broker URL has to be a list of real
	addresses rather than a virtual address.

	Brokers have the following states:
	- connecting: Backup broker trying to connect to primary - loops retrying broker URL.
	- catchup: Backup connected to primary, catching up on pre-existing configuration & messages.
	- ready: Backup fully caught-up, ready to take over as primary.
	- primary: Acting as primary, serving clients.

	** Interaction with rgmanager

	rgmanager interacts with qpid via 2 service scripts: backup &
	primary. These scripts interact with the underlying qpidd
	service. rgmanager picks the new primary when the old primary
	fails. In a partition it also takes care of killing inquorate brokers.

	*** Initial cluster start

	rgmanager starts the backup service on all nodes and the primary service on one node.

	On the backup nodes qpidd is in the connecting state. The primary node goes into
	the primary state. Backups discover the primary, connect and catch up.

	*** Failover

	primary broker or node fails. Backup brokers see disconnect and go
	back to connecting mode.

	rgmanager notices the failure and starts the primary service on a new node.
	This tells qpidd to go to primary mode. Backups re-connect and catch up.

	The primary can only be started on nodes where there is a ready backup service.
	If the backup is catching up, it's not eligible to take over as primary.

	*** Failback

	Cluster of N brokers has suffered a failure, only N-1 brokers
	remain. We want to start a new broker (possibly on a new node) to
	restore redundancy.

	If the new broker has a new IP address, the sysadmin pushes a new URL
	to all the existing brokers.

	The new broker starts in connecting mode. It discovers the primary,
	connects and catches up.

	*** Failure during failback

	A second failure occurs before the new backup B completes its catch
	up. The backup B refuses to become primary by failing the primary
	start script if rgmanager chooses it, so rgmanager will try another
	(hopefully caught-up) backup to become primary.

	*** Backup failure

	If a backup fails it is re-started. It connects and catches up from scratch
	to become a ready backup.

	** Interaction with the store.

	Clean shutdown: entire cluster is shut down cleanly by an admin tool:
	- primary stops accepting client connections till shutdown is complete.
	- backups come fully up to speed with primary state.
	- all shut down marking stores as 'clean' with an identifying UUID.

	After clean shutdown the cluster can re-start automatically as all nodes
	have equivalent stores. Stores starting up with the wrong UUID will fail.

	Stored status: clean(UUID)/dirty, primary/backup, generation number.
	- All stores are marked dirty except after a clean shutdown.
	- Generation number: passed to backups and incremented by new primary.

	After total crash must manually identify the "best" store, provide admin tool.
	Best = highest generation number among stores in primary state.

	Recovering from total crash: all brokers will refuse to start as all stores are dirty.
	Check the stores manually to find the best one, then either:
	1. Copy stores:
	- copy good store to all hosts
	- restart qpidd on all hosts.
	2. Erase stores:
	- Erase the store on all other hosts.
	- Restart qpidd on the good store and wait for it to become primary.
	- Restart qpidd on all other hosts.

	Broker startup with store:
	- Dirty: refuse to start
	- Clean:
	- Start and load from store.
	- When connecting as backup, check UUID matches primary, shut down if not.
	- Empty: start ok, no UUID check with primary.

	** Current Limitations

	(In no particular order at present)

	For message replication:

	LM1 - The re-synchronisation does not handle the case where a newly elected
	primary is behind one of the other backups. To address this I propose
	a new event for restting the sequence that the new primary would send
	out on detecting that a replicating browser is ahead of it, requesting
	that the replica revert back to a particular sequence number. The
	replica on receiving this event would then discard (i.e. dequeue) all
	the messages ahead of that sequence number and reset the counter to
	correctly sequence any subsequently delivered messages.

	LM2 - There is a need to handle wrap-around of the message sequence to avoid
	confusing the resynchronisation where a replica has been disconnected
	for a long time, sufficient for the sequence numbering to wrap around.

	LM3 - Transactional changes to queue state are not replicated atomically.

	LM4 - Acknowledgements are confirmed to clients before the message has been
	dequeued from replicas or indeed from the local store if that is
	asynchronous.

	LM5 - During failover, messages (re)published to a queue before there are
	the requisite number of replication subscriptions established will be
	confirmed to the publisher before they are replicated, leaving them
	vulnerable to a loss of the new primary before they are replicated.

	LM6 - persistence: In the event of a total cluster failure there are
	no tools to automatically identify the "latest" store. Once this
	is manually identfied, all other stores belonging to cluster members
	must be erased and the latest node must started as primary.

	LM6 - persistence: In the event of a single node failure, that nodes
	store must be erased before it can re-join the cluster.

	For configuration propagation:

	LC2 - Queue and exchange propagation is entirely asynchronous. There
	are three cases to consider here for queue creation:

	(a) where queues are created through the addressing syntax supported
	the messaging API, they should be recreated if needed on failover and
	message replication if required is dealt with seperately;

	(b) where queues are created using configuration tools by an
	administrator or by a script they can query the backups to verify the
	config has propagated and commands can be re-run if there is a failure
	before that;

	(c) where applications have more complex programs on which
	queues/exchanges are created using QMF or directly via 0-10 APIs, the
	completion of the command will not guarantee that the command has been
	carried out on other nodes.

	I.e. case (a) doesn't require anything (apart from LM5 in some cases),
	case (b) can be addressed in a simple manner through tooling but case
	(c) would require changes to the broker to allow client to simply
	determine when the command has fully propagated.

	LC3 - Queues that are not in the query response received when a
	replica establishes a propagation subscription but exist locally are
	not deleted. I.e. Deletion of queues/exchanges while a replica is not
	connected will not be propagated. Solution is to delete any queues
	marked for propagation that exist locally but do not show up in the
	query response.

	LC4 - It is possible on failover that the new primary did not
	previously receive a given QMF event while a backup did (sort of an
	analogous situation to LM1 but without an easy way to detect or remedy
	it).

	LC5 - Need richer control over which queues/exchanges are propagated, and
	which are not.

	LC6 - The events and query responses are not fully synchronized.

	In particular it is possible to not receive a delete event but
	for the deleted object to still show up in the query response
	(meaning the deletion is 'lost' to the update).

	It is also possible for an create event to be received as well
	as the created object being in the query response. Likewise it
	is possible to receive a delete event and a query response in
	which the object no longer appears. In these cases the event is
	essentially redundant.

	It is not possible to miss a create event and yet not to have
	the object in question in the query response however.


	* Benefits compared to previous cluster implementation.

	- Does not depend on openais/corosync, does not require multicast.
	- Can be integrated with different resource managers: for example rgmanager, PaceMaker, Veritas.
	- Can be ported to/implemented in other environments: e.g. Java, Windows
	- Disaster Recovery is just another backup, no need for separate queue replication mechanism.
	- Can take advantage of resource manager features, e.g. virtual IP addresses.
	- Fewer inconsistent errors (store failures) that can be handled without killing brokers.
	- Improved performance
	* User Documentation Notes

	Notes to seed initial user documentation. Loosely tracking the implementation,
	some points mentioned in the doc may not be implemented yet.

	** High Availability Overview

	HA is implemented using a 'hot standby' approach. Clients are directed
	to a single "primary" broker. The primary executes client requests and
	also replicates them to one or more "backup" brokers. If the primary
	fails, one of the backups takes over the role of primary carrying on
	from where the primary left off. Clients will fail over to the new
	primary automatically and continue their work.

	TODO: at least once, deduplication.

	** Enabling replication on the client.

	To enable replication set the qpid.replicate argument when creating a
	queue or exchange.

	This can have one of 3 values
	- none: the object is not replicated
	- configuration: queues, exchanges and bindings are replicated but messages are not.
	- messages: configuration and messages are replicated.

	TODO: examples
	TODO: more options for default value of qpid.replicate

	A HA client connection has multiple addresses, one for each broker. If
	the it fails to connect to an address, or the connection breaks,
	it will automatically fail-over to another address.

	Only the primary broker accepts connections, the backup brokers
	redirect connection attempts to the primary. If the primary fails, one
	of the backups is promoted to primary and clients fail-over to the new
	primary.

	TODO: using multiple-address connections, examples c++, python, java.

	TODO: dynamic cluster addressing?

	TODO: need de-duplication.

	** Enabling replication on the broker.

	Network topology: backup links, separate client/broker networks.
	Describe failover mechanisms.
	- Client view: URLs, failover, exclusion & discovery.
	- Broker view: similar.
	Role of rmganager

	** Configuring rgmanager

	** Configuring qpidd