cpp/design_docs/hot-standby-design.txt - qpid - Git at Google

 -*-org-*-
 # Licensed to the Apache Software Foundation (ASF) under one
 # or more contributor license agreements.  See the NOTICE file
 # distributed with this work for additional information
 # regarding copyright ownership.  The ASF licenses this file
 # to you under the Apache License, Version 2.0 (the
 # "License"); you may not use this file except in compliance
 # with the License.  You may obtain a copy of the License at
 #
 #   http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing,
 # software distributed under the License is distributed on an
 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.

 * Another new design for Qpid clustering.

 For background see [[./new-cluster-design.txt]] which describes the issues
 with the old design and a new active-active design that could replace it.

 This document describes an alternative hot-standby approach.

 ** Delivery guarantee

 We guarantee N-way redundant, at least once delivey. Once a message
 from a client has been acknowledged by the broker, it will be
 delivered even if N-1 brokers subsequently fail. There may be
 duplicates in the event of a failure. We don't make duplicates
 during normal operation (i.e when no brokers have failed)

 This is the same guarantee as the old cluster and the alternative
 active-active design.

 ** Active-active vs. hot standby (aka primary-backup)

 An active-active cluster allows clients to connect to any broker in
 the cluster. If a broker fails, clients can fail-over to any other
 live broker.

 A hot-standby cluster has only one active broker at a time (the
 "primary") and one or more brokers on standby (the "backups"). Clients
 are only served by the leader, clients that connect to a backup are
 redirected to the leader. The backpus are kept up-to-date in real time
 by the primary, if the primary fails a backup is elected to be the new
 primary.

 Aside: A cold-standby cluster is possible using a standalone broker,
 CMAN and shared storage. In this scenario only one broker runs at a
 time writing to a shared store. If it fails, another broker is started
 (by CMAN) and recovers from the store. This bears investigation but
 the store recovery time is probably too long for failover.

 ** Why hot standby?

 Active-active has some advantages:
 - Finding a broker on startup or failover is simple, just pick any live broker.
 - All brokers are always running in active mode, there's no
 - Distributing clients across brokers gives better performance, but see [1].
 - A broker failure affects only clients connected to that broker.

 The main problem with active-active is co-ordinating consumers of the
 same queue on multiple brokers such that there are no duplicates in
 normal operation. There are 2 approaches:

 Predictive: each broker predicts which messages others will take. This
 the main weakness of the old design so not appealing.

 Locking: brokers "lock" a queue in order to take messages. This is
 complex to implement, its not straighforward to determine the most
 performant strategie for passing the lock.

 Hot-standby removes this problem. Only the primary can modify queues
 so it just has to tell the backups what it is doing, there's no
 locking.

 The primary can enqueue messages and replicate asynchronously -
 exactly like the store does, but it "writes" to the replicas over the
 network rather than writing to disk.

 ** Failover in a hot-standby cluster.

 Hot-standby has some potential performance issues around failover:

 - Failover "spike": when the primary fails every client will fail over
   at the same time, putting strain on the system.

 - Until a new primary is elected, cluster cannot serve any clients or
   redirect clients to the primary.

 We want to minimize the number of re-connect attempts that clients
 have to make. The cluster can use a well-known algorithm to choose the
 new primary (e.g. round robin on a known sequence of brokers) so that
 clients can guess the new primary correctly in most cases.

 Even if clients do guess correctly it may be that the new primary is
 not yet aware of the death of the old primary, which is may to cause
 multiple failed connect attempts before clients eventually get
 connected. We will need to prototype to see how much this happens in
 reality and how we can best get clients redirected.

 ** Threading and performance.

 The primary-backup cluster operates analogously to the way the disk store does now:
 - use the same MessageStore interface as the store to interact with the broker
 - use the same asynchronous-completion model for replicating messages.
 - use the same recovery interfaces (?) for new backups joining.

 Re-using the well-established store design gives credibility to the new cluster design.

 The single CPG dispatch thread was a severe performance bottleneck for the old cluster.

 The primary has the same threading model as a a standalone broker with
 a store, which we know that this performs well.

 If we use CPG for replication of messages, the backups will receive
 messages in the CPG dispatch thread. To get more concurency, the CPG
 thread can dump work onto internal PollableQueues to be processed in
 parallel.

 Messages from the same broker queue need to go onto the same
 PollableQueue. There could be a separate PollableQueue for each broker
 queue. If that's too resource intensive we can use a fixed set of
 PollableQueues and assign broker queues to PollableQueues via hashing
 or round robin.

 Another possible optimization is to use multiple CPG queues: one per
 queue or a hashed set, to get more concurrency in the CPG layer. The
 old cluster is not able to keep CPG busy.

 TODO: Transactions pose a challenge with these concurrent models: how
 to co-ordinate multiple messages being added (commit a publish or roll
 back an accept) to multiple queues so that all replicas end up with
 the same message sequence while respecting atomicity.

 ** Use of CPG

 CPG provides several benefits in the old cluster:
 - tracking membership (essential for determining the primary)
 - handling "spit brain" (integrates with partition support from CMAN)
 - reliable multicast protocol to distribute messages.

 I believe we still need CPG for membership and split brain. We could
 experiment with sending the bulk traffic over AMQP conections.

 ** Flow control

 Need to ensure that
 1) In-memory internal queues used by the cluster don't overflow.
 2) The backups don't fall too far behind on processing CPG messages

 ** Recovery
 When a new backup joins an active cluster it must get a snapshot
 from one of the other backups, or the primary if there are none. In
 store terms this is "recovery" (old cluster called it an "update)

 Compared to old cluster we only replidate well defined data set of the store.
 This is the crucial sore spot of old cluster.

 We can also replicated it more efficiently by recovering queues in
 reverse (LIFO) order. That means as clients actively consume messages
 from the front of the queue, they are redeucing the work we have to do
 in recovering from the back. (NOTE: this may not be compatible with
 using the same recovery interfaces as the store.)

 ** Selective replication
 In this model it's easy to support selective replication of individual queues via
 configuration.
 - Explicit exchange/queue declare argument and message boolean: x-qpid-replicate.
   Treated analogously to persistent/durable properties for the store.
 - if not explicitly marked, provide a choice of default
   - default is replicate (replicated message on replicated queue)
   - default is don't replicate
   - default is replicate persistent/durable messages.

 ** Inconsistent errors

 The new design eliminates most sources of inconsistent errors in the
 old design (connections, sessions, security, management etc.) and
 eliminates the need to stall the whole cluster till an error is
 resolved. We still have to handle inconsistent store errors when store
 and cluster are used together.

 We also have to include error handling in the async completion loop to
 guarantee N-way at least once: we should only report success to the
 client when we know the message was replicated and stored on all N-1
 backups.

 TODO: We have a lot more options than the old cluster, need to figure
 out the best approach, or possibly allow mutliple approaches. Need to
 go thru the various failure cases. We may be able to do recovery on a
 per-queue basis rather than restarting an entire node.

 ** New members joining

 We should be able to catch up much faster than the the old design. A
 new backup can catch up ("recover") the current cluster state on a
 per-queue basis.
 - queues can be updated in parallel
 - "live" updates avoid the the "endless chase"

 During a "live" update several things are happening on a queue:
 - clients are publishing messages to the back of the queue, replicated to the backup
 - clients are consuming messages from the front of the queue, replicated to the backup.
 - the primary is sending pre-existing messages to the new backup.

 The primary sends pre-existing messages in LIFO order - starting from
 the back of the queue, at the same time clients are consuming from the front.
 The active consumers actually reduce the amount of work to be done, as there's
 no need to replicate messages that are no longer on the queue.

 * Steps to get there

 ** Baseline replication
 Validate the overall design get initial notion of performance. Just
 message+wiring replication, no update/recovery for new members joining,
 single CPG dispatch thread on backups, no failover, no transactions.

 ** Failover
 Electing primary, backups redirect to primary. Measure failover time
 for large # clients.  Strategies to minimise number of retries after a
 failure.

 ** Flow Control
 Keep internal queues from over-flowing. Similar to internal flow control in old cluster.
 Needed for realistic performance/stress tests

 ** Concurrency
 Experiment with multiple threads on backups, multiple CPG groups.

 ** Recovery/new member joining
 Initial status handshake for new member. Recovering queues from the back.

 ** Transactions
 TODO: How to implement transactions with concurrency.  Worst solution:
 a global --cluster-use-transactions flag that forces single thread
 mode. Need to find a better solution.
	--org--
	# Licensed to the Apache Software Foundation (ASF) under one
	# or more contributor license agreements. See the NOTICE file
	# distributed with this work for additional information
	# regarding copyright ownership. The ASF licenses this file
	# to you under the Apache License, Version 2.0 (the
	# "License"); you may not use this file except in compliance
	# with the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing,
	# software distributed under the License is distributed on an
	# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	# KIND, either express or implied. See the License for the
	# specific language governing permissions and limitations
	# under the License.

	* Another new design for Qpid clustering.

	For background see [[./new-cluster-design.txt]] which describes the issues
	with the old design and a new active-active design that could replace it.

	This document describes an alternative hot-standby approach.

	** Delivery guarantee

	We guarantee N-way redundant, at least once delivey. Once a message
	from a client has been acknowledged by the broker, it will be
	delivered even if N-1 brokers subsequently fail. There may be
	duplicates in the event of a failure. We don't make duplicates
	during normal operation (i.e when no brokers have failed)

	This is the same guarantee as the old cluster and the alternative
	active-active design.

	** Active-active vs. hot standby (aka primary-backup)

	An active-active cluster allows clients to connect to any broker in
	the cluster. If a broker fails, clients can fail-over to any other
	live broker.

	A hot-standby cluster has only one active broker at a time (the
	"primary") and one or more brokers on standby (the "backups"). Clients
	are only served by the leader, clients that connect to a backup are
	redirected to the leader. The backpus are kept up-to-date in real time
	by the primary, if the primary fails a backup is elected to be the new
	primary.

	Aside: A cold-standby cluster is possible using a standalone broker,
	CMAN and shared storage. In this scenario only one broker runs at a
	time writing to a shared store. If it fails, another broker is started
	(by CMAN) and recovers from the store. This bears investigation but
	the store recovery time is probably too long for failover.

	** Why hot standby?

	Active-active has some advantages:
	- Finding a broker on startup or failover is simple, just pick any live broker.
	- All brokers are always running in active mode, there's no
	- Distributing clients across brokers gives better performance, but see [1].
	- A broker failure affects only clients connected to that broker.

	The main problem with active-active is co-ordinating consumers of the
	same queue on multiple brokers such that there are no duplicates in
	normal operation. There are 2 approaches:

	Predictive: each broker predicts which messages others will take. This
	the main weakness of the old design so not appealing.

	Locking: brokers "lock" a queue in order to take messages. This is
	complex to implement, its not straighforward to determine the most
	performant strategie for passing the lock.

	Hot-standby removes this problem. Only the primary can modify queues
	so it just has to tell the backups what it is doing, there's no
	locking.

	The primary can enqueue messages and replicate asynchronously -
	exactly like the store does, but it "writes" to the replicas over the
	network rather than writing to disk.

	** Failover in a hot-standby cluster.

	Hot-standby has some potential performance issues around failover:

	- Failover "spike": when the primary fails every client will fail over
	at the same time, putting strain on the system.

	- Until a new primary is elected, cluster cannot serve any clients or
	redirect clients to the primary.

	We want to minimize the number of re-connect attempts that clients
	have to make. The cluster can use a well-known algorithm to choose the
	new primary (e.g. round robin on a known sequence of brokers) so that
	clients can guess the new primary correctly in most cases.

	Even if clients do guess correctly it may be that the new primary is
	not yet aware of the death of the old primary, which is may to cause
	multiple failed connect attempts before clients eventually get
	connected. We will need to prototype to see how much this happens in
	reality and how we can best get clients redirected.

	** Threading and performance.

	The primary-backup cluster operates analogously to the way the disk store does now:
	- use the same MessageStore interface as the store to interact with the broker
	- use the same asynchronous-completion model for replicating messages.
	- use the same recovery interfaces (?) for new backups joining.

	Re-using the well-established store design gives credibility to the new cluster design.

	The single CPG dispatch thread was a severe performance bottleneck for the old cluster.

	The primary has the same threading model as a a standalone broker with
	a store, which we know that this performs well.

	If we use CPG for replication of messages, the backups will receive
	messages in the CPG dispatch thread. To get more concurency, the CPG
	thread can dump work onto internal PollableQueues to be processed in
	parallel.

	Messages from the same broker queue need to go onto the same
	PollableQueue. There could be a separate PollableQueue for each broker
	queue. If that's too resource intensive we can use a fixed set of
	PollableQueues and assign broker queues to PollableQueues via hashing
	or round robin.

	Another possible optimization is to use multiple CPG queues: one per
	queue or a hashed set, to get more concurrency in the CPG layer. The
	old cluster is not able to keep CPG busy.

	TODO: Transactions pose a challenge with these concurrent models: how
	to co-ordinate multiple messages being added (commit a publish or roll
	back an accept) to multiple queues so that all replicas end up with
	the same message sequence while respecting atomicity.

	** Use of CPG

	CPG provides several benefits in the old cluster:
	- tracking membership (essential for determining the primary)
	- handling "spit brain" (integrates with partition support from CMAN)
	- reliable multicast protocol to distribute messages.

	I believe we still need CPG for membership and split brain. We could
	experiment with sending the bulk traffic over AMQP conections.

	** Flow control

	Need to ensure that
	1) In-memory internal queues used by the cluster don't overflow.
	2) The backups don't fall too far behind on processing CPG messages

	** Recovery
	When a new backup joins an active cluster it must get a snapshot
	from one of the other backups, or the primary if there are none. In
	store terms this is "recovery" (old cluster called it an "update)

	Compared to old cluster we only replidate well defined data set of the store.
	This is the crucial sore spot of old cluster.

	We can also replicated it more efficiently by recovering queues in
	reverse (LIFO) order. That means as clients actively consume messages
	from the front of the queue, they are redeucing the work we have to do
	in recovering from the back. (NOTE: this may not be compatible with
	using the same recovery interfaces as the store.)

	** Selective replication
	In this model it's easy to support selective replication of individual queues via
	configuration.
	- Explicit exchange/queue declare argument and message boolean: x-qpid-replicate.
	Treated analogously to persistent/durable properties for the store.
	- if not explicitly marked, provide a choice of default
	- default is replicate (replicated message on replicated queue)
	- default is don't replicate
	- default is replicate persistent/durable messages.

	** Inconsistent errors

	The new design eliminates most sources of inconsistent errors in the
	old design (connections, sessions, security, management etc.) and
	eliminates the need to stall the whole cluster till an error is
	resolved. We still have to handle inconsistent store errors when store
	and cluster are used together.

	We also have to include error handling in the async completion loop to
	guarantee N-way at least once: we should only report success to the
	client when we know the message was replicated and stored on all N-1
	backups.

	TODO: We have a lot more options than the old cluster, need to figure
	out the best approach, or possibly allow mutliple approaches. Need to
	go thru the various failure cases. We may be able to do recovery on a
	per-queue basis rather than restarting an entire node.

	** New members joining

	We should be able to catch up much faster than the the old design. A
	new backup can catch up ("recover") the current cluster state on a
	per-queue basis.
	- queues can be updated in parallel
	- "live" updates avoid the the "endless chase"

	During a "live" update several things are happening on a queue:
	- clients are publishing messages to the back of the queue, replicated to the backup
	- clients are consuming messages from the front of the queue, replicated to the backup.
	- the primary is sending pre-existing messages to the new backup.

	The primary sends pre-existing messages in LIFO order - starting from
	the back of the queue, at the same time clients are consuming from the front.
	The active consumers actually reduce the amount of work to be done, as there's
	no need to replicate messages that are no longer on the queue.

	* Steps to get there

	** Baseline replication
	Validate the overall design get initial notion of performance. Just
	message+wiring replication, no update/recovery for new members joining,
	single CPG dispatch thread on backups, no failover, no transactions.

	** Failover
	Electing primary, backups redirect to primary. Measure failover time
	for large # clients. Strategies to minimise number of retries after a
	failure.

	** Flow Control
	Keep internal queues from over-flowing. Similar to internal flow control in old cluster.
	Needed for realistic performance/stress tests

	** Concurrency
	Experiment with multiple threads on backups, multiple CPG groups.

	** Recovery/new member joining
	Initial status handshake for new member. Recovering queues from the back.

	** Transactions
	TODO: How to implement transactions with concurrency. Worst solution:
	a global --cluster-use-transactions flag that forces single thread
	mode. Need to find a better solution.