blob: 99a5dc0199172b603647e8ad876490de6f158f45 [file] [log] [blame]
-*-org-*-
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
* Another new design for Qpid clustering.
For background see [[./new-cluster-design.txt]] which describes the issues
with the old design and a new active-active design that could replace it.
This document describes an alternative hot-standby approach.
** Delivery guarantee
We guarantee N-way redundant, at least once delivey. Once a message
from a client has been acknowledged by the broker, it will be
delivered even if N-1 brokers subsequently fail. There may be
duplicates in the event of a failure. We don't make duplicates
during normal operation (i.e when no brokers have failed)
This is the same guarantee as the old cluster and the alternative
active-active design.
** Active-active vs. hot standby (aka primary-backup)
An active-active cluster allows clients to connect to any broker in
the cluster. If a broker fails, clients can fail-over to any other
live broker.
A hot-standby cluster has only one active broker at a time (the
"primary") and one or more brokers on standby (the "backups"). Clients
are only served by the leader, clients that connect to a backup are
redirected to the leader. The backpus are kept up-to-date in real time
by the primary, if the primary fails a backup is elected to be the new
primary.
Aside: A cold-standby cluster is possible using a standalone broker,
CMAN and shared storage. In this scenario only one broker runs at a
time writing to a shared store. If it fails, another broker is started
(by CMAN) and recovers from the store. This bears investigation but
the store recovery time is probably too long for failover.
** Why hot standby?
Active-active has some advantages:
- Finding a broker on startup or failover is simple, just pick any live broker.
- All brokers are always running in active mode, there's no
- Distributing clients across brokers gives better performance, but see [1].
- A broker failure affects only clients connected to that broker.
The main problem with active-active is co-ordinating consumers of the
same queue on multiple brokers such that there are no duplicates in
normal operation. There are 2 approaches:
Predictive: each broker predicts which messages others will take. This
the main weakness of the old design so not appealing.
Locking: brokers "lock" a queue in order to take messages. This is
complex to implement, its not straighforward to determine the most
performant strategie for passing the lock.
Hot-standby removes this problem. Only the primary can modify queues
so it just has to tell the backups what it is doing, there's no
locking.
The primary can enqueue messages and replicate asynchronously -
exactly like the store does, but it "writes" to the replicas over the
network rather than writing to disk.
** Failover in a hot-standby cluster.
Hot-standby has some potential performance issues around failover:
- Failover "spike": when the primary fails every client will fail over
at the same time, putting strain on the system.
- Until a new primary is elected, cluster cannot serve any clients or
redirect clients to the primary.
We want to minimize the number of re-connect attempts that clients
have to make. The cluster can use a well-known algorithm to choose the
new primary (e.g. round robin on a known sequence of brokers) so that
clients can guess the new primary correctly in most cases.
Even if clients do guess correctly it may be that the new primary is
not yet aware of the death of the old primary, which is may to cause
multiple failed connect attempts before clients eventually get
connected. We will need to prototype to see how much this happens in
reality and how we can best get clients redirected.
** Threading and performance.
The primary-backup cluster operates analogously to the way the disk store does now:
- use the same MessageStore interface as the store to interact with the broker
- use the same asynchronous-completion model for replicating messages.
- use the same recovery interfaces (?) for new backups joining.
Re-using the well-established store design gives credibility to the new cluster design.
The single CPG dispatch thread was a severe performance bottleneck for the old cluster.
The primary has the same threading model as a a standalone broker with
a store, which we know that this performs well.
If we use CPG for replication of messages, the backups will receive
messages in the CPG dispatch thread. To get more concurency, the CPG
thread can dump work onto internal PollableQueues to be processed in
parallel.
Messages from the same broker queue need to go onto the same
PollableQueue. There could be a separate PollableQueue for each broker
queue. If that's too resource intensive we can use a fixed set of
PollableQueues and assign broker queues to PollableQueues via hashing
or round robin.
Another possible optimization is to use multiple CPG queues: one per
queue or a hashed set, to get more concurrency in the CPG layer. The
old cluster is not able to keep CPG busy.
TODO: Transactions pose a challenge with these concurrent models: how
to co-ordinate multiple messages being added (commit a publish or roll
back an accept) to multiple queues so that all replicas end up with
the same message sequence while respecting atomicity.
** Use of CPG
CPG provides several benefits in the old cluster:
- tracking membership (essential for determining the primary)
- handling "spit brain" (integrates with partition support from CMAN)
- reliable multicast protocol to distribute messages.
I believe we still need CPG for membership and split brain. We could
experiment with sending the bulk traffic over AMQP conections.
** Flow control
Need to ensure that
1) In-memory internal queues used by the cluster don't overflow.
2) The backups don't fall too far behind on processing CPG messages
** Recovery
When a new backup joins an active cluster it must get a snapshot
from one of the other backups, or the primary if there are none. In
store terms this is "recovery" (old cluster called it an "update)
Compared to old cluster we only replidate well defined data set of the store.
This is the crucial sore spot of old cluster.
We can also replicated it more efficiently by recovering queues in
reverse (LIFO) order. That means as clients actively consume messages
from the front of the queue, they are redeucing the work we have to do
in recovering from the back. (NOTE: this may not be compatible with
using the same recovery interfaces as the store.)
** Selective replication
In this model it's easy to support selective replication of individual queues via
configuration.
- Explicit exchange/queue declare argument and message boolean: x-qpid-replicate.
Treated analogously to persistent/durable properties for the store.
- if not explicitly marked, provide a choice of default
- default is replicate (replicated message on replicated queue)
- default is don't replicate
- default is replicate persistent/durable messages.
** Inconsistent errors
The new design eliminates most sources of inconsistent errors in the
old design (connections, sessions, security, management etc.) and
eliminates the need to stall the whole cluster till an error is
resolved. We still have to handle inconsistent store errors when store
and cluster are used together.
We also have to include error handling in the async completion loop to
guarantee N-way at least once: we should only report success to the
client when we know the message was replicated and stored on all N-1
backups.
TODO: We have a lot more options than the old cluster, need to figure
out the best approach, or possibly allow mutliple approaches. Need to
go thru the various failure cases. We may be able to do recovery on a
per-queue basis rather than restarting an entire node.
** New members joining
We should be able to catch up much faster than the the old design. A
new backup can catch up ("recover") the current cluster state on a
per-queue basis.
- queues can be updated in parallel
- "live" updates avoid the the "endless chase"
During a "live" update several things are happening on a queue:
- clients are publishing messages to the back of the queue, replicated to the backup
- clients are consuming messages from the front of the queue, replicated to the backup.
- the primary is sending pre-existing messages to the new backup.
The primary sends pre-existing messages in LIFO order - starting from
the back of the queue, at the same time clients are consuming from the front.
The active consumers actually reduce the amount of work to be done, as there's
no need to replicate messages that are no longer on the queue.
* Steps to get there
** Baseline replication
Validate the overall design get initial notion of performance. Just
message+wiring replication, no update/recovery for new members joining,
single CPG dispatch thread on backups, no failover, no transactions.
** Failover
Electing primary, backups redirect to primary. Measure failover time
for large # clients. Strategies to minimise number of retries after a
failure.
** Flow Control
Keep internal queues from over-flowing. Similar to internal flow control in old cluster.
Needed for realistic performance/stress tests
** Concurrency
Experiment with multiple threads on backups, multiple CPG groups.
** Recovery/new member joining
Initial status handshake for new member. Recovering queues from the back.
** Transactions
TODO: How to implement transactions with concurrency. Worst solution:
a global --cluster-use-transactions flag that forces single thread
mode. Need to find a better solution.