| -*-org-*- |
| # Licensed to the Apache Software Foundation (ASF) under one |
| # or more contributor license agreements. See the NOTICE file |
| # distributed with this work for additional information |
| # regarding copyright ownership. The ASF licenses this file |
| # to you under the Apache License, Version 2.0 (the |
| # "License"); you may not use this file except in compliance |
| # with the License. You may obtain a copy of the License at |
| # |
| # http://www.apache.org/licenses/LICENSE-2.0 |
| # |
| # Unless required by applicable law or agreed to in writing, |
| # software distributed under the License is distributed on an |
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| # KIND, either express or implied. See the License for the |
| # specific language governing permissions and limitations |
| # under the License. |
| |
| * An active-passive, hot-standby design for Qpid clustering. |
| |
| This document describes an active-passive approach to HA based on |
| queue browsing to replicate message data. |
| |
| See [[./old-cluster-issues.txt]] for issues with the old design. |
| |
| ** Active-active vs. active-passive (hot-standby) |
| |
| An active-active cluster allows clients to connect to any broker in |
| the cluster. If a broker fails, clients can fail-over to any other |
| live broker. |
| |
| A hot-standby cluster has only one active broker at a time (the |
| "primary") and one or more brokers on standby (the "backups"). Clients |
| are only served by the primary, clients that connect to a backup are |
| redirected to the primary. The backups are kept up-to-date in real |
| time by the primary, if the primary fails a backup is elected to be |
| the new primary. |
| |
| The main problem with active-active is co-ordinating consumers of the |
| same queue on multiple brokers such that there are no duplicates in |
| normal operation. There are 2 approaches: |
| |
| Predictive: each broker predicts which messages others will take. This |
| the main weakness of the old design so not appealing. |
| |
| Locking: brokers "lock" a queue in order to take messages. This is |
| complex to implement and it is not straighforward to determine the |
| best strategy for passing the lock. In tests to date it results in |
| very high latencies (10x standalone broker). |
| |
| Hot-standby removes this problem. Only the primary can modify queues |
| so it just has to tell the backups what it is doing, there's no |
| locking. |
| |
| The primary can enqueue messages and replicate asynchronously - |
| exactly like the store does, but it "writes" to the replicas over the |
| network rather than writing to disk. |
| |
| ** Replicating browsers |
| |
| The unit of replication is a replicating browser. This is an AMQP |
| consumer that browses a remote queue via a federation link and |
| maintains a local replica of the queue. As well as browsing the remote |
| messages as they are added the browser receives dequeue notifications |
| when they are dequeued remotely. |
| |
| On the primary broker incoming mesage transfers are completed only when |
| all of the replicating browsers have signaled completion. Thus a completed |
| message is guaranteed to be on the backups. |
| |
| ** Failover and Cluster Resource Managers |
| |
| We want to delegate the failover management to an existing cluster |
| resource manager. Initially this is rgmanager from Cluster Suite, but |
| other managers (e.g. PaceMaker) could be supported in future. |
| |
| Rgmanager takes care of starting and stopping brokers and informing |
| brokers of their roles as primary or backup. It ensures there's |
| exactly one primary broker running at any time. It also tracks quorum |
| and protects against split-brain. |
| |
| Rgmanger can also manage a virtual IP address so clients can just |
| retry on a single address to fail over. Alternatively we will also |
| support configuring a fixed list of broker addresses when qpid is run |
| outside of a resource manager. |
| |
| ** Replicating configuration |
| |
| New queues and exchanges and their bindings also need to be replicated. |
| This is done by a QMF client that registers for configuration changes |
| on the remote broker and mirrors them in the local broker. |
| |
| ** Use of CPG (openais/corosync) |
| |
| CPG is not required in this model, an external cluster resource |
| manager takes care of membership and quorum. |
| |
| ** Selective replication |
| |
| In this model it's easy to support selective replication of individual queues via |
| configuration. |
| |
| Explicit exchange/queue qpid.replicate argument: |
| - none: the object is not replicated |
| - configuration: queues, exchanges and bindings are replicated but messages are not. |
| - all: configuration and messages are replicated. |
| |
| Set configurable default all/configuration/none |
| |
| ** Inconsistent errors |
| |
| The new design eliminates most sources of inconsistent errors in the |
| old design (connections, sessions, security, management etc.) and |
| eliminates the need to stall the whole cluster till an error is |
| resolved. We still have to handle inconsistent store errors when store |
| and cluster are used together. |
| |
| We have 3 options (configurable) for handling inconsistent errors, |
| on the backup that fails to store a message from primary we can: |
| - Abort the backup broker allowing it to be re-started. |
| - Raise a critical error on the backup broker but carry on with the message lost. |
| - Reset and re-try replication for just the affected queue. |
| |
| We will provide some configurable options in this regard. |
| |
| ** New backups connecting to primary. |
| |
| When the primary fails, one of the backups becomes primary and the |
| others connect to the new primary as backups. |
| |
| The backups can take advantage of the messages they already have |
| backed up, the new primary only needs to replicate new messages. |
| |
| To keep the N-way guarantee the primary needs to delay completion on |
| new messages until all the back-ups have caught up. However if a |
| backup does not catch up within some timeout, it is considered dead |
| and its messages are completed so the cluster can carry on with N-1 |
| members. |
| |
| |
| ** Broker discovery and lifecycle. |
| |
| The cluster has a client URL that can contain a single virtual IP |
| address or a list of real IP addresses for the cluster. |
| |
| In backup mode, brokers reject connections normal client connections |
| so clients will fail over to the primary. HA admin tools mark their |
| connections so they are allowed to connect to backup brokers. |
| |
| Clients discover the primary by re-trying connection to all addresses in the client URL |
| until they successfully connect to the primary. In the case of a |
| virtual IP they re-try the same address until it is relocated to the |
| primary. In the case of a list of IPs the client tries each in |
| turn. Clients do multiple retries over a configured period of time |
| before giving up. |
| |
| Backup brokers discover the primary in the same way as clients. There |
| is a separate broker URL for brokers since they often will connect |
| over a different network. The broker URL has to be a list of real |
| addresses rather than a virtual address. |
| |
| ** Interaction with rgmanager |
| |
| rgmanager interacts with qpid via 2 service scripts: backup & |
| primary. These scripts interact with the underlying qpidd |
| service. rgmanager picks the new primary when the old primary |
| fails. In a partition it also takes care of killing inquorate brokers. |
| |
| *** Initial cluster start |
| |
| rgmanager starts the backup service on all nodes and the primary service on one node. |
| |
| On the backup nodes qpidd is in the connecting state. The primary node goes into |
| the primary state. Backups discover the primary, connect and catch up. |
| |
| *** Failover |
| |
| primary broker or node fails. Backup brokers see the disconnect and |
| start trying to re-connect to the new primary. |
| |
| rgmanager notices the failure and starts the primary service on a new node. |
| This tells qpidd to go to primary mode. Backups re-connect and catch up. |
| |
| The primary can only be started on nodes where there is a ready backup service. |
| If the backup is catching up, it's not eligible to take over as primary. |
| |
| *** Failback |
| |
| Cluster of N brokers has suffered a failure, only N-1 brokers |
| remain. We want to start a new broker (possibly on a new node) to |
| restore redundancy. |
| |
| If the new broker has a new IP address, the sysadmin pushes a new URL |
| to all the existing brokers. |
| |
| The new broker starts in connecting mode. It discovers the primary, |
| connects and catches up. |
| |
| *** Failure during failback |
| |
| A second failure occurs before the new backup B completes its catch |
| up. The backup B refuses to become primary by failing the primary |
| start script if rgmanager chooses it, so rgmanager will try another |
| (hopefully caught-up) backup to become primary. |
| |
| *** Backup failure |
| |
| If a backup fails it is re-started. It connects and catches up from scratch |
| to become a ready backup. |
| |
| ** Interaction with the store. |
| |
| Needs more detail: |
| |
| We want backup brokers to be able to user their stored messages on restart |
| so they don't have to download everything from priamary. |
| This requires a HA sequence number to be stored with the message |
| so the backup can identify which messages are in common with the primary. |
| |
| This will work very similarly to the way live backups can use in-memory |
| messages to reduce the download. |
| |
| Need to determine which broker is chosen as initial primary based on currency of |
| stores. Probably using stored generation numbers and status flags. Ideally |
| automated with rgmanager, or some intervention might be reqiured. |
| |
| * Current Limitations |
| |
| (In no particular order at present) |
| |
| For message replication (missing items have been fixed) |
| |
| LM3 - Transactional changes to queue state are not replicated atomically. |
| |
| LM4 - (No worse than store) Acknowledgements are confirmed to clients before the message |
| has been dequeued from replicas or indeed from the local store if that is asynchronous. |
| |
| LM6 - persistence: In the event of a total cluster failure there are |
| no tools to automatically identify the "latest" store. |
| |
| LM7 - persistence: In the event of a persistent broker being |
| re-started (due to failure or admin) it should be able to use its |
| stored messages to reduce the download required from the |
| primary. This means storing message IDs persistently. |
| |
| For configuration propagation: |
| |
| LC2 - Queue and exchange propagation is entirely asynchronous. There |
| are three cases to consider here for queue creation: |
| |
| (a) where queues are created through the addressing syntax supported |
| the messaging API, they should be recreated if needed on failover and |
| message replication if required is dealt with seperately; |
| |
| (b) where queues are created using configuration tools by an |
| administrator or by a script they can query the backups to verify the |
| config has propagated and commands can be re-run if there is a failure |
| before that; |
| |
| (c) where applications have more complex programs on which |
| queues/exchanges are created using QMF or directly via 0-10 APIs, the |
| completion of the command will not guarantee that the command has been |
| carried out on other nodes. |
| |
| I.e. case (a) doesn't require anything (apart from LM5 in some cases), |
| case (b) can be addressed in a simple manner through tooling but case |
| (c) would require changes to the broker to allow client to simply |
| determine when the command has fully propagated. |
| |
| LC4 - It is possible on failover that the new primary did not |
| previously receive a given QMF event while a backup did (sort of an |
| analogous situation to LM1 but without an easy way to detect or remedy |
| it). |
| |
| LC6 - The events and query responses are not fully synchronized. |
| |
| In particular it *is* possible to not receive a delete event but |
| for the deleted object to still show up in the query response |
| (meaning the deletion is 'lost' to the update). |
| |
| It is also possible for an create event to be received as well |
| as the created object being in the query response. Likewise it |
| is possible to receive a delete event and a query response in |
| which the object no longer appears. In these cases the event is |
| essentially redundant. |
| |
| It is not possible to miss a create event and yet not to have |
| the object in question in the query response however. |
| |
| LC7 Federated links from the primary will be lost in failover, they will not be re-connected on |
| the new primary. Federation links to the primary can fail over. |
| |
| LC9 The "last man standing" feature of the old cluster is not available. |
| |
| * Benefits compared to previous cluster implementation. |
| |
| - Allows per queue/exchange control over what is replicated. |
| - Does not depend on openais/corosync, does not require multicast. |
| - Can be integrated with different resource managers: for example rgmanager, PaceMaker, Veritas. |
| - Can be ported to/implemented in other environments: e.g. Java, Windows |
| - Disaster Recovery is just another backup, no need for separate queue replication mechanism. |
| - Can take advantage of resource manager features, e.g. virtual IP addresses. |
| - Fewer inconsistent errors (store failures) that can be handled without killing brokers. |
| - Improved performance |