| -*-org-*- |
| # Licensed to the Apache Software Foundation (ASF) under one |
| # or more contributor license agreements. See the NOTICE file |
| # distributed with this work for additional information |
| # regarding copyright ownership. The ASF licenses this file |
| # to you under the Apache License, Version 2.0 (the |
| # "License"); you may not use this file except in compliance |
| # with the License. You may obtain a copy of the License at |
| # |
| # http://www.apache.org/licenses/LICENSE-2.0 |
| # |
| # Unless required by applicable law or agreed to in writing, |
| # software distributed under the License is distributed on an |
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| # KIND, either express or implied. See the License for the |
| # specific language governing permissions and limitations |
| # under the License. |
| |
| * An active-passive, hot-standby design for Qpid clustering. |
| |
| This document describes an active-passive approach to HA based on |
| queue browsing to replicate message data. |
| |
| See [[./old-cluster-issues.txt]] for issues with the old design. |
| |
| ** Active-active vs. active-passive (hot-standby) |
| |
| An active-active cluster allows clients to connect to any broker in |
| the cluster. If a broker fails, clients can fail-over to any other |
| live broker. |
| |
| A hot-standby cluster has only one active broker at a time (the |
| "primary") and one or more brokers on standby (the "backups"). Clients |
| are only served by the primary, clients that connect to a backup are |
| redirected to the primary. The backups are kept up-to-date in real |
| time by the primary, if the primary fails a backup is elected to be |
| the new primary. |
| |
| The main problem with active-active is co-ordinating consumers of the |
| same queue on multiple brokers such that there are no duplicates in |
| normal operation. There are 2 approaches: |
| |
| Predictive: each broker predicts which messages others will take. This |
| the main weakness of the old design so not appealing. |
| |
| Locking: brokers "lock" a queue in order to take messages. This is |
| complex to implement and it is not straighforward to determine the |
| best strategy for passing the lock. In tests to date it results in |
| very high latencies (10x standalone broker). |
| |
| Hot-standby removes this problem. Only the primary can modify queues |
| so it just has to tell the backups what it is doing, there's no |
| locking. |
| |
| The primary can enqueue messages and replicate asynchronously - |
| exactly like the store does, but it "writes" to the replicas over the |
| network rather than writing to disk. |
| |
| ** Replicating browsers |
| |
| The unit of replication is a replicating browser. This is an AMQP |
| consumer that browses a remote queue via a federation link and |
| maintains a local replica of the queue. As well as browsing the remote |
| messages as they are added the browser receives dequeue notifications |
| when they are dequeued remotely. |
| |
| On the primary broker incoming mesage transfers are completed only when |
| all of the replicating browsers have signaled completion. Thus a completed |
| message is guaranteed to be on the backups. |
| |
| ** Failover and Cluster Resource Managers |
| |
| We want to delegate the failover management to an existing cluster |
| resource manager. Initially this is rgmanager from Cluster Suite, but |
| other managers (e.g. PaceMaker) could be supported in future. |
| |
| Rgmanager takes care of starting and stopping brokers and informing |
| brokers of their roles as primary or backup. It ensures there's |
| exactly one primary broker running at any time. It also tracks quorum |
| and protects against split-brain. |
| |
| Rgmanger can also manage a virtual IP address so clients can just |
| retry on a single address to fail over. Alternatively we will also |
| support configuring a fixed list of broker addresses when qpid is run |
| outside of a resource manager. |
| |
| Aside: Cold-standby is also possible using rgmanager with shared |
| storage for the message store (e.g. GFS). If the broker fails, another |
| broker is started on a different node and and recovers from the |
| store. This bears investigation but the store recovery times are |
| likely too long for failover. |
| |
| ** Replicating configuration |
| |
| New queues and exchanges and their bindings also need to be replicated. |
| This is done by a QMF client that registers for configuration changes |
| on the remote broker and mirrors them in the local broker. |
| |
| ** Use of CPG (openais/corosync) |
| |
| CPG is not required in this model, an external cluster resource |
| manager takes care of membership and quorum. |
| |
| ** Selective replication |
| |
| In this model it's easy to support selective replication of individual queues via |
| configuration. |
| |
| Explicit exchange/queue qpid.replicate argument: |
| - none: the object is not replicated |
| - configuration: queues, exchanges and bindings are replicated but messages are not. |
| - messages: configuration and messages are replicated. |
| |
| TODO: provide configurable default for qpid.replicate |
| |
| [GRS: current prototype relies on queue sequence for message identity |
| so selectively replicating certain messages on a given queue would be |
| challenging. Selectively replicating certain queues however is trivial.] |
| |
| ** Inconsistent errors |
| |
| The new design eliminates most sources of inconsistent errors in the |
| old design (connections, sessions, security, management etc.) and |
| eliminates the need to stall the whole cluster till an error is |
| resolved. We still have to handle inconsistent store errors when store |
| and cluster are used together. |
| |
| We have 2 options (configurable) for handling inconsistent errors, |
| on the backup that fails to store a message from primary we can: |
| - Abort the backup broker allowing it to be re-started. |
| - Raise a critical error on the backup broker but carry on with the message lost. |
| We can configure the option to abort or carry on per-queue, we |
| will also provide a broker-wide configurable default. |
| |
| ** New backups connecting to primary. |
| |
| When the primary fails, one of the backups becomes primary and the |
| others connect to the new primary as backups. |
| |
| The backups can take advantage of the messages they already have |
| backed up, the new primary only needs to replicate new messages. |
| |
| To keep the N-way guarantee the primary needs to delay completion on |
| new messages until all the back-ups have caught up. However if a |
| backup does not catch up within some timeout, it is considered dead |
| and its messages are completed so the cluster can carry on with N-1 |
| members. |
| |
| |
| ** Broker discovery and lifecycle. |
| |
| The cluster has a client URL that can contain a single virtual IP |
| address or a list of real IP addresses for the cluster. |
| |
| In backup mode, brokers reject connections normal client connections |
| so clients will fail over to the primary. HA admin tools mark their |
| connections so they are allowed to connect to backup brokers. |
| |
| Clients discover the primary by re-trying connection to the client URL |
| until the successfully connect to the primary. In the case of a |
| virtual IP they re-try the same address until it is relocated to the |
| primary. In the case of a list of IPs the client tries each in |
| turn. Clients do multiple retries over a configured period of time |
| before giving up. |
| |
| Backup brokers discover the primary in the same way as clients. There |
| is a separate broker URL for brokers since they often will connect |
| over a different network. The broker URL has to be a list of real |
| addresses rather than a virtual address. |
| |
| Brokers have the following states: |
| - connecting: Backup broker trying to connect to primary - loops retrying broker URL. |
| - catchup: Backup connected to primary, catching up on pre-existing configuration & messages. |
| - ready: Backup fully caught-up, ready to take over as primary. |
| - primary: Acting as primary, serving clients. |
| |
| ** Interaction with rgmanager |
| |
| rgmanager interacts with qpid via 2 service scripts: backup & |
| primary. These scripts interact with the underlying qpidd |
| service. rgmanager picks the new primary when the old primary |
| fails. In a partition it also takes care of killing inquorate brokers. |
| |
| *** Initial cluster start |
| |
| rgmanager starts the backup service on all nodes and the primary service on one node. |
| |
| On the backup nodes qpidd is in the connecting state. The primary node goes into |
| the primary state. Backups discover the primary, connect and catch up. |
| |
| *** Failover |
| |
| primary broker or node fails. Backup brokers see disconnect and go |
| back to connecting mode. |
| |
| rgmanager notices the failure and starts the primary service on a new node. |
| This tells qpidd to go to primary mode. Backups re-connect and catch up. |
| |
| The primary can only be started on nodes where there is a ready backup service. |
| If the backup is catching up, it's not eligible to take over as primary. |
| |
| *** Failback |
| |
| Cluster of N brokers has suffered a failure, only N-1 brokers |
| remain. We want to start a new broker (possibly on a new node) to |
| restore redundancy. |
| |
| If the new broker has a new IP address, the sysadmin pushes a new URL |
| to all the existing brokers. |
| |
| The new broker starts in connecting mode. It discovers the primary, |
| connects and catches up. |
| |
| *** Failure during failback |
| |
| A second failure occurs before the new backup B completes its catch |
| up. The backup B refuses to become primary by failing the primary |
| start script if rgmanager chooses it, so rgmanager will try another |
| (hopefully caught-up) backup to become primary. |
| |
| *** Backup failure |
| |
| If a backup fails it is re-started. It connects and catches up from scratch |
| to become a ready backup. |
| |
| ** Interaction with the store. |
| |
| Clean shutdown: entire cluster is shut down cleanly by an admin tool: |
| - primary stops accepting client connections till shutdown is complete. |
| - backups come fully up to speed with primary state. |
| - all shut down marking stores as 'clean' with an identifying UUID. |
| |
| After clean shutdown the cluster can re-start automatically as all nodes |
| have equivalent stores. Stores starting up with the wrong UUID will fail. |
| |
| Stored status: clean(UUID)/dirty, primary/backup, generation number. |
| - All stores are marked dirty except after a clean shutdown. |
| - Generation number: passed to backups and incremented by new primary. |
| |
| After total crash must manually identify the "best" store, provide admin tool. |
| Best = highest generation number among stores in primary state. |
| |
| Recovering from total crash: all brokers will refuse to start as all stores are dirty. |
| Check the stores manually to find the best one, then either: |
| 1. Copy stores: |
| - copy good store to all hosts |
| - restart qpidd on all hosts. |
| 2. Erase stores: |
| - Erase the store on all other hosts. |
| - Restart qpidd on the good store and wait for it to become primary. |
| - Restart qpidd on all other hosts. |
| |
| Broker startup with store: |
| - Dirty: refuse to start |
| - Clean: |
| - Start and load from store. |
| - When connecting as backup, check UUID matches primary, shut down if not. |
| - Empty: start ok, no UUID check with primary. |
| |
| ** Current Limitations |
| |
| (In no particular order at present) |
| |
| For message replication: |
| |
| LM1 - The re-synchronisation does not handle the case where a newly elected |
| primary is *behind* one of the other backups. To address this I propose |
| a new event for restting the sequence that the new primary would send |
| out on detecting that a replicating browser is ahead of it, requesting |
| that the replica revert back to a particular sequence number. The |
| replica on receiving this event would then discard (i.e. dequeue) all |
| the messages ahead of that sequence number and reset the counter to |
| correctly sequence any subsequently delivered messages. |
| |
| LM2 - There is a need to handle wrap-around of the message sequence to avoid |
| confusing the resynchronisation where a replica has been disconnected |
| for a long time, sufficient for the sequence numbering to wrap around. |
| |
| LM3 - Transactional changes to queue state are not replicated atomically. |
| |
| LM4 - Acknowledgements are confirmed to clients before the message has been |
| dequeued from replicas or indeed from the local store if that is |
| asynchronous. |
| |
| LM5 - During failover, messages (re)published to a queue before there are |
| the requisite number of replication subscriptions established will be |
| confirmed to the publisher before they are replicated, leaving them |
| vulnerable to a loss of the new primary before they are replicated. |
| |
| LM6 - persistence: In the event of a total cluster failure there are |
| no tools to automatically identify the "latest" store. Once this |
| is manually identfied, all other stores belonging to cluster members |
| must be erased and the latest node must started as primary. |
| |
| LM6 - persistence: In the event of a single node failure, that nodes |
| store must be erased before it can re-join the cluster. |
| |
| For configuration propagation: |
| |
| LC2 - Queue and exchange propagation is entirely asynchronous. There |
| are three cases to consider here for queue creation: |
| |
| (a) where queues are created through the addressing syntax supported |
| the messaging API, they should be recreated if needed on failover and |
| message replication if required is dealt with seperately; |
| |
| (b) where queues are created using configuration tools by an |
| administrator or by a script they can query the backups to verify the |
| config has propagated and commands can be re-run if there is a failure |
| before that; |
| |
| (c) where applications have more complex programs on which |
| queues/exchanges are created using QMF or directly via 0-10 APIs, the |
| completion of the command will not guarantee that the command has been |
| carried out on other nodes. |
| |
| I.e. case (a) doesn't require anything (apart from LM5 in some cases), |
| case (b) can be addressed in a simple manner through tooling but case |
| (c) would require changes to the broker to allow client to simply |
| determine when the command has fully propagated. |
| |
| LC3 - Queues that are not in the query response received when a |
| replica establishes a propagation subscription but exist locally are |
| not deleted. I.e. Deletion of queues/exchanges while a replica is not |
| connected will not be propagated. Solution is to delete any queues |
| marked for propagation that exist locally but do not show up in the |
| query response. |
| |
| LC4 - It is possible on failover that the new primary did not |
| previously receive a given QMF event while a backup did (sort of an |
| analogous situation to LM1 but without an easy way to detect or remedy |
| it). |
| |
| LC5 - Need richer control over which queues/exchanges are propagated, and |
| which are not. |
| |
| LC6 - The events and query responses are not fully synchronized. |
| |
| In particular it *is* possible to not receive a delete event but |
| for the deleted object to still show up in the query response |
| (meaning the deletion is 'lost' to the update). |
| |
| It is also possible for an create event to be received as well |
| as the created object being in the query response. Likewise it |
| is possible to receive a delete event and a query response in |
| which the object no longer appears. In these cases the event is |
| essentially redundant. |
| |
| It is not possible to miss a create event and yet not to have |
| the object in question in the query response however. |
| |
| |
| * Benefits compared to previous cluster implementation. |
| |
| - Does not depend on openais/corosync, does not require multicast. |
| - Can be integrated with different resource managers: for example rgmanager, PaceMaker, Veritas. |
| - Can be ported to/implemented in other environments: e.g. Java, Windows |
| - Disaster Recovery is just another backup, no need for separate queue replication mechanism. |
| - Can take advantage of resource manager features, e.g. virtual IP addresses. |
| - Fewer inconsistent errors (store failures) that can be handled without killing brokers. |
| - Improved performance |
| * User Documentation Notes |
| |
| Notes to seed initial user documentation. Loosely tracking the implementation, |
| some points mentioned in the doc may not be implemented yet. |
| |
| ** High Availability Overview |
| |
| HA is implemented using a 'hot standby' approach. Clients are directed |
| to a single "primary" broker. The primary executes client requests and |
| also replicates them to one or more "backup" brokers. If the primary |
| fails, one of the backups takes over the role of primary carrying on |
| from where the primary left off. Clients will fail over to the new |
| primary automatically and continue their work. |
| |
| TODO: at least once, deduplication. |
| |
| ** Enabling replication on the client. |
| |
| To enable replication set the qpid.replicate argument when creating a |
| queue or exchange. |
| |
| This can have one of 3 values |
| - none: the object is not replicated |
| - configuration: queues, exchanges and bindings are replicated but messages are not. |
| - messages: configuration and messages are replicated. |
| |
| TODO: examples |
| TODO: more options for default value of qpid.replicate |
| |
| A HA client connection has multiple addresses, one for each broker. If |
| the it fails to connect to an address, or the connection breaks, |
| it will automatically fail-over to another address. |
| |
| Only the primary broker accepts connections, the backup brokers |
| redirect connection attempts to the primary. If the primary fails, one |
| of the backups is promoted to primary and clients fail-over to the new |
| primary. |
| |
| TODO: using multiple-address connections, examples c++, python, java. |
| |
| TODO: dynamic cluster addressing? |
| |
| TODO: need de-duplication. |
| |
| ** Enabling replication on the broker. |
| |
| Network topology: backup links, separate client/broker networks. |
| Describe failover mechanisms. |
| - Client view: URLs, failover, exclusion & discovery. |
| - Broker view: similar. |
| Role of rmganager |
| |
| ** Configuring rgmanager |
| |
| ** Configuring qpidd |
| |
| |