| = Network Isolation (Split Brain) |
| :idprefix: |
| :idseparator: - |
| :docinfo: shared |
| |
| A _split brain_ is a condition that occurs when two different brokers are serving the same messages at the same time. |
| When this happens instead of client applications all sharing the _same_ broker as they ought, they may become divided between the two split brain brokers. |
| This is problematic because it can lead to: |
| |
| * *Duplicate messages* e.g. when multiple consumers on the same JMS queue split between both brokers and receive the same message(s) |
| * *Missed messages* e.g. when multiple consumers on the same JMS topic split between both brokers and producers are only sending messages to one broker |
| |
| Split brain most commonly happens when a pair of brokers in an HA *replication* configuration lose the replication connection linking them together. |
| When this connection is lost the backup assumes that the primary has died and therefore activates. |
| At this point there are two brokers on the network which are isolated from each other and since the backup has a copy of all the messages from the primary they are each serving the same messages. |
| |
| [IMPORTANT] |
| .What about shared store configurations? |
| ==== |
| While it is technically possible for split brain to happen with a pair of brokers in an HA _shared store_ configuration it would require a failure in the file-locking mechanism of the storage device which the brokers are sharing. |
| |
| One of the benefits of using a shared store is that the storage device itself acts as an arbiter to ensure consistency and mitigate split brain. |
| ==== |
| |
| Recovering from a split brain may be as simple as stopping the broker which activated by mistake. |
| However, this solution is only viable *if* no client application connected to it and performed messaging operations. |
| The longer client applications are allowed to interact with split brain brokers the more difficult it will be to understand and remediate the resulting problems. |
| |
| There are several different configurations you can choose from that will help mitigate split brain. |
| |
| == Pluggable Lock Manager |
| |
| A pluggable lock manager configuration requires a 3rd party to establish a shared lock between primary and backup brokers. |
| The shared lock ensures that either the primary or backup is active at any given point in time, similar to how the file lock functions in the shared storage use-case. |
| |
| The _plugin_ decides what 3rd party implementation is used. |
| It could be something as simple as a shared file on a network file system that supports locking (e.g. NFS) or it could be something more complex like https://etcd.io/[etcd]. |
| |
| The broker ships with a xref:ha.adoc#apache-zookeeper-integration[reference plugin implementation] based on https://zookeeper.apache.org/[Apache ZooKeeper] - a common implementation used for this kind of task. |
| |
| The main benefit of a pluggable lock manager is that is releases the broker from the responsibility of establishing a reliable vote. |
| This means that a _single_ HA pair of brokers can be reliably protected against split-brain. |
| |
| == Quorum Voting |
| |
| Quorum voting is a process by which one node in a cluster can determine whether another node in the cluster is active without directly communicating with that node. |
| Then the broker initiating the vote can take action based on the result (e.g. shutting itself down to avoid split-brain). |
| |
| Quorum voting requires the participation of the other _active_ brokers in the cluster. |
| Of course this requires that there are, in fact, other active brokers in the cluster which means quorum voting won't work with a single HA pair of brokers. |
| Furthermore, it also won't work with just two HA pairs of brokers either because that's still not enough for a legitimate quorum. |
| There must be at least three HA pairs to establish a proper quorum with quorum voting. |
| |
| === Voting Mechanics |
| |
| When the replication connection between an active broker and a passive broker is lost the passive and/or the active broker may initiate a vote. |
| |
| [IMPORTANT] |
| ==== |
| For a vote to pass a _majority_ of affirmative responses is required. |
| For example, in a 3 node cluster a vote will pass with 2 affirmatives. |
| For a 4 node cluster this would be 3 affirmatives and so on. |
| ==== |
| |
| ==== Passive Voting |
| |
| If a passive broker loses its replication connection to the active broker it will initiate a quorum vote in order to decide whether to activate or not. |
| It will keep voting until it either receives a vote allowing it to start or it detects that the previously connected broker is still active. |
| In the latter case it will then restart as passive. |
| |
| See the section on xref:ha.adoc#replication-configuration[Replication Configuration] for more details on configuration. |
| |
| ==== Active Voting |
| |
| By default, if the active broker loses its replication connection to the passive broker then it will just carry on and wait for a passive to reconnect and start replicating again. |
| However, this may mean that it remains active even though the passive broker has activated so this behavior is configurable via the `vote-on-replication-failure` property. |
| |
| See the section on xref:ha.adoc#replication-configuration[Replication Configuration] for more details on configuration. |
| |
| == Pinging the network |
| |
| You may configure one more addresses in `broker.xml` that that will be pinged throughout the life of the server. The server will stop itself if it can't ping one or more of the addresses in the list. |
| |
| If you execute the `create` command using the `--ping` argument you will create a default XML that is ready to be used with network checks: |
| |
| [,console] |
| ---- |
| $ ./artemis create /myDir/myServer --ping 10.0.0.1 |
| ---- |
| |
| This XML will be added to your `broker.xml`: |
| |
| [,xml] |
| ---- |
| <!-- |
| You can verify the network health of a particular NIC by specifying the <network-check-NIC> element. |
| <network-check-NIC>theNicName</network-check-NIC> |
| --> |
| |
| <!-- |
| Use this to use an HTTP server to validate the network |
| <network-check-URL-list>http://www.apache.org</network-check-URL-list> --> |
| |
| <network-check-period>10000</network-check-period> |
| <network-check-timeout>1000</network-check-timeout> |
| |
| <!-- this is a comma separated list, no spaces, just DNS or IPs |
| it should accept IPV6 |
| |
| Warning: Make sure you understand your network topology as this is meant to check if your network is up. |
| Using IPs that could eventually disappear or be partially visible may defeat the purpose. |
| You can use a list of multiple IPs, any successful ping will make the server OK to continue running --> |
| <network-check-list>10.0.0.1</network-check-list> |
| |
| <!-- use this to customize the ping used for ipv4 addresses --> |
| <network-check-ping-command>ping -c 1 -t %d %s</network-check-ping-command> |
| |
| <!-- use this to customize the ping used for ipv6 addresses --> |
| <network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command> |
| ---- |
| Once you lose connectivity towards `10.0.0.1` on the given example the broker will log something like this: |
| ---- |
| 09:49:24,562 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Ping Address /10.0.0.1 wasn't reacheable |
| 09:49:36,577 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is unhealthy, stopping service ActiveMQServerImpl::serverUUID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0 |
| 09:49:36,625 INFO [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 1.6.0 [04fd5dd8-b18c-11e6-9efe-6a0001921ad0] stopped, uptime 14.787 seconds |
| 09:50:00,653 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] ping: sendto: No route to host |
| 09:50:10,656 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Host is down: java.net.ConnectException: Host is down |
| at java.net.Inet6AddressImpl.isReachable0(Native Method) [rt.jar:1.8.0_73] |
| at java.net.Inet6AddressImpl.isReachable(Inet6AddressImpl.java:77) [rt.jar:1.8.0_73] |
| at java.net.InetAddress.isReachable(InetAddress.java:502) [rt.jar:1.8.0_73] |
| at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:295) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT] |
| at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:276) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT] |
| at org.apache.activemq.artemis.core.server.NetworkHealthCheck.run(NetworkHealthCheck.java:244) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT] |
| at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:189) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT] |
| at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:199) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT] |
| at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_73] |
| at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_73] |
| at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_73] |
| at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_73] |
| at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_73] |
| at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_73] |
| at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_73] |
| ---- |
| |
| Once you reestablish your network connections towards the configured check-list: |
| |
| ---- |
| 09:53:23,461 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is healthy, starting service ActiveMQServerImpl:: |
| 09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221000: primary Message Broker is starting with configuration Broker Configuration (clustered=false,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=./data/paging) |
| 09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221013: Using NIO Journal |
| 09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE |
| 09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP |
| 09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ |
| 09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT |
| 09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE |
| 09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP |
| 09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.DLQ |
| 09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.ExpiryQueue |
| 09:53:23,549 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE] |
| 09:53:23,550 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5445 for protocols [HORNETQ,STOMP] |
| 09:53:23,554 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5672 for protocols [AMQP] |
| 09:53:23,555 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:1883 for protocols [MQTT] |
| 09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61613 for protocols [STOMP] |
| 09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now active |
| 09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221001: Apache ActiveMQ Artemis Message Broker version 1.6.0 [0.0.0.0, nodeID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0] |
| ---- |
| |
| [IMPORTANT] |
| ==== |
| Make sure you understand your network topology as this is meant to validate your network. |
| Using IPs that could eventually disappear or be partially visible may defeat the purpose. |
| You can use a list of multiple IPs. |
| Any successful ping will make the server OK to continue running |
| ==== |