blob: d1fb621520314305f9d03ee55c68c5cf8fff9751 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="generator" content="Asciidoctor 2.0.18">
<link rel="icon" type="image/png" href="images/favicon.png">
<title>Network Isolation (Split Brain)</title>
<link rel="stylesheet" href="css/asciidoctor.css">
<link rel="stylesheet" href="css/font-awesome.css">
<link rel="stylesheet" href="css/rouge-github.css">
</head>
<body class="book toc2 toc-left">
<div id="header">
<h1>Network Isolation (Split Brain)</h1>
<div id="toc" class="toc2">
<div id="toctitle"><a href="index.html">User Manual for 2.32.0</a></div>
<ul class="sectlevel1">
<li><a href="#quorum-voting">1. Quorum Voting</a>
<ul class="sectlevel2">
<li><a href="#backup-voting">1.1. Backup Voting</a></li>
<li><a href="#primary-voting">1.2. Primary Voting</a></li>
</ul>
</li>
<li><a href="#pinging-the-network">2. Pinging the network</a></li>
</ul>
</div>
</div>
<div id="content">
<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p>It is possible that if a primary or backup configured for replication becomes isolated in a network that failover will occur and you will end up with 2 active brokers serving messages. This we call <em>split brain</em>.
There are different configurations you can choose from that will help mitigate this problem.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="quorum-voting"><a class="anchor" href="#quorum-voting"></a><a class="link" href="#quorum-voting">1. Quorum Voting</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>Quorum voting is used by both the primary and the backup to decide what to do if a replication connection is disconnected.
Basically the server will request each active server in the cluster to vote as to whether it thinks the server it is replicating to or from is still alive.
You can also configure the time for which the quorum manager will wait for the quorum vote response.
The default time is 30 seconds you can configure like so for primary and also for the backup:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="rouge highlight nowrap"><code data-lang="xml"><span class="nt">&lt;ha-policy&gt;</span>
<span class="nt">&lt;replication&gt;</span>
<span class="nt">&lt;primary&gt;</span>
<span class="nt">&lt;quorum-vote-wait&gt;</span>12<span class="nt">&lt;/quorum-vote-wait&gt;</span>
<span class="nt">&lt;/primary&gt;</span>
<span class="nt">&lt;/replication&gt;</span>
<span class="nt">&lt;/ha-policy&gt;</span></code></pre>
</div>
</div>
<div class="paragraph">
<p>This being the case the minimum number of live/backup pairs needed is 3.
If less than 3 pairs are used then the only option is to use a Network Pinger which is explained later in this chapter or choose how you want each server to react which the following details:</p>
</div>
<div class="sect2">
<h3 id="backup-voting"><a class="anchor" href="#backup-voting"></a><a class="link" href="#backup-voting">1.1. Backup Voting</a></h3>
<div class="paragraph">
<p>By default if a backup loses its replication connection to its primary it makes a decision as to whether to start or not with a quorum vote.
This of course requires that there be at least 3 pairs of primary/backup nodes in the cluster.
For a 3 node cluster it will start if it gets 2 votes back saying that its primary server is no longer available, for 4 nodes this would be 3 votes and so on.
When a backup loses connection to the primary it will keep voting for a quorum until it either receives a vote allowing it to start or it detects that the primary is still active.
for the latter it will then restart as a backup.
How many votes and how long between each vote the backup should wait is configured like so:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="rouge highlight nowrap"><code data-lang="xml"><span class="nt">&lt;ha-policy&gt;</span>
<span class="nt">&lt;replication&gt;</span>
<span class="nt">&lt;backup&gt;</span>
<span class="nt">&lt;vote-retries&gt;</span>12<span class="nt">&lt;/vote-retries&gt;</span>
<span class="nt">&lt;vote-retry-wait&gt;</span>5000<span class="nt">&lt;/vote-retry-wait&gt;</span>
<span class="nt">&lt;/backup&gt;</span>
<span class="nt">&lt;/replication&gt;</span>
<span class="nt">&lt;/ha-policy&gt;</span></code></pre>
</div>
</div>
<div class="paragraph">
<p>It&#8217;s also possible to statically set the quorum size that should be used for the case where the cluster size is known up front, this is done on the Replica Policy like so:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="rouge highlight nowrap"><code data-lang="xml"><span class="nt">&lt;ha-policy&gt;</span>
<span class="nt">&lt;replication&gt;</span>
<span class="nt">&lt;backup&gt;</span>
<span class="nt">&lt;quorum-size&gt;</span>2<span class="nt">&lt;/quorum-size&gt;</span>
<span class="nt">&lt;/backup&gt;</span>
<span class="nt">&lt;/replication&gt;</span>
<span class="nt">&lt;/ha-policy&gt;</span></code></pre>
</div>
</div>
<div class="paragraph">
<p>In this example the quorum size is set to 2 so if you were using a single pair and the backup lost connectivity it would never start.</p>
</div>
</div>
<div class="sect2">
<h3 id="primary-voting"><a class="anchor" href="#primary-voting"></a><a class="link" href="#primary-voting">1.2. Primary Voting</a></h3>
<div class="paragraph">
<p>By default, if the primary server loses its replication connection then it will just carry on and wait for a backup to reconnect and start replicating again.
In the event of a possible split brain scenario this may mean that the primary stays active even though the backup has been activated.
It is possible to configure the primary server to vote for a quorum if this happens, in this way if the primary server does not receive a majority vote then it will shutdown.
This is done by setting the <em>vote-on-replication-failure</em> to <code>true</code>.</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="rouge highlight nowrap"><code data-lang="xml"><span class="nt">&lt;ha-policy&gt;</span>
<span class="nt">&lt;replication&gt;</span>
<span class="nt">&lt;primary&gt;</span>
<span class="nt">&lt;vote-on-replication-failure&gt;</span>true<span class="nt">&lt;/vote-on-replication-failure&gt;</span>
<span class="nt">&lt;quorum-size&gt;</span>2<span class="nt">&lt;/quorum-size&gt;</span>
<span class="nt">&lt;/primary&gt;</span>
<span class="nt">&lt;/replication&gt;</span>
<span class="nt">&lt;/ha-policy&gt;</span></code></pre>
</div>
</div>
<div class="paragraph">
<p>As in the backup policy it is also possible to statically configure the quorum size.</p>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="pinging-the-network"><a class="anchor" href="#pinging-the-network"></a><a class="link" href="#pinging-the-network">2. Pinging the network</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>You may configure one more addresses on the broker.xml that are part of your network topology, that will be pinged through the life cycle of the server.</p>
</div>
<div class="paragraph">
<p>The server will stop itself until the network is back on such case.</p>
</div>
<div class="paragraph">
<p>If you execute the create command passing a -ping argument, you will create a default xml that is ready to be used with network checks:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="nowrap">./artemis create /myDir/myServer --ping 10.0.0.1</pre>
</div>
</div>
<div class="paragraph">
<p>This XML part will be added to your broker.xml:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="rouge highlight nowrap"><code data-lang="xml"><span class="c">&lt;!--
You can verify the network health of a particular NIC by specifying the &lt;network-check-NIC&gt; element.
&lt;network-check-NIC&gt;theNicName&lt;/network-check-NIC&gt;
--&gt;</span>
<span class="c">&lt;!--
Use this to use an HTTP server to validate the network
&lt;network-check-URL-list&gt;http://www.apache.org&lt;/network-check-URL-list&gt; --&gt;</span>
<span class="nt">&lt;network-check-period&gt;</span>10000<span class="nt">&lt;/network-check-period&gt;</span>
<span class="nt">&lt;network-check-timeout&gt;</span>1000<span class="nt">&lt;/network-check-timeout&gt;</span>
<span class="c">&lt;!-- this is a comma separated list, no spaces, just DNS or IPs
it should accept IPV6
Warning: Make sure you understand your network topology as this is meant to check if your network is up.
Using IPs that could eventually disappear or be partially visible may defeat the purpose.
You can use a list of multiple IPs, any successful ping will make the server OK to continue running --&gt;</span>
<span class="nt">&lt;network-check-list&gt;</span>10.0.0.1<span class="nt">&lt;/network-check-list&gt;</span>
<span class="c">&lt;!-- use this to customize the ping used for ipv4 addresses --&gt;</span>
<span class="nt">&lt;network-check-ping-command&gt;</span>ping -c 1 -t %d %s<span class="nt">&lt;/network-check-ping-command&gt;</span>
<span class="c">&lt;!-- use this to customize the ping used for ipv addresses --&gt;</span>
<span class="nt">&lt;network-check-ping6-command&gt;</span>ping6 -c 1 %2$s<span class="nt">&lt;/network-check-ping6-command&gt;</span></code></pre>
</div>
</div>
<div class="paragraph">
<p>Once you lose connectivity towards 10.0.0.1 on the given example, you will see see this output at the server:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="nowrap">09:49:24,562 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Ping Address /10.0.0.1 wasn't reacheable
09:49:36,577 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is unhealthy, stopping service ActiveMQServerImpl::serverUUID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0
09:49:36,625 INFO [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 1.6.0 [04fd5dd8-b18c-11e6-9efe-6a0001921ad0] stopped, uptime 14.787 seconds
09:50:00,653 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] ping: sendto: No route to host
09:50:10,656 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Host is down: java.net.ConnectException: Host is down
at java.net.Inet6AddressImpl.isReachable0(Native Method) [rt.jar:1.8.0_73]
at java.net.Inet6AddressImpl.isReachable(Inet6AddressImpl.java:77) [rt.jar:1.8.0_73]
at java.net.InetAddress.isReachable(InetAddress.java:502) [rt.jar:1.8.0_73]
at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:295) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:276) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at org.apache.activemq.artemis.core.server.NetworkHealthCheck.run(NetworkHealthCheck.java:244) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:189) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:199) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_73]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_73]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_73]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_73]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_73]</pre>
</div>
</div>
<div class="paragraph">
<p>Once you re establish your network connections towards the configured check list:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="nowrap">09:53:23,461 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is healthy, starting service ActiveMQServerImpl::
09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221000: live Message Broker is starting with configuration Broker Configuration (clustered=false,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=./data/paging)
09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221013: Using NIO Journal
09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE
09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP
09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ
09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT
09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE
09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP
09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.DLQ
09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.ExpiryQueue
09:53:23,549 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
09:53:23,550 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5445 for protocols [HORNETQ,STOMP]
09:53:23,554 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5672 for protocols [AMQP]
09:53:23,555 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:1883 for protocols [MQTT]
09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61613 for protocols [STOMP]
09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live
09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221001: Apache ActiveMQ Artemis Message Broker version 1.6.0 [0.0.0.0, nodeID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0]</pre>
</div>
</div>
<div class="admonitionblock warning">
<table>
<tr>
<td class="icon">
<i class="fa icon-warning" title="Warning"></i>
</td>
<td class="content">
<div class="paragraph">
<p>Make sure you understand your network topology as this is meant to validate your network.
Using IPs that could eventually disappear or be partially visible may defeat the purpose.
You can use a list of multiple IPs.
Any successful ping will make the server OK to continue running</p>
</div>
</td>
</tr>
</table>
</div>
</div>
</div>
</div>
</body>
</html>