Configuring for HA

The Myriad high availability (HA) feature provides no job failure or downtime in case of failure. In addition, self recovery from a failure is provided to restore it back to a highly available state after the failure.

A Myriad HA environment allows the Node Managers to reconnect to the new Resource Manager instance upon failover.

On failover, the following occurs:

  • Marathon re-launches the Resource Manager as a new task.
  • Mesos-DNS updates the IP address for the Resource Manager Mesos task to the new IP address.

Note: All clients that are connected to Resource Manager continue to work as long as the FQDN (for example, rm.marathon.mesos) is used to connect to the Resource Manager.

Prerequisites

  • Deploy mesos-master, mesos-slave (per node), zookeeper, marathon, and mesos-dns on your cluster.

Setting Up Mesos-DNS

Step 1: Create a directory for Mesos-DNS. For example, /etc/mesos-dns.

Step 2: Install Mesos-DNS on one node in your cluster.

Step 3: Configure Mesos-DNS by providing the required parameters in the /etc/mesos-dns/config.json file. See the Mesos-DNS configuration documentation for more information. The following example parameters represent a minimum configuration.

{
    "zk": "zk:10.10.100.19:2181/mesos",
    "refreshSeconds": 60,
    "ttl": 60,
    "domain": "mesos",
    "port": 53,
    "resolvers": ["10.10.1.10"],
    "timeout": 5,
}

Step 4: If you are on Linux, add the following Mesos-DNS name server to the /etc/resolv.conf file (at the top of the file) on all cluster nodes and clients. For example, clients running RM UI, Myriad UI, and so on.

 nameserver <mesos-dnsIP address>

Note: Add the entries at the top (in the beginning) of the /etc/resolv.conf file. If the entries are not at the top, Mesos-DNS may not work correctly.

Configuring HA

Configuring Myriad for HA involves adding HA configuration properties to the $YARN_HOME/etc/hadoop/yarn-site.xml file and the $YARN_HOME/etc/hadoop/myriad-config-default.yml file.

Modify yarn-site.xml

To the $YARN_HOME/etc/hadoop/yarn-site.xml file, add the following properties:

Modify myriad-config-default.yml

To the $YARN_HOME/etc/hadoop/myriad-config-default.yml file, modify the following values:

frameworkFailoverTimeout: <non-zero value>
haEnabled: true

Note: The Myriad Mesos frameworkFailoverTimeout parameter is specified in milliseconds. This paramenter indicates to Mesos that Myriad will failover within this time interval.