docs/src/site/twiki/HighAvailability.twiki - atlas - Git at Google

 ---+ Fault Tolerance and High Availability Options

 ---++ Introduction

 Apache Atlas uses and interacts with a variety of systems to provide metadata management and data lineage to data
 administrators. By choosing and configuring these dependencies appropriately, it is possible to achieve a good degree
 of service availability with Atlas. This document describes the state of high availability support in Atlas,
 including its capabilities and current limitations, and also the configuration required for achieving a this level of
 high availability.

 [[Architecture][The architecture page]] in the wiki gives an overview of the various components that make up Atlas.
 The options mentioned below for various components derive context from the above page, and would be worthwhile to
 review before proceeding to read this page.

 ---++ Atlas Web Service

 Currently, the Atlas Web service has a limitation that it can only have one active instance at a time. Therefore, in
 case of errors to the host running the service, a new Atlas web service instance should be brought up and pointed to
 from the clients. In future versions of the system, we plan to provide full High Availability of the service, thereby
 enabling hot failover. To minimize service loss, we recommend the following:

    * An extra physical host with the Atlas system software and configuration is available to be brought up on demand.
    * It would be convenient to have the web service fronted by a proxy solution like [[https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#5.2][HAProxy]] which can be used to provide both the monitoring and transparent switching of the backend instance clients talk to.
       * An example HAProxy configuration of this form will allow a transparent failover to a backup server:
       <verbatim>
       listen atlas
         bind <proxy hostname>:<proxy port>
         balance roundrobin
         server inst1 <atlas server hostname>:<port> check
         server inst2 <atlas backup server hostname>:<port> check backup
       </verbatim>
    * The stores that hold Atlas data can be configured to be highly available as described below.

 ---++ Metadata Store

 As described above, Atlas uses Titan to store the metadata it manages. By default, Titan uses BerkeleyDB as an embedded
 backing store. However, this option would result in loss of data if the node running the Atlas server fails. In order
 to provide HA for the metadata store, we recommend that Atlas be configured to use HBase as the backing store for Titan.
 Doing this implies that you could benefit from the HA guarantees HBase provides. In order to configure Atlas to use
 HBase in HA mode, do the following:

    * Choose an existing HBase cluster that is set up in HA mode to configure in Atlas (OR) Set up a new HBase cluster in [[http://hbase.apache.org/book.html#quickstart_fully_distributed][HA mode]].
       * If setting up HBase for Atlas, please following instructions listed for setting up HBase in the [[InstallationSteps][Installation Steps]].
    * We recommend using more than one HBase masters (at least 2) in the cluster on different physical hosts that use Zookeeper for coordination to provide redundancy and high availability of HBase.
       * Refer to the [[Configuration][Configuration page]] for the options to configure in atlas.properties to setup Atlas with HBase.

 ---++ Index Store

 As described above, Atlas indexes metadata through Titan to support full text search queries. In order to provide HA
 for the index store, we recommend that Atlas be configured to use Solr as the backing index store for Titan. In order
 to configure Atlas to use Solr in HA mode, do the following:

    * Choose an existing !SolrCloud cluster setup in HA mode to configure in Atlas (OR) Set up a new [[https://cwiki.apache.org/confluence/display/solr/SolrCloud][SolrCloud cluster]].
       * Ensure Solr is brought up on at least 2 physical hosts for redundancy, and each host runs a Solr node.
       * We recommend the number of replicas to be set to at least 2 for redundancy.
    * Create the !SolrCloud collections required by Atlas, as described in [[InstallationSteps][Installation Steps]]
    * Refer to the [[Configuration][Configuration page]] for the options to configure in atlas.properties to setup Atlas with Solr.

 ---++ Notification Server

 Metadata notification events from Hooks are sent to Atlas by writing them to a Kafka topic called *ATLAS_HOOK*. Similarly, events from
 Atlas to other integrating components like Ranger, are written to a Kafka topic called *ATLAS_ENTITIES*. Since Kafka
 persists these messages, the events will not be lost even if the consumers are down as the events are being sent. In
 addition, we recommend Kafka is also setup for fault tolerance so that it has higher availability guarantees. In order
 to configure Atlas to use Kafka in HA mode, do the following:

    * Choose an existing Kafka cluster set up in HA mode to configure in Atlas (OR) Set up a new Kafka cluster.
    * We recommend that there are more than one Kafka brokers in the cluster on different physical hosts that use Zookeeper for coordination to provide redundancy and high availability of Kafka.
       * Setup at least 2 physical hosts for redundancy, each hosting a Kafka broker.
    * Set up Kafka topics for Atlas usage:
       * The number of partitions for the ATLAS topics should be set to 1 (numPartitions)
       * Decide number of replicas for Kafka topic: Set this to at least 2 for redundancy.
       * Run the following commands:
       <verbatim>
       $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper <list of zookeeper host:port entries> --topic ATLAS_HOOK --replication-factor <numReplicas> --partitions 1
       $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper <list of zookeeper host:port entries> --topic ATLAS_ENTITIES --replication-factor <numReplicas> --partitions 1
       Here KAFKA_HOME points to the Kafka installation directory.
       </verbatim>
    * In application.properties, set the following configuration:
      <verbatim>
      atlas.notification.embedded=false
      atlas.kafka.zookeeper.connect=<comma separated list of servers forming Zookeeper quorum used by Kafka>
      atlas.kafka.bootstrap.servers=<comma separated list of Kafka broker endpoints in host:port form> - Give at least 2 for redundancy.
      </verbatim>

 ---++ Known Issues

    * [[https://issues.apache.org/jira/browse/ATLAS-338][ATLAS-338]]: ATLAS-338: Metadata events generated from a Hive CLI (as opposed to Beeline or any client going HiveServer2) would be lost if Atlas server is down.
    * If the HBase region servers hosting the Atlas ‘titan’ HTable are down, Atlas would not be able to store or retrieve metadata from HBase until they are brought back online.
	---+ Fault Tolerance and High Availability Options

	---++ Introduction

	Apache Atlas uses and interacts with a variety of systems to provide metadata management and data lineage to data
	administrators. By choosing and configuring these dependencies appropriately, it is possible to achieve a good degree
	of service availability with Atlas. This document describes the state of high availability support in Atlas,
	including its capabilities and current limitations, and also the configuration required for achieving a this level of
	high availability.

	[[Architecture][The architecture page]] in the wiki gives an overview of the various components that make up Atlas.
	The options mentioned below for various components derive context from the above page, and would be worthwhile to
	review before proceeding to read this page.

	---++ Atlas Web Service

	Currently, the Atlas Web service has a limitation that it can only have one active instance at a time. Therefore, in
	case of errors to the host running the service, a new Atlas web service instance should be brought up and pointed to
	from the clients. In future versions of the system, we plan to provide full High Availability of the service, thereby
	enabling hot failover. To minimize service loss, we recommend the following:

	* An extra physical host with the Atlas system software and configuration is available to be brought up on demand.
	* It would be convenient to have the web service fronted by a proxy solution like [[https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#5.2][HAProxy]] which can be used to provide both the monitoring and transparent switching of the backend instance clients talk to.
	* An example HAProxy configuration of this form will allow a transparent failover to a backup server:
	<verbatim>
	listen atlas
	bind <proxy hostname>:<proxy port>
	balance roundrobin
	server inst1 <atlas server hostname>:<port> check
	server inst2 <atlas backup server hostname>:<port> check backup
	</verbatim>
	* The stores that hold Atlas data can be configured to be highly available as described below.

	---++ Metadata Store

	As described above, Atlas uses Titan to store the metadata it manages. By default, Titan uses BerkeleyDB as an embedded
	backing store. However, this option would result in loss of data if the node running the Atlas server fails. In order
	to provide HA for the metadata store, we recommend that Atlas be configured to use HBase as the backing store for Titan.
	Doing this implies that you could benefit from the HA guarantees HBase provides. In order to configure Atlas to use
	HBase in HA mode, do the following:

	* Choose an existing HBase cluster that is set up in HA mode to configure in Atlas (OR) Set up a new HBase cluster in [[http://hbase.apache.org/book.html#quickstart_fully_distributed][HA mode]].
	* If setting up HBase for Atlas, please following instructions listed for setting up HBase in the [[InstallationSteps][Installation Steps]].
	* We recommend using more than one HBase masters (at least 2) in the cluster on different physical hosts that use Zookeeper for coordination to provide redundancy and high availability of HBase.
	* Refer to the [[Configuration][Configuration page]] for the options to configure in atlas.properties to setup Atlas with HBase.

	---++ Index Store

	As described above, Atlas indexes metadata through Titan to support full text search queries. In order to provide HA
	for the index store, we recommend that Atlas be configured to use Solr as the backing index store for Titan. In order
	to configure Atlas to use Solr in HA mode, do the following:

	* Choose an existing !SolrCloud cluster setup in HA mode to configure in Atlas (OR) Set up a new [[https://cwiki.apache.org/confluence/display/solr/SolrCloud][SolrCloud cluster]].
	* Ensure Solr is brought up on at least 2 physical hosts for redundancy, and each host runs a Solr node.
	* We recommend the number of replicas to be set to at least 2 for redundancy.
	* Create the !SolrCloud collections required by Atlas, as described in [[InstallationSteps][Installation Steps]]
	* Refer to the [[Configuration][Configuration page]] for the options to configure in atlas.properties to setup Atlas with Solr.

	---++ Notification Server

	Metadata notification events from Hooks are sent to Atlas by writing them to a Kafka topic called ATLAS_HOOK. Similarly, events from
	Atlas to other integrating components like Ranger, are written to a Kafka topic called ATLAS_ENTITIES. Since Kafka
	persists these messages, the events will not be lost even if the consumers are down as the events are being sent. In
	addition, we recommend Kafka is also setup for fault tolerance so that it has higher availability guarantees. In order
	to configure Atlas to use Kafka in HA mode, do the following:

	* Choose an existing Kafka cluster set up in HA mode to configure in Atlas (OR) Set up a new Kafka cluster.
	* We recommend that there are more than one Kafka brokers in the cluster on different physical hosts that use Zookeeper for coordination to provide redundancy and high availability of Kafka.
	* Setup at least 2 physical hosts for redundancy, each hosting a Kafka broker.
	* Set up Kafka topics for Atlas usage:
	* The number of partitions for the ATLAS topics should be set to 1 (numPartitions)
	* Decide number of replicas for Kafka topic: Set this to at least 2 for redundancy.
	* Run the following commands:
	<verbatim>
	$KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper <list of zookeeper host:port entries> --topic ATLAS_HOOK --replication-factor <numReplicas> --partitions 1
	$KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper <list of zookeeper host:port entries> --topic ATLAS_ENTITIES --replication-factor <numReplicas> --partitions 1
	Here KAFKA_HOME points to the Kafka installation directory.
	</verbatim>
	* In application.properties, set the following configuration:
	<verbatim>
	atlas.notification.embedded=false
	atlas.kafka.zookeeper.connect=<comma separated list of servers forming Zookeeper quorum used by Kafka>
	atlas.kafka.bootstrap.servers=<comma separated list of Kafka broker endpoints in host:port form> - Give at least 2 for redundancy.
	</verbatim>

	---++ Known Issues

	* [[https://issues.apache.org/jira/browse/ATLAS-338][ATLAS-338]]: ATLAS-338: Metadata events generated from a Hive CLI (as opposed to Beeline or any client going HiveServer2) would be lost if Atlas server is down.
	* If the HBase region servers hosting the Atlas ‘titan’ HTable are down, Atlas would not be able to store or retrieve metadata from HBase until they are brought back online.