name: Installation Instruction route: /Installation menu: Documentation submenu: Setup

import themen from ‘theme/styles/styled-colors’; import * as theme from ‘react-syntax-highlighter/dist/esm/styles/hljs’; import SyntaxHighlighter from ‘react-syntax-highlighter’;

Installing & Running Apache Atlas

Installing Apache Atlas

From the directory you would like Apache Atlas to be installed, run the following commands:

Running Apache Atlas with Local Apache HBase & Apache Solr

To run Apache Atlas with local Apache HBase & Apache Solr instances that are started/stopped along with Atlas start/stop, run following commands:

Running Apache Atlas with BerkeleyDB & Apache Solr

To run Apache Atlas with BerkeleyDB, and local instances of Apache Solr and Apache Zookeeper, run following commands:

Using Apache Atlas

  • To verify if Apache Atlas server is up and running, run curl command as shown below:
  • Run quick start to load sample model and data

Stopping Apache Atlas Server

To stop Apache Atlas, run following command:

Configuring Apache Atlas

By default config directory used by Apache Atlas is {package dir}/conf. To override this set environment variable ATLAS_CONF to the path of the conf dir.

Environment variables needed to run Apache Atlas can be set in atlas-env.sh file in the conf directory. This file will be sourced by Apache Atlas scripts before any commands are executed. The following environment variables are available to set.

any additional java opts you want to set. This will apply to both client and server operations

#export ATLAS_OPTS=

any additional java opts that you want to set for client only

#export ATLAS_CLIENT_OPTS=

java heap size we want to set for the client. Default is 1024MB

#export ATLAS_CLIENT_HEAP=

any additional opts you want to set for atlas service.

#export ATLAS_SERVER_OPTS=

java heap size we want to set for the atlas server. Default is 1024MB

#export ATLAS_SERVER_HEAP=

What is is considered as atlas home dir. Default is the base location of the installed software

#export ATLAS_HOME_DIR=

Where log files are stored. Defatult is logs directory under the base install location

#export ATLAS_LOG_DIR=

Where pid files are stored. Defatult is logs directory under the base install location

#export ATLAS_PID_DIR=

Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.

#export ATLAS_EXPANDED_WEBAPP_DIR=`}

Settings to support large number of metadata objects

If you plan to store large number of metadata objects, it is recommended that you use values tuned for better GC performance of the JVM.

The following values are common server side options:

The -XX:SoftRefLRUPolicyMSPerMB option was found to be particularly helpful to regulate GC performance for query heavy workloads with many concurrent users.

The following values are recommended for JDK 8:

NOTE for Mac OS users If you are using a Mac OS, you will need to configure the ATLAS_SERVER_OPTS (explained above).

In {package dir}/conf/atlas-env.sh uncomment the following line {export ATLAS_SERVER_OPTS=}

and change it to look as below

Configuring Apache HBase as the storage backend for the Graph Repository

By default, Apache Atlas uses JanusGraph as the graph repository and is the only graph repository implementation available currently. Apache HBase versions currently supported are 1.1.x. For configuring Apache Atlas graph persistence on Apache HBase, please see “Graph persistence engine - HBase” in the Configuration section for more details.

Apache HBase tables used by Apache Atlas can be set using the following configurations:

Configuring Apache Solr as the indexing backend for the Graph Repository

By default, Apache Atlas uses JanusGraph as the graph repository and is the only graph repository implementation available currently. For configuring JanusGraph to work with Apache Solr, please follow the instructions below

SolrCloud mode uses a ZooKeeper Service as a highly available, central location for cluster management. For a small cluster, running with an existing ZooKeeper quorum should be fine. For larger clusters, you would want to run separate multiple ZooKeeper quorum with at least 3 servers. For more information, refer Apache Solr documentation - https://cwiki.apache.org/confluence/display/solr/SolrCloud

  • For e.g., to bring up an Apache Solr node listening on port 8983 on a machine, you can use the command:
  • Run the following commands from SOLR_BIN (e.g. $SOLR_HOME/bin) directory to create collections in Apache Solr corresponding to the indexes that Apache Atlas uses. In the case that the Apache Atlas and Apache Solr instances are on 2 different hosts, first copy the required configuration files from ATLAS_HOME/conf/solr on the Apache Atlas instance host to Apache Solr instance host. SOLR_CONF in the below mentioned commands refer to the directory where Apache Solr configuration files have been copied to on Apache Solr host:

Note: If numShards and replicationFactor are not specified, they default to 1 which suffices if you are trying out solr with ATLAS on a single node instance. Otherwise specify numShards according to the number of hosts that are in the Solr cluster and the maxShardsPerNode configuration. The number of shards cannot exceed the total number of Solr nodes in your SolrCloud cluster.

The number of replicas (replicationFactor) can be set according to the redundancy required.

Also note that Apache Solr will automatically be called to create the indexes when Apache Atlas server is started if the SOLR_BIN and SOLR_CONF environment variables are set and the search indexing backend is set to ‘solr5’.

  • Change ATLAS configuration to point to Apache Solr instance setup. Please make sure the following configurations are set to the below values in ATLAS_HOME/conf/atlas-application.properties

For more information on JanusGraph solr configuration , please refer http://docs.janusgraph.org/0.2.0/solr.html

Pre-requisites for running Apache Solr in cloud mode

  • Memory - Apache Solr is both memory and CPU intensive. Make sure the server running Apache Solr has adequate memory, CPU and disk. Apache Solr works well with 32GB RAM. Plan to provide as much memory as possible to Apache Solr process

  • Disk - If the number of entities that need to be stored are large, plan to have at least 500 GB free space in the volume where Apache Solr is going to store the index data

  • SolrCloud has support for replication and sharding. It is highly recommended to use SolrCloud with at least two Apache Solr nodes running on different servers with replication enabled. If using SolrCloud, then you also need ZooKeeper installed and configured with 3 or 5 ZooKeeper nodes

  • Start Apache Solr in http mode - alternative setup to Solr in cloud mode.

Solr Standalone is used for a single instance, and it keeps configuration information on the file system. It does not require zookeeper and provides high performance for medium size index. Can be consider as a good option for fast prototyping as well as valid configuration for development environments. In some cases it demonstrates a better performance than solr cloud mode in production grade setup of Atlas.

  • Change ATLAS configuration to point to Standalone Apache Solr instance setup. Please make sure the following configurations are set to the below values in ATLAS_HOME/conf/atlas-application.properties

Note: Solr standalone can be run in embedded mode using embedded-hbase-solr profile.

Configuring Elasticsearch as the indexing backend for the Graph Repository (Tech Preview)

By default, Apache Atlas uses JanusGraph as the graph repository and is the only graph repository implementation available currently. For configuring JanusGraph to work with Elasticsearch, please follow the instructions below

  • Install an Elasticsearch cluster. The version currently supported is 5.6.4, and can be acquired from: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.4.tar.gz

  • For simple testing a single Elasticsearch node can be started by using the ‘elasticsearch’ command in the bin directory of the Elasticsearch distribution.

  • Change Apache Atlas configuration to point to the Elasticsearch instance setup. Please make sure the following configurations are set to the below values in ATLAS_HOME/conf/atlas-application.properties

For more information on JanusGraph configuration for elasticsearch, please refer http://docs.janusgraph.org/0.2.0/elasticsearch.html

Configuring Kafka Topics

Apache Atlas uses Apache Kafka to ingest metadata from other components at runtime. This is described in the Architecture in more detail. Depending on the configuration of Apache Kafka, sometimes you might need to setup the topics explicitly before using Apache Atlas. To do so, Apache Atlas provides a script =bin/atlas_kafka_setup.py= which can be run from Apache Atlas server. In some environments, the hooks might start getting used first before Apache Atlas server itself is setup. In such cases, the topics can be run on the hosts where hooks are installed using a similar script hook-bin/atlas_kafka_setup_hook.py. Both these use configuration in atlas-application.properties for setting up the topics. Please refer to the Configuration for these details.

Setting up Apache Atlas

There are a few steps that setup dependencies of Apache Atlas. One such example is setting up the JanusGraph schema in the storage backend of choice. In a simple single server setup, these are automatically setup with default configuration when the server first accesses these dependencies.

However, there are scenarios when we may want to run setup steps explicitly as one time operations. For example, in a multiple server scenario using High Availability, it is preferable to run setup steps from one of the server instances the first time, and then start the services.

To run these steps one time, execute the command =bin/atlas_start.py -setup= from a single Apache Atlas server instance.

However, Apache Atlas server does take care of parallel executions of the setup steps. Also, running the setup steps multiple times is idempotent. Therefore, if one chooses to run the setup steps as part of server startup, for convenience, then they should enable the configuration option atlas.server.run.setup.on.start by defining it with the value true in the atlas-application.properties file.

Examples: calling Apache Atlas REST APIs

Here are few examples of calling Apache Atlas REST APIs via curl command.

  • List the types in the repository
  • List the instances for a given type
  • Search for entities

Troubleshooting

Setup issues

If the setup of Apache Atlas service fails due to any reason, the next run of setup (either by an explicit invocation of atlas_start.py -setup or by enabling the configuration option atlas.server.run.setup.on.start) will fail with a message such as A previous setup run may not have completed cleanly.. In such cases, you would need to manually ensure the setup can run and delete the Zookeeper node at /apache_atlas/setup_in_progress before attempting to run setup again.

If the setup failed due to Apache HBase schema setup errors, it may be necessary to repair Apache HBase schema. If no data has been stored, one can also disable and drop the Apache HBase tables used by Apache Atlas and run setup again.