blob: 7951e88130e40089718b0510ad1cd7eaf07275ed [file] [log] [blame]
---++ Building & Installing Apache Atlas
---+++ Building Atlas
<verbatim>
git clone https://gitbox.apache.org/repos/asf/atlas.git atlas
cd atlas
git checkout branch-0.8
export MAVEN_OPTS="-Xms2g -Xmx2g"
mvn clean -DskipTests install</verbatim></verbatim>
---+++ Packaging Atlas
To create Apache Atlas package for deployment in an environment having functional HBase and Solr instances, build with the following command:
<verbatim>
mvn clean -DskipTests package -Pdist</verbatim>
* NOTES:
* Remove option '-DskipTests' to run unit and integration tests
* To build a distribution without minified js,css file, build with skipMinify profile. By default js and css files are minified.
Above will build Atlas for an environment having functional HBase and Solr instances. Atlas needs to be setup with the following to run in this environment:
* Configure atlas.graph.storage.hostname (see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section).
* Configure atlas.graph.index.search.solr.zookeeper-url (see "Graph Search Index - Solr" in the [[Configuration][Configuration]] section).
* Set HBASE_CONF_DIR to point to a valid HBase config directory (see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section).
* Create the SOLR indices (see "Graph Search Index - Solr" in the [[Configuration][Configuration]] section).
---+++ Packaging Atlas with Embedded HBase & Solr
To create Apache Atlas package that includes HBase and Solr, build with the embedded-hbase-solr profile as shown below:
<verbatim>
mvn clean -DskipTests package -Pdist,embedded-hbase-solr</verbatim>
Using the embedded-hbase-solr profile will configure Atlas so that an HBase instance and a Solr instance will be started and stopped along with the Atlas server by default.
---+++ Apache Atlas Package
Build will create following files, which are used to install Apache Atlas.
<verbatim>
distro/target/apache-atlas-${project.version}-bin.tar.gz
distro/target/apache-atlas-${project.version}-sources.tar.gz</verbatim>
---+++ Installing & Running Atlas
---++++ Installing Atlas
<verbatim>
tar -xzvf apache-atlas-${project.version}-bin.tar.gz
cd atlas-${project.version}</verbatim>
---++++ Configuring Atlas
By default config directory used by Atlas is {package dir}/conf. To override this set environment variable ATLAS_CONF to the path of the conf dir.
Environment variables needed to run Atlas can be set in atlas-env.sh file in the conf directory. This file will be sourced by Atlas scripts before any commands are executed. The following environment variables are available to set.
<verbatim>
# The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path
#export JAVA_HOME=
# any additional java opts you want to set. This will apply to both client and server operations
#export ATLAS_OPTS=
# any additional java opts that you want to set for client only
#export ATLAS_CLIENT_OPTS=
# java heap size we want to set for the client. Default is 1024MB
#export ATLAS_CLIENT_HEAP=
# any additional opts you want to set for atlas service.
#export ATLAS_SERVER_OPTS=
# java heap size we want to set for the atlas server. Default is 1024MB
#export ATLAS_SERVER_HEAP=
# What is is considered as atlas home dir. Default is the base locaion of the installed software
#export ATLAS_HOME_DIR=
# Where log files are stored. Defatult is logs directory under the base install location
#export ATLAS_LOG_DIR=
# Where pid files are stored. Defatult is logs directory under the base install location
#export ATLAS_PID_DIR=
# where the atlas titan db data is stored. Defatult is logs/data directory under the base install location
#export ATLAS_DATA_DIR=
# Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.
#export ATLAS_EXPANDED_WEBAPP_DIR=</verbatim>
*Settings to support large number of metadata objects*
If you plan to store several tens of thousands of metadata objects, it is recommended that you use values tuned for better GC performance of the JVM.
The following values are common server side options:
<verbatim>
export ATLAS_SERVER_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps"</verbatim>
The =-XX:SoftRefLRUPolicyMSPerMB= option was found to be particularly helpful to regulate GC performance for query heavy workloads with many concurrent users.
The following values are recommended for JDK 7:
<verbatim>
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=3072m -XX:PermSize=100M -XX:MaxPermSize=512m"</verbatim>
The following values are recommended for JDK 8:
<verbatim>
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=5120m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m"</verbatim>
*NOTE for Mac OS users*
If you are using a Mac OS, you will need to configure the ATLAS_SERVER_OPTS (explained above).
In {package dir}/conf/atlas-env.sh uncomment the following line
<verbatim>
#export ATLAS_SERVER_OPTS=</verbatim>
and change it to look as below
<verbatim>
export ATLAS_SERVER_OPTS="-Djava.awt.headless=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="</verbatim>
*Hbase as the Storage Backend for the Graph Repository*
By default, Atlas uses Titan as the graph repository and is the only graph repository implementation available currently. The HBase versions currently supported are 1.1.x. For configuring ATLAS graph persistence on HBase, please see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section
for more details.
HBase tables used by Atlas can be set using the following configurations:
<verbatim>
atlas.graph.storage.hbase.table=apache_atlas_titan
atlas.audit.hbase.tablename=apache_atlas_entity_audit</verbatim>
*Configuring SOLR as the Indexing Backend for the Graph Repository*
By default, Atlas uses Titan as the graph repository and is the only graph repository implementation available currently. For configuring Titan to work with Solr, please follow the instructions below
* Install solr if not already running. The version of SOLR supported is 5.2.1. Could be installed from http://archive.apache.org/dist/lucene/solr/5.2.1/solr-5.2.1.tgz
* Start solr in cloud mode.
!SolrCloud mode uses a !ZooKeeper Service as a highly available, central location for cluster management.
For a small cluster, running with an existing !ZooKeeper quorum should be fine. For larger clusters, you would want to run separate multiple !ZooKeeper quorum with atleast 3 servers.
Note: Atlas currently supports solr in "cloud" mode only. "http" mode is not supported. For more information, refer solr documentation - https://cwiki.apache.org/confluence/display/solr/SolrCloud
* For e.g., to bring up a Solr node listening on port 8983 on a machine, you can use the command:
<verbatim>
$SOLR_HOME/bin/solr start -c -z <zookeeper_host:port> -p 8983</verbatim>
* Run the following commands from SOLR_BIN (e.g. $SOLR_HOME/bin) directory to create collections in Solr corresponding to the indexes that Atlas uses. In the case that the ATLAS and SOLR instance are on 2 different hosts, first copy the required configuration files from ATLAS_HOME/conf/solr on the ATLAS instance host to the Solr instance host. SOLR_CONF in the below mentioned commands refer to the directory where the solr configuration files have been copied to on Solr host:
<verbatim>
$SOLR_BIN/solr create -c vertex_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor
$SOLR_BIN/solr create -c edge_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor
$SOLR_BIN/solr create -c fulltext_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor</verbatim>
Note: If numShards and replicationFactor are not specified, they default to 1 which suffices if you are trying out solr with ATLAS on a single node instance.
Otherwise specify numShards according to the number of hosts that are in the Solr cluster and the maxShardsPerNode configuration.
The number of shards cannot exceed the total number of Solr nodes in your !SolrCloud cluster.
The number of replicas (replicationFactor) can be set according to the redundancy required.
Also note that solr will automatically be called to create the indexes when the Atlas server is started if the
SOLR_BIN and SOLR_CONF environment variables are set and the search indexing backend is set to 'solr5'.
* Change ATLAS configuration to point to the Solr instance setup. Please make sure the following configurations are set to the below values in ATLAS_HOME/conf/atlas-application.properties
<verbatim>
atlas.graph.index.search.backend=solr5
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.zookeeper-url=<the ZK quorum setup for solr as comma separated value> eg: 10.1.6.4:2181,10.1.6.5:2181
atlas.graph.index.search.solr.zookeeper-connect-timeout=<SolrCloud Zookeeper Connection Timeout>. Default value is 60000 ms
atlas.graph.index.search.solr.zookeeper-session-timeout=<SolrCloud Zookeeper Session Timeout>. Default value is 60000 ms</verbatim>
* Restart Atlas
For more information on Titan solr configuration , please refer http://s3.thinkaurelius.com/docs/titan/0.5.4/solr.htm
Pre-requisites for running Solr in cloud mode
* Memory - Solr is both memory and CPU intensive. Make sure the server running Solr has adequate memory, CPU and disk.
Solr works well with 32GB RAM. Plan to provide as much memory as possible to Solr process
* Disk - If the number of entities that need to be stored are large, plan to have at least 500 GB free space in the volume where Solr is going to store the index data
* !SolrCloud has support for replication and sharding. It is highly recommended to use !SolrCloud with at least two Solr nodes running on different servers with replication enabled.
If using !SolrCloud, then you also need !ZooKeeper installed and configured with 3 or 5 !ZooKeeper nodes
*Configuring Kafka Topics*
Atlas uses Kafka to ingest metadata from other components at runtime. This is described in the [[Architecture][Architecture page]]
in more detail. Depending on the configuration of Kafka, sometimes you might need to setup the topics explicitly before
using Atlas. To do so, Atlas provides a script =bin/atlas_kafka_setup.py= which can be run from the Atlas server. In some
environments, the hooks might start getting used first before Atlas server itself is setup. In such cases, the topics
can be run on the hosts where hooks are installed using a similar script =hook-bin/atlas_kafka_setup_hook.py=. Both these
use configuration in =atlas-application.properties= for setting up the topics. Please refer to the [[Configuration][Configuration page]]
for these details.
---++++ Setting up Atlas
There are a few steps that setup dependencies of Atlas. One such example is setting up the Titan schema
in the storage backend of choice. In a simple single server setup, these are automatically setup with default
configuration when the server first accesses these dependencies.
However, there are scenarios when we may want to run setup steps explicitly as one time operations. For example, in a
multiple server scenario using [[HighAvailability][High Availability]], it is preferable to run setup steps from one
of the server instances the first time, and then start the services.
To run these steps one time, execute the command =bin/atlas_start.py -setup= from a single Atlas server instance.
However, the Atlas server does take care of parallel executions of the setup steps. Also, running the setup steps multiple
times is idempotent. Therefore, if one chooses to run the setup steps as part of server startup, for convenience,
then they should enable the configuration option =atlas.server.run.setup.on.start= by defining it with the value =true=
in the =atlas-application.properties= file.
---++++ Starting Atlas Server
<verbatim>
bin/atlas_start.py [-port <port>]</verbatim>
---+++ Using Atlas
* Verify if the server is up and running
<verbatim>
curl -v -u username:password http://localhost:21000/api/atlas/admin/version
{"Version":"v0.1"}</verbatim>
* Access Atlas UI using a browser: http://localhost:21000
* Quick start model - sample model and data
<verbatim>
bin/quick_start.py [<atlas endpoint>]</verbatim>
* List the types in the repository
<verbatim>
curl -v -u username:password http://localhost:21000/api/atlas/v2/types/typedefs/headers
[ {"guid":"fa421be8-c21b-4cf8-a226-fdde559ad598","name":"Referenceable","category":"ENTITY"},
{"guid":"7f3f5712-521d-450d-9bb2-ba996b6f2a4e","name":"Asset","category":"ENTITY"},
{"guid":"84b02fa0-e2f4-4cc4-8b24-d2371cd00375","name":"DataSet","category":"ENTITY"},
{"guid":"f93975d5-5a5c-41da-ad9d-eb7c4f91a093","name":"Process","category":"ENTITY"},
{"guid":"79dcd1f9-f350-4f7b-b706-5bab416f8206","name":"Infrastructure","category":"ENTITY"}
]</verbatim>
* List the instances for a given type
<verbatim>
curl -v -u username:password http://localhost:21000/api/atlas/v2/search/basic?typeName=hive_db
{
"queryType":"BASIC",
"searchParameters":{
"typeName":"hive_db",
"excludeDeletedEntities":false,
"includeClassificationAttributes":false,
"includeSubTypes":true,
"includeSubClassifications":true,
"limit":100,
"offset":0
},
"entities":[
{
"typeName":"hive_db",
"guid":"5d900c19-094d-4681-8a86-4eb1d6ffbe89",
"status":"ACTIVE",
"displayText":"default",
"classificationNames":[],
"attributes":{
"owner":"public",
"createTime":null,
"qualifiedName":"default@cl1",
"name":"default",
"description":"Default Hive database"
}
},
{
"typeName":"hive_db",
"guid":"3a0b14b0-ab85-4b65-89f2-e418f3f7f77c",
"status":"ACTIVE",
"displayText":"finance",
"classificationNames":[],
"attributes":{
"owner":"hive",
"createTime":null,
"qualifiedName":"finance@cl1",
"name":"finance",
"description":null
}
}
]
}</verbatim>
* Search for entities
curl -v -u username:password http://localhost:21000/api/atlas/v2/search/dsl?query=hive_db%20where%20name='default'
{
"queryType":"DSL",
"queryText":"hive_db where name='default'",
"entities":[
{
"typeName":"hive_db",
"guid":"5d900c19-094d-4681-8a86-4eb1d6ffbe89",
"status":"ACTIVE",
"displayText":"default",
"classificationNames":[],
"attributes":{
"owner":"public",
"createTime":null,
"qualifiedName":"default@cl1",
"name":"default",
"description":
"Default Hive database"
}
}
]
}</verbatim>
---+++ Stopping Atlas Server
<verbatim>
bin/atlas_stop.py</verbatim>
---+++ Troubleshooting
---++++ Setup issues
If the setup of Atlas service fails due to any reason, the next run of setup (either by an explicit invocation of
=atlas_start.py -setup= or by enabling the configuration option =atlas.server.run.setup.on.start=) will fail with
a message such as =A previous setup run may not have completed cleanly.=. In such cases, you would need to manually
ensure the setup can run and delete the Zookeeper node at =/apache_atlas/setup_in_progress= before attempting to
run setup again.
If the setup failed due to HBase Titan schema setup errors, it may be necessary to repair the HBase schema. If no
data has been stored, one can also disable and drop the HBase tables used by Atlas and run setup again.