| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[olap-spark-yarn]] |
| == OLAP traversals with Spark on YARN |
| |
| TinkerPop's combination of link:https://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[SparkGraphComputer] |
| and link:https://tinkerpop.apache.org/docs/x.y.z/reference/#_properties_files[HadoopGraph] allows for running |
| distributed, analytical graph queries (OLAP) on a computer cluster. The |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[reference documentation] covers the cases |
| where Spark runs locally or where the cluster is managed by a Spark server. However, many users can only run OLAP jobs |
| via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (YARN), which requires `SparkGraphComputer` to be |
| configured differently. This recipe describes this configuration. |
| |
| === Approach |
| |
| Most configuration problems of TinkerPop with Spark on YARN stem from three reasons: |
| |
| 1. `SparkGraphComputer` creates its own `SparkContext` so it does not get any configs from the usual `spark-submit` command. |
| 2. The TinkerPop Spark plugin did not include Spark on YARN runtime dependencies until version 3.2.7/3.3.1. |
| 3. Resolving reason 2 by adding the cluster's Spark jars to the classpath may create all kinds of version |
| conflicts with the TinkerPop dependencies. |
| |
| The current recipe follows a minimalist approach in which no dependencies are added to the dependencies |
| included in the TinkerPop binary distribution. The Hadoop cluster's Spark installation is completely ignored. This |
| approach minimizes the chance of dependency version conflicts. |
| |
| === Prerequisites |
| |
| This recipe is suitable for both a real external and a local pseudo Hadoop cluster. While the recipe is maintained |
| for the vanilla Hadoop pseudo-cluster, it has been reported to work on real clusters with Hadoop distributions |
| from various vendors. |
| |
| If you want to try the recipe on a local Hadoop pseudo-cluster, the easiest way to install |
| it is to look at the install script at https://github.com/apache/tinkerpop/blob/x.y.z/docker/hadoop/install.sh |
| and the `start hadoop` section of https://github.com/apache/tinkerpop/blob/x.y.z/docker/scripts/build.sh. |
| |
| This recipe assumes that you installed the Gremlin Console with the |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[Spark plugin] (the |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[Hadoop plugin] is optional). Your Hadoop cluster |
| may have been configured to use file compression, e.g. LZO compression. If so, you need to copy the relevant |
| jar (e.g. `hadoop-lzo-*.jar`) to Gremlin Console's `ext/spark-gremlin/lib` folder. |
| |
| For starting the Gremlin Console in the right environment, create a shell script (e.g. `bin/spark-yarn.sh`) with the |
| contents below. Of course, actual values for `GREMLIN_HOME`, `HADOOP_HOME` and `HADOOP_CONF_DIR` need to be adapted to |
| your particular environment. |
| |
| [source] |
| ---- |
| #!/bin/bash |
| # Variables to be adapted to the actual environment |
| GREMLIN_HOME=/home/yourdir/lib/apache-tinkerpop-gremlin-console-x.y.z-standalone |
| export HADOOP_HOME=/usr/local/lib/hadoop-2.7.7 |
| export HADOOP_CONF_DIR=/usr/local/lib/hadoop-2.7.7/etc/hadoop |
| |
| # Have TinkerPop find the hadoop cluster configs and hadoop native libraries |
| export CLASSPATH=$HADOOP_CONF_DIR |
| export JAVA_OPTIONS="-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/native/Linux-amd64-64" |
| |
| # Start gremlin-console without getting the HADOOP_GREMLIN_LIBS warning |
| cd $GREMLIN_HOME |
| [ ! -e empty ] && mkdir empty |
| export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty |
| bin/gremlin.sh |
| ---- |
| |
| === Running the job |
| |
| You can now run a gremlin OLAP query with Spark on YARN: |
| |
| [source] |
| ---- |
| $ hdfs dfs -put data/tinkerpop-modern.kryo . |
| $ . bin/spark-yarn.sh |
| ---- |
| |
| [gremlin-groovy] |
| ---- |
| hadoop = System.getenv('HADOOP_HOME') |
| hadoopConfDir = System.getenv('HADOOP_CONF_DIR') |
| archive = 'spark-gremlin.zip' |
| archivePath = "/tmp/$archive" |
| ['bash', '-c', "rm -f $archivePath; cd ext/spark-gremlin/lib && zip $archivePath *.jar"].execute().waitFor() |
| conf = new Configurations().properties(new File('conf/hadoop/hadoop-gryo.properties')) |
| conf.setProperty('spark.master', 'yarn') |
| conf.setProperty('spark.submit.deployMode', 'client') |
| conf.setProperty('spark.yarn.archive', "$archivePath") |
| conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH', "./__spark_libs__/*:$hadoopConfDir") |
| conf.setProperty('spark.executor.extraClassPath', "./__spark_libs__/*:$hadoopConfDir") |
| conf.setProperty('spark.driver.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64") |
| conf.setProperty('spark.executor.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64") |
| conf.setProperty('gremlin.spark.persistContext', 'true') |
| hdfs.copyFromLocal('data/tinkerpop-modern.kryo', 'tinkerpop-modern.kryo') |
| graph = GraphFactory.open(conf) |
| g = traversal().withEmbedded(graph).withComputer(SparkGraphComputer) |
| g.V().group().by(values('name')).by(both().count()) |
| ---- |
| |
| If you run into exceptions, you will have to dig into the logs. You can do this from the command line with |
| `yarn application -list -appStates ALL` to find the `applicationId`, while the logs are available with |
| `yarn logs -applicationId application_1498627870374_0008`. Alternatively, you can inspect the logs via |
| the YARN Resource Manager UI (e.g. \http://rm.your.domain:8088/cluster), provided that YARN was configured with the |
| `yarn.log-aggregation-enable` property set to `true`. See the Spark documentation for |
| https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application[additional hints]. |
| |
| === Explanation |
| |
| This recipe does not require running the `bin/hadoop/init-tp-spark.sh` script described in the |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[reference documentation] and thus is also |
| valid for cluster users without access permissions to do so. |
| |
| Rather, it exploits the `spark.yarn.archive` property, which points to an archive with jars on the local file |
| system and is loaded into the various YARN containers. As a result the `spark-gremlin.zip` archive becomes available |
| as the directory named `+__spark_libs__+` in the YARN containers. The `spark.executor.extraClassPath` and |
| `spark.yarn.appMasterEnv.CLASSPATH` properties point to the jars inside this directory. |
| This is why they contain the `+./__spark_lib__/*+` item. Just because a Spark executor got the archive with |
| jars loaded into its container, does not mean it knows how to access them. |
| |
| Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on YARN as implemented (jars |
| added to the `SparkContext` are not available to the YARN application master). |
| |
| The `gremlin.spark.persistContext` property is explained in the reference documentation of |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[SparkGraphComputer]: it helps in getting |
| follow-up OLAP queries answered faster, because you skip the overhead for getting resources from YARN. |
| |
| === Additional configuration options |
| |
| This recipe does most of the graph configuration in the Gremlin Console so that environment variables can be used and |
| the chance of configuration mistakes is minimal. Once you have your setup working, it is probably easier to make a copy |
| of the `conf/hadoop/hadoop-gryo.properties` file and put the property values specific to your environment there. This is |
| also the right moment to take a look at the `spark-defaults.xml` file of your cluster, in particular the settings for |
| the https://spark.apache.org/docs/latest/monitoring.html[Spark History Service], which allows you to access logs of |
| finished applications via the YARN resource manager UI. |
| |
| This recipe uses the Gremlin Console, but things should not be very different for your own JVM-based application, |
| as long as you do not use the `spark-submit` or `spark-shell` commands. You will also want to check the additional |
| runtime dependencies listed in the `Gremlin-Plugin-Dependencies` section of the manifest file in the `spark-gremlin` |
| jar. |
| |
| You may not like the idea that the Hadoop and Spark jars from the TinkerPop distribution differ from the versions in |
| your cluster. If so, just build TinkerPop from source with the corresponding dependencies changed in the various `pom.xml` |
| files (e.g. `spark-core_2.11-2.2.0-some-vendor.jar` instead of `spark-core_2.11-2.2.0.jar`). Of course, TinkerPop will |
| only build for exactly matching or slightly differing artifact versions. |