| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[giraphgraphcomputer]] |
| ==== GiraphGraphComputer |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>giraph-gremlin</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| ---- |
| |
| image:giraph-logo.png[width=100,float=left] link:http://giraph.apache.org[Giraph] is an Apache Software Foundation |
| project focused on OLAP-based graph processing. Giraph makes use of the distributed graph computing paradigm made |
| popular by Google's Pregel. In Giraph, developers write "vertex programs" that get executed at each vertex in |
| parallel. These programs communicate with one another in a bulk synchronous parallel (BSP) manner. This model aligns |
| with TinkerPop3's `GraphComputer` API. TinkerPop3 provides an implementation of `GraphComputer` that works for Giraph |
| called `GiraphGraphComputer`. Moreover, with TinkerPop3's <<mapreduce,MapReduce>>-framework, the standard |
| Giraph/Pregel model is extended to support an arbitrary number of MapReduce phases to aggregate and yield results |
| from the graph. Below are examples using `GiraphGraphComputer` from the <<gremlin-console,Gremlin-Console>>. |
| |
| WARNING: Giraph uses a large number of Hadoop counters. The default for Hadoop is 120. In `mapred-site.xml` it is |
| possible to increase the limit it via the `mapreduce.job.counters.max` property. A good value to use is 1000. This |
| is a cluster-wide property so be sure to restart the cluster after updating. |
| |
| WARNING: The maximum number of workers can be no larger than the number of map-slots in the Hadoop cluster minus 1. |
| For example, if the Hadoop cluster has 4 map slots, then `giraph.maxWorkers` can not be larger than 3. One map-slot |
| is reserved for the master compute node and all other slots can be allocated as workers to execute the VertexPrograms |
| on the vertices of the graph. |
| |
| If `GiraphGraphComputer` will be used as the `GraphComputer` for `HadoopGraph` then its `lib` directory should be |
| specified in `HADOOP_GREMLIN_LIBS`. |
| |
| [source,shell] |
| export HADOOP_GREMLIN_LIBS=$HADOOP_GREMLIN_LIBS:/usr/local/gremlin-console/ext/giraph-gremlin/lib |
| |
| Or, the user can specify the directory in the Gremlin Console. |
| |
| [source,groovy] |
| System.setProperty('HADOOP_GREMLIN_LIBS',System.getProperty('HADOOP_GREMLIN_LIBS') + ':' + '/usr/local/gremlin-console/ext/giraph-gremlin/lib') |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal().withComputer(GiraphGraphComputer) |
| g.V().count() |
| g.V().out().out().values('name') |
| ---- |
| |
| IMPORTANT: The examples above do not use lambdas (i.e. closures in Gremlin-Groovy). This makes the traversal |
| serializable and thus, able to be distributed to all machines in the Hadoop cluster. If a lambda is required in a |
| traversal, then the traversal must be sent as a `String` and compiled locally at each machine in the cluster. The |
| following example demonstrates the `:remote` command which allows for submitting Gremlin traversals as a `String`. |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal().withComputer(GiraphGraphComputer) |
| :remote connect tinkerpop.hadoop graph g |
| :> g.V().group().by{it.value('name')[1]}.by('name') |
| result |
| result.memory.runtime |
| ---- |
| |
| NOTE: If the user explicitly specifies `giraph.maxWorkers` and/or `giraph.numComputeThreads` in the configuration, |
| then these values will be used by Giraph. However, if these are not specified and the user never calls |
| `GraphComputer.workers()` then `GiraphGraphComputer` will try to compute the number of workers/threads to use based |
| on the cluster's profile. |