blob: 5f5804c54114ba1b5d32345eb1b86dd4e5b10429 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
[[giraphgraphcomputer]]
==== GiraphGraphComputer
[source,xml]
----
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>giraph-gremlin</artifactId>
<version>x.y.z</version>
</dependency>
----
image:giraph-logo.png[width=100,float=left] link:http://giraph.apache.org[Giraph] is an Apache Software Foundation
project focused on OLAP-based graph processing. Giraph makes use of the distributed graph computing paradigm made
popular by Google's Pregel. In Giraph, developers write "vertex programs" that get executed at each vertex in
parallel. These programs communicate with one another in a bulk synchronous parallel (BSP) manner. This model aligns
with TinkerPop3's `GraphComputer` API. TinkerPop3 provides an implementation of `GraphComputer` that works for Giraph
called `GiraphGraphComputer`. Moreover, with TinkerPop3's <<mapreduce,MapReduce>>-framework, the standard
Giraph/Pregel model is extended to support an arbitrary number of MapReduce phases to aggregate and yield results
from the graph. Below are examples using `GiraphGraphComputer` from the <<gremlin-console,Gremlin-Console>>.
WARNING: Giraph uses a large number of Hadoop counters. The default for Hadoop is 120. In `mapred-site.xml` it is
possible to increase the limit it via the `mapreduce.job.counters.max` property. A good value to use is 1000. This
is a cluster-wide property so be sure to restart the cluster after updating.
WARNING: The maximum number of workers can be no larger than the number of map-slots in the Hadoop cluster minus 1.
For example, if the Hadoop cluster has 4 map slots, then `giraph.maxWorkers` can not be larger than 3. One map-slot
is reserved for the master compute node and all other slots can be allocated as workers to execute the VertexPrograms
on the vertices of the graph.
If `GiraphGraphComputer` will be used as the `GraphComputer` for `HadoopGraph` then its `lib` directory should be
specified in `HADOOP_GREMLIN_LIBS`.
[source,shell]
export HADOOP_GREMLIN_LIBS=$HADOOP_GREMLIN_LIBS:/usr/local/gremlin-console/ext/giraph-gremlin/lib
Or, the user can specify the directory in the Gremlin Console.
[source,groovy]
System.setProperty('HADOOP_GREMLIN_LIBS',System.getProperty('HADOOP_GREMLIN_LIBS') + ':' + '/usr/local/gremlin-console/ext/giraph-gremlin/lib')
[gremlin-groovy]
----
graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
g = graph.traversal().withComputer(GiraphGraphComputer)
g.V().count()
g.V().out().out().values('name')
----
IMPORTANT: The examples above do not use lambdas (i.e. closures in Gremlin-Groovy). This makes the traversal
serializable and thus, able to be distributed to all machines in the Hadoop cluster. If a lambda is required in a
traversal, then the traversal must be sent as a `String` and compiled locally at each machine in the cluster. The
following example demonstrates the `:remote` command which allows for submitting Gremlin traversals as a `String`.
[gremlin-groovy]
----
graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
g = graph.traversal().withComputer(GiraphGraphComputer)
:remote connect tinkerpop.hadoop graph g
:> g.V().group().by{it.value('name')[1]}.by('name')
result
result.memory.runtime
----
NOTE: If the user explicitly specifies `giraph.maxWorkers` and/or `giraph.numComputeThreads` in the configuration,
then these values will be used by Giraph. However, if these are not specified and the user never calls
`GraphComputer.workers()` then `GiraphGraphComputer` will try to compute the number of workers/threads to use based
on the cluster's profile.