| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[hadoop-gremlin]] |
| Hadoop-Gremlin |
| -------------- |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>hadoop-gremlin</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| ---- |
| |
| image:hadoop-logo-notext.png[width=100,float=left] link:http://hadoop.apache.org/[Hadoop] is a distributed |
| computing framework that is used to process data represented across a multi-machine compute cluster. When the |
| data in the Hadoop cluster represents a TinkerPop3 graph, then Hadoop-Gremlin can be used to process the graph |
| using both TinkerPop3's OLTP and OLAP graph computing models. |
| |
| IMPORTANT: This section assumes that the user has a Hadoop 2.x cluster functioning. For more information on getting |
| started with Hadoop, please see the |
| link:http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html[Single Node Setup] |
| tutorial. Moreover, if using `GiraphGraphComputer` or `SparkGraphComputer` it is advisable that the reader also |
| familiarize their self with Giraph (link:http://giraph.apache.org/quick_start.html[Getting Started]) and Spark |
| (link:http://spark.apache.org/docs/latest/quick-start.html[Quick Start]). |
| |
| Installing Hadoop-Gremlin |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| If using <<gremlin-console,Gremlin Console>>, it is important to install the Hadoop-Gremlin plugin. Note that |
| Hadoop-Gremlin requires a Gremlin Console restart after installing. |
| |
| [source,text] |
| ---- |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| plugin activated: tinkerpop.server |
| plugin activated: tinkerpop.utilities |
| plugin activated: tinkerpop.tinkergraph |
| gremlin> :install org.apache.tinkerpop hadoop-gremlin x.y.z |
| ==>loaded: [org.apache.tinkerpop, hadoop-gremlin, x.y.z] - restart the console to use [tinkerpop.hadoop] |
| gremlin> :q |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| plugin activated: tinkerpop.server |
| plugin activated: tinkerpop.utilities |
| plugin activated: tinkerpop.tinkergraph |
| gremlin> :plugin use tinkerpop.hadoop |
| ==>tinkerpop.hadoop activated |
| gremlin> |
| ---- |
| |
| It is important that the `CLASSPATH` environmental variable references `HADOOP_CONF_DIR` and that the configuration |
| files in `HADOOP_CONF_DIR` contain references to a live Hadoop cluster. It is easy to verify a proper configuration |
| from within the Gremlin Console. If `hdfs` references the local file system, then there is a configuration issue. |
| |
| [source,text] |
| ---- |
| gremlin> hdfs |
| ==>storage[org.apache.hadoop.fs.LocalFileSystem@65bb9029] // BAD |
| |
| gremlin> hdfs |
| ==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1229457199_1, ugi=user (auth:SIMPLE)]]] // GOOD |
| ---- |
| |
| The `HADOOP_GREMLIN_LIBS` references locations that contains jars that should be uploaded to a respective |
| distributed cache (link:http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html[YARN] or SparkServer). |
| Note that the locations in `HADOOP_GREMLIN_LIBS` can be a colon-separated (`:`) and all jars from all locations will |
| be loaded into the cluster. Typically, only the jars of the respective GraphComputer are required to be loaded (e.g. |
| `GiraphGraphComputer` plugin lib directory). |
| |
| [source,shell] |
| export HADOOP_GREMLIN_LIBS=/usr/local/gremlin-console/ext/giraph-gremlin/lib |
| |
| Properties Files |
| ~~~~~~~~~~~~~~~~ |
| |
| `HadoopGraph` makes use of properties files which ultimately get turned into Apache configurations and/or |
| Hadoop configurations. The example properties file presented below is located at `conf/hadoop/hadoop-gryo.properties`. |
| |
| [source,text] |
| gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph |
| gremlin.hadoop.inputLocation=tinkerpop-modern.kryo |
| gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat |
| gremlin.hadoop.outputLocation=output |
| gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat |
| gremlin.hadoop.jarsInDistributedCache=true |
| #################################### |
| # Spark Configuration # |
| #################################### |
| spark.master=local[4] |
| spark.executor.memory=1g |
| spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer |
| #################################### |
| # SparkGraphComputer Configuration # |
| #################################### |
| gremlin.spark.graphInputRDD=org.apache.tinkerpop.gremlin.spark.structure.io.InputRDDFormat |
| gremlin.spark.graphOutputRDD=org.apache.tinkerpop.gremlin.spark.structure.io.OutputRDDFormat |
| gremlin.spark.persistContext=true |
| ##################################### |
| # GiraphGraphComputer Configuration # |
| ##################################### |
| giraph.minWorkers=2 |
| giraph.maxWorkers=2 |
| giraph.useOutOfCoreGraph=true |
| giraph.useOutOfCoreMessages=true |
| mapreduce.map.java.opts=-Xmx1024m |
| mapreduce.reduce.java.opts=-Xmx1024m |
| giraph.numInputThreads=2 |
| giraph.numComputeThreads=2 |
| |
| A review of the Hadoop-Gremlin specific properties are provided in the table below. For the respective OLAP |
| engines (<<sparkgraphcomputer,`SparkGraphComputer`>> or <<giraphgraphcomputer,`GiraphGraphComputer`>>) refer |
| to their respective documentation for configuration options. |
| |
| [width="100%",cols="2,10",options="header"] |
| |========================================================= |
| |Property |Description |
| |gremlin.graph |The class of the graph to construct using GraphFactory. |
| |gremlin.hadoop.inputLocation |The location of the input file(s) for Hadoop-Gremlin to read the graph from. |
| |gremlin.hadoop.graphInputFormat |The format that the graph input file(s) are represented in. |
| |gremlin.hadoop.outputLocation |The location to write the computed HadoopGraph to. |
| |gremlin.hadoop.graphOutputFormat |The format that the output file(s) should be represented in. |
| |gremlin.hadoop.jarsInDistributedCache |Whether to upload the Hadoop-Gremlin jars to a distributed cache (necessary if jars are not on the machines' classpaths). |
| |========================================================= |
| |
| |
| |
| Along with the properties above, the numerous link:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml[Hadoop specific properties] |
| can be added as needed to tune and parameterize the executed Hadoop-Gremlin job on the respective Hadoop cluster. |
| |
| IMPORTANT: As the size of the graphs being processed becomes large, it is important to fully understand how the |
| underlying OLAP engine (e.g. Spark, Giraph, etc.) works and understand the numerous parameterizations offered by |
| these systems. Such knowledge can help alleviate out of memory exceptions, slow load times, slow processing times, |
| garbage collection issues, etc. |
| |
| OLTP Hadoop-Gremlin |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| image:hadoop-pipes.png[width=180,float=left] It is possible to execute OLTP operations over a `HadoopGraph`. |
| However, realize that the underlying HDFS files are not random access and thus, to retrieve a vertex, a linear scan |
| is required. OLTP operations are useful for peeking into the graph prior to executing a long running OLAP job -- e.g. |
| `g.V().valueMap().limit(10)`. |
| |
| WARNING: OLTP operations on `HadoopGraph` are not efficient. They require linear scans to execute and are unreasonable |
| for large graphs. In such large graph situations, make use of <<traversalvertexprogram,TraversalVertexProgram>> |
| which is the OLAP Gremlin machine. |
| |
| [gremlin-groovy] |
| ---- |
| hdfs.copyFromLocal('data/tinkerpop-modern.kryo', 'tinkerpop-modern.kryo') |
| hdfs.ls() |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal() |
| g.V().count() |
| g.V().out().out().values('name') |
| g.V().group().by{it.value('name')[1]}.by('name').next() |
| ---- |
| |
| OLAP Hadoop-Gremlin |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| image:hadoop-furnace.png[width=180,float=left] Hadoop-Gremlin was designed to execute OLAP operations via |
| `GraphComputer`. The OLTP examples presented previously are reproduced below, but using `TraversalVertexProgram` |
| for the execution of the Gremlin traversal. |
| |
| A `Graph` in TinkerPop3 can support any number of `GraphComputer` implementations. Out of the box, Hadoop-Gremlin |
| supports the following three implementations. |
| |
| * <<mapreducegraphcomputer,`MapReduceGraphComputer`>>: Leverages Hadoop's MapReduce engine to execute TinkerPop3 OLAP |
| computations. (*coming soon*) |
| ** The graph must fit within the total disk space of the Hadoop cluster (supports massive graphs). Message passing is |
| coordinated via MapReduce jobs over the on-disk graph (slow traversals). |
| * <<sparkgraphcomputer,`SparkGraphComputer`>>: Leverages Apache Spark to execute TinkerPop3 OLAP computations. |
| ** The graph may fit within the total RAM of the cluster (supports larger graphs). Message passing is coordinated via |
| Spark map/reduce/join operations on in-memory and disk-cached data (average speed traversals). |
| * <<giraphgraphcomputer,`GiraphGraphComputer`>>: Leverages Apache Giraph to execute TinkerPop3 OLAP computations. |
| ** The graph should fit within the total RAM of the Hadoop cluster (graph size restriction), though "out-of-core" |
| processing is possible. Message passing is coordinated via ZooKeeper for the in-memory graph (speedy traversals). |
| |
| TIP: image:gremlin-sugar.png[width=50,float=left] For those wanting to use the <<sugar-plugin,SugarPlugin>> with |
| their submitted traversal, do `:remote config useSugar true` as well as `:plugin use tinkerpop.sugar` at the start of |
| the Gremlin Console session if it is not already activated. |
| |
| Note that `SparkGraphComputer` and `GiraphGraphComputer` are loaded via their respective plugins. Typically only |
| one plugin or the other is loaded depending on the desired `GraphComputer` to use. |
| |
| [source,text] |
| ---- |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| plugin activated: tinkerpop.server |
| plugin activated: tinkerpop.utilities |
| plugin activated: tinkerpop.tinkergraph |
| plugin activated: tinkerpop.hadoop |
| gremlin> :install org.apache.tinkerpop giraph-gremlin x.y.z |
| ==>loaded: [org.apache.tinkerpop, giraph-gremlin, x.y.z] - restart the console to use [tinkerpop.giraph] |
| gremlin> :install org.apache.tinkerpop spark-gremlin x.y.z |
| ==>loaded: [org.apache.tinkerpop, spark-gremlin, x.y.z] - restart the console to use [tinkerpop.spark] |
| gremlin> :q |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| plugin activated: tinkerpop.server |
| plugin activated: tinkerpop.utilities |
| plugin activated: tinkerpop.tinkergraph |
| plugin activated: tinkerpop.hadoop |
| gremlin> :plugin use tinkerpop.giraph |
| ==>tinkerpop.giraph activated |
| gremlin> :plugin use tinkerpop.spark |
| ==>tinkerpop.spark activated |
| ---- |
| |
| WARNING: Hadoop, Spark, and Giraph all depend on many of the same libraries (e.g. ZooKeeper, Snappy, Netty, Guava, |
| etc.). Unfortunately, typically these dependencies are not to the same versions of the respective libraries. As such, |
| it is best to *not* have both Spark and Giraph plugins loaded in the same console session nor in the same Java |
| project (though intelligent `<exclusion>`-usage can help alleviate conflicts in a Java project). |
| |
| WARNING: It is important to note that when doing an OLAP traversal, any resulting vertices, edges, or properties will be |
| attached to the source graph. For Hadoop-based graphs, this may lead to linear search times on massive graphs. Thus, |
| if vertex, edge, or property objects are to be returns (as a final result), it is best to `.id()` to get the id |
| of the object and not the actual attached object. |
| |
| [[mapreducegraphcomputer]] |
| MapReduceGraphComputer |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| *COMING SOON* |
| |
| [[sparkgraphcomputer]] |
| SparkGraphComputer |
| ^^^^^^^^^^^^^^^^^^ |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>spark-gremlin</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| ---- |
| |
| image:spark-logo.png[width=175,float=left] link:http://spark.apache.org[Spark] is an Apache Software Foundation |
| project focused on general-purpose OLAP data processing. Spark provides a hybrid in-memory/disk-based distributed |
| computing model that is similar to Hadoop's MapReduce model. Spark maintains a fluent function chaining DSL that is |
| arguably easier for developers to work with than native Hadoop MapReduce. Spark-Gremlin provides an implementation of |
| the bulk-synchronous parallel, distributed message passing algorithm within Spark and thus, any `VertexProgram` can be |
| executed over `SparkGraphComputer`. |
| |
| If `SparkGraphComputer` will be used as the `GraphComputer` for `HadoopGraph` then its `lib` directory should be |
| specified in `HADOOP_GREMLIN_LIBS`. |
| |
| [source,shell] |
| export HADOOP_GREMLIN_LIBS=$HADOOP_GREMLIN_LIBS:/usr/local/gremlin-console/ext/spark-gremlin/lib |
| |
| Furthermore the `lib/` directory should be distributed across all machines in the SparkServer cluster. For this purpose TinkerPop |
| provides a helper script, which takes the Spark installation directory and the the Spark machines as input: |
| |
| [source,shell] |
| bin/hadoop/init-tp-spark.sh /usr/local/spark spark@10.0.0.1 spark@10.0.0.2 spark@10.0.0.3 |
| |
| Once the `lib/` directory is distributed, `SparkGraphComputer` can be used as follows. |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal(computer(SparkGraphComputer)) |
| g.V().count() |
| g.V().out().out().values('name') |
| ---- |
| |
| For using lambdas in Gremlin-Groovy, simply provide `:remote connect` a `TraversalSource` which leverages SparkGraphComputer. |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal(computer(SparkGraphComputer)) |
| :remote connect tinkerpop.hadoop graph g |
| :> g.V().group().by{it.value('name')[1]}.by('name') |
| ---- |
| |
| The `SparkGraphComputer` algorithm leverages Spark's caching abilities to reduce the amount of data shuffled across |
| the wire on each iteration of the <<vertexprogram,`VertexProgram`>>. When the graph is loaded as a Spark RDD |
| (Resilient Distributed Dataset) it is immediately cached as `graphRDD`. The `graphRDD` is a distributed adjacency |
| list which encodes the vertex, its properties, and all its incident edges. On the first iteration, each vertex |
| (in parallel) is passed through `VertexProgram.execute()`. This yields an output of the vertex's mutated state |
| (i.e. updated compute keys -- `propertyX`) and its outgoing messages. This `viewOutgoingRDD` is then reduced to |
| `viewIncomingRDD` where the outgoing messages are sent to their respective vertices. If a `MessageCombiner` exists |
| for the vertex program, then messages are aggregated locally and globally to ultimately yield one incoming message |
| for the vertex. This reduce sequence is the "message pass." If the vertex program does not terminate on this |
| iteration, then the `viewIncomingRDD` is joined with the cached `graphRDD` and the process continues. When there |
| are no more iterations, there is a final join and the resultant RDD is stripped of its edges and messages. This |
| `mapReduceRDD` is cached and is processed by each <<mapreduce,`MapReduce`>> job in the |
| <<graphcomputer,`GraphComputer`>> computation. |
| |
| image::spark-algorithm.png[width=775] |
| |
| [width="100%",cols="2,10",options="header"] |
| |======================================================== |
| |Property |Description |
| |gremlin.spark.graphInputRDD |A class for creating RDD's from underlying graph data, defaults to Hadoop `InputFormat`. |
| |gremlin.spark.graphOutputRDD |A class for output RDD's, defaults to Hadoop `OutputFormat`. |
| |gremlin.spark.graphStorageLevel |What `StorageLevel` to use for the cached graph during job execution (default `MEMORY_ONLY`). |
| |gremlin.spark.persistContext |Whether to create a new `SparkContext` for every `SparkGraphComputer` or to reuse an existing one. |
| |gremlin.spark.persistStorageLevel |What `StorageLevel` to use when persisted RDDs via `PersistedOutputRDD` (default `MEMORY_ONLY`). |
| |======================================================== |
| |
| InputRDD and OutputRDD |
| ++++++++++++++++++++++ |
| |
| If the provider/user does not want to use Hadoop `InputFormats`, it is possible to leverage Spark's RDD |
| constructs directly. There is a `gremlin.spark.graphInputRDD` configuration that references a `Class<? extends |
| InputRDD>`. An `InputRDD` provides a read method that takes a `SparkContext` and returns a graphRDD. Likewise, use |
| `gremlin.spark.graphOutputRDD` and the respective `OutputRDD`. |
| |
| If the graph system provider uses an `InputRDD`, the RDD should maintain an associated `org.apache.spark.Partitioner`. By doing so, |
| `SparkGraphComputer` will not partition the loaded graph across the cluster as it has already been partitioned by the graph system provider. |
| This can save a significant amount of time and space resources. |
| If the `InputRDD` does not have a registered partitioner, `SparkGraphComputer` will partition the graph using |
| a `org.apache.spark.HashPartitioner` with the number of partitions being either the number of existing partitions in the input (e.g. input splits) |
| or the user specified number of `GraphComputer.workers()`. |
| |
| Storage Levels |
| ++++++++++++++ |
| |
| The `SparkGraphComputer` uses `MEMORY_ONLY` to cache the input graph and the output graph by default. Users should be aware of the impact of |
| different storage levels, since the default settings can quickly lead to memory issues on larger graphs. An overview of Spark's persistence |
| settings is provided in link:http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence[Spark's programming guide]. |
| |
| |
| Using a Persisted Context |
| +++++++++++++++++++++++++ |
| |
| It is possible to persist the graph RDD between jobs within the `SparkContext` (e.g. SparkServer) by leveraging `PersistedOutputRDD`. |
| Note that `gremlin.spark.persistContext` should be set to `true` or else the persisted RDD will be destroyed when the `SparkContext` closes. |
| The persisted RDD is named by the `gremlin.hadoop.outputLocation` configuration. Similarly, `PersistedInputRDD` is used with respective |
| `gremlin.hadoop.inputLocation` to retrieve the persisted RDD from the `SparkContext`. |
| |
| When using a persistent `SparkContext` the configuration used by the original Spark Configuration will be inherited by all threaded |
| references to that Spark Context. The exception to this rule are those properties which have a specific thread local effect. |
| |
| .Thread Local Properties |
| . spark.jobGroup.id |
| . spark.job.description |
| . spark.job.interruptOnCancel |
| . spark.scheduler.pool |
| |
| Finally, there is a `spark` object that can be used to manage persisted RDDs (see <<interacting-with-spark, Interacting with Spark>>). |
| |
| [[bulkdumpervertexprogramusingspark]] |
| Exporting with BulkDumperVertexProgram |
| ++++++++++++++++++++++++++++++++++++++ |
| |
| The <<bulkdumpervertexprogram, BulkDumperVertexProgram>> exports a whole graph in any of the supported Hadoop GraphOutputFormats (`GraphSONOutputFormat`, |
| `GryoOutputFormat` or `ScriptOutputFormat`). The example below takes a Hadoop graph as the input (in `GryoInputFormat`) and exports it as a GraphSON file |
| (`GraphSONOutputFormat`). |
| |
| [gremlin-groovy] |
| ---- |
| hdfs.copyFromLocal('data/tinkerpop-modern.kryo', 'tinkerpop-modern.kryo') |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| graph.configuration().setProperty('gremlin.hadoop.graphOutputFormat', 'org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat') |
| graph.compute(SparkGraphComputer).program(BulkDumperVertexProgram.build().create()).submit().get() |
| hdfs.ls('output') |
| hdfs.head('output/~g') |
| ---- |
| |
| Loading with BulkLoaderVertexProgram |
| ++++++++++++++++++++++++++++++++++++ |
| |
| The <<bulkloadervertexprogram, BulkLoaderVertexProgram>> is a generalized bulk loader that can be used to load large |
| amounts of data to and from different `Graph` implementations. The following code demonstrates how to load the |
| Grateful Dead graph from HadoopGraph into TinkerGraph over Spark: |
| |
| [gremlin-groovy] |
| ---- |
| hdfs.copyFromLocal('data/grateful-dead.kryo', 'grateful-dead.kryo') |
| readGraph = GraphFactory.open('conf/hadoop/hadoop-grateful-gryo.properties') |
| writeGraph = 'conf/tinkergraph-gryo.properties' |
| blvp = BulkLoaderVertexProgram.build(). |
| keepOriginalIds(false). |
| writeGraph(writeGraph).create(readGraph) |
| readGraph.compute(SparkGraphComputer).workers(1).program(blvp).submit().get() |
| :set max-iteration 10 |
| graph = GraphFactory.open(writeGraph) |
| g = graph.traversal() |
| g.V().valueMap() |
| graph.close() |
| ---- |
| |
| [source,properties] |
| ---- |
| # hadoop-grateful-gryo.properties |
| |
| # |
| # Hadoop Graph Configuration |
| # |
| gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph |
| gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat |
| gremlin.hadoop.inputLocation=grateful-dead.kryo |
| gremlin.hadoop.outputLocation=output |
| gremlin.hadoop.jarsInDistributedCache=true |
| |
| # |
| # SparkGraphComputer Configuration |
| # |
| spark.master=local[1] |
| spark.executor.memory=1g |
| spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer |
| ---- |
| |
| [source,properties] |
| ---- |
| # tinkergraph-gryo.properties |
| |
| gremlin.graph=org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph |
| gremlin.tinkergraph.graphFormat=gryo |
| gremlin.tinkergraph.graphLocation=/tmp/tinkergraph.kryo |
| ---- |
| |
| IMPORTANT: The path to TinkerGraph jars needs to be included in the `HADOOP_GREMLIN_LIBS` for the above example to work. |
| |
| [[giraphgraphcomputer]] |
| GiraphGraphComputer |
| ^^^^^^^^^^^^^^^^^^^ |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>giraph-gremlin</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| ---- |
| |
| image:giraph-logo.png[width=100,float=left] link:http://giraph.apache.org[Giraph] is an Apache Software Foundation |
| project focused on OLAP-based graph processing. Giraph makes use of the distributed graph computing paradigm made |
| popular by Google's Pregel. In Giraph, developers write "vertex programs" that get executed at each vertex in |
| parallel. These programs communicate with one another in a bulk synchronous parallel (BSP) manner. This model aligns |
| with TinkerPop3's `GraphComputer` API. TinkerPop3 provides an implementation of `GraphComputer` that works for Giraph |
| called `GiraphGraphComputer`. Moreover, with TinkerPop3's <<mapreduce,MapReduce>>-framework, the standard |
| Giraph/Pregel model is extended to support an arbitrary number of MapReduce phases to aggregate and yield results |
| from the graph. Below are examples using `GiraphGraphComputer` from the <<gremlin-console,Gremlin-Console>>. |
| |
| WARNING: Giraph uses a large number of Hadoop counters. The default for Hadoop is 120. In `mapred-site.xml` it is |
| possible to increase the limit it via the `mapreduce.job.counters.max` property. A good value to use is 1000. This |
| is a cluster-wide property so be sure to restart the cluster after updating. |
| |
| WARNING: The maximum number of workers can be no larger than the number of map-slots in the Hadoop cluster minus 1. |
| For example, if the Hadoop cluster has 4 map slots, then `giraph.maxWorkers` can not be larger than 3. One map-slot |
| is reserved for the master compute node and all other slots can be allocated as workers to execute the VertexPrograms |
| on the vertices of the graph. |
| |
| If `GiraphGraphComputer` will be used as the `GraphComputer` for `HadoopGraph` then its `lib` directory should be |
| specified in `HADOOP_GREMLIN_LIBS`. |
| |
| [source,shell] |
| export HADOOP_GREMLIN_LIBS=$HADOOP_GREMLIN_LIBS:/usr/local/gremlin-console/ext/giraph-gremlin/lib |
| |
| Or, the user can specify the directory in the Gremlin Console. |
| |
| [source,groovy] |
| System.setProperty('HADOOP_GREMLIN_LIBS',System.getProperty('HADOOP_GREMLIN_LIBS') + ':' + '/usr/local/gremlin-console/ext/giraph-gremlin/lib') |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal(computer(GiraphGraphComputer)) |
| g.V().count() |
| g.V().out().out().values('name') |
| ---- |
| |
| IMPORTANT: The examples above do not use lambdas (i.e. closures in Gremlin-Groovy). This makes the traversal |
| serializable and thus, able to be distributed to all machines in the Hadoop cluster. If a lambda is required in a |
| traversal, then the traversal must be sent as a `String` and compiled locally at each machine in the cluster. The |
| following example demonstrates the `:remote` command which allows for submitting Gremlin traversals as a `String`. |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal(computer(GiraphGraphComputer)) |
| :remote connect tinkerpop.hadoop graph g |
| :> g.V().group().by{it.value('name')[1]}.by('name') |
| result |
| result.memory.runtime |
| result.memory.keys() |
| result.memory.get('~reducing') |
| ---- |
| |
| NOTE: If the user explicitly specifies `giraph.maxWorkers` and/or `giraph.numComputeThreads` in the configuration, |
| then these values will be used by Giraph. However, if these are not specified and the user never calls |
| `GraphComputer.workers()` then `GiraphGraphComputer` will try to compute the number of workers/threads to use based |
| on the cluster's profile. |
| |
| Loading with BulkLoaderVertexProgram |
| ++++++++++++++++++++++++++++++++++++ |
| |
| The <<bulkloadervertexprogram, BulkLoaderVertexProgram>> is a generalized bulk loader that can be used to load |
| large amounts of data to and from different `Graph` implementations. The following code demonstrates how to load |
| the Grateful Dead graph from HadoopGraph into TinkerGraph over Giraph: |
| |
| [gremlin-groovy] |
| ---- |
| hdfs.copyFromLocal('data/grateful-dead.kryo', 'grateful-dead.kryo') |
| readGraph = GraphFactory.open('conf/hadoop/hadoop-grateful-gryo.properties') |
| writeGraph = 'conf/tinkergraph-gryo.properties' |
| blvp = BulkLoaderVertexProgram.build(). |
| keepOriginalIds(false). |
| writeGraph(writeGraph).create(readGraph) |
| readGraph.compute(GiraphGraphComputer).workers(1).program(blvp).submit().get() |
| :set max-iteration 10 |
| graph = GraphFactory.open(writeGraph) |
| g = graph.traversal() |
| g.V().valueMap() |
| graph.close() |
| ---- |
| |
| [source,properties] |
| ---- |
| # hadoop-grateful-gryo.properties |
| |
| # |
| # Hadoop Graph Configuration |
| # |
| gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph |
| gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat |
| gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat |
| gremlin.hadoop.inputLocation=grateful-dead.kryo |
| gremlin.hadoop.outputLocation=output |
| gremlin.hadoop.jarsInDistributedCache=true |
| |
| # |
| # GiraphGraphComputer Configuration |
| # |
| giraph.minWorkers=1 |
| giraph.maxWorkers=1 |
| giraph.useOutOfCoreGraph=true |
| giraph.useOutOfCoreMessages=true |
| mapred.map.child.java.opts=-Xmx1024m |
| mapred.reduce.child.java.opts=-Xmx1024m |
| giraph.numInputThreads=4 |
| giraph.numComputeThreads=4 |
| giraph.maxMessagesInMemory=100000 |
| ---- |
| |
| [source,properties] |
| ---- |
| # tinkergraph-gryo.properties |
| |
| gremlin.graph=org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph |
| gremlin.tinkergraph.graphFormat=gryo |
| gremlin.tinkergraph.graphLocation=/tmp/tinkergraph.kryo |
| ---- |
| |
| NOTE: The path to TinkerGraph needs to be included in the `HADOOP_GREMLIN_LIBS` for the above example to work. |
| |
| Input/Output Formats |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| image:adjacency-list.png[width=300,float=right] Hadoop-Gremlin provides various I/O formats -- i.e. Hadoop |
| `InputFormat` and `OutputFormat`. All of the formats make use of an link:http://en.wikipedia.org/wiki/Adjacency_list[adjacency list] |
| representation of the graph where each "row" represents a single vertex, its properties, and its incoming and |
| outgoing edges. |
| |
| {empty} + |
| |
| [[gryo-io-format]] |
| Gryo I/O Format |
| ^^^^^^^^^^^^^^^ |
| |
| * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat` |
| * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat` |
| |
| <<gryo-reader-writer,Gryo>> is a binary graph format that leverages link:https://github.com/EsotericSoftware/kryo[Kryo] |
| to make a compact, binary representation of a vertex. It is recommended that users leverage Gryo given its space/time |
| savings over text-based representations. |
| |
| NOTE: The `GryoInputFormat` is splittable. |
| |
| [[graphson-io-format]] |
| GraphSON I/O Format |
| ^^^^^^^^^^^^^^^^^^^ |
| |
| * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat` |
| * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat` |
| |
| <<graphson-reader-writer,GraphSON>> is a JSON based graph format. GraphSON is a space-expensive graph format in that |
| it is a text-based markup language. However, it is convenient for many developers to work with as its structure is |
| simple (easy to create and parse). |
| |
| The data below represents an adjacency list representation of the classic TinkerGraph toy graph in GraphSON format. |
| |
| [source,json] |
| ---- |
| {"id":1,"label":"person","outE":{"created":[{"id":9,"inV":3,"properties":{"weight":0.4}}],"knows":[{"id":7,"inV":2,"properties":{"weight":0.5}},{"id":8,"inV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":0,"value":"marko"}],"age":[{"id":1,"value":29}]}} |
| {"id":2,"label":"person","inE":{"knows":[{"id":7,"outV":1,"properties":{"weight":0.5}}]},"properties":{"name":[{"id":2,"value":"vadas"}],"age":[{"id":3,"value":27}]}} |
| {"id":3,"label":"software","inE":{"created":[{"id":9,"outV":1,"properties":{"weight":0.4}},{"id":11,"outV":4,"properties":{"weight":0.4}},{"id":12,"outV":6,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":4,"value":"lop"}],"lang":[{"id":5,"value":"java"}]}} |
| {"id":4,"label":"person","inE":{"knows":[{"id":8,"outV":1,"properties":{"weight":1.0}}]},"outE":{"created":[{"id":10,"inV":5,"properties":{"weight":1.0}},{"id":11,"inV":3,"properties":{"weight":0.4}}]},"properties":{"name":[{"id":6,"value":"josh"}],"age":[{"id":7,"value":32}]}} |
| {"id":5,"label":"software","inE":{"created":[{"id":10,"outV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":8,"value":"ripple"}],"lang":[{"id":9,"value":"java"}]}} |
| {"id":6,"label":"person","outE":{"created":[{"id":12,"inV":3,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":10,"value":"peter"}],"age":[{"id":11,"value":35}]}} |
| ---- |
| |
| [[script-io-format]] |
| Script I/O Format |
| ^^^^^^^^^^^^^^^^^ |
| |
| * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat` |
| * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptOutputFormat` |
| |
| `ScriptInputFormat` and `ScriptOutputFormat` take an arbitrary script and use that script to either read or write |
| `Vertex` objects, respectively. This can be considered the most general `InputFormat`/`OutputFormat` possible in that |
| Hadoop-Gremlin uses the user provided script for all reading/writing. |
| |
| ScriptInputFormat |
| +++++++++++++++++ |
| |
| The data below represents an adjacency list representation of the classic TinkerGraph toy graph. First line reads, |
| "vertex `1`, labeled `person` having 2 property values (`marko` and `29`) has 3 outgoing edges; the first edge is |
| labeled `knows`, connects the current vertex `1` with vertex `2` and has a property value `0.4`, and so on." |
| |
| [source] |
| 1:person:marko:29 knows:2:0.5,knows:4:1.0,created:3:0.4 |
| 2:person:vadas:27 |
| 3:project:lop:java |
| 4:person:josh:32 created:3:0.4,created:5:1.0 |
| 5:project:ripple:java |
| 6:person:peter:35 created:3:0.2 |
| |
| There is no corresponding `InputFormat` that can parse this particular file (or some adjacency list variant of it). |
| As such, `ScriptInputFormat` can be used. With `ScriptInputFormat` a script is stored in HDFS and leveraged by each |
| mapper in the Hadoop job. The script must have the following method defined: |
| |
| [source,groovy] |
| def parse(String line, ScriptElementFactory factory) { ... } |
| |
| `ScriptElementFactory` is a legacy from previous versions and, although it's still functional, it should no longer be used. |
| In order to create vertices and edges, the `parse()` method gets access to a global variable named `graph`, which holds |
| the local `StarGraph` for the current line/vertex. |
| |
| An appropriate `parse()` for the above adjacency list file is: |
| |
| [source,groovy] |
| def parse(line, factory) { |
| def parts = line.split(/ /) |
| def (id, label, name, x) = parts[0].split(/:/).toList() |
| def v1 = graph.addVertex(T.id, id, T.label, label) |
| if (name != null) v1.property('name', name) // first value is always the name |
| if (x != null) { |
| // second value depends on the vertex label; it's either |
| // the age of a person or the language of a project |
| if (label.equals('project')) v1.property('lang', x) |
| else v1.property('age', Integer.valueOf(x)) |
| } |
| if (parts.length == 2) { |
| parts[1].split(/,/).grep { !it.isEmpty() }.each { |
| def (eLabel, refId, weight) = it.split(/:/).toList() |
| def v2 = graph.addVertex(T.id, refId) |
| v1.addOutEdge(eLabel, v2, 'weight', Double.valueOf(weight)) |
| } |
| } |
| return v1 |
| } |
| |
| The resultant `Vertex` denotes whether the line parsed yielded a valid Vertex. As such, if the line is not valid |
| (e.g. a comment line, a skip line, etc.), then simply return `null`. |
| |
| ScriptOutputFormat Support |
| ++++++++++++++++++++++++++ |
| |
| The principle above can also be used to convert a vertex to an arbitrary `String` representation that is ultimately |
| streamed back to a file in HDFS. This is the role of `ScriptOutputFormat`. `ScriptOutputFormat` requires that the |
| provided script maintains a method with the following signature: |
| |
| [source,groovy] |
| def stringify(Vertex vertex) { ... } |
| |
| An appropriate `stringify()` to produce output in the same format that was shown in the `ScriptInputFormat` sample is: |
| |
| [source,groovy] |
| def stringify(vertex) { |
| def v = vertex.values('name', 'age', 'lang').inject(vertex.id(), vertex.label()).join(':') |
| def outE = vertex.outE().map { |
| def e = it.get() |
| e.values('weight').inject(e.label(), e.inV().next().id()).join(':') |
| }.join(',') |
| return [v, outE].join('\t') |
| } |
| |
| |
| |
| Storage Systems |
| ~~~~~~~~~~~~~~~ |
| |
| Hadoop-Gremlin provides two implementations of the `Storage` API: |
| |
| * `FileSystemStorage`: Access HDFS and local file system data. |
| * `SparkContextStorage`: Access Spark persisted RDD data. |
| |
| [[interacting-with-hdfs]] |
| Interacting with HDFS |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| The distributed file system of Hadoop is called link:http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system[HDFS]. |
| The results of any OLAP operation are stored in HDFS accessible via `hdfs`. For local file system access, there is `local`. |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get(); |
| hdfs.ls() |
| hdfs.ls('output') |
| hdfs.head('output', GryoInputFormat) |
| hdfs.head('output', 'clusterCount', SequenceFileInputFormat) |
| hdfs.rm('output') |
| hdfs.ls() |
| ---- |
| |
| [[interacting-with-spark]] |
| Interacting with Spark |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| If a Spark context is persisted, then Spark RDDs will remain the Spark cache and accessible over subsequent jobs. |
| RDDs are retrieved and saved to the `SparkContext` via `PersistedInputRDD` and `PersistedOutputRDD` respectivly. |
| Persisted RDDs can be accessed using `spark`. |
| |
| [gremlin-groovy] |
| ---- |
| Spark.create('local[4]') |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| graph.configuration().setProperty('gremlin.spark.graphOutputRDD', PersistedOutputRDD.class.getCanonicalName()) |
| graph.configuration().clearProperty('gremlin.hadoop.graphOutputFormat') |
| graph.configuration().setProperty('gremlin.spark.persistContext',true) |
| graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get(); |
| spark.ls() |
| spark.ls('output') |
| spark.head('output', PersistedInputRDD) |
| spark.head('output', 'clusterCount', PersistedInputRDD) |
| spark.rm('output') |
| spark.ls() |
| ---- |
| |
| A Command Line Example |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| image::pagerank-logo.png[width=300] |
| |
| The classic link:http://en.wikipedia.org/wiki/PageRank[PageRank] centrality algorithm can be executed over the |
| TinkerPop graph from the command line using `GiraphGraphComputer`. |
| |
| WARNING: Be sure that the `HADOOP_GREMLIN_LIBS` references the location `lib` directory of the respective |
| `GraphComputer` engine being used or else the requisite dependencies will not be uploaded to the Hadoop cluster. |
| |
| [source,text] |
| ---- |
| $ hdfs dfs -copyFromLocal data/tinkerpop-modern.json tinkerpop-modern.json |
| $ hdfs dfs -ls |
| Found 2 items |
| -rw-r--r-- 1 marko supergroup 2356 2014-07-28 13:00 /user/marko/tinkerpop-modern.json |
| $ hadoop jar target/giraph-gremlin-x.y.z-job.jar org.apache.tinkerpop.gremlin.giraph.process.computer.GiraphGraphComputer ../hadoop-gremlin/conf/hadoop-graphson.properties |
| 15/09/11 08:02:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable |
| 15/09/11 08:02:11 INFO computer.GiraphGraphComputer: HadoopGremlin(Giraph): PageRankVertexProgram[alpha=0.85,iterations=30] |
| 15/09/11 08:02:12 INFO mapreduce.JobSubmitter: number of splits:3 |
| 15/09/11 08:02:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1441915907347_0028 |
| 15/09/11 08:02:12 INFO impl.YarnClientImpl: Submitted application application_1441915907347_0028 |
| 15/09/11 08:02:12 INFO job.GiraphJob: Tracking URL: http://markos-macbook:8088/proxy/application_1441915907347_0028/ |
| 15/09/11 08:02:12 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 3 mappers |
| 15/09/11 08:03:54 INFO mapreduce.Job: Running job: job_1441915907347_0028 |
| 15/09/11 08:03:55 INFO mapreduce.Job: Job job_1441915907347_0028 running in uber mode : false |
| 15/09/11 08:03:55 INFO mapreduce.Job: map 33% reduce 0% |
| 15/09/11 08:03:57 INFO mapreduce.Job: map 67% reduce 0% |
| 15/09/11 08:04:01 INFO mapreduce.Job: map 100% reduce 0% |
| 15/09/11 08:06:17 INFO mapreduce.Job: Job job_1441915907347_0028 completed successfully |
| 15/09/11 08:06:17 INFO mapreduce.Job: Counters: 80 |
| File System Counters |
| FILE: Number of bytes read=0 |
| FILE: Number of bytes written=483918 |
| FILE: Number of read operations=0 |
| FILE: Number of large read operations=0 |
| FILE: Number of write operations=0 |
| HDFS: Number of bytes read=1465 |
| HDFS: Number of bytes written=1760 |
| HDFS: Number of read operations=39 |
| HDFS: Number of large read operations=0 |
| HDFS: Number of write operations=20 |
| Job Counters |
| Launched map tasks=3 |
| Other local map tasks=3 |
| Total time spent by all maps in occupied slots (ms)=458105 |
| Total time spent by all reduces in occupied slots (ms)=0 |
| Total time spent by all map tasks (ms)=458105 |
| Total vcore-seconds taken by all map tasks=458105 |
| Total megabyte-seconds taken by all map tasks=469099520 |
| Map-Reduce Framework |
| Map input records=3 |
| Map output records=0 |
| Input split bytes=132 |
| Spilled Records=0 |
| Failed Shuffles=0 |
| Merged Map outputs=0 |
| GC time elapsed (ms)=1594 |
| CPU time spent (ms)=0 |
| Physical memory (bytes) snapshot=0 |
| Virtual memory (bytes) snapshot=0 |
| Total committed heap usage (bytes)=527958016 |
| Giraph Stats |
| Aggregate edges=0 |
| Aggregate finished vertices=0 |
| Aggregate sent message message bytes=13535 |
| Aggregate sent messages=186 |
| Aggregate vertices=6 |
| Current master task partition=0 |
| Current workers=2 |
| Last checkpointed superstep=0 |
| Sent message bytes=438 |
| Sent messages=6 |
| Superstep=31 |
| Giraph Timers |
| Initialize (ms)=2996 |
| Input superstep (ms)=5209 |
| Setup (ms)=59 |
| Shutdown (ms)=9324 |
| Superstep 0 GiraphComputation (ms)=3861 |
| Superstep 1 GiraphComputation (ms)=4027 |
| Superstep 10 GiraphComputation (ms)=4000 |
| Superstep 11 GiraphComputation (ms)=4004 |
| Superstep 12 GiraphComputation (ms)=3999 |
| Superstep 13 GiraphComputation (ms)=4000 |
| Superstep 14 GiraphComputation (ms)=4005 |
| Superstep 15 GiraphComputation (ms)=4003 |
| Superstep 16 GiraphComputation (ms)=4001 |
| Superstep 17 GiraphComputation (ms)=4007 |
| Superstep 18 GiraphComputation (ms)=3998 |
| Superstep 19 GiraphComputation (ms)=4006 |
| Superstep 2 GiraphComputation (ms)=4007 |
| Superstep 20 GiraphComputation (ms)=3996 |
| Superstep 21 GiraphComputation (ms)=4006 |
| Superstep 22 GiraphComputation (ms)=4002 |
| Superstep 23 GiraphComputation (ms)=3998 |
| Superstep 24 GiraphComputation (ms)=4003 |
| Superstep 25 GiraphComputation (ms)=4001 |
| Superstep 26 GiraphComputation (ms)=4003 |
| Superstep 27 GiraphComputation (ms)=4005 |
| Superstep 28 GiraphComputation (ms)=4002 |
| Superstep 29 GiraphComputation (ms)=4001 |
| Superstep 3 GiraphComputation (ms)=3988 |
| Superstep 30 GiraphComputation (ms)=4248 |
| Superstep 4 GiraphComputation (ms)=4010 |
| Superstep 5 GiraphComputation (ms)=3998 |
| Superstep 6 GiraphComputation (ms)=3996 |
| Superstep 7 GiraphComputation (ms)=4005 |
| Superstep 8 GiraphComputation (ms)=4009 |
| Superstep 9 GiraphComputation (ms)=3994 |
| Total (ms)=138788 |
| File Input Format Counters |
| Bytes Read=0 |
| File Output Format Counters |
| Bytes Written=0 |
| $ hdfs dfs -cat output/~g/* |
| {"id":1,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.15000000000000002}],"name":[{"id":0,"value":"marko"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":3.0}],"age":[{"id":1,"value":29}]}} |
| {"id":5,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.23181250000000003}],"name":[{"id":8,"value":"ripple"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"lang":[{"id":9,"value":"java"}]}} |
| {"id":3,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.4018125}],"name":[{"id":4,"value":"lop"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":0.0}],"lang":[{"id":5,"value":"java"}]}} |
| {"id":4,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}],"name":[{"id":6,"value":"josh"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}],"age":[{"id":7,"value":32}]}} |
| {"id":2,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.19250000000000003}],"name":[{"id":2,"value":"vadas"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"age":[{"id":3,"value":27}]}} |
| {"id":6,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.15000000000000002}],"name":[{"id":10,"value":"peter"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":1.0}],"age":[{"id":11,"value":35}]}} |
| ---- |
| |
| Vertex 4 ("josh") is isolated below: |
| |
| [source,js] |
| ---- |
| { |
| "id":4, |
| "label":"person", |
| "properties": { |
| "gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}], |
| "name":[{"id":6,"value":"josh"}], |
| "gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}], |
| "age":[{"id":7,"value":32}]} |
| } |
| } |
| ---- |