| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[implementations]] |
| Implementations |
| =============== |
| |
| image::gremlin-racecar.png[width=325] |
| |
| [[graph-system-provider-requirements]] |
| Graph System Provider Requirements |
| ---------------------------------- |
| |
| image:tinkerpop-enabled.png[width=140,float=left] At the core of TinkerPop3 is a Java8 API. The implementation of this |
| core API and its validation via the `gremlin-test` suite is all that is required of a graph system provider wishing to |
| provide a TinkerPop3-enabled graph engine. Once a graph system has a valid implementation, then all the applications |
| provided by TinkerPop (e.g. Gremlin Console, Gremlin Server, etc.) and 3rd-party developers (e.g. Gremlin-Scala, |
| Gremlin-JS, etc.) will integrate properly. Finally, please feel free to use the logo on the left to promote your |
| TinkerPop3 implementation. |
| |
| Implementing Gremlin-Core |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The classes that a graph system provider should focus on implementing are itemized below. It is a good idea to study |
| the <<tinkergraph-gremlin,TinkerGraph>> (in-memory OLTP and OLAP in `tinkergraph-gremlin`), <<neo4j-gremlin,Neo4jGraph>> |
| (OTLP w/ transactions in `neo4j-gremlin`) and/or <<hadoop-gremlin,HadoopGraph>> (OLAP in `hadoop-gremlin`) |
| implementations for ideas and patterns. |
| |
| . Online Transactional Processing Graph Systems (*OLTP*) |
| .. Structure API: `Graph`, `Element`, `Vertex`, `Edge`, `Property` and `Transaction` (if transactions are supported). |
| .. Process API: `TraversalStrategy` instances for optimizing Gremlin traversals to the provider's graph system (i.e. `TinkerGraphStepStrategy`). |
| . Online Analytics Processing Graph Systems (*OLAP*) |
| .. Everything required of OTLP is required of OLAP (but not vice versa). |
| .. GraphComputer API: `GraphComputer`, `Messenger`, `Memory`. |
| |
| Please consider the following implementation notes: |
| |
| * Be sure your `Graph` implementation is named as `XXXGraph` (e.g. TinkerGraph, Neo4jGraph, HadoopGraph, etc.). |
| * Use `StringHelper` to ensuring that the `toString()` representation of classes are consistent with other implementations. |
| * Ensure that your implementation's `Features` (Graph, Vertex, etc.) are correct so that test cases handle particulars accordingly. |
| * Use the numerous static method helper classes such as `ElementHelper`, `GraphComputerHelper`, `VertexProgramHelper`, etc. |
| * There are a number of default methods on the provided interfaces that are semantically correct. However, if they are |
| not efficient for the implementation, override them. |
| * Implement the `structure/` package interfaces first and then, if desired, interfaces in the `process/` package interfaces. |
| * `ComputerGraph` is a `Wrapper` system that ensure proper semantics during a GraphComputer computation. |
| |
| [[oltp-implementations]] |
| OLTP Implementations |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| image:pipes-character-1.png[width=110,float=right] The most important interfaces to implement are in the `structure/` |
| package. These include interfaces like Graph, Vertex, Edge, Property, Transaction, etc. The `StructureStandardSuite` |
| will ensure that the semantics of the methods implemented are correct. Moreover, there are numerous `Exceptions` |
| classes with static exceptions that should be thrown by the graph system so that all the exceptions and their |
| messages are consistent amongst all TinkerPop3 implementations. |
| |
| [[olap-implementations]] |
| OLAP Implementations |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| image:furnace-character-1.png[width=110,float=right] Implementing the OLAP interfaces may be a bit more complicated. |
| Note that before OLAP interfaces are implemented, it is necessary for the OLTP interfaces to be, at minimal, |
| implemented as specified in <<oltp-implementations,OLTP Implementations>>. A summary of each required interface |
| implementation is presented below: |
| |
| . `GraphComputer`: A fluent builder for specifying an isolation level, a VertexProgram, and any number of MapReduce jobs to be submitted. |
| . `Memory`: A global blackboard for ANDing, ORing, INCRing, and SETing values for specified keys. |
| . `Messenger`: The system that collects and distributes messages being propagated by vertices executing the VertexProgram application. |
| . `MapReduce.MapEmitter`: The system that collects key/value pairs being emitted by the MapReduce applications map-phase. |
| . `MapReduce.ReduceEmitter`: The system that collects key/value pairs being emitted by the MapReduce applications combine- and reduce-phases. |
| |
| NOTE: The VertexProgram and MapReduce interfaces in the `process/computer/` package are not required by the graph |
| system. Instead, these are interfaces to be implemented by application developers writing VertexPrograms and MapReduce jobs. |
| |
| IMPORTANT: TinkerPop3 provides three OLAP implementations: <<tinkergraph-gremlin,TinkerGraphComputer>> (TinkerGraph), |
| <<giraphgraphcomputer,GiraphGraphComputer>> (HadoopGraph), and <<sparkgraphcomputer,`SparkGraphComputer`>> (Hadoop). |
| Given the complexity of the OLAP system, it is good to study and copy many of the patterns used in these reference |
| implementations. |
| |
| Implementing GraphComputer |
| ++++++++++++++++++++++++++ |
| |
| image:furnace-character-3.png[width=150,float=right] The most complex method in GraphComputer is the `submit()`-method. The method must do the following: |
| |
| . Ensure the the GraphComputer has not already been executed. |
| . Ensure that at least there is a VertexProgram or 1 MapReduce job. |
| . If there is a VertexProgram, validate that it can execute on the GraphComputer given the respectively defined features. |
| . Create the Memory to be used for the computation. |
| . Execute the VertexProgram.setup() method once and only once. |
| . Execute the VertexProgram.execute() method for each vertex. |
| . Execute the VertexProgram.terminate() method once and if true, repeat VertexProgram.execute(). |
| . When VertexProgram.terminate() returns true, move to MapReduce job execution. |
| . MapReduce jobs are not required to be executed in any specified order. |
| . For each Vertex, execute MapReduce.map(). Then (if defined) execute MapReduce.combine() and MapReduce.reduce(). |
| . Update Memory with runtime information. |
| . Construct a new `ComputerResult` containing the compute Graph and Memory. |
| |
| Implementing Memory |
| +++++++++++++++++++ |
| |
| image:gremlin-brain.png[width=175,float=left] The Memory object is initially defined by `VertexProgram.setup()`. |
| The memory data is available in the first round of the `VertexProgram.execute()` method. Each Vertex, when executing |
| the VertexProgram, can update the Memory in its round. However, the update is not seen by the other vertices until |
| the next round. At the end of the first round, all the updates are aggregated and the new memory data is available |
| on the second round. This process repeats until the VertexProgram terminates. |
| |
| Implementing Messenger |
| ++++++++++++++++++++++ |
| |
| The Messenger object is similar to the Memory object in that a vertex can read and write to the Messenger. However, |
| the data it reads are the messages sent to the vertex in the previous step and the data it writes are the messages |
| that will be readable by the receiving vertices in the subsequent round. |
| |
| Implementing MapReduce Emitters |
| +++++++++++++++++++++++++++++++ |
| |
| image:hadoop-logo-notext.png[width=150,float=left] The MapReduce framework in TinkerPop3 is similar to the model |
| popularized by link:http://apache.hadoop.org[Hadoop]. The primary difference is that all Mappers process the vertices |
| of the graph, not an arbitrary key/value pair. However, the vertices' edges can not be accessed -- only their |
| properties. This greatly reduces the amount of data needed to be pushed through the MapReduce engine as any edge |
| information required, can be computed in the VertexProgram.execute() method. Moreover, at this stage, vertices can |
| not be mutated, only their token and property data read. A Gremlin OLAP system needs to provide implementations for |
| to particular classes: `MapReduce.MapEmitter` and `MapReduce.ReduceEmitter`. TinkerGraph's implementation is provided |
| below which demonstrates the simplicity of the algorithm (especially when the data is all within the same JVM). |
| |
| [source,java] |
| ---- |
| public class TinkerMapEmitter<K, V> implements MapReduce.MapEmitter<K, V> { |
| |
| public Map<K, Queue<V>> reduceMap; |
| public Queue<KeyValue<K, V>> mapQueue; |
| private final boolean doReduce; |
| |
| public TinkerMapEmitter(final boolean doReduce) { <1> |
| this.doReduce = doReduce; |
| if (this.doReduce) |
| this.reduceMap = new ConcurrentHashMap<>(); |
| else |
| this.mapQueue = new ConcurrentLinkedQueue<>(); |
| } |
| |
| @Override |
| public void emit(K key, V value) { |
| if (this.doReduce) |
| this.reduceMap.computeIfAbsent(key, k -> new ConcurrentLinkedQueue<>()).add(value); <2> |
| else |
| this.mapQueue.add(new KeyValue<>(key, value)); <3> |
| } |
| |
| protected void complete(final MapReduce<K, V, ?, ?, ?> mapReduce) { |
| if (!this.doReduce && mapReduce.getMapKeySort().isPresent()) { <4> |
| final Comparator<K> comparator = mapReduce.getMapKeySort().get(); |
| final List<KeyValue<K, V>> list = new ArrayList<>(this.mapQueue); |
| Collections.sort(list, Comparator.comparing(KeyValue::getKey, comparator)); |
| this.mapQueue.clear(); |
| this.mapQueue.addAll(list); |
| } else if (mapReduce.getMapKeySort().isPresent()) { |
| final Comparator<K> comparator = mapReduce.getMapKeySort().get(); |
| final List<Map.Entry<K, Queue<V>>> list = new ArrayList<>(); |
| list.addAll(this.reduceMap.entrySet()); |
| Collections.sort(list, Comparator.comparing(Map.Entry::getKey, comparator)); |
| this.reduceMap = new LinkedHashMap<>(); |
| list.forEach(entry -> this.reduceMap.put(entry.getKey(), entry.getValue())); |
| } |
| } |
| } |
| ---- |
| |
| <1> If the MapReduce job has a reduce, then use one data structure (`reduceMap`), else use another (`mapList`). The |
| difference being that a reduction requires a grouping by key and therefore, the `Map<K,Queue<V>>` definition. If no |
| reduction/grouping is required, then a simple `Queue<KeyValue<K,V>>` can be leveraged. |
| <2> If reduce is to follow, then increment the Map with a new value for the key. `MapHelper` is a TinkerPop3 class |
| with static methods for adding data to a Map. |
| <3> If no reduce is to follow, then simply append a KeyValue to the queue. |
| <4> When the map phase is complete, any map-result sorting required can be executed at this point. |
| |
| [source,java] |
| ---- |
| public class TinkerReduceEmitter<OK, OV> implements MapReduce.ReduceEmitter<OK, OV> { |
| |
| protected Queue<KeyValue<OK, OV>> reduceQueue = new ConcurrentLinkedQueue<>(); |
| |
| @Override |
| public void emit(final OK key, final OV value) { |
| this.reduceQueue.add(new KeyValue<>(key, value)); |
| } |
| |
| protected void complete(final MapReduce<?, ?, OK, OV, ?> mapReduce) { |
| if (mapReduce.getReduceKeySort().isPresent()) { |
| final Comparator<OK> comparator = mapReduce.getReduceKeySort().get(); |
| final List<KeyValue<OK, OV>> list = new ArrayList<>(this.reduceQueue); |
| Collections.sort(list, Comparator.comparing(KeyValue::getKey, comparator)); |
| this.reduceQueue.clear(); |
| this.reduceQueue.addAll(list); |
| } |
| } |
| } |
| ---- |
| |
| The method `MapReduce.reduce()` is defined as: |
| |
| [source,java] |
| public void reduce(final OK key, final Iterator<OV> values, final ReduceEmitter<OK, OV> emitter) { ... } |
| |
| In other words, for the TinkerGraph implementation, iterate through the entrySet of the `reduceMap` and call the |
| `reduce()` method on each entry. The `reduce()` method can emit key/value pairs which are simply aggregated into a |
| `Queue<KeyValue<OK,OV>>` in an analogous fashion to `TinkerMapEmitter` when no reduce is to follow. These two emitters |
| are tied together in `TinkerGraphComputer.submit()`. |
| |
| [source,java] |
| ---- |
| ... |
| for (final MapReduce mapReduce : mapReducers) { |
| if (mapReduce.doStage(MapReduce.Stage.MAP)) { |
| final TinkerMapEmitter<?, ?> mapEmitter = new TinkerMapEmitter<>(mapReduce.doStage(MapReduce.Stage.REDUCE)); |
| final SynchronizedIterator<Vertex> vertices = new SynchronizedIterator<>(this.graph.vertices()); |
| workers.setMapReduce(mapReduce); |
| workers.mapReduceWorkerStart(MapReduce.Stage.MAP); |
| workers.executeMapReduce(workerMapReduce -> { |
| while (true) { |
| final Vertex vertex = vertices.next(); |
| if (null == vertex) return; |
| workerMapReduce.map(ComputerGraph.mapReduce(vertex), mapEmitter); |
| } |
| }); |
| workers.mapReduceWorkerEnd(MapReduce.Stage.MAP); |
| |
| // sort results if a map output sort is defined |
| mapEmitter.complete(mapReduce); |
| |
| // no need to run combiners as this is single machine |
| if (mapReduce.doStage(MapReduce.Stage.REDUCE)) { |
| final TinkerReduceEmitter<?, ?> reduceEmitter = new TinkerReduceEmitter<>(); |
| final SynchronizedIterator<Map.Entry<?, Queue<?>>> keyValues = new SynchronizedIterator((Iterator) mapEmitter.reduceMap.entrySet().iterator()); |
| workers.mapReduceWorkerStart(MapReduce.Stage.REDUCE); |
| workers.executeMapReduce(workerMapReduce -> { |
| while (true) { |
| final Map.Entry<?, Queue<?>> entry = keyValues.next(); |
| if (null == entry) return; |
| workerMapReduce.reduce(entry.getKey(), entry.getValue().iterator(), reduceEmitter); |
| } |
| }); |
| workers.mapReduceWorkerEnd(MapReduce.Stage.REDUCE); |
| reduceEmitter.complete(mapReduce); // sort results if a reduce output sort is defined |
| mapReduce.addResultToMemory(this.memory, reduceEmitter.reduceQueue.iterator()); <1> |
| } else { |
| mapReduce.addResultToMemory(this.memory, mapEmitter.mapQueue.iterator()); <2> |
| } |
| } |
| } |
| ... |
| ---- |
| |
| <1> Note that the final results of the reducer are provided to the Memory as specified by the application developer's |
| `MapReduce.addResultToMemory()` implementation. |
| <2> If there is no reduce stage, the the map-stage results are inserted into Memory as specified by the application |
| developer's `MapReduce.addResultToMemory()` implementation. |
| |
| [[io-implementations]] |
| IO Implementations |
| ^^^^^^^^^^^^^^^^^^ |
| |
| If a `Graph` requires custom serializers for IO to work properly, implement the `Graph.io` method. A typical example |
| of where a `Graph` would require such a custom serializers is if their identifier system uses non-primitive values, |
| such as OrientDB's `Rid` class. From basic serialization of a single `Vertex` all the way up the stack to Gremlin |
| Server, the need to know how to handle these complex identifiers is an important requirement. |
| |
| The first step to implementing custom serializers is to first implement the `IoRegistry` interface and register the |
| custom classes and serializers to it. Each `Io` implementation has different requirements for what it expects from the |
| `IoRegistry`: |
| |
| * *GraphML* - No custom serializers expected/allowed. |
| * *GraphSON* - Register a Jackson `SimpleModule`. The `SimpleModule` encapsulates specific classes to be serialized, |
| so it does not need to be registered to a specific class in the `IoRegistry` (use `null`). |
| * *Gryo* - Expects registration of one of three objects: |
| ** Register just the custom class with a `null` Kryo `Serializer` implementation - this class will use default "field-level" Kryo serialization. |
| ** Register the custom class with a specific Kryo `Serializer' implementation. |
| ** Register the custom class with a `Function<Kryo, Serializer>` for those cases where the Kryo `Serializer` requires the `Kryo` instance to get constructed. |
| |
| This implementation should provide a zero-arg constructor as the stack may require instantiation via reflection. |
| Consider extending `AbstractIoRegistry` for convenience as follows: |
| |
| [source,java] |
| ---- |
| public class MyGraphIoRegistry extends AbstractIoRegistry { |
| public MyGraphIoRegistry() { |
| register(GraphSONIo.class, null, new MyGraphSimpleModule()); |
| register(GryoIo.class, MyGraphIdClass.class, new MyGraphIdSerializer()); |
| } |
| } |
| ---- |
| |
| In the `Graph.io` method, provide the `IoRegistry` object to the supplied `Builder` and call the `create` method to |
| return that `Io` instance as follows: |
| |
| [source,java] |
| ---- |
| public <I extends Io> I io(final Io.Builder<I> builder) { |
| return (I) builder.graph(this).registry(myGraphIoRegistry).create(); |
| }} |
| ---- |
| |
| In this way, `Graph` implementations can pre-configure custom serializers for IO interactions and users will not need |
| to know about those details. Following this pattern will ensure proper execution of the test suite as well as |
| simplified usage for end-users. |
| |
| IMPORTANT: Proper implementation of IO is critical to successful `Graph` operations in Gremlin Server. The Test Suite |
| does have "serialization" tests that provide some assurance that an implementation is working properly, but those |
| tests cannot make assertions against any specifics of a custom serializer. It is the responsibility of the |
| implementer to test the specifics of their custom serializers. |
| |
| TIP: Consider separating serializer code into its own module, if possible, so that clients that use the `Graph` |
| implementation remotely don't need a full dependency on the entire `Graph` - just the IO components and related |
| classes being serialized. |
| |
| [[validating-with-gremlin-test]] |
| Validating with Gremlin-Test |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| image:gremlin-edumacated.png[width=225] |
| |
| [source,xml] |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>gremlin-test</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>gremlin-groovy-test</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| |
| The operational semantics of any OLTP or OLAP implementation are validated by `gremlin-test` and functional |
| interoperability with the Groovy environment is ensured by `gremlin-groovy-test`. To implement these tests, provide |
| test case implementations as shown below, where `XXX` below denotes the name of the graph implementation (e.g. |
| TinkerGraph, Neo4jGraph, HadoopGraph, etc.). |
| |
| [source,java] |
| ---- |
| // Structure API tests |
| @RunWith(StructureStandardSuite.class) |
| @GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class) |
| public class XXXStructureStandardTest {} |
| |
| // Process API tests |
| @RunWith(ProcessComputerSuite.class) |
| @GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class) |
| public class XXXProcessComputerTest {} |
| |
| @RunWith(ProcessStandardSuite.class) |
| @GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class) |
| public class XXXProcessStandardTest {} |
| |
| @RunWith(GroovyEnvironmentSuite.class) |
| @GraphProviderClass(provider = XXXProvider.class, graph = TinkerGraph.class) |
| public class XXXGroovyEnvironmentTest {} |
| |
| @RunWith(GroovyProcessStandardSuite.class) |
| @GraphProviderClass(provider = XXXGraphProvider.class, graph = TinkerGraph.class) |
| public class XXXGroovyProcessStandardTest {} |
| |
| @RunWith(GroovyProcessComputerSuite.class) |
| @GraphProviderClass(provider = XXXGraphComputerProvider.class, graph = TinkerGraph.class) |
| public class XXXGroovyProcessComputerTest {} |
| ---- |
| |
| The above set of tests represent the minimum test suite set to implement. There are other "integration" and |
| "performance" tests that should be considered optional. Implementing those tests requires the same pattern as shown above. |
| |
| IMPORTANT: It is as important to look at "ignored" tests as it is to look at ones that fail. The `gremlin-test` |
| suite utilizes the `Feature` implementation exposed by the `Graph` to determine which tests to execute. If a test |
| utilizes features that are not supported by the graph, it will ignore them. While that may be fine, implementers |
| should validate that the ignored tests are appropriately bypassed and that there are no mistakes in their feature |
| definitions. Moreover, implementers should consider filling gaps in their own test suites, especially when |
| IO-related tests are being ignored. |
| |
| The only test-class that requires any code investment is the `GraphProvider` implementation class. This class is a |
| used by the test suite to construct `Graph` configurations and instances and provides information about the |
| implementation itself. In most cases, it is best to simply extend `AbstractGraphProvider` as it provides many |
| default implementations of the `GraphProvider` interface. |
| |
| Finally, specify the test suites that will be supported by the `Graph` implementation using the `@Graph.OptIn` |
| annotation. See the `TinkerGraph` implementation below as an example: |
| |
| [source,java] |
| ---- |
| @Graph.OptIn(Graph.OptIn.SUITE_STRUCTURE_STANDARD) |
| @Graph.OptIn(Graph.OptIn.SUITE_PROCESS_STANDARD) |
| @Graph.OptIn(Graph.OptIn.SUITE_PROCESS_COMPUTER) |
| @Graph.OptIn(Graph.OptIn.SUITE_GROOVY_PROCESS_STANDARD) |
| @Graph.OptIn(Graph.OptIn.SUITE_GROOVY_PROCESS_COMPUTER) |
| @Graph.OptIn(Graph.OptIn.SUITE_GROOVY_ENVIRONMENT) |
| public class TinkerGraph implements Graph { |
| ---- |
| |
| Only include annotations for the suites the implementation will support. Note that implementing the suite, but |
| not specifying the appropriate annotation will prevent the suite from running (an obvious error message will appear |
| in this case when running the mis-configured suite). |
| |
| There are times when there may be a specific test in the suite that the implementation cannot support (despite the |
| features it implements) or should not otherwise be executed. It is possible for implementers to "opt-out" of a test |
| by using the `@Graph.OptOut` annotation. The following is an example of this annotation usage as taken from |
| `HadoopGraph`: |
| |
| [source, java] |
| ---- |
| @Graph.OptIn(Graph.OptIn.SUITE_PROCESS_STANDARD) |
| @Graph.OptIn(Graph.OptIn.SUITE_PROCESS_COMPUTER) |
| @Graph.OptOut( |
| test = "org.apache.tinkerpop.gremlin.process.graph.step.map.MatchTest$Traversals", |
| method = "g_V_matchXa_hasXname_GarciaX__a_inXwrittenByX_b__a_inXsungByX_bX", |
| reason = "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute.") |
| @Graph.OptOut( |
| test = "org.apache.tinkerpop.gremlin.process.graph.step.map.MatchTest$Traversals", |
| method = "g_V_matchXa_inXsungByX_b__a_inXsungByX_c__b_outXwrittenByX_d__c_outXwrittenByX_e__d_hasXname_George_HarisonX__e_hasXname_Bob_MarleyXX", |
| reason = "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute.") |
| @Graph.OptOut( |
| test = "org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest", |
| method = "shouldNotAllowBadMemoryKeys", |
| reason = "Hadoop does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though.") |
| @Graph.OptOut( |
| test = "org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest", |
| method = "shouldRequireRegisteringMemoryKeys", |
| reason = "Hadoop does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though.") |
| public class HadoopGraph implements Graph { |
| ---- |
| |
| The above examples show how to ignore individual tests. It is also possible to: |
| |
| * Ignore an entire test case (i.e. all the methods within the test) by setting the `method` to "*". |
| * Ignore a "base" test class such that test that extend from those classes will all be ignored. This style of |
| ignoring is useful for Gremlin "process" tests that have bases classes that are extended by various Gremlin flavors (e.g. groovy). |
| * Ignore a `GraphComputer` test based on the type of `GraphComputer` being used. Specify the "computer" attribute on |
| the `OptOut` (which is an array specification) which should have a value of the `GraphComputer` implementation class |
| that should ignore that test. This attribute should be left empty for "standard" execution and by default all |
| `GraphComputer` implementations will be included in the `OptOut` so if there are multiple implementations, explicitly |
| specify the ones that should be excluded. |
| |
| Also note that some of the tests in the Gremlin Test Suite are parameterized tests and require an additional level of |
| specificity to be properly ignored. To ignore these types of tests, examine the name template of the parameterized |
| tests. It is defined by a Java annotation that looks like this: |
| |
| [source, java] |
| @Parameterized.Parameters(name = "expect({0})") |
| |
| The annotation above shows that the name of each parameterized test will be prefixed with "expect" and have |
| parentheses wrapped around the first parameter (at index 0) value supplied to each test. This information can |
| only be garnered by studying the test set up itself. Once the pattern is determined and the specific unique name of |
| the parameterized test is identified, add it to the `specific` property on the `OptOut` annotation in addition to |
| the other arguments. |
| |
| These annotations help provide users a level of transparency into test suite compliance (via the |
| xref:describe-graph[describeGraph()] utility function). It also allows implementers to have a lot of flexibility in |
| terms of how they wish to support TinkerPop. For example, maybe there is a single test case that prevents an |
| implementer from claiming support of a `Feature`. The implementer could choose to either not support the `Feature` |
| or to support it but "opt-out" of the test with a "reason" as to why so that users understand the limitation. |
| |
| IMPORTANT: Before using `OptOut` be sure that the reason for using it is sound and it is more of a last resort. |
| It is possible that a test from the suite doesn't properly represent the expectations of a feature, is too broad or |
| narrow for the semantics it is trying to enforce or simply contains a bug. Please consider raising issues in the |
| developer mailing list with such concerns before assuming `OptOut` is the only answer. |
| |
| IMPORTANT: There are no tests that specifically validate complete compliance with Gremlin Server. Generally speaking, |
| a `Graph` that passes the full Test Suite, should be compliant with Gremlin Server. The one area where problems can |
| occur is in serialization. Always ensure that IO is properly implemented, that custom serializers are tested fully |
| and ultimately integration test the `Graph` with an actual Gremlin Server instance. |
| |
| CAUTION: Configuring tests to run in parallel might result in errors that are difficult to debug as there is some |
| shared state in test execution around graph configuration. It is therefore recommended that parallelism be turned |
| off for the test suite (the Maven SureFire Plugin is configured this way by default). It may also be important to |
| include this setting, `<reuseForks>false</reuseForks>`, in the SureFire configuration if tests are failing in an |
| unexplainable way. |
| |
| Accessibility via GremlinPlugin |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| image:gremlin-plugin.png[width=100,float=left] The applications distributed with TinkerPop3 do not distribute with |
| any graph system implementations besides TinkerGraph. If your implementation is stored in a Maven repository (e.g. |
| Maven Central Repository), then it is best to provide a `GremlinPlugin` implementation so the respective jars can be |
| downloaded according and when required by the user. Neo4j's GremlinPlugin is provided below for reference. |
| |
| [source,java] |
| ---- |
| public class Neo4jGremlinPlugin implements GremlinPlugin { |
| |
| private static final String IMPORT = "import "; |
| private static final String DOT_STAR = ".*"; |
| |
| private static final Set<String> IMPORTS = new HashSet<String>() {{ |
| add(IMPORT + Neo4jGraph.class.getPackage().getName() + DOT_STAR); |
| }}; |
| |
| @Override |
| public String getName() { |
| return "neo4j"; |
| } |
| |
| @Override |
| public void pluginTo(final PluginAcceptor pluginAcceptor) { |
| pluginAcceptor.addImports(IMPORTS); |
| } |
| } |
| ---- |
| |
| With the above plugin implementations, users can now download respective binaries for Gremlin Console, Gremlin Server, etc. |
| |
| [source,groovy] |
| gremlin> g = Neo4jGraph.open('/tmp/neo4j') |
| No such property: Neo4jGraph for class: groovysh_evaluate |
| Display stack trace? [yN] |
| gremlin> :install org.apache.tinkerpop neo4j-gremlin x.y.z |
| ==>loaded: [org.apache.tinkerpop, neo4j-gremlin, …] |
| gremlin> :plugin use tinkerpop.neo4j |
| ==>tinkerpop.neo4j activated |
| gremlin> g = Neo4jGraph.open('/tmp/neo4j') |
| ==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]] |
| |
| In-Depth Implementations |
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| image:gremlin-painting.png[width=200,float=right] The graph system implementation details presented thus far are |
| minimum requirements necessary to yield a valid TinkerPop3 implementation. However, there are other areas that a |
| graph system provider can tweak to provide an implementation more optimized for their underlying graph engine. Typical |
| areas of focus include: |
| |
| * Traversal Strategies: A <<traversalstrategy,TraversalStrategy>> can be used to alter a traversal prior to its |
| execution. A typical example is converting a pattern of `g.V().has('name','marko')` into a global index lookup for |
| all vertices with name "marko". In this way, a `O(|V|)` lookup becomes an `O(log(|V|))`. Please review |
| `TinkerGraphStepStrategy` for ideas. |
| * Step Implementations: Every <<graph-traversal-steps,step>> is ultimately referenced by the `GraphTraversal` |
| interface. It is possible to extend `GraphTraversal` to use a graph system specific step implementation. |
| |
| |
| [[tinkergraph-gremlin]] |
| TinkerGraph-Gremlin |
| ------------------- |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>tinkergraph-gremlin</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| ---- |
| |
| image:tinkerpop-character.png[width=100,float=left] TinkerGraph is a single machine, in-memory (with optional |
| persistence), non-transactional graph engine that provides both OLTP and OLAP functionality. It is deployed with |
| TinkerPop3 and serves as the reference implementation for other providers to study in order to understand the |
| semantics of the various methods of the TinkerPop3 API. Constructing a simple graph in Java8 is presented below. |
| |
| [source,java] |
| Graph g = TinkerGraph.open(); |
| Vertex marko = g.addVertex("name","marko","age",29); |
| Vertex lop = g.addVertex("name","lop","lang","java"); |
| marko.addEdge("created",lop,"weight",0.6d); |
| |
| The above graph creates two vertices named "marko" and "lop" and connects them via a created-edge with a weight=0.6 |
| property. Next, the graph can be queried as such. |
| |
| [source,java] |
| g.V().has("name","marko").out("created").values("name") |
| |
| The `g.V().has("name","marko")` part of the query can be executed in two ways. |
| |
| * A linear scan of all vertices filtering out those vertices that don't have the name "marko" |
| * A `O(log(|V|))` index lookup for all vertices with the name "marko" |
| |
| Given the initial graph construction in the first code block, no index was defined and thus, a linear scan is executed. |
| However, if the graph was constructed as such, then an index lookup would be used. |
| |
| [source,java] |
| Graph g = TinkerGraph.open(); |
| g.createIndex("name",Vertex.class) |
| |
| The execution times for a vertex lookup by property is provided below for both no-index and indexed version of |
| TinkerGraph over the Grateful Dead graph. |
| |
| [gremlin-groovy] |
| ---- |
| graph = TinkerGraph.open() |
| g = graph.traversal() |
| graph.io(graphml()).readGraph('data/grateful-dead.xml') |
| clock(1000) {g.V().has('name','Garcia').iterate()} <1> |
| graph = TinkerGraph.open() |
| g = graph.traversal() |
| graph.createIndex('name',Vertex.class) |
| graph.io(graphml()).readGraph('data/grateful-dead.xml') |
| clock(1000){g.V().has('name','Garcia').iterate()} <2> |
| ---- |
| |
| <1> Determine the average runtime of 1000 vertex lookups when no `name`-index is defined. |
| <2> Determine the average runtime of 1000 vertex lookups when a `name`-index is defined. |
| |
| IMPORTANT: Each graph system will have different mechanism by which indices and schemas are defined. TinkerPop3 |
| does not require any conformance in this area. In TinkerGraph, the only definitions are around indices. With other |
| graph systems, property value types, indices, edge labels, etc. may be required to be defined _a priori_ to adding |
| data to the graph. |
| |
| NOTE: TinkerGraph is distributed with Gremlin Server and is therefore automatically available to it for configuration. |
| |
| Configuration |
| ~~~~~~~~~~~~~ |
| |
| TinkerGraph has several settings that can be provided on creation via `Configuration` object: |
| |
| [width="100%",cols="2,10",options="header"] |
| |========================================================= |
| |Property |Description |
| |gremlin.graph |`org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph` |
| |gremlin.tinkergraph.vertexIdManager |The `IdManager` implementation to use for vertices. |
| |gremlin.tinkergraph.edgeIdManager |The `IdManager` implementation to use for edges. |
| |gremlin.tinkergraph.vertexPropertyIdManager |The `IdManager` implementation to use for vertex properties. |
| |gremlin.tinkergraph.defaultVertexPropertyCardinality |The default `VertexProperty.Cardinality` to use when `Vertex.property(k,v)` is called. |
| |gremlin.tinkergraph.graphLocation |The path and file name for where TinkerGraph should persist the graph data. If a |
| value is specified here, the the `gremlin.tinkergraph.graphFormat` should also be specified. If this value is not |
| included (default), then the graph will stay in-memory and not be loaded/persisted to disk. |
| |gremlin.tinkergraph.graphFormat |The format to use to serialize the graph which may be one of the following: |
| `graphml`, `graphson`, or `gryo`. If a value is specified here, the the `gremlin.tinkergraph.graphLocation` should |
| also be specified. If this value is not included (default), then the graph will stay in-memory and not be |
| loaded/persisted to disk. |
| |========================================================= |
| |
| The `IdManager` settings above refer to how TinkerGraph will control identifiers for vertices, edges and vertex |
| properties. There are several options for each of these settings: `ANY`, `LONG`, `INTEGER`, `UUID`, or the fully |
| qualified class name of an `IdManager` implementation on the classpath. When not specified, the default values |
| for all settings is `ANY`, meaning that the graph will work with any object on the JVM as the identifier and will |
| generate new identifiers from `Long` when the identifier is not user supplied. TinkerGraph will also expect the |
| user to understand the types used for identifiers when querying, meaning that `g.V(1)` and `g.V(1L)` could return |
| two different vertices. `LONG`, `INTEGER` and `UUID` settings will try to coerce identifier values to the expected |
| type as well as generate new identifiers with that specified type. |
| |
| If the TinkerGraph is configured for persistence with `gremlin.tinkergraph.graphLocation` and |
| `gremlin.tinkergraph.graphFormat`, then the graph will be written to the specified location with the specified |
| format when `Graph.close()` is called. In addition, if these settings are present, TinkerGraph will attempt to |
| load the graph from the specified location. |
| |
| It is important to consider the data being imported to TinkerGraph with respect to `defaultVertexPropertyCardinality` |
| setting. For example, if a `.gryo` file is known to contain multi-property data, be sure to set the default |
| cardinality to `list` or else the data will import as `single`. Consider the following: |
| |
| [gremlin-groovy] |
| ---- |
| graph = TinkerGraph.open() |
| graph.io(gryo()).readGraph("data/tinkerpop-crew.kryo") |
| g = graph.traversal() |
| g.V().properties() |
| conf = new BaseConfiguration() |
| conf.setProperty("gremlin.tinkergraph.defaultVertexPropertyCardinality","list") |
| graph = TinkerGraph.open(conf) |
| graph.io(gryo()).readGraph("data/tinkerpop-crew.kryo") |
| g = graph.traversal() |
| g.V().properties() |
| ---- |
| |
| [[neo4j-gremlin]] |
| Neo4j-Gremlin |
| ------------- |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>neo4j-gremlin</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| <!-- neo4j-tinkerpop-api-impl is NOT Apache 2 licensed - more information below --> |
| <dependency> |
| <groupId>org.neo4j</groupId> |
| <artifactId>neo4j-tinkerpop-api-impl</artifactId> |
| <version>0.1-2.2</version> |
| </dependency> |
| ---- |
| |
| link:http://neotechnology.com[Neo Technology] are the developers of the OLTP-based link:http://neo4j.org[Neo4j graph database]. |
| |
| CAUTION: Unless under a commercial agreement with Neo Technology, Neo4j is licensed |
| link:http://en.wikipedia.org/wiki/Affero_General_Public_License[AGPL]. The `neo4j-gremlin` module is licensed Apache2 |
| because it only references the Apache2-licensed Neo4j API (not its implementation). Note that neither the |
| <<gremlin-console,Gremlin Console>> nor <<gremlin-server,Gremlin Server>> distribute with the Neo4j implementation |
| binaries. To access the binaries, use the `:install` command to download binaries from |
| link:http://search.maven.org/[Maven Central Repository]. |
| |
| [source,groovy] |
| ---- |
| gremlin> :install org.apache.tinkerpop neo4j-gremlin x.y.z |
| ==>Loaded: [org.apache.tinkerpop, neo4j-gremlin, x.y.z] - restart the console to use [tinkerpop.neo4j] |
| gremlin> :q |
| ... |
| gremlin> :plugin use tinkerpop.neo4j |
| ==>tinkerpop.neo4j activated |
| gremlin> graph = Neo4jGraph.open('/tmp/neo4j') |
| ==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]] |
| ---- |
| |
| NOTE: Neo4j link:http://docs.neo4j.org/chunked/stable/ha.html[High Availability] is currently not supported by |
| Neo4j-Gremlin. |
| |
| TIP: To host Neo4j in <<gremlin-server,Gremlin Server>>, the dependencies must first be "installed" or otherwise |
| copied to the Gremlin Server path. The automated method for doing this would be to execute |
| `bin/gremlin-server.sh -i org.apache.tinkerpop neo4j-gremlin x.y.z`. |
| |
| Indices |
| ~~~~~~~ |
| |
| Neo4j 2.x indices leverage vertex labels to partition the index space. TinkerPop3 does not provide method interfaces |
| for defining schemas/indices for the underlying graph system. Thus, in order to create indices, it is important to |
| call the Neo4j API directly. |
| |
| NOTE: `Neo4jGraphStep` will attempt to discern which indices to use when executing a traversal of the form `g.V().has()`. |
| |
| The Gremlin-Console session below demonstrates Neo4j indices. For more information, please refer to the Neo4j documentation: |
| |
| * Manipulating indices with link:http://docs.neo4j.org/chunked/stable/query-schema-index.html[Cypher]. |
| * Manipulating indices with the Neo4j link:http://docs.neo4j.org/chunked/stable/tutorials-java-embedded-new-index.html[Java API]. |
| |
| [gremlin-groovy] |
| ---- |
| graph = Neo4jGraph.open('/tmp/neo4j') |
| graph.cypher("CREATE INDEX ON :person(name)") |
| graph.tx().commit() <1> |
| graph.addVertex(label,'person','name','marko') |
| graph.addVertex(label,'dog','name','puppy') |
| g = graph.traversal() |
| g.V().hasLabel('person').has('name','marko').values('name') |
| graph.close() |
| ---- |
| |
| <1> Schema mutations must happen in a different transaction than graph mutations |
| |
| Below demonstrates the runtime benefits of indices and demonstrates how if there is no defined index (only vertex |
| labels), a linear scan of the vertex-label partition is still faster than a linear scan of all vertices. |
| |
| [gremlin-groovy] |
| ---- |
| graph = Neo4jGraph.open('/tmp/neo4j') |
| graph.io(graphml()).readGraph('data/grateful-dead.xml') |
| g = graph.traversal() |
| g.tx().commit() |
| clock(1000) {g.V().hasLabel('artist').has('name','Garcia').iterate()} <1> |
| graph.cypher("CREATE INDEX ON :artist(name)") <2> |
| g.tx().commit() |
| Thread.sleep(5000) <3> |
| clock(1000) {g.V().hasLabel('artist').has('name','Garcia').iterate()} <4> |
| clock(1000) {g.V().has('name','Garcia').iterate()} <5> |
| graph.cypher("DROP INDEX ON :artist(name)") <6> |
| g.tx().commit() |
| graph.close() |
| ---- |
| |
| <1> Find all artists whose name is Garcia which does a linear scan of the artist vertex-label partition. |
| <2> Create an index for all artist vertices on their name property. |
| <3> Neo4j indices are eventually consistent so this stalls to give the index time to populate itself. |
| <4> Find all artists whose name is Garcia which uses the pre-defined schema index. |
| <5> Find all vertices whose name is Garcia which requires a linear scan of all the data in the graph. |
| <6> Drop the created index. |
| |
| Multi/Meta-Properties |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| `Neo4jGraph` supports both multi- and meta-properties (see <<_vertex_properties,vertex properties>>). These features |
| are not native to Neo4j and are implemented using "hidden" Neo4j nodes. For example, when a vertex has multiple |
| "name" properties, each property is a new node (multi-properties) which can have properties attached to it |
| (meta-properties). As such, the native, underlying representation may become difficult to query directly using |
| another graph language such as <<_cypher,Cypher>>. The default setting is to disable multi- and meta-properties. |
| However, if this feature is desired, then it can be activated via `gremlin.neo4j.metaProperties` and |
| `gremlin.neo4j.multiProperties` configurations being set to `true`. Once the configuration is set, it can not be |
| changed for the lifetime of the graph. |
| |
| [gremlin-groovy] |
| ---- |
| conf = new BaseConfiguration() |
| conf.setProperty('gremlin.neo4j.directory','/tmp/neo4j') |
| conf.setProperty('gremlin.neo4j.multiProperties',true) |
| conf.setProperty('gremlin.neo4j.metaProperties',true) |
| graph = Neo4jGraph.open(conf) |
| g = graph.traversal() |
| g.addV('name','michael','name','michael hunger','name','mhunger') |
| g.V().properties('name').property('acl', 'public') |
| g.V(0).valueMap() |
| g.V(0).properties() |
| g.V(0).properties().valueMap() |
| graph.close() |
| ---- |
| |
| WARNING: `Neo4jGraph` without multi- and meta-properties is in 1-to-1 correspondence with the native, underlying Neo4j |
| representation. It is recommended that if the user does not require multi/meta-properties, then they should not |
| enable them. Without multi- and meta-properties enabled, Neo4j can be interacted with with other tools and technologies |
| that do not leverage TinkerPop. |
| |
| IMPORTANT: When using a multi-property enabled `Neo4jGraph`, vertices may represent their properties on "hidden |
| nodes" adjacent to the vertex. If a vertex property key/value is required for indexing, then two indices are |
| required -- e.g. `CREATE INDEX ON :person(name)` and `CREATE INDEX ON :vertexProperty(name)` |
| (see <<_indices,Neo4j indices>>). |
| |
| Cypher |
| ~~~~~~ |
| |
| image::gremlin-loves-cypher.png[width=400] |
| |
| NeoTechnology are the creators of the graph pattern-match query language link:http://www.neo4j.org/learn/cypher[Cypher]. |
| It is possible to leverage Cypher from within Gremlin by using the `Neo4jGraph.cypher()` graph traversal method. |
| |
| [gremlin-groovy] |
| ---- |
| graph = Neo4jGraph.open('/tmp/neo4j') |
| graph.io(gryo()).readGraph('data/tinkerpop-modern.kryo') |
| graph.cypher('MATCH (a {name:"marko"}) RETURN a') |
| graph.cypher('MATCH (a {name:"marko"}) RETURN a').select('a').out('knows').values('name') |
| graph.close() |
| ---- |
| |
| Thus, like <<match-step,`match()`>>-step in Gremlin, it is possible to do a declarative pattern match and then move |
| back into imperative Gremlin. |
| |
| TIP: For those developers using <<gremlin-server,Gremlin Server>> against Neo4j, it is possible to do Cypher queries |
| by simply placing the Cypher string in `graph.cypher(...)` before submission to the server. |
| |
| Multi-Label |
| ~~~~~~~~~~~ |
| |
| TinkerPop3 requires every `Element` to have a single, immutable string label (i.e. a `Vertex`, `Edge`, and |
| `VertexProperty`). In Neo4j, a `Node` (vertex) can have an |
| link:http://neo4j.com/docs/stable/graphdb-neo4j-labels.html[arbitrary number of labels] while a `Relationship` |
| (edge) can have one and only one. Furthermore, in Neo4j, `Node` labels are mutable while `Relationship` labels are |
| not. In order to handle this mismatch, three `Neo4jVertex` specific methods exist in Neo4j-Gremlin. |
| |
| [source,java] |
| public Set<String> labels() // get all the labels of the vertex |
| public void addLabel(String label) // add a label to the vertex |
| public void removeLabel(String label) // remove a label from the vertex |
| |
| An example use case is presented below. |
| |
| [gremlin-groovy] |
| ---- |
| graph = Neo4jGraph.open('/tmp/neo4j') |
| vertex = (Neo4jVertex) graph.addVertex('human::animal') <1> |
| vertex.label() <2> |
| vertex.labels() <3> |
| vertex.addLabel('organism') <4> |
| vertex.label() |
| vertex.removeLabel('human') <5> |
| vertex.labels() |
| vertex.addLabel('organism') <6> |
| vertex.labels() |
| vertex.removeLabel('human') <7> |
| vertex.label() |
| g = graph.traversal() |
| g.V().has(label,'organism') <8> |
| g.V().has(label,of('organism')) <9> |
| g.V().has(label,of('organism')).has(label,of('animal')) |
| g.V().has(label,of('organism').and(of('animal'))) |
| graph.close() |
| ---- |
| |
| <1> Typecasting to a `Neo4jVertex` is only required in Java. |
| <2> The standard `Vertex.label()` method returns all the labels in alphabetical order concatenated using `::`. |
| <3> `Neo4jVertex.labels()` method returns the individual labels as a set. |
| <4> `Neo4jVertex.addLabel()` method adds a single label. |
| <5> `Neo4jVertex.removeLabel()` method removes a single label. |
| <6> Labels are unique and thus duplicate labels don't exist. |
| <7> If a label that does not exist is removed, nothing happens. |
| <8> `P.eq()` does a full string match and should only be used if multi-labels are not leveraged. |
| <9> `LabelP.of()` is specific to `Neo4jGraph` and used for multi-label matching. |
| |
| IMPORTANT: `LabelP.of()` is only required if multi-labels are leveraged. `LabelP.of()` is used when |
| filtering/looking-up vertices by their label(s) as the standard `P.eq()` does a direct match on the `::`-representation |
| of `vertex.label()` |
| |
| Loading with BulkLoaderVertexProgram |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The <<bulkloadervertexprogram, BulkLoaderVertexProgram>> is a generalized bulk loader that can be used to load |
| large amounts of data to and from Neo4j. The following code demonstrates how to load the modern graph from TinkerGraph |
| into Neo4j: |
| |
| [gremlin-groovy] |
| ---- |
| wgConf = 'conf/neo4j-standalone.properties' |
| modern = TinkerFactory.createModern() |
| blvp = BulkLoaderVertexProgram.build(). |
| keepOriginalIds(false). |
| writeGraph(wgConf).create(modern) |
| modern.compute().workers(1).program(blvp).submit().get() |
| graph = GraphFactory.open(wgConf) |
| g = graph.traversal() |
| g.V().valueMap() |
| graph.close() |
| ---- |
| |
| [source,properties] |
| ---- |
| # neo4j-standalone.properties |
| |
| gremlin.graph=org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph |
| gremlin.neo4j.directory=/tmp/neo4j |
| gremlin.neo4j.conf.node_auto_indexing=true |
| gremlin.neo4j.conf.relationship_auto_indexing=true |
| ---- |
| |
| [[hadoop-gremlin]] |
| Hadoop-Gremlin |
| -------------- |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>hadoop-gremlin</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| ---- |
| |
| image:hadoop-logo-notext.png[width=100,float=left] link:http://hadoop.apache.org/[Hadoop] is a distributed |
| computing framework that is used to process data represented across a multi-machine compute cluster. When the |
| data in the Hadoop cluster represents a TinkerPop3 graph, then Hadoop-Gremlin can be used to process the graph |
| using both TinkerPop3's OLTP and OLAP graph computing models. |
| |
| IMPORTANT: This section assumes that the user has a Hadoop 2.x cluster functioning. For more information on getting |
| started with Hadoop, please see the |
| link:http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/SingleCluster.html[Single Node Setup] |
| tutorial. Moreover, if using `GiraphGraphComputer` or `SparkGraphComputer` it is advisable that the reader also |
| familiarize their self with Giraph (link:http://giraph.apache.org/quick_start.html[Getting Started]) and Spark |
| (link:http://spark.apache.org/docs/latest/quick-start.html[Quick Start]). |
| |
| Installing Hadoop-Gremlin |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The `HADOOP_GREMLIN_LIBS` references locations that contains jars that should be uploaded to a respective |
| distributed cache (link:http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html[YARN] or SparkServer). |
| Note that the locations in `HADOOP_GREMLIN_LIBS` can be a colon-separated (`:`) and all jars from all locations will |
| be loaded into the cluster. Typically, only the jars of the respective GraphComputer are required to be loaded (e.g. |
| `GiraphGraphComputer` plugin lib directory). |
| |
| [source,shell] |
| export HADOOP_GREMLIN_LIBS=/usr/local/gremlin-console/ext/giraph-gremlin/lib |
| |
| If using <<gremlin-console,Gremlin Console>>, it is important to install the Hadoop-Gremlin plugin. Note that |
| Hadoop-Gremlin requires a Gremlin Console restart after installing. |
| |
| [source,text] |
| ---- |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| plugin activated: tinkerpop.server |
| plugin activated: tinkerpop.utilities |
| plugin activated: tinkerpop.tinkergraph |
| gremlin> :install org.apache.tinkerpop hadoop-gremlin x.y.z |
| ==>loaded: [org.apache.tinkerpop, hadoop-gremlin, x.y.z] - restart the console to use [tinkerpop.hadoop] |
| gremlin> :q |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| plugin activated: tinkerpop.server |
| plugin activated: tinkerpop.utilities |
| plugin activated: tinkerpop.tinkergraph |
| gremlin> :plugin use tinkerpop.hadoop |
| ==>tinkerpop.hadoop activated |
| gremlin> |
| ---- |
| |
| Properties Files |
| ~~~~~~~~~~~~~~~~ |
| |
| `HadoopGraph` makes use of properties files which ultimately get turned into Apache configurations and/or |
| Hadoop configurations. The example properties file presented below is located at `conf/hadoop/hadoop-gryo.properties`. |
| |
| [source,text] |
| gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph |
| gremlin.hadoop.inputLocation=tinkerpop-modern.kryo |
| gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat |
| gremlin.hadoop.outputLocation=output |
| gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat |
| gremlin.hadoop.jarsInDistributedCache=true |
| #################################### |
| # Spark Configuration # |
| #################################### |
| spark.master=local[4] |
| spark.executor.memory=1g |
| spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer |
| #################################### |
| # SparkGraphComputer Configuration # |
| #################################### |
| gremlin.spark.graphInputRDD=org.apache.tinkerpop.gremlin.spark.structure.io.InputRDDFormat |
| gremlin.spark.graphOutputRDD=org.apache.tinkerpop.gremlin.spark.structure.io.OutputRDDFormat |
| gremlin.spark.persistContext=true |
| ##################################### |
| # GiraphGraphComputer Configuration # |
| ##################################### |
| giraph.minWorkers=2 |
| giraph.maxWorkers=2 |
| giraph.useOutOfCoreGraph=true |
| giraph.useOutOfCoreMessages=true |
| mapreduce.map.java.opts=-Xmx1024m |
| mapreduce.reduce.java.opts=-Xmx1024m |
| giraph.numInputThreads=2 |
| giraph.numComputeThreads=2 |
| |
| A review of the Hadoop-Gremlin specific properties are provided in the table below. For the respective OLAP |
| engines (<<sparkgraphcomputer,`SparkGraphComputer`>> or <<giraphgraphcomputer,`GiraphGraphComputer`>>) refer |
| to their respective documentation for configuration options. |
| |
| [width="100%",cols="2,10",options="header"] |
| |========================================================= |
| |Property |Description |
| |gremlin.graph |The class of the graph to construct using GraphFactory. |
| |gremlin.hadoop.inputLocation |The location of the input file(s) for Hadoop-Gremlin to read the graph from. |
| |gremlin.hadoop.graphInputFormat |The format that the graph input file(s) are represented in. |
| |gremlin.hadoop.outputLocation |The location to write the computed HadoopGraph to. |
| |gremlin.hadoop.graphOutputFormat |The format that the output file(s) should be represented in. |
| |gremlin.hadoop.jarsInDistributedCache |Whether to upload the Hadoop-Gremlin jars to a distributed cache (necessary if jars are not on the machines' classpaths). |
| |========================================================= |
| |
| |
| |
| Along with the properties above, the numerous link:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml[Hadoop specific properties] |
| can be added as needed to tune and parameterize the executed Hadoop-Gremlin job on the respective Hadoop cluster. |
| |
| IMPORTANT: As the size of the graphs being processed becomes large, it is important to fully understand how the |
| underlying OLAP engine (e.g. Spark, Giraph, etc.) works and understand the numerous parameterizations offered by |
| these systems. Such knowledge can help alleviate out of memory exceptions, slow load times, slow processing times, |
| garbage collection issues, etc. |
| |
| OLTP Hadoop-Gremlin |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| image:hadoop-pipes.png[width=180,float=left] It is possible to execute OLTP operations over a `HadoopGraph`. |
| However, realize that the underlying HDFS files are not random access and thus, to retrieve a vertex, a linear scan |
| is required. OLTP operations are useful for peeking into the graph prior to executing a long running OLAP job -- e.g. |
| `g.V().valueMap().limit(10)`. |
| |
| CAUTION: OLTP operations on `HadoopGraph` are not efficient. They require linear scans to execute and are unreasonable |
| for large graphs. In such large graph situations, make use of <<traversalvertexprogram,TraversalVertexProgram>> |
| which is the OLAP Gremlin machine. |
| |
| [gremlin-groovy] |
| ---- |
| hdfs.copyFromLocal('data/tinkerpop-modern.kryo', 'tinkerpop-modern.kryo') |
| hdfs.ls() |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal() |
| g.V().count() |
| g.V().out().out().values('name') |
| g.V().group().by{it.value('name')[1]}.by('name').next() |
| ---- |
| |
| OLAP Hadoop-Gremlin |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| image:hadoop-furnace.png[width=180,float=left] Hadoop-Gremlin was designed to execute OLAP operations via |
| `GraphComputer`. The OLTP examples presented previously are reproduced below, but using `TraversalVertexProgram` |
| for the execution of the Gremlin traversal. |
| |
| A `Graph` in TinkerPop3 can support any number of `GraphComputer` implementations. Out of the box, Hadoop-Gremlin |
| supports the following three implementations. |
| |
| * <<mapreducegraphcomputer,`MapReduceGraphComputer`>>: Leverages Hadoop's MapReduce engine to execute TinkerPop3 OLAP |
| computations. (*coming soon*) |
| ** The graph must fit within the total disk space of the Hadoop cluster (supports massive graphs). Message passing is |
| coordinated via MapReduce jobs over the on-disk graph (slow traversals). |
| * <<sparkgraphcomputer,`SparkGraphComputer`>>: Leverages Apache Spark to execute TinkerPop3 OLAP computations. |
| ** The graph may fit within the total RAM of the cluster (supports larger graphs). Message passing is coordinated via |
| Spark map/reduce/join operations on in-memory and disk-cached data (average speed traversals). |
| * <<giraphgraphcomputer,`GiraphGraphComputer`>>: Leverages Apache Giraph to execute TinkerPop3 OLAP computations. |
| ** The graph should fit within the total RAM of the Hadoop cluster (graph size restriction), though "out-of-core" |
| processing is possible. Message passing is coordinated via ZooKeeper for the in-memory graph (speedy traversals). |
| |
| TIP: image:gremlin-sugar.png[width=50,float=left] For those wanting to use the <<sugar-plugin,SugarPlugin>> with |
| their submitted traversal, do `:remote config useSugar true` as well as `:plugin use tinkerpop.sugar` at the start of |
| the Gremlin Console session if it is not already activated. |
| |
| Note that `SparkGraphComputer` and `GiraphGraphComputer` are loaded via their respective plugins. Typically only |
| one plugin or the other is loaded depending on the desired `GraphComputer` to use. |
| |
| [source,text] |
| ---- |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| plugin activated: tinkerpop.server |
| plugin activated: tinkerpop.utilities |
| plugin activated: tinkerpop.tinkergraph |
| plugin activated: tinkerpop.hadoop |
| gremlin> :install org.apache.tinkerpop giraph-gremlin x.y.z |
| ==>loaded: [org.apache.tinkerpop, giraph-gremlin, x.y.z] - restart the console to use [tinkerpop.giraph] |
| gremlin> :install org.apache.tinkerpop spark-gremlin x.y.z |
| ==>loaded: [org.apache.tinkerpop, spark-gremlin, x.y.z] - restart the console to use [tinkerpop.spark] |
| gremlin> :q |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| plugin activated: tinkerpop.server |
| plugin activated: tinkerpop.utilities |
| plugin activated: tinkerpop.tinkergraph |
| plugin activated: tinkerpop.hadoop |
| gremlin> :plugin use tinkerpop.giraph |
| ==>tinkerpop.giraph activated |
| gremlin> :plugin use tinkerpop.spark |
| ==>tinkerpop.spark activated |
| ---- |
| |
| WARNING: Hadoop, Spark, and Giraph all depend on many of the same libraries (e.g. ZooKeeper, Snappy, Netty, Guava, |
| etc.). Unfortunately, typically these dependencies are not to the same versions of the respective libraries. As such, |
| it is best to *not* have both Spark and Giraph plugins loaded in the same console session nor in the same Java |
| project (though intelligent `<exclusion>`-usage can help alleviate conflicts in a Java project). |
| |
| [[mapreducegraphcomputer]] |
| MapReduceGraphComputer |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| *COMING SOON* |
| |
| [[sparkgraphcomputer]] |
| SparkGraphComputer |
| ^^^^^^^^^^^^^^^^^^ |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>spark-gremlin</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| ---- |
| |
| image:spark-logo.png[width=175,float=left] link:http://spark.apache.org[Spark] is an Apache Software Foundation |
| project focused on general-purpose OLAP data processing. Spark provides a hybrid in-memory/disk-based distributed |
| computing model that is similar to Hadoop's MapReduce model. Spark maintains a fluent function chaining DSL that is |
| arguably easier for developers to work with than native Hadoop MapReduce. Spark-Gremlin provides an implementation of |
| the bulk-synchronous parallel, distributed message passing algorithm within Spark and thus, any `VertexProgram` can be |
| executed over `SparkGraphComputer`. |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal(computer(SparkGraphComputer)) |
| g.V().count() |
| g.V().out().out().values('name') |
| ---- |
| |
| For using lambdas in Gremlin-Groovy, simply provide `:remote connect` a `TraversalSource` which leverages SparkGraphComputer. |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal(computer(SparkGraphComputer)) |
| :remote connect tinkerpop.hadoop graph g |
| :> g.V().group().by{it.value('name')[1]}.by('name') |
| ---- |
| |
| The `SparkGraphComputer` algorithm leverages Spark's caching abilities to reduce the amount of data shuffled across |
| the wire on each iteration of the <<vertexprogram,`VertexProgram`>>. When the graph is loaded as a Spark RDD |
| (Resilient Distributed Dataset) it is immediately cached as `graphRDD`. The `graphRDD` is a distributed adjacency |
| list which encodes the vertex, its properties, and all its incident edges. On the first iteration, each vertex |
| (in parallel) is passed through `VertexProgram.execute()`. This yields an output of the vertex's mutated state |
| (i.e. updated compute keys -- `propertyX`) and its outgoing messages. This `viewOutgoingRDD` is then reduced to |
| `viewIncomingRDD` where the outgoing messages are sent to their respective vertices. If a `MessageCombiner` exists |
| for the vertex program, then messages are aggregated locally and globally to ultimately yield one incoming message |
| for the vertex. This reduce sequence is the "message pass." If the vertex program does not terminate on this |
| iteration, then the `viewIncomingRDD` is joined with the cached `graphRDD` and the process continues. When there |
| are no more iterations, there is a final join and the resultant RDD is stripped of its edges and messages. This |
| `mapReduceRDD` is cached and is processed by each <<mapreduce,`MapReduce`>> job in the |
| <<graphcomputer,`GraphComputer`>> computation. |
| |
| image::spark-algorithm.png[width=775] |
| |
| [width="100%",cols="2,10",options="header"] |
| |======================================================== |
| |Property |Description |
| |gremlin.spark.graphInputRDD |A class for creating RDD's from underlying graph data, defaults to Hadoop `InputFormat`. |
| |gremlin.spark.graphOutputRDD |A class for output RDD's, defaults to Hadoop `OutputFormat`. |
| |gremlin.spark.persistContext |Whether to create a new `SparkContext` for every `SparkGraphComputer` or to reuse an existing one. |
| |======================================================== |
| |
| If the provider/user wishes to not use Hadoop `InputFormats`, it is possible to leverage Spark's RDD |
| constructs directly. There is a `gremlin.spark.graphInputRDD` configuration that references a `Class<? extends |
| InputRDD>`. An `InputRDD` provides a read method that takes a `SparkContext` and returns a graphRDD. Likewise, use |
| `gremlin.spark.graphOutputRDD` and the respective `OutputRDD`. |
| |
| It is possible to persist the graph RDD between jobs within the `SparkContext` (e.g. SparkServer) by leveraging `PersistedOutputRDD`. |
| Note that `gremlin.spark.persistContext` should be set to `true` or else the persisted RDD will be destroyed when the `SparkContext` closes. |
| The persisted RDD is named by the `gremlin.hadoop.outputLocation` configuration (i.e. named in `SparkContext.getPersistedRDDs()`). |
| Finally, `PersistedInputRDD` is used with respective `gremlin.hadoop.inputLocation` to retrieve the persisted RDD from the `SparkContext`. |
| |
| When using a persistent `Spark Context` the configuration used by the original Spark Configuration will be inherited by all threaded |
| references to that Spark Context. The exception to this rule are those properties which have a specific thread local effect. |
| |
| .Thread Local Properties |
| . spark.jobGroup.id |
| . spark.job.description |
| . spark.job.interruptOnCancel |
| . spark.scheduler.pool |
| |
| Loading with BulkLoaderVertexProgram |
| ++++++++++++++++++++++++++++++++++++ |
| |
| The <<bulkloadervertexprogram, BulkLoaderVertexProgram>> is a generalized bulk loader that can be used to load large |
| amounts of data to and from different `Graph` implementations. The following code demonstrates how to load the |
| Grateful Dead graph from HadoopGraph into TinkerGraph over Spark: |
| |
| [gremlin-groovy] |
| ---- |
| hdfs.copyFromLocal('data/grateful-dead.kryo', 'data/grateful-dead.kryo') |
| readGraph = GraphFactory.open('conf/hadoop/hadoop-grateful-gryo.properties') |
| writeGraph = 'conf/tinkergraph-gryo.properties' |
| blvp = BulkLoaderVertexProgram.build(). |
| keepOriginalIds(false). |
| writeGraph(writeGraph).create(readGraph) |
| readGraph.compute(SparkGraphComputer).workers(1).program(blvp).submit().get() |
| :set max-iteration 10 |
| graph = GraphFactory.open(writeGraph) |
| g = graph.traversal() |
| g.V().valueMap() |
| graph.close() |
| ---- |
| |
| [source,properties] |
| ---- |
| # hadoop-grateful-gryo.properties |
| |
| # |
| # Hadoop Graph Configuration |
| # |
| gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph |
| gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat |
| gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat |
| gremlin.hadoop.inputLocation=data/grateful-dead.kryo |
| gremlin.hadoop.outputLocation=output |
| gremlin.hadoop.deriveMemory=false |
| gremlin.hadoop.jarsInDistributedCache=true |
| |
| # |
| # SparkGraphComputer Configuration |
| # |
| spark.master=local[1] |
| spark.executor.memory=1g |
| spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer |
| ---- |
| |
| [source,properties] |
| ---- |
| # tinkergraph-gryo.properties |
| |
| gremlin.graph=org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph |
| gremlin.tinkergraph.graphFormat=gryo |
| gremlin.tinkergraph.graphLocation=/tmp/tinkergraph.kryo |
| ---- |
| |
| IMPORTANT: The path to TinkerGraph jars needs to be included in the `HADOOP_GREMLIN_LIBS` for the above example to work. |
| |
| [[giraphgraphcomputer]] |
| GiraphGraphComputer |
| ^^^^^^^^^^^^^^^^^^^ |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>giraph-gremlin</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| ---- |
| |
| image:giraph-logo.png[width=100,float=left] link:http://giraph.apache.org[Giraph] is an Apache Software Foundation |
| project focused on OLAP-based graph processing. Giraph makes use of the distributed graph computing paradigm made |
| popular by Google's Pregel. In Giraph, developers write "vertex programs" that get executed at each vertex in |
| parallel. These programs communicate with one another in a bulk synchronous parallel (BSP) manner. This model aligns |
| with TinkerPop3's `GraphComputer` API. TinkerPop3 provides an implementation of `GraphComputer` that works for Giraph |
| called `GiraphGraphComputer`. Moreover, with TinkerPop3's <<mapreduce,MapReduce>>-framework, the standard |
| Giraph/Pregel model is extended to support an arbitrary number of MapReduce phases to aggregate and yield results |
| from the graph. Below are examples using `GiraphGraphComputer` from the <<gremlin-console,Gremlin-Console>>. |
| |
| WARNING: Giraph uses a large number of Hadoop counters. The default for Hadoop is 120. In `mapred-site.xml` it is |
| possible to increase the limit it via the `mapreduce.job.counters.max` property. A good value to use is 1000. This |
| is a cluster-wide property so be sure to restart the cluster after updating. |
| |
| WARNING: The maximum number of workers can be no larger than the number of map-slots in the Hadoop cluster minus 1. |
| For example, if the Hadoop cluster has 4 map slots, then `giraph.maxWorkers` can not be larger than 3. One map-slot |
| is reserved for the master compute node and all other slots can be allocated as workers to execute the VertexPrograms |
| on the vertices of the graph. |
| |
| If `GiraphGraphComputer` will be used as the `GraphComputer` for `HadoopGraph` then its `lib` directory should be |
| specified in `HADOOP_GREMLIN_LIBS`. |
| |
| [source,shell] |
| export HADOOP_GREMLIN_LIBS=$HADOOP_GREMLIN_LIBS:/usr/local/gremlin-console/ext/giraph-gremlin/lib |
| |
| Or, the user can specify the directory in the Gremlin Console. |
| |
| [source,groovy] |
| System.setProperty('HADOOP_GREMLIN_LIBS',System.getProperty('HADOOP_GREMLIN_LIBS') + ':' + '/usr/local/gremlin-console/ext/giraph-gremlin/lib') |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal(computer(GiraphGraphComputer)) |
| g.V().count() |
| g.V().out().out().values('name') |
| ---- |
| |
| IMPORTANT: The examples above do not use lambdas (i.e. closures in Gremlin-Groovy). This makes the traversal |
| serializable and thus, able to be distributed to all machines in the Hadoop cluster. If a lambda is required in a |
| traversal, then the traversal must be sent as a `String` and compiled locally at each machine in the cluster. The |
| following example demonstrates the `:remote` command which allows for submitting Gremlin traversals as a `String`. |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal(computer(GiraphGraphComputer)) |
| :remote connect tinkerpop.hadoop graph g |
| :> g.V().group().by{it.value('name')[1]}.by('name') |
| result |
| result.memory.runtime |
| result.memory.keys() |
| result.memory.get('~reducing') |
| ---- |
| |
| NOTE: If the user explicitly specifies `giraph.maxWorkers` and/or `giraph.numComputeThreads` in the configuration, |
| then these values will be used by Giraph. However, if these are not specified and the user never calls |
| `GraphComputer.workers()` then `GiraphGraphComputer` will try to compute the number of workers/threads to use based |
| on the cluster's profile. |
| |
| Loading with BulkLoaderVertexProgram |
| ++++++++++++++++++++++++++++++++++++ |
| |
| The <<bulkloadervertexprogram, BulkLoaderVertexProgram>> is a generalized bulk loader that can be used to load |
| large amounts of data to and from different `Graph` implementations. The following code demonstrates how to load |
| the Grateful Dead graph from HadoopGraph into TinkerGraph over Giraph: |
| |
| [gremlin-groovy] |
| ---- |
| hdfs.copyFromLocal('data/grateful-dead.kryo', 'data/grateful-dead.kryo') |
| readGraph = GraphFactory.open('conf/hadoop/hadoop-grateful-gryo.properties') |
| writeGraph = 'conf/tinkergraph-gryo.properties' |
| blvp = BulkLoaderVertexProgram.build(). |
| keepOriginalIds(false). |
| writeGraph(writeGraph).create(readGraph) |
| readGraph.compute(GiraphGraphComputer).workers(1).program(blvp).submit().get() |
| :set max-iteration 10 |
| graph = GraphFactory.open(writeGraph) |
| g = graph.traversal() |
| g.V().valueMap() |
| graph.close() |
| ---- |
| |
| [source,properties] |
| ---- |
| # hadoop-grateful-gryo.properties |
| |
| # |
| # Hadoop Graph Configuration |
| # |
| gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph |
| gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat |
| gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat |
| gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat |
| gremlin.hadoop.inputLocation=data/grateful-dead.kryo |
| gremlin.hadoop.outputLocation=output |
| gremlin.hadoop.deriveMemory=false |
| gremlin.hadoop.jarsInDistributedCache=true |
| |
| # |
| # GiraphGraphComputer Configuration |
| # |
| giraph.minWorkers=1 |
| giraph.maxWorkers=1 |
| giraph.useOutOfCoreGraph=true |
| giraph.useOutOfCoreMessages=true |
| mapred.map.child.java.opts=-Xmx1024m |
| mapred.reduce.child.java.opts=-Xmx1024m |
| giraph.numInputThreads=4 |
| giraph.numComputeThreads=4 |
| giraph.maxMessagesInMemory=100000 |
| ---- |
| |
| [source,properties] |
| ---- |
| # tinkergraph-gryo.properties |
| |
| gremlin.graph=org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph |
| gremlin.tinkergraph.graphFormat=gryo |
| gremlin.tinkergraph.graphLocation=/tmp/tinkergraph.kryo |
| ---- |
| |
| NOTE: The path to TinkerGraph needs to be included in the `HADOOP_GREMLIN_LIBS` for the above example to work. |
| |
| Input/Output Formats |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| image:adjacency-list.png[width=300,float=right] Hadoop-Gremlin provides various I/O formats -- i.e. Hadoop |
| `InputFormat` and `OutputFormat`. All of the formats make use of an link:http://en.wikipedia.org/wiki/Adjacency_list[adjacency list] |
| representation of the graph where each "row" represents a single vertex, its properties, and its incoming and |
| outgoing edges. |
| |
| {empty} + |
| |
| [[gryo-io-format]] |
| Gryo I/O Format |
| ^^^^^^^^^^^^^^^ |
| |
| * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat` |
| * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat` |
| |
| <<gryo-reader-writer,Gryo>> is a binary graph format that leverages link:https://github.com/EsotericSoftware/kryo[Kryo] |
| to make a compact, binary representation of a vertex. It is recommended that users leverage Gryo given its space/time |
| savings over text-based representations. |
| |
| NOTE: The `GryoInputFormat` is splittable. |
| |
| [[graphson-io-format]] |
| GraphSON I/O Format |
| ^^^^^^^^^^^^^^^^^^^ |
| |
| * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat` |
| * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat` |
| |
| <<graphson-reader-writer,GraphSON>> is a JSON based graph format. GraphSON is a space-expensive graph format in that |
| it is a text-based markup language. However, it is convenient for many developers to work with as its structure is |
| simple (easy to create and parse). |
| |
| The data below represents an adjacency list representation of the classic TinkerGraph toy graph in GraphSON format. |
| |
| [source,json] |
| {"id":1,"label":"person","outE":{"created":[{"id":9,"inV":3,"properties":{"weight":0.4}}],"knows":[{"id":7,"inV":2,"properties":{"weight":0.5}},{"id":8,"inV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":0,"value":"marko"}],"age":[{"id":1,"value":29}]}} |
| {"id":2,"label":"person","inE":{"knows":[{"id":7,"outV":1,"properties":{"weight":0.5}}]},"properties":{"name":[{"id":2,"value":"vadas"}],"age":[{"id":3,"value":27}]}} |
| {"id":3,"label":"software","inE":{"created":[{"id":9,"outV":1,"properties":{"weight":0.4}},{"id":11,"outV":4,"properties":{"weight":0.4}},{"id":12,"outV":6,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":4,"value":"lop"}],"lang":[{"id":5,"value":"java"}]}} |
| {"id":4,"label":"person","inE":{"knows":[{"id":8,"outV":1,"properties":{"weight":1.0}}]},"outE":{"created":[{"id":10,"inV":5,"properties":{"weight":1.0}},{"id":11,"inV":3,"properties":{"weight":0.4}}]},"properties":{"name":[{"id":6,"value":"josh"}],"age":[{"id":7,"value":32}]}} |
| {"id":5,"label":"software","inE":{"created":[{"id":10,"outV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":8,"value":"ripple"}],"lang":[{"id":9,"value":"java"}]}} |
| {"id":6,"label":"person","outE":{"created":[{"id":12,"inV":3,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":10,"value":"peter"}],"age":[{"id":11,"value":35}]}} |
| |
| [[script-io-format]] |
| Script I/O Format |
| ^^^^^^^^^^^^^^^^^ |
| |
| * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat` |
| * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptOutputFormat` |
| |
| `ScriptInputFormat` and `ScriptOutputFormat` take an arbitrary script and use that script to either read or write |
| `Vertex` objects, respectively. This can be considered the most general `InputFormat`/`OutputFormat` possible in that |
| Hadoop-Gremlin uses the user provided script for all reading/writing. |
| |
| ScriptInputFormat |
| +++++++++++++++++ |
| |
| The data below represents an adjacency list representation of the classic TinkerGraph toy graph. First line reads, |
| "vertex `1`, labeled `person` having 2 property values (`marko` and `29`) has 3 outgoing edges; the first edge is |
| labeled `knows`, connects the current vertex `1` with vertex `2` and has a property value `0.4`, and so on." |
| |
| [source] |
| 1:person:marko:29 knows:2:0.5,knows:4:1.0,created:3:0.4 |
| 2:person:vadas:27 |
| 3:project:lop:java |
| 4:person:josh:32 created:3:0.4,created:5:1.0 |
| 5:project:ripple:java |
| 6:person:peter:35 created:3:0.2 |
| |
| There is no corresponding `InputFormat` that can parse this particular file (or some adjacency list variant of it). |
| As such, `ScriptInputFormat` can be used. With `ScriptInputFormat` a script is stored in HDFS and leveraged by each |
| mapper in the Hadoop job. The script must have the following method defined: |
| |
| [source,groovy] |
| def parse(String line, ScriptElementFactory factory) { ... } |
| |
| `ScriptElementFactory` provides the following 4 methods: |
| |
| [source,java] |
| Vertex vertex(Object id); // get or create the vertex with the given id |
| Vertex vertex(Object id, String label); // get or create the vertex with the given id and label |
| Edge edge(Vertex out, Vertex in); // create an edge between the two given vertices |
| Edge edge(Vertex out, Vertex in, String label); // create an edge between the two given vertices using the given label |
| |
| An appropriate `parse()` for the above adjacency list file is: |
| |
| [source,groovy] |
| def parse(line, factory) { |
| def parts = line.split(/ /) |
| def (id, label, name, x) = parts[0].split(/:/).toList() |
| def v1 = factory.vertex(id, label) |
| if (name != null) v1.property('name', name) // first value is always the name |
| if (x != null) { |
| // second value depends on the vertex label; it's either |
| // the age of a person or the language of a project |
| if (label.equals('project')) v1.property('lang', x) |
| else v1.property('age', Integer.valueOf(x)) |
| } |
| if (parts.length == 2) { |
| parts[1].split(/,/).grep { !it.isEmpty() }.each { |
| def (eLabel, refId, weight) = it.split(/:/).toList() |
| def v2 = factory.vertex(refId) |
| def edge = factory.edge(v1, v2, eLabel) |
| edge.property('weight', Double.valueOf(weight)) |
| } |
| } |
| return v1 |
| } |
| |
| The resultant `Vertex` denotes whether the line parsed yielded a valid Vertex. As such, if the line is not valid |
| (e.g. a comment line, a skip line, etc.), then simply return `null`. |
| |
| ScriptOutputFormat Support |
| ++++++++++++++++++++++++++ |
| |
| The principle above can also be used to convert a vertex to an arbitrary `String` representation that is ultimately |
| streamed back to a file in HDFS. This is the role of `ScriptOutputFormat`. `ScriptOutputFormat` requires that the |
| provided script maintains a method with the following signature: |
| |
| [source,groovy] |
| def stringify(Vertex vertex) { ... } |
| |
| An appropriate `stringify()` to produce output in the same format that was shown in the `ScriptInputFormat` sample is: |
| |
| [source,groovy] |
| def stringify(vertex) { |
| def v = vertex.values('name', 'age', 'lang').inject(vertex.id(), vertex.label()).join(':') |
| def outE = vertex.outE().map { |
| def e = it.get() |
| e.values('weight').inject(e.label(), e.inV().next().id()).join(':') |
| }.join(',') |
| return [v, outE].join('\t') |
| } |
| |
| Interacting with HDFS |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| The distributed file system of Hadoop is called link:http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system[HDFS]. |
| The results of any OLAP operation are stored in HDFS accessible via `hdfs`. |
| |
| [gremlin-groovy] |
| ---- |
| graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') |
| g = graph.traversal(computer(SparkGraphComputer)) |
| :remote connect tinkerpop.hadoop graph g |
| :> g.V().group().by{it.value('name')[1]}.by('name') |
| hdfs.ls() |
| hdfs.ls('output') |
| hdfs.ls('output/~reducing') |
| hdfs.head('output/~reducing', ObjectWritable) |
| ---- |
| |
| A list of the HDFS methods available are itemized below. Note that these methods are also available for the 'local' variable: |
| |
| [width="100%",cols="13,10",options="header"] |
| |========================================================= |
| | Method| Description |
| |hdfs.ls(String path)| List the contents of the supplied directory. |
| |hdfs.cp(String from, String to)| Copy the specified path to the specified path. |
| |hdfs.exists(String path)| Whether the specified path exists. |
| |hdfs.rm(String path)| Remove the specified path. |
| |hdfs.rmr(String path)| Remove the specified path and its contents recurssively. |
| |hdfs.copyToLocal(String from, String to)| Copy the specified HDFS path to the specified local path. |
| |hdfs.copyFromLocal(String from, String to)| Copy the specified local path to the specified HDFS path. |
| |hdfs.mergeToLocal(String from, String to)| Merge the files in path to the specified local path. |
| |hdfs.head(String path)| Display the data in the path as text. |
| |hdfs.head(String path, int lineCount)| Text display only the first `lineCount`-number of lines in the path. |
| |hdfs.head(String path, int totalKeyValues, Class<Writable> writableClass)| Display the path interpreting the key values as respective writable. |
| |========================================================= |
| |
| A Command Line Example |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| image::pagerank-logo.png[width=300] |
| |
| The classic link:http://en.wikipedia.org/wiki/PageRank[PageRank] centrality algorithm can be executed over the |
| TinkerPop graph from the command line using `GiraphGraphComputer`. |
| |
| WARNING: Be sure that the `HADOOP_GREMLIN_LIBS` references the location `lib` directory of the respective |
| `GraphComputer` engine being used or else the requisite dependencies will not be uploaded to the Hadoop cluster. |
| |
| [source,text] |
| ---- |
| $ hdfs dfs -copyFromLocal data/tinkerpop-modern.json tinkerpop-modern.json |
| $ hdfs dfs -ls |
| Found 2 items |
| -rw-r--r-- 1 marko supergroup 2356 2014-07-28 13:00 /user/marko/tinkerpop-modern.json |
| $ hadoop jar target/giraph-gremlin-x.y.z-job.jar org.apache.tinkerpop.gremlin.giraph.process.computer.GiraphGraphComputer ../hadoop-gremlin/conf/hadoop-graphson.properties |
| 15/09/11 08:02:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable |
| 15/09/11 08:02:11 INFO computer.GiraphGraphComputer: HadoopGremlin(Giraph): PageRankVertexProgram[alpha=0.85,iterations=30] |
| 15/09/11 08:02:12 INFO mapreduce.JobSubmitter: number of splits:3 |
| 15/09/11 08:02:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1441915907347_0028 |
| 15/09/11 08:02:12 INFO impl.YarnClientImpl: Submitted application application_1441915907347_0028 |
| 15/09/11 08:02:12 INFO job.GiraphJob: Tracking URL: http://markos-macbook:8088/proxy/application_1441915907347_0028/ |
| 15/09/11 08:02:12 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 3 mappers |
| 15/09/11 08:03:54 INFO mapreduce.Job: Running job: job_1441915907347_0028 |
| 15/09/11 08:03:55 INFO mapreduce.Job: Job job_1441915907347_0028 running in uber mode : false |
| 15/09/11 08:03:55 INFO mapreduce.Job: map 33% reduce 0% |
| 15/09/11 08:03:57 INFO mapreduce.Job: map 67% reduce 0% |
| 15/09/11 08:04:01 INFO mapreduce.Job: map 100% reduce 0% |
| 15/09/11 08:06:17 INFO mapreduce.Job: Job job_1441915907347_0028 completed successfully |
| 15/09/11 08:06:17 INFO mapreduce.Job: Counters: 80 |
| File System Counters |
| FILE: Number of bytes read=0 |
| FILE: Number of bytes written=483918 |
| FILE: Number of read operations=0 |
| FILE: Number of large read operations=0 |
| FILE: Number of write operations=0 |
| HDFS: Number of bytes read=1465 |
| HDFS: Number of bytes written=1760 |
| HDFS: Number of read operations=39 |
| HDFS: Number of large read operations=0 |
| HDFS: Number of write operations=20 |
| Job Counters |
| Launched map tasks=3 |
| Other local map tasks=3 |
| Total time spent by all maps in occupied slots (ms)=458105 |
| Total time spent by all reduces in occupied slots (ms)=0 |
| Total time spent by all map tasks (ms)=458105 |
| Total vcore-seconds taken by all map tasks=458105 |
| Total megabyte-seconds taken by all map tasks=469099520 |
| Map-Reduce Framework |
| Map input records=3 |
| Map output records=0 |
| Input split bytes=132 |
| Spilled Records=0 |
| Failed Shuffles=0 |
| Merged Map outputs=0 |
| GC time elapsed (ms)=1594 |
| CPU time spent (ms)=0 |
| Physical memory (bytes) snapshot=0 |
| Virtual memory (bytes) snapshot=0 |
| Total committed heap usage (bytes)=527958016 |
| Giraph Stats |
| Aggregate edges=0 |
| Aggregate finished vertices=0 |
| Aggregate sent message message bytes=13535 |
| Aggregate sent messages=186 |
| Aggregate vertices=6 |
| Current master task partition=0 |
| Current workers=2 |
| Last checkpointed superstep=0 |
| Sent message bytes=438 |
| Sent messages=6 |
| Superstep=31 |
| Giraph Timers |
| Initialize (ms)=2996 |
| Input superstep (ms)=5209 |
| Setup (ms)=59 |
| Shutdown (ms)=9324 |
| Superstep 0 GiraphComputation (ms)=3861 |
| Superstep 1 GiraphComputation (ms)=4027 |
| Superstep 10 GiraphComputation (ms)=4000 |
| Superstep 11 GiraphComputation (ms)=4004 |
| Superstep 12 GiraphComputation (ms)=3999 |
| Superstep 13 GiraphComputation (ms)=4000 |
| Superstep 14 GiraphComputation (ms)=4005 |
| Superstep 15 GiraphComputation (ms)=4003 |
| Superstep 16 GiraphComputation (ms)=4001 |
| Superstep 17 GiraphComputation (ms)=4007 |
| Superstep 18 GiraphComputation (ms)=3998 |
| Superstep 19 GiraphComputation (ms)=4006 |
| Superstep 2 GiraphComputation (ms)=4007 |
| Superstep 20 GiraphComputation (ms)=3996 |
| Superstep 21 GiraphComputation (ms)=4006 |
| Superstep 22 GiraphComputation (ms)=4002 |
| Superstep 23 GiraphComputation (ms)=3998 |
| Superstep 24 GiraphComputation (ms)=4003 |
| Superstep 25 GiraphComputation (ms)=4001 |
| Superstep 26 GiraphComputation (ms)=4003 |
| Superstep 27 GiraphComputation (ms)=4005 |
| Superstep 28 GiraphComputation (ms)=4002 |
| Superstep 29 GiraphComputation (ms)=4001 |
| Superstep 3 GiraphComputation (ms)=3988 |
| Superstep 30 GiraphComputation (ms)=4248 |
| Superstep 4 GiraphComputation (ms)=4010 |
| Superstep 5 GiraphComputation (ms)=3998 |
| Superstep 6 GiraphComputation (ms)=3996 |
| Superstep 7 GiraphComputation (ms)=4005 |
| Superstep 8 GiraphComputation (ms)=4009 |
| Superstep 9 GiraphComputation (ms)=3994 |
| Total (ms)=138788 |
| File Input Format Counters |
| Bytes Read=0 |
| File Output Format Counters |
| Bytes Written=0 |
| $ hdfs dfs -cat output/~g/* |
| {"id":1,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.15000000000000002}],"name":[{"id":0,"value":"marko"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":3.0}],"age":[{"id":1,"value":29}]}} |
| {"id":5,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.23181250000000003}],"name":[{"id":8,"value":"ripple"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"lang":[{"id":9,"value":"java"}]}} |
| {"id":3,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.4018125}],"name":[{"id":4,"value":"lop"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":0.0}],"lang":[{"id":5,"value":"java"}]}} |
| {"id":4,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}],"name":[{"id":6,"value":"josh"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}],"age":[{"id":7,"value":32}]}} |
| {"id":2,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.19250000000000003}],"name":[{"id":2,"value":"vadas"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"age":[{"id":3,"value":27}]}} |
| {"id":6,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.15000000000000002}],"name":[{"id":10,"value":"peter"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":1.0}],"age":[{"id":11,"value":35}]}} |
| ---- |
| |
| Vertex 4 ("josh") is isolated below: |
| |
| [source,js] |
| ---- |
| { |
| "id":4, |
| "label":"person", |
| "properties": { |
| "gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}], |
| "name":[{"id":6,"value":"josh"}], |
| "gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}], |
| "age":[{"id":7,"value":32}]} |
| } |
| } |
| ---- |
| |
| Hadoop-Gremlin for Graph System Providers |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Hadoop-Gremlin is centered around `InputFormats` and `OutputFormats`. If a 3rd-party graph system provider wishes to |
| leverage Hadoop-Gremlin (and its respective `GraphComputer` engines), then they need to provide, at minimum, a |
| Hadoop2 `InputFormat<NullWritable,VertexWritable>` for their graph system. If the provider wishes to persist computed |
| results back to their graph system (and not just to HDFS via a `FileOutputFormat`), then a graph system specific |
| `OutputFormat<NullWritable,VertexWritable>` must be developed as well. |
| |
| Conceptually, `HadoopGraph` is a wrapper around a `Configuration` object. There is no "data" in the `HadoopGraph` as |
| the `InputFormat` specifies where and how to get the graph data at OLAP (and OLTP) runtime. Thus, `HadoopGraph` is a |
| small object with little overhead. Graph system providers should realize `HadoopGraph` as the gateway to the OLAP |
| features offered by Hadoop-Gremlin. For example, a graph system specific `Graph.compute(Class<? extends GraphComputer> |
| graphComputerClass)`-method may look as follows: |
| |
| [source,java] |
| ---- |
| public <C extends GraphComputer> C compute(final Class<C> graphComputerClass) throws IllegalArgumentException { |
| try { |
| if (AbstractHadoopGraphComputer.class.isAssignableFrom(graphComputerClass)) |
| return graphComputerClass.getConstructor(HadoopGraph.class).newInstance(this); |
| else |
| throw Graph.Exceptions.graphDoesNotSupportProvidedGraphComputer(graphComputerClass); |
| } catch (final Exception e) { |
| throw new IllegalArgumentException(e.getMessage(),e); |
| } |
| } |
| ---- |
| |
| Note that the configurations for Hadoop are assumed to be in the `Graph.configuration()` object. If this is not the |
| case, then the `Configuration` provided to `HadoopGraph.open()` should be dynamically created within the |
| `compute()`-method. It is in the provided configuration that `HadoopGraph` gets the various properties which |
| determine how to read and write data to and from Hadoop. For instance, `gremlin.hadoop.graphInputFormat` and |
| `gremlin.hadoop.graphOutputFormat`. |
| |
| IMPORTANT: A graph system provider's `OutputFormat` should implement the `PersistResultGraphAware` interface which |
| determines which persistence options are available to the user. For the standard file-based `OutputFormats` provided |
| by Hadoop-Gremlin (e.g. <<gryo-io-format,`GryoOutputFormat`>>, <<graphson-io-format,`GraphSONOutputFormat`>>, |
| and <<script-io-format,`ScriptInputOutputFormat`>>) `ResultGraph.ORIGINAL` is not supported as the original graph |
| data files are not random access and are, in essence, immutable. Thus, these file-based `OutputFormats` only support |
| `ResultGraph.NEW` which creates a copy of the data specified by the `Persist` enum. |
| |