| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[intro]] |
| Introduction to Graph Computing |
| =============================== |
| |
| image::graph-computing.png[width=350] |
| |
| [source,xml] |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>gremlin-core</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| |
| A link:http://en.wikipedia.org/wiki/Graph_(data_structure)[graph] is a data structure composed of vertices (nodes, |
| dots) and edges (arcs, lines). When modeling a graph in a computer and applying it to modern data sets and practices, |
| the generic mathematically-oriented, binary graph is extended to support both labels and key/value properties. This |
| structure is known as a property graph. More formally, it is a directed, binary, attributed multi-graph. An example |
| property graph is diagrammed below. This graph example will be used extensively throughout the documentation and is |
| called "TinkerPop Classic" as it is the original demo graph distributed with TinkerPop0 back in 2009 (i.e. the good |
| ol' days -- it was the best of times and it was the worst of times). |
| |
| TIP: The TinkerPop graph is available with <<tinkergraph-gremlin,TinkerGraph>> via `TinkerFactory.createModern()`. |
| TinkerGraph is the reference implementation of TinkerPop3 and is used in nearly all the examples in this documentation. |
| Note that there also exists the classic `TinkerFactory.createClassic()` which is the graph used in TinkerPop2 and does |
| not include vertex labels. |
| |
| [[tinkerpop-modern]] |
| .TinkerPop Modern |
| image::tinkerpop-modern.png[width=500] |
| |
| TinkerPop3 is the third incarnation of the TinkerPop graph computing framework. Similar to computing in general, graph |
| computing makes a distinction between *structure* (graph) and *process* (traversal). The structure of the graph is the |
| data model defined by a vertex/edge/property link:http://en.wikipedia.org/wiki/Network_topology[topology]. The process |
| of the graph is the means by which the structure is analyzed. The typical form of graph processing is called a |
| link:http://en.wikipedia.org/wiki/Graph_traversal[traversal]. |
| |
| .Primary components of the TinkerPop3 *structure* API |
| * `Graph`: maintains a set of vertices and edges, and access to database functions such as transactions. |
| * `Element`: maintains a collection of properties and a string label denoting the element type. |
| ** `Vertex`: extends Element and maintains a set of incoming and outgoing edges. |
| ** `Edge`: extends Element and maintains an incoming and outgoing vertex. |
| * `Property<V>`: a string key associated with a `V` value. |
| ** `VertexProperty<V>`: a string key associated with a `V` value as well as a collection of `Property<U>` properties (*vertices only*) |
| |
| .Primary components of the TinkerPop3 *process* API |
| * `TraversalSource`: a generator of traversals for a particular graph, link:http://en.wikipedia.org/wiki/Domain-specific_language[domain specific language] (DSL), and execution engine. |
| ** `Traversal<S,E>`: a functional data flow process transforming objects of type `S` into object of type `E`. |
| *** `GraphTraversal`: a traversal DSL that is oriented towards the semantics of the raw graph (i.e. vertices, edges, etc.). |
| * `GraphComputer`: a system that processes the graph in parallel and potentially, distributed over a multi-machine cluster. |
| ** `VertexProgram`: code executed at all vertices in a logically parallel manner with intercommunication via message passing. |
| ** `MapReduce`: a computations that analyzes all vertices in the graph in parallel and yields a single reduced result. |
| |
| IMPORTANT: TinkerPop3 is licensed under the popular link:http://www.apache.org/licenses/LICENSE-2.0.html[Apache2] |
| free software license. However, note that the underlying graph engine used with TinkerPop3 may have a different |
| license. Thus, be sure to respect the license caveats of the graph system product. |
| |
| image:tinkerpop-enabled.png[width=135,float=left] When a graph system implements the TinkerPop3 structure and process |
| link:http://en.wikipedia.org/wiki/Application_programming_interface[APIs], their technology is considered |
| _TinkerPop3-enabled_ and becomes nearly indistinguishable from any other TinkerPop-enabled graph system save for |
| their respective time and space complexity. The purpose of this documentation is to describe the structure/process |
| dichotomy at length and in doing so, explain how to leverage TinkerPop3 for the sole purpose of graph system-agnostic |
| graph computing. Before deep-diving into the various structure/process APIs, a short introductory review of both APIs |
| is provided. |
| |
| NOTE: The TinkerPop3 API rides a fine line between providing concise "query language" method names and respecting |
| Java method naming standards. The general convention used throughout TinkerPop3 is that if a method is "user exposed," |
| then a concise name is provided (e.g. `out()`, `path()`, `repeat()`). If the method is primarily for graph systems |
| providers, then the standard Java naming convention is followed (e.g. `getNextStep()`, `getSteps()`, |
| `getElementComputeKeys()`). |
| |
| The Graph Structure |
| ------------------- |
| |
| image:gremlin-standing.png[width=125,float=left] A graph's structure is the topology formed by the explicit references |
| between its vertices, edges, and properties. A vertex has incident edges. A vertex is adjacent to another vertex if |
| they share an incident edge. A property is attached to an element and an element has a set of properties. A property |
| is a key/value pair, where the key is always a character `String`. The graph structure API of TinkerPop3 provides the |
| methods necessary to create such a structure. The TinkerPop graph previously diagrammed can be created with the |
| following Java 8 code. Note that this graph is available as an in-memory TinkerGraph using |
| `TinkerFactory.createClassic()`. |
| |
| [source,java] |
| Graph graph = TinkerGraph.open(); <1> |
| Vertex marko = graph.addVertex(T.label, "person", T.id, 1, "name", "marko", "age", 29); <2> |
| Vertex vadas = graph.addVertex(T.label, "person", T.id, 2, "name", "vadas", "age", 27); |
| Vertex lop = graph.addVertex(T.label, "software", T.id, 3, "name", "lop", "lang", "java"); |
| Vertex josh = graph.addVertex(T.label, "person", T.id, 4, "name", "josh", "age", 32); |
| Vertex ripple = graph.addVertex(T.label, "software", T.id, 5, "name", "ripple", "lang", "java"); |
| Vertex peter = graph.addVertex(T.label, "person", T.id, 6, "name", "peter", "age", 35); |
| marko.addEdge("knows", vadas, T.id, 7, "weight", 0.5f); <3> |
| marko.addEdge("knows", josh, T.id, 8, "weight", 1.0f); |
| marko.addEdge("created", lop, T.id, 9, "weight", 0.4f); |
| josh.addEdge("created", ripple, T.id, 10, "weight", 1.0f); |
| josh.addEdge("created", lop, T.id, 11, "weight", 0.4f); |
| peter.addEdge("created", lop, T.id, 12, "weight", 0.2f); |
| |
| <1> Create a new in-memory `TinkerGraph` and assign it to the variable `graph`. |
| <2> Create a vertex along with a set of key/value pairs with `T.label` being the vertex label and `T.id` being the vertex id. |
| <3> Create an edge along with a set of key/value pairs with the edge label being specified as the first argument. |
| |
| In the above code all the vertices are created first and then their respective edges. There are two "accessor tokens": |
| `T.id` and `T.label`. When any of these, along with a set of other key value pairs is provided to |
| `Graph.addVertex(Object...)` or `Vertex.addEdge(String,Vertex,Object...)`, the respective element is created along |
| with the provided key/value pair properties appended to it. |
| |
| WARNING: Many graph systems do not allow the user to specify an element ID and in such cases, an exception is thrown. |
| |
| NOTE: In TinkerPop3, vertices are allowed a single immutable string label (similar to an edge label). This |
| functionality did not exist in TinkerPop2. Element ids are still immutable in TinkerPop3 as they were in TinkerPop2. |
| |
| Mutating the Graph |
| ~~~~~~~~~~~~~~~~~~ |
| |
| Below is a sequence of basic graph mutation operations represented in Java 8. One of the major differences between |
| TinkerPop2 and TinkerPop3 is that in TinkerPop3, the Java convention of using setters and getters has been abandoned |
| in favor of a syntax that is more aligned with the syntax of Gremlin-Groovy in TinkerPop2. Given that Gremlin-Java8 |
| and Gremlin-Groovy are nearly identical due to the inclusion of Java 8 lambdas, a big effort was made to ensure that |
| both languages are as similar as possible. |
| |
| WARNING: In the code examples presented throughout this documentation, either Gremlin-Java8 or Gremlin-Groovy is used. |
| It is possible to determine which derivative of Gremlin is being used by mousing over the code block. The word "JAVA" |
| or "GROOVY" will appear in the top right corner of the code block. |
| |
| image:basic-mutation.png[width=240,float=right] |
| [source,java] |
| // create a new graph |
| Graph graph = TinkerGraph.open(); |
| // add a software vertex with a name property |
| Vertex gremlin = graph.addVertex(T.label, "software", |
| "name", "gremlin"); <1> |
| // only one vertex should exist |
| assert(IteratorUtils.count(graph.vertices()) == 1) |
| // no edges should exist as none have been created |
| assert(IteratorUtils.count(graph.edges()) == 0) |
| // add a new property |
| gremlin.property("created",2009) <2> |
| // add a new software vertex to the graph |
| Vertex blueprints = graph.addVertex(T.label, "software", |
| "name", "blueprints"); <3> |
| // connect gremlin to blueprints via a dependsOn-edge |
| gremlin.addEdge("dependsOn",blueprints); <4> |
| // now there are two vertices and one edge |
| assert(IteratorUtils.count(graph.vertices()) == 2) |
| assert(IteratorUtils.count(graph.edges()) == 1) |
| // add a property to blueprints |
| blueprints.property("created",2010) <5> |
| // remove that property |
| blueprints.property("created").remove() <6> |
| // connect gremlin to blueprints via encapsulates |
| gremlin.addEdge("encapsulates",blueprints) <7> |
| assert(IteratorUtils.count(graph.vertices()) == 2) |
| assert(IteratorUtils.count(graph.edges()) == 2) |
| // removing a vertex removes all its incident edges as well |
| blueprints.remove() <8> |
| gremlin.remove() <9> |
| // the graph is now empty |
| assert(IteratorUtils.count(graph.vertices()) == 0) |
| assert(IteratorUtils.count(graph.edges()) == 0) |
| // tada! |
| |
| IMPORTANT: image:groovy-logo.png[width=175,float=left] Gremlin-Groovy leverages the |
| link:http://groovy.codehaus.org/[Groovy 2.x language] to express Gremlin traversals. One of the major benefits of |
| Groovy is the inclusion of a runtime console that makes it easy for developers to practice with the Gremlin language |
| and for production users to connect to their graph and execute traversals in an interactive manner. Moreover, |
| Gremlin-Groovy provides various syntax simplifications. |
| |
| TIP: image:gremlin-sugar.png[width=100,float=left] For those wishing to use the Gremlin2 syntax, please see |
| <<sugar-plugin,SugarPlugin>>. This plugin provides syntactic sugar at, typically, a runtime cost. It can be loaded |
| programmatically via `SugarLoader.load()`. Once loaded, it is possible to do `g.V.out.name` instead of |
| `g.V().out().values('name')` as well as a host of other conveniences. |
| |
| Here is the same code, but using Gremlin-Groovy in the <<gremlin-console,Gremlin Console>>. |
| |
| [source,groovy] |
| ---- |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| gremlin> graph = TinkerGraph.open() |
| ==>tinkergraph[vertices:0 edges:0] |
| gremlin> gremlin = graph.addVertex(label,'software','name','gremlin') |
| ==>v[0] |
| gremlin> gremlin.property('created',2009) |
| ==>vp[created->2009] |
| gremlin> blueprints = graph.addVertex(label,'software','name','blueprints') |
| ==>v[3] |
| gremlin> gremlin.addEdge('dependsOn',blueprints) |
| ==>e[5][0-dependsOn->3] |
| gremlin> blueprints.property('created',2010) |
| ==>vp[created->2010] |
| gremlin> blueprints.property('created').remove() |
| ==>null <1> |
| gremlin> gremlin.addEdge('encapsulates',blueprints) |
| ==>e[7][0-encapsulates->3] |
| gremlin> blueprints.remove() |
| ==>null |
| gremlin> gremlin.remove() |
| ==>null |
| ---- |
| |
| <1> A `==>null` output is usually from a `void` method call and simply indicates that there was no problem with the |
| invocation. If there were a problem, an error would be output or an exception would be thrown. |
| |
| IMPORTANT: TinkerGraph is not a transactional graph. For more information on transaction handling (for those graph |
| systems that support them) see the section dedicated to <<transactions,transactions>>. |
| |
| The Graph Process |
| ----------------- |
| |
| image:gremlin-running.png[width=125,float=left] The primary way in which graphs are processed are via graph |
| traversals. The TinkerPop3 process API is focused on allowing users to create graph traversals in a |
| syntactically-friendly way over the structures defined in the previous section. A traversal is an algorithmic walk |
| across the elements of a graph according to the referential structure explicit within the graph data structure. |
| For example: _"What software does vertex 1's friends work on?"_ This English-statement can be represented in the |
| following algorithmic/traversal fashion: |
| |
| . Start at vertex 1. |
| . Walk the incident knows-edges to the respective adjacent friend vertices of 1. |
| . Move from those friend-vertices to software-vertices via created-edges. |
| . Finally, select the name-property value of the current software-vertices. |
| |
| Traversals in Gremlin are spawned from a `TraversalSource`. The `GraphTraversalSource` is the typical "graph-oriented" |
| DSL used throughout the documentation and will most likely be the most used DSL in a TinkerPop application. |
| `GraphTraversalSource` provides two traversal methods. |
| |
| . `GraphTraversalSource.V(Object... ids)`: generates a traversal starting at vertices in the graph (if no ids are provided, all vertices). |
| . `GraphTraversalSource.E(Object... ids)`: generates a traversal starting at edges in the graph (if no ids are provided, all edges). |
| |
| The return type of `V()` and `E()` is a `GraphTraversal`. A GraphTraversal maintains numerous methods that return |
| `GraphTraversal`. In this way, a `GraphTraversal` supports function composition. Each method of `GraphTraversal` is |
| called a step and each step modulates the results of the previous step in one of five general ways. |
| |
| . `map`: transform the incoming traverser's object to another object (S → E). |
| . `flatMap`: transform the incoming traverser's object to an iterator of other objects (S → E*). |
| . `filter`: allow or disallow the traverser from proceeding to the next step (S → S ∪ ∅). |
| . `sideEffect`: allow the traverser to proceed unchanged, but yield some computational sideEffect in the process (S ↬ S). |
| . `branch`: split the traverser and send each to an arbitrary location in the traversal (S → { S~1~ → E*, ..., S~n~ → E* } → E*). |
| |
| Nearly every step in GraphTraversal either extends `MapStep`, `FlatMapStep`, `FilterStep`, `SideEffectStep`, or `BranchStep`. |
| |
| TIP: `GraphTraversal` is a link:http://en.wikipedia.org/wiki/Monoid[monoid] in that it is an algebraic structure |
| that has a single binary operation that is associative. The binary operation is function composition (i.e. method |
| chaining) and its identity is the step `identity()`. This is related to a |
| link:http://en.wikipedia.org/wiki/Monad_(functional_programming)[monad] as popularized by the functional programming |
| community. |
| |
| Given the TinkerPop graph, the following query will return the names of all the people that the marko-vertex knows. |
| The following query is demonstrated using Gremlin-Groovy. |
| |
| [source,groovy] |
| ---- |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| gremlin> graph = TinkerFactory.createModern() // <1> |
| ==>tinkergraph[vertices:6 edges:6] |
| gremlin> g = graph.traversal(standard()) // <2> |
| ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard] |
| gremlin> g.V().has('name','marko').out('knows').values('name') // <3> |
| ==>vadas |
| ==>josh |
| ---- |
| |
| <1> Open the toy graph and reference it by the variable `graph`. |
| <2> Create a graph traversal source from the graph using the standard, OLTP traversal engine. |
| <3> Spawn a traversal off the traversal source that determines the names of the people that the marko-vertex knows. |
| |
| .The Name of The People That Marko Knows |
| image::tinkerpop-classic-ex1.png[width=500] |
| |
| Or, if the marko-vertex is already realized with a direct reference pointer (i.e. a variable), then the traversal can |
| be spawned off that vertex. |
| |
| [gremlin-groovy,modern] |
| ---- |
| marko = g.V().has('name','marko').next() <1> |
| g.V(marko).out('knows') <2> |
| g.V(marko).out('knows').values('name') <3> |
| ---- |
| |
| <1> Set the variable `marko` to the the vertex in the graph `g` named "marko". |
| <2> Get the vertices that are outgoing adjacent to the marko-vertex via knows-edges. |
| <3> Get the names of the marko-vertex's friends. |
| |
| The Traverser |
| ~~~~~~~~~~~~~ |
| |
| When a traversal is executed, the source of the traversal is on the left of the expression (e.g. vertex 1), the steps |
| are the middle of the traversal (e.g. `out('knows')` and `values('name')`), and the results are "traversal.next()'d" |
| out of the right of the traversal (e.g. "vadas" and "josh"). |
| |
| image::traversal-mechanics.png[width=500] |
| |
| In TinkerPop3, the objects propagating through the traversal are wrapped in a `Traverser<T>`. The traverser concept |
| is new to TinkerPop3 and provides the means by which steps remain stateless. A traverser maintains all the metadata |
| about the traversal -- e.g., how many times the traverser has gone through a loop, the path history of the traverser, |
| the current object being traversed, etc. Traverser metadata may be accessed by a step. A classic example is the |
| <<path-step,`path()`>>-step. |
| |
| [gremlin-groovy,modern] |
| ---- |
| g.V(marko).out('knows').values('name').path() |
| ---- |
| |
| WARNING: Path calculation is costly in terms of space as an array of previously seen objects is stored in each path |
| of the respective traverser. Thus, a traversal strategy analyzes the traversal to determine if path metadata is |
| required. If not, then path calculations are turned off. |
| |
| Another example is the <<repeat-step,`repeat()`>>-step which takes into account the number of times the traverser |
| has gone through a particular section of the traversal expression (i.e. a loop). |
| |
| [gremlin-groovy,modern] |
| ---- |
| g.V(marko).repeat(out()).times(2).values('name') |
| ---- |
| |
| WARNING: A Traversal's result are never ordered unless explicitly by means of <<order-step,`order()`>>-step. Thus, |
| never rely on the iteration order between TinkerPop3 releases and even within a release (as traversal optimizations |
| may alter the flow). |
| |
| On Gremlin Language Variants |
| ---------------------------- |
| |
| Gremlin is written in Java 8. There are various language variants of Gremlin such as Gremlin-Groovy (packaged with |
| TinkerPop3), link:https://github.com/mpollmeier/gremlin-scala[Gremlin-Scala], Gremlin-JavaScript, Gremlin-Clojure |
| (known as link:https://github.com/clojurewerkz/ogre[Ogre]), etc. It is best to think of Gremlin as a style of graph |
| traversing that is not bound to a particular programming language per se. Within a programming language familiar to |
| the developer, there is a Gremlin variant that they can use that leverages the idioms of that language. At minimum, |
| a programming language providing a Gremlin implementation must support |
| link:http://en.wikipedia.org/wiki/Method_chaining[function chaining] (with |
| link:http://en.wikipedia.org/wiki/Anonymous_function[lambdas/anonymous functions] being a "nice to have" if the |
| variant wishes to offer arbitrary computations beyond the provided Gremlin steps). |
| |
| Throughout the documentation, the examples provided are primarily written in Gremlin-Groovy. The reason for this is |
| the <<gremlin-console,Gremlin Console>> whereby an interactive programming environment exists that does not require |
| code compilation. For learning TinkerPop3 and interacting with a live graph system in an ad hoc manner, the Gremlin |
| Console is invaluable. However, for developers interested in working with Gremlin-Java, a few Groovy-to-Java patterns |
| are presented below. |
| |
| [source,groovy] |
| // Gremlin-Groovy |
| g.V().out('knows').values('name') <1> |
| g.V().out('knows').map{it.get().value('name') + ' is the friend name'} <2> |
| g.V().out('knows').sideEffect(System.out.&println) <3> |
| g.V().as('person').out('knows').as('friend').select().by{it.value('name').length()} <4> |
| |
| [source,java] |
| // Gremlin-Java |
| g.V().out("knows").values("name") <1> |
| g.V().out("knows").map(t -> t.get().value("name") + " is the friend name") <2> |
| g.V().out("knows").sideEffect(System.out::println) <3> |
| g.V().as("person").out("knows").as("friend").select().by((Function<Vertex, Integer>) v -> v.<String>value("name").length()) <4> |
| |
| <1> All the non-lambda step chaining is identical in Gremlin-Groovy and Gremlin-Java. However, note that Groovy |
| supports `'` strings as well as `"` strings. |
| <2> In Groovy, lambdas are called closures and have a different syntax, where Groovy supports the `it` keyword and |
| Java doesn't with all parameters requiring naming. |
| <3> The syntax for method references differs slightly between link:https://docs.oracle.com/javase/tutorial/java/javaOO/methodreferences.html[Java] |
| and link:http://mrhaki.blogspot.de/2009/08/groovy-goodness-turn-methods-into.html[Gremlin-Groovy]. |
| <4> Groovy is lenient on object typing and Java is not. When the parameter type of the lambda is not known, |
| typecasting is required. |
| |
| Graph System Integration |
| ------------------------ |
| |
| image:provider-integration.png[width=395,float=right] TinkerPop is a framework composed of various interoperable |
| components. At the foundation there is the <<graph,core TinkerPop3 API>> which defines what a `Graph`, `Vertex`, |
| `Edge`, etc. are. At minimum a graph system provider must implement the core API. Once implemented, the Gremlin |
| <<traversal,traversal language>> is available to the graph system's users. However, the provider can go further and |
| develop specific <<traversalstrategy,`TraversalStrategy`>> optimizations that allow the graph system to inspect a |
| Gremlin query at runtime and optimize it for its particular implementation (e.g. index lookups, step reordering). If |
| the graph system is a graph processor (i.e. provides OLAP capabilities), the system should implement the |
| <<graphcomputer,`GraphComputer`>> API. This API defines how messages/traversers are passed between communicating |
| workers (i.e. threads and/or machines). Once implemented, the same Gremlin traversals execute against both the graph |
| database (OLTP) and the graph processor (OLAP). Note that the Gremlin language interprets the graph in terms of |
| vertices and edges -- i.e. Gremlin is a graph-based domain specific language. Users can create their own domain |
| specific languages to process the graph in terms of higher-order constructs such as people, companies, and their |
| various relationships. Finally, <<gremlin-server,Gremlin Server>> can be leveraged to allow over the wire |
| communication with the TinkerPop-enabled graph system. Gremlin Server provides a configurable communication interface |
| along with metrics and monitoring capabilities. In total, this is The TinkerPop. |