| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[intro]] |
| Introduction to Graph Computing |
| =============================== |
| |
| image::graph-computing.png[width=350] |
| |
| [source,xml] |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>gremlin-core</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| |
| A link:http://en.wikipedia.org/wiki/Graph_(data_structure)[graph] is a data structure composed of vertices (nodes, dots) and edges (arcs, lines). When modeling a graph in a computer and applying it to modern data sets and practices, the generic mathematically-oriented, binary graph is extended to support both labels and key/value properties. This structure is known as a property graph. More formally, it is a directed, binary, attributed multi-graph. An example property graph is diagrammed below. This graph example will be used extensively throughout the documentation and is called "TinkerPop Classic" as it is the original demo graph distributed with TinkerPop0 back in 2009 (i.e. the good ol' days -- it was the best of times and it was the worst of times). |
| |
| TIP: The TinkerPop graph is available with <<tinkergraph-gremlin,TinkerGraph>> via `TinkerFactory.createModern()`. TinkerGraph is the reference implementation of TinkerPop3 and is used in nearly all the examples in this documentation. Note that there also exists the classic `TinkerFactory.createClassic()` which is the graph used in TinkerPop2 and does not include vertex labels. |
| |
| [[tinkerpop-modern]] |
| .TinkerPop Modern |
| image::tinkerpop-modern.png[width=500] |
| |
| TinkerPop3 is the third incarnation of the TinkerPop graph computing framework. Similar to computing in general, graph computing makes a distinction between *structure* (graph) and *process* (traversal). The structure of the graph is the data model defined by a vertex/edge/property link:http://en.wikipedia.org/wiki/Network_topology[topology]. The process of the graph is the means by which the structure is analyzed. The typical form of graph processing is called a link:http://en.wikipedia.org/wiki/Graph_traversal[traversal]. |
| |
| .Primary components of the TinkerPop3 *structure* API |
| * `Graph`: maintains a set of vertices and edges, and access to database functions such as transactions. |
| * `Element`: maintains a collection of properties and a string label denoting the element type. |
| ** `Vertex`: extends Element and maintains a set of incoming and outgoing edges. |
| ** `Edge`: extends Element and maintains an incoming and outgoing vertex. |
| * `Property<V>`: a string key associated with a `V` value. |
| ** `VertexProperty<V>`: a string key associated with a `V` value as well as a collection of `Property<U>` properties (*vertices only*) |
| |
| .Primary components of the TinkerPop3 *process* API |
| * `TraversalSource`: a generator of traversals for a particular graph, link:http://en.wikipedia.org/wiki/Domain-specific_language[domain specific language] (DSL), and execution engine. |
| ** `Traversal<S,E>`: a functional data flow process transforming objects of type `S` into object of type `E`. |
| *** `GraphTraversal`: a traversal DSL that is oriented towards the semantics of the raw graph (i.e. vertices, edges, etc.). |
| * `GraphComputer`: a system that processes the graph in parallel and potentially, distributed over a multi-machine cluster. |
| ** `VertexProgram`: code executed at all vertices in a logically parallel manner with intercommunication via message passing. |
| ** `MapReduce`: a computations that analyzes all vertices in the graph in parallel and yields a single reduced result. |
| |
| IMPORTANT: TinkerPop3 is licensed under the popular link:http://www.apache.org/licenses/LICENSE-2.0.html[Apache2] free software license. However, note that the underlying graph engine used with TinkerPop3 may have a difference license. Thus, be sure to respect the license caveats of the vendor product. |
| |
| image:tinkerpop-enabled.png[width=135,float=left] When a graph vendor implements the TinkerPop3 structure and process link:http://en.wikipedia.org/wiki/Application_programming_interface[APIs], their technology is considered _TinkerPop3-enabled_ and becomes nearly indistinguishable from any other TinkerPop-enabled graph system save for their respective time and space complexity. The purpose of this documentation is to describe the structure/process dichotomy at length and in doing so, explain how to leverage TinkerPop3 for the sole purpose of vendor-agnostic graph computing. Before deep-diving into the various structure/process APIs, a short introductory review of both APIs is provided. |
| |
| NOTE: The TinkerPop3 API rides a fine line between providing concise "query language" method names and respecting Java method naming standards. The general convention used throughout TinkerPop3 is that if a method is "user exposed," then a concise name is provided (e.g. `out()`, `path()`, `repeat()`). If the method is primarily for vendors, then the standard Java naming convention is followed (e.g. `getNextStep()`, `getSteps()`, `getElementComputeKeys()`). |
| |
| The Graph Structure |
| ------------------- |
| |
| image:gremlin-standing.png[width=125,float=left] A graph's structure is the topology formed by the explicit references between its vertices, edges, and properties. A vertex has incident edges. A vertex is adjacent to another vertex if they share an incident edge. A property is attached to an element and an element has a set of properties. A property is a key/value pair, where the key is always a character `String`. The graph structure API of TinkerPop3 provides the methods necessary to create such a structure. The TinkerPop graph previously diagrammed can be created with the following Java8 code. Note that this graph is available as an in-memory TinkerGraph using `TinkerFactory.createClassic()`. |
| |
| [source,java] |
| Graph graph = TinkerGraph.open(); <1> |
| Vertex marko = graph.addVertex(T.label, "person", T.id, 1, "name", "marko", "age", 29); <2> |
| Vertex vadas = graph.addVertex(T.label, "person", T.id, 2, "name", "vadas", "age", 27); |
| Vertex lop = graph.addVertex(T.label, "software", T.id, 3, "name", "lop", "lang", "java"); |
| Vertex josh = graph.addVertex(T.label, "person", T.id, 4, "name", "josh", "age", 32); |
| Vertex ripple = graph.addVertex(T.label, "software", T.id, 5, "name", "ripple", "lang", "java"); |
| Vertex peter = graph.addVertex(T.label, "person", T.id, 6, "name", "peter", "age", 35); |
| marko.addEdge("knows", vadas, T.id, 7, "weight", 0.5f); <3> |
| marko.addEdge("knows", josh, T.id, 8, "weight", 1.0f); |
| marko.addEdge("created", lop, T.id, 9, "weight", 0.4f); |
| josh.addEdge("created", ripple, T.id, 10, "weight", 1.0f); |
| josh.addEdge("created", lop, T.id, 11, "weight", 0.4f); |
| peter.addEdge("created", lop, T.id, 12, "weight", 0.2f); |
| |
| <1> Create a new in-memory `TinkerGraph` and assign it to the variable `graph`. |
| <2> Create a vertex along with a set of key/value pairs with `T.label` being the vertex label and `T.id` being the vertex id. |
| <3> Create an edge along with a set of key/value pairs with the edge label being specified as the first argument. |
| |
| In the above code all the vertices are created first and then their respective edges. There are two "accessor tokens": `T.id` and `T.label`. When any of these, along with a set of other key value pairs is provided to `Graph.addVertex(Object...)` or `Vertex.addEdge(String,Vertex,Object...)`, the respective element is created along with the provided key/value pair properties appended to it. |
| |
| CAUTION: Many graph vendors do not allow the user to specify an element ID and in such cases, an exception is thrown. |
| |
| NOTE: In TinkerPop3, vertices are allowed a single immutable string label (similar to an edge label). This functionality did not exist in TinkerPop2. Likewise, element id's are immutable as they were in TinkerPop2. |
| |
| Mutating the Graph |
| ~~~~~~~~~~~~~~~~~~ |
| |
| Below is a sequence of basic graph mutation operations represented in Java8. One of the major differences between TinkerPop2 and TinkerPop3 is that in TinkerPop3, the Java convention of using setters and getters has been abandoned in favor of a syntax that is more aligned with the syntax of Gremlin-Groovy in TinkerPop2. Given that Gremlin-Java8 and Gremlin-Groovy are nearly identical due to the inclusion of Java8 lambdas, a big efforts was made to ensure that both languages are as similar as possible. |
| |
| CAUTION: In the code examples presented throughout this documentation, either Gremlin-Java8 or Gremlin-Groovy is used. It is possible to determine which derivative of Gremlin is being used by "mousing over" on the code block and see either "JAVA" or "GROOVY" pop up in the top right corner of the code block. |
| |
| image:basic-mutation.png[width=240,float=right] |
| [source,java] |
| // create a new graph |
| Graph graph = TinkerGraph.open(); |
| // add a software vertex with a name property |
| Vertex gremlin = graph.addVertex(T.label, "software", |
| "name", "gremlin"); <1> |
| // only one vertex should exist |
| assert(IteratorUtils.count(graph.vertices()) == 1) |
| // no edges should exist as none have been created |
| assert(IteratorUtils.count(graph.edges()) == 0) |
| // add a new property |
| gremlin.property("created",2009) <2> |
| // add a new software vertex to the graph |
| Vertex blueprints = graph.addVertex(T.label, "software", |
| "name", "blueprints"); <3> |
| // connect gremlin to blueprints via a dependsOn-edge |
| gremlin.addEdge("dependsOn",blueprints); <4> |
| // now there are two vertices and one edge |
| assert(IteratorUtils.count(graph.vertices()) == 2) |
| assert(IteratorUtils.count(graph.edges()) == 1) |
| // add a property to blueprints |
| blueprints.property("created",2010) <5> |
| // remove that property |
| blueprints.property("created").remove() <6> |
| // connect gremlin to blueprints via encapsulates |
| gremlin.addEdge("encapsulates",blueprints) <7> |
| assert(IteratorUtils.count(graph.vertices()) == 2) |
| assert(IteratorUtils.count(graph.edges()) == 2) |
| // removing a vertex removes all its incident edges as well |
| blueprints.remove() <8> |
| gremlin.remove() <9> |
| // the graph is now empty |
| assert(IteratorUtils.count(graph.vertices()) == 0) |
| assert(IteratorUtils.count(graph.edges()) == 0) |
| // tada! |
| |
| IMPORTANT: image:groovy-logo.png[width=175,float=left] Gremlin-Groovy leverages the link:http://groovy.codehaus.org/[Groovy 2.x language] to express Gremlin traversals. One of the major benefits of Groovy is the inclusion of a runtime console that makes it easy for developers to practice with the Gremlin language and for production users to connect to their graph and execute traversals in an interactive manner. Moreover, Gremlin-Groovy provides various syntax simplifications. |
| |
| TIP: image:gremlin-sugar.png[width=100,float=left] For those wishing to use the Gremlin2 syntax, please see <<sugar-plugin,SugarPlugin>>. This plugin provides syntactic sugar at, typically, a runtime cost. It can be loaded programmaticaly via `SugarLoader.load()`. Once loaded, it is possible to do `g.V.out.name` instead of `g.V().out().values('name')` as well as a host of other conveniences. |
| |
| Here is the same code, but using Gremlin-Groovy in the <<gremlin-console,Gremlin Console>>. |
| |
| [source,groovy] |
| ---- |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| gremlin> graph = TinkerGraph.open() |
| ==>tinkergraph[vertices:0 edges:0] |
| gremlin> gremlin = graph.addVertex(label,'software','name','gremlin') |
| ==>v[0] |
| gremlin> gremlin.property('created',2009) |
| ==>vp[created->2009] |
| gremlin> blueprints = graph.addVertex(label,'software','name','blueprints') |
| ==>v[3] |
| gremlin> gremlin.addEdge('dependsOn',blueprints) |
| ==>e[5][0-dependsOn->3] |
| gremlin> blueprints.property('created',2010) |
| ==>vp[created->2010] |
| gremlin> blueprints.property('created').remove() |
| ==>null |
| gremlin> gremlin.addEdge('encapsulates',blueprints) |
| ==>e[7][0-encapsulates->3] |
| gremlin> blueprints.remove() |
| ==>null |
| gremlin> gremlin.remove() |
| ==>null |
| ---- |
| |
| IMPORTANT: TinkerGraph is not a transactional graph. For more information on transaction handling (for those graph systems that support them) see the section dedicated to <<transactions,transactions>>. |
| |
| The Graph Process |
| ----------------- |
| |
| image:gremlin-running.png[width=125,float=left] The primary way in which graphs are processed are via graph traversals. The TinkerPop3 process API is focused on allowing users to create graph traversals in a syntactically-friendly way over the structures defined in the previous section. A traversal is an algorithmic walk across the elements of a graph according to the referential structure explicit within the graph data structure. For example: _"What software does vertex 1's friends work on?"_ This English-statement can be represented in the following algorithmic/traversal fashion: |
| |
| . Start at vertex 1. |
| . Walk the incident knows-edges to the respective adjacent friend vertices of 1. |
| . Move from those friend-vertices to software-vertices via created-edges. |
| . Finally, select the name-property value of the current software-vertices. |
| |
| Traversals in Gremlin are spawned from a `TraversalSource`. The `GraphTraversalSource` is the typical "graph-oriented" DSL used throughout the documentation and will most likely be the most used DSL in a TinkerPop application. `GraphTraversalSource` provides two traversal methods. |
| |
| . `GraphTraversalSource.V(Object... ids)`: generates a traversal starting at vertices in the graph (if no ids are provided, all vertices). |
| . `GraphTraversalSource.E(Object... ids)`: generates a traversal starting at edges in the graph (if no ids are provided, all edges). |
| |
| The return type of `V()` and `E()` is a `GraphTraversal`. A GraphTraversal maintains numerous methods that return GraphTraversal. In this way, a GraphTraversal supports function composition. Each method of GraphTraversal is called a step and each step modulates the results of the previous step in one of five general ways. |
| |
| . `map`: transform the incoming traverser's object to another object (S → E). |
| . `flatMap`: transform the incoming traverser's object to an iterator of other objects (S → E*). |
| . `filter`: allow or disallow the traverser from proceeding to the next step (S → S ∪ ∅). |
| . `sideEffect`: allow the traverser to proceed unchanged, but yield some computational sideEffect in the process (S ↬ S). |
| . `branch`: split the traverser and send each to an arbitrary location in the traversal (S → { S~1~ → E*, ..., S~n~ → E* } → E*). |
| |
| Nearly every step in GraphTraversal either extends `MapStep`, `FlatMapStep`, `FilterStep`, `SideEffectStep`, or `BranchStep`. |
| |
| TIP: `GraphTraversal` is a link:http://en.wikipedia.org/wiki/Monoid[monoid] in that it is an algebraic structure that has a single binary operation that is associative. The binary operation is function composition (i.e. method chaining) and its identity is the step `identity()`. This is related to a link:http://en.wikipedia.org/wiki/Monad_(functional_programming)[monad] as popularized by the functional programming community. |
| |
| Given the TinkerPop graph, the following query will return the names of all the people that the marko-vertex knows. The following query is demonstrated using Gremlin-Groovy. |
| |
| [source,groovy] |
| ---- |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| gremlin> graph = TinkerFactory.createModern() // <1> |
| ==>tinkergraph[vertices:6 edges:6] |
| gremlin> g = graph.traversal(standard()) // <2> |
| ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard] |
| gremlin> g.V().has('name','marko').out('knows').values('name') // <3> |
| ==>vadas |
| ==>josh |
| ---- |
| |
| <1> Open the toy graph and reference it by the variable `graph`. |
| <2> Create a graph traversal source from the graph using the standard, OLTP traversal engine. |
| <3> Spawn a traversal off the traversal source that determines the names of the people that the marko-vertex knows. |
| |
| .The Name of The People That Marko Knows |
| image::tinkerpop-classic-ex1.png[width=500] |
| |
| Or, if the marko-vertex is already realized with a direct reference pointer (i.e. a variable), then the traversal can be spawned off that vertex. |
| |
| [gremlin-groovy,modern] |
| ---- |
| marko = g.V().has('name','marko').next() <1> |
| g.V(marko).out('knows') <2> |
| g.V(marko).out('knows').values('name') <3> |
| ---- |
| |
| <1> Set the variable `marko` to the the vertex in the graph `g` named "marko". |
| <2> Get the vertices that are outgoing adjacent to the marko-vertex via knows-edges. |
| <3> Get the names of the marko-vertex's friends. |
| |
| The Traverser |
| ~~~~~~~~~~~~~ |
| |
| When a traversal is executed, the source of the traversal is on the left of the expression (e.g. vertex 1), the steps are the middle of the traversal (e.g. `out('knows')` and `values('name')`), and the results are "traversal.next()'d" out of the right of the traversal (e.g. "vadas" and "josh"). |
| |
| image::traversal-mechanics.png[width=500] |
| |
| In TinkerPop3, the objects propagating through the traversal are wrapped in a `Traverser<T>`. The traverser concept is new to TinkerPop3 and provides the means by which steps remain stateless. A traverser maintains all the metadata about the traversal -- e.g., how many times the traverser has gone through a loop, the path history of the traverser, the current object being traversed, etc. Traverser metadata may be accessed by a step. A classic example is the <<path-step,`path()`>>-step. |
| |
| [gremlin-groovy,modern] |
| ---- |
| g.V(marko).out('knows').values('name').path() |
| ---- |
| |
| CAUTION: Path calculation is costly in terms of space as an array of previously seen objects is stored in each path of the respective traverser. Thus, a traversal strategy analyzes the traversal to determine if path metadata is required. If not, then path calculations are turned off. |
| |
| Another example is the <<repeat-step,`repeat()`>>-step which takes into account the number of times the traverser has gone through a particular section of the traversal expression (i.e. a loop). |
| |
| [gremlin-groovy,modern] |
| ---- |
| g.V(marko).repeat(out()).times(2).values('name') |
| ---- |
| |
| CAUTION: A Traversal's result are never ordered unless explicitly by means of <<order-step,`order()`>>-step. Thus, never rely on the iteration order between TinkerPop3 releases and even within a release (as traversal optimizations may alter the flow). |
| |
| On Gremlin Language Variants |
| ---------------------------- |
| |
| Gremlin is written in Java8. There are various language variants of Gremlin such as Gremlin-Groovy (packaged with TinkerPop3), link:https://github.com/mpollmeier/gremlin-scala[Gremlin-Scala], Gremlin-JavaScript, Gremlin-Clojure (known as link:https://github.com/clojurewerkz/ogre[Ogre]), etc. It is best to think of Gremlin as a style of graph traversing that is not bound to a particular programming language per se. Within a programming language familiar to the developer, there is a Gremlin variant that they can use that leverages the idioms of that language. At minimum, a programming language providing a Gremlin implementation must support link:http://en.wikipedia.org/wiki/Method_chaining[function chaining] (with link:http://en.wikipedia.org/wiki/Anonymous_function[lambdas/anonymous functions] being a "nice to have" if the variants wishes to offer arbitrary computations beyond the provided Gremlin steps). |
| |
| Throughout the documentation, the examples provided are primarily written in Gremlin-Groovy. The reason for this is the <<gremlin-console,Gremlin Console>> whereby an interactive programming environment exists that does not require code compilation. For learning TinkerPop3 and interacting with a live graph system in an ad hoc manner, the Gremlin Console is invaluable. However, for developers interested in working with Gremlin-Java, a few Groovy-to-Java patterns are presented below. |
| |
| [source,groovy] |
| // Gremlin-Groovy |
| g.V().out('knows').values('name') <1> |
| g.V().out('knows').map{it.get().value('name') + ' is the friend name'} <2> |
| g.V().out('knows').sideEffect(System.out.&println) <3> |
| g.V().as('person').out('knows').as('friend').select().by{it.value('name').length()} <4> |
| |
| [source,java] |
| // Gremlin-Java |
| g.V().out("knows").values("name") <1> |
| g.V().out("knows").map(t -> t.get().value("name") + " is the friend name") <2> |
| g.V().out("knows").sideEffect(System.out::println) <3> |
| g.V().as("person").out("knows").as("friend").select().by((Function<Vertex, Integer>) v -> v.<String>value("name").length()) <4> |
| |
| <1> All the non-lambda step chaining is identical in Gremlin-Groovy and Gremlin-Java. However, note that Groovy supports `'` strings as well as `"` strings. |
| <2> In Groovy, lambdas are called closures and have a different syntax, where Groovy supports the `it` keyword and Java doesn't with all parameters requiring naming. |
| <3> The syntax for method references differs slightly between link:https://docs.oracle.com/javase/tutorial/java/javaOO/methodreferences.html[Java] and link:http://mrhaki.blogspot.de/2009/08/groovy-goodness-turn-methods-into.html[Gremlin-Groovy]. |
| <4> Groovy is lenient on object typing and Java is not. When the parameter type of the lambda is not known, typecasting is required. |
| |
| Vendor Integration |
| ------------------ |
| |
| image:vendor-integration.png[width=395,float=right] TinkerPop is a framework composed of various interoperable components. At the foundation there is the <<graph,core TinkerPop3 API>> which defines what a `Graph`, `Vertex`, `Edge`, etc. are. At minimum a vendor must implement the core API. Once implemented, the Gremlin <<traversal,traversal language>> is available to the vendor's users. However, the vendor can go further and provide specific <<traversalstrategy,`TraversalStrategy`>> optimizations that allow the vendor to inspect a Gremlin query at runtime and optimize it for their particular implementation (e.g. index lookups, step reordering). If the vendor's graph system is a graph processor (i.e. provides OLAP capabilities), the vendor can implement the <<graphcomputer,`GraphComputer`>> API. This API defines how messages/traversers are passed between communicating workers (i.e. threads and/or machines). Once implemented, the same Gremlin traversals execute against both the graph database (OLTP) and the graph processor (OLAP). Note that the Gremlin language interprets the graph in terms of vertices and edges -- i.e. Gremlin is a graph-based domain specific language. Users can create their own domain specific languages to process the graph in terms of higher-order constructs such as people, companies, and their various relationships. Finally, <<gremlin-server,Gremlin Server>> can be leveraged to allow over the wire communication with the TinkerPop-enabled graph system. Gremlin Server provides a configurable communication interface along with metrics and monitoring capabilities. In total, this is The TinkerPop. |