| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[intro]] |
| = Introduction to Graph Computing |
| |
| image::graph-computing.png[width=350] |
| |
| [source,xml] |
| <dependency> |
| <groupId>org.apache.tinkerpop</groupId> |
| <artifactId>gremlin-core</artifactId> |
| <version>x.y.z</version> |
| </dependency> |
| |
| A link:http://en.wikipedia.org/wiki/Graph_(data_structure)[graph] is a data structure composed of vertices (nodes, |
| dots) and edges (arcs, lines). When modeling a graph in a computer and applying it to modern data sets and practices, |
| the generic mathematically-oriented, binary graph is extended to support both labels and key/value properties. This |
| structure is known as a property graph. More formally, it is a directed, binary, attributed multi-graph. An example |
| property graph is diagrammed below. This graph example will be used extensively throughout the documentation and is |
| called "TinkerPop Modern" as it is a modern variation of the original demo graph distributed with TinkerPop0 back |
| in 2009 (i.e. the good ol' days -- it was the best of times and it was the worst of times). |
| |
| TIP: The TinkerPop graph is available with <<tinkergraph-gremlin,TinkerGraph>> via `TinkerFactory.createModern()`. |
| TinkerGraph is the reference implementation of TinkerPop3 and is used in nearly all the examples in this documentation. |
| Note that there also exists the classic `TinkerFactory.createClassic()` which is the graph used in TinkerPop2 and does |
| not include vertex labels. |
| |
| TIP: All of the toy graphs available in TinkerPop are described in |
| link:https://tinkerpop.apache.org/docs/x.y.z/tutorials/the-gremlin-console/#toy-graphs[The Gremlin Console] tutorial. |
| |
| [[tinkerpop-modern]] |
| .TinkerPop Modern |
| image::tinkerpop-modern.png[width=500] |
| |
| TinkerPop3 is the third incarnation of the Apache TinkerPop™ graph computing framework. Similar to computing in |
| general, graph computing makes a distinction between *structure* (graph) and *process* (traversal). The structure of |
| the graph is the data model defined by a vertex/edge/property link:http://en.wikipedia.org/wiki/Network_topology[topology]. |
| The process of the graph is the means by which the structure is analyzed. The typical form of graph processing is |
| called a link:http://en.wikipedia.org/wiki/Graph_traversal[traversal]. |
| |
| Generally speaking, the structure or "graph" API is meant for link:https://tinkerpop.apache.org/providers.html[graph providers] |
| who are implementing the TinkerPop interfaces and the process or "traversal" API (i.e. Gremlin) is meant for end-users |
| who are utilizing a graph system from a graph provider. While the components of the process API are itemized below, |
| they are described in greater detail in the link:https://tinkerpop.apache.org/docs/x.y.z/tutorials/gremlins-anatomy/[Gremlin's Anatomy] |
| tutorial. |
| |
| .Primary components of the TinkerPop3 *structure* API |
| * `Graph`: maintains a set of vertices and edges, and access to database functions such as transactions. |
| * `Element`: maintains a collection of properties and a string label denoting the element type. |
| ** `Vertex`: extends Element and maintains a set of incoming and outgoing edges. |
| ** `Edge`: extends Element and maintains an incoming and outgoing vertex. |
| * `Property<V>`: a string key associated with a `V` value. |
| ** `VertexProperty<V>`: a string key associated with a `V` value as well as a collection of `Property<U>` properties (*vertices only*) |
| |
| .Primary components of the TinkerPop3 *process* API |
| * `TraversalSource`: a generator of traversals for a particular graph, link:http://en.wikipedia.org/wiki/Domain-specific_language[domain specific language] (DSL), and execution engine. |
| ** `Traversal<S,E>`: a functional data flow process transforming objects of type `S` into object of type `E`. |
| *** `GraphTraversal`: a traversal DSL that is oriented towards the semantics of the raw graph (i.e. vertices, edges, etc.). |
| * `GraphComputer`: a system that processes the graph in parallel and potentially, distributed over a multi-machine cluster. |
| ** `VertexProgram`: code executed at all vertices in a logically parallel manner with intercommunication via message passing. |
| ** `MapReduce`: a computation that analyzes all vertices in the graph in parallel and yields a single reduced result. |
| |
| IMPORTANT: TinkerPop3 is licensed under the popular link:http://www.apache.org/licenses/LICENSE-2.0.html[Apache2] |
| free software license. However, note that the underlying graph engine used with TinkerPop3 may have a different |
| license. Thus, be sure to respect the license caveats of the graph system product. |
| |
| image:tinkerpop-enabled.png[width=135,float=left] When a graph system implements the TinkerPop3 structure and process |
| link:http://en.wikipedia.org/wiki/Application_programming_interface[APIs], their technology is considered |
| _TinkerPop3-enabled_ and becomes nearly indistinguishable from any other TinkerPop-enabled graph system save for |
| their respective time and space complexity. The purpose of this documentation is to describe the structure/process |
| dichotomy at length and in doing so, explain how to leverage TinkerPop3 for the sole purpose of graph system-agnostic |
| graph computing. Before deep-diving into the various structure/process APIs, a short introductory review of both APIs |
| is provided. |
| |
| NOTE: The TinkerPop3 API rides a fine line between providing concise "query language" method names and respecting |
| Java method naming standards. The general convention used throughout TinkerPop3 is that if a method is "user exposed," |
| then a concise name is provided (e.g. `out()`, `path()`, `repeat()`). If the method is primarily for graph systems |
| providers, then the standard Java naming convention is followed (e.g. `getNextStep()`, `getSteps()`, |
| `getElementComputeKeys()`). |
| |
| == The Graph Structure |
| |
| image:gremlin-standing.png[width=125,float=left] A graph's structure is the topology formed by the explicit references |
| between its vertices, edges, and properties. A vertex has incident edges. A vertex is adjacent to another vertex if |
| they share an incident edge. A property is attached to an element and an element has a set of properties. A property |
| is a key/value pair, where the key is always a character `String`. Conceptual knowledge of how a graph is composed is |
| essential to end-users working with graphs, however, as mentioned earlier, the structure API is not the appropriate |
| way for users to think when building applications with TinkerPop. The structure API is reserved for usage by graph |
| providers. Those interested in implementing the structure API to make their graph system TinkerPop enabled can learn |
| more about it in the link:https://tinkerpop.apache.org/docs/x.y.z/dev/provider/[Graph Provider] documentation. |
| |
| [[the-graph-process]] |
| == The Graph Process |
| |
| image:gremlin-running.png[width=125,float=left] The primary way in which graphs are processed are via graph |
| traversals. The TinkerPop3 process API is focused on allowing users to create graph traversals in a |
| syntactically-friendly way over the structures defined in the previous section. A traversal is an algorithmic walk |
| across the elements of a graph according to the referential structure explicit within the graph data structure. |
| For example: _"What software does vertex 1's friends work on?"_ This English-statement can be represented in the |
| following algorithmic/traversal fashion: |
| |
| . Start at vertex 1. |
| . Walk the incident knows-edges to the respective adjacent friend vertices of 1. |
| . Move from those friend-vertices to software-vertices via created-edges. |
| . Finally, select the name-property value of the current software-vertices. |
| |
| Traversals in Gremlin are spawned from a `TraversalSource`. The `GraphTraversalSource` is the typical "graph-oriented" |
| DSL used throughout the documentation and will most likely be the most used DSL in a TinkerPop application. |
| `GraphTraversalSource` provides two traversal methods. |
| |
| . `GraphTraversalSource.V(Object... ids)`: generates a traversal starting at vertices in the graph (if no ids are provided, all vertices). |
| . `GraphTraversalSource.E(Object... ids)`: generates a traversal starting at edges in the graph (if no ids are provided, all edges). |
| |
| The return type of `V()` and `E()` is a `GraphTraversal`. A GraphTraversal maintains numerous methods that return |
| `GraphTraversal`. In this way, a `GraphTraversal` supports function composition. Each method of `GraphTraversal` is |
| called a step and each step modulates the results of the previous step in one of five general ways. |
| |
| . `map`: transform the incoming traverser's object to another object (S → E). |
| . `flatMap`: transform the incoming traverser's object to an iterator of other objects (S → E*). |
| . `filter`: allow or disallow the traverser from proceeding to the next step (S → E ⊆ S). |
| . `sideEffect`: allow the traverser to proceed unchanged, but yield some computational sideEffect in the process (S ↬ S). |
| . `branch`: split the traverser and send each to an arbitrary location in the traversal (S → { S~1~ → E*, ..., S~n~ → E* } → E*). |
| |
| Nearly every step in GraphTraversal either extends `MapStep`, `FlatMapStep`, `FilterStep`, `SideEffectStep`, or `BranchStep`. |
| |
| TIP: `GraphTraversal` is a link:http://en.wikipedia.org/wiki/Monoid[monoid] in that it is an algebraic structure |
| that has a single binary operation that is associative. The binary operation is function composition (i.e. method |
| chaining) and its identity is the step `identity()`. This is related to a |
| link:http://en.wikipedia.org/wiki/Monad_(functional_programming)[monad] as popularized by the functional programming |
| community. |
| |
| Given the TinkerPop graph, the following query will return the names of all the people that the marko-vertex knows. |
| The following query is demonstrated using Gremlin-Groovy. |
| |
| [source,groovy] |
| ---- |
| $ bin/gremlin.sh |
| |
| \,,,/ |
| (o o) |
| -----oOOo-(3)-oOOo----- |
| gremlin> graph = TinkerFactory.createModern() // <1> |
| ==>tinkergraph[vertices:6 edges:6] |
| gremlin> g = graph.traversal() // <2> |
| ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard] |
| gremlin> g.V().has('name','marko').out('knows').values('name') // <3> |
| ==>vadas |
| ==>josh |
| ---- |
| |
| <1> Open the toy graph and reference it by the variable `graph`. |
| <2> Create a graph traversal source from the graph using the standard, OLTP traversal engine. This object should be created once and then re-used. |
| <3> Spawn a traversal off the traversal source that determines the names of the people that the marko-vertex knows. |
| |
| .The Name of The People That Marko Knows |
| image::tinkerpop-classic-ex1.png[width=500] |
| |
| Or, if the marko-vertex is already realized with a direct reference pointer (i.e. a variable), then the traversal can |
| be spawned off that vertex. |
| |
| [gremlin-groovy,modern] |
| ---- |
| marko = g.V().has('name','marko').next() <1> |
| g.V(marko).out('knows') <2> |
| g.V(marko).out('knows').values('name') <3> |
| ---- |
| |
| <1> Set the variable `marko` to the vertex in the graph `g` named "marko". |
| <2> Get the vertices that are outgoing adjacent to the marko-vertex via knows-edges. |
| <3> Get the names of the marko-vertex's friends. |
| |
| === The Traverser |
| |
| When a traversal is executed, the source of the traversal is on the left of the expression (e.g. vertex 1), the steps |
| are the middle of the traversal (e.g. `out('knows')` and `values('name')`), and the results are "traversal.next()'d" |
| out of the right of the traversal (e.g. "vadas" and "josh"). |
| |
| image::traversal-mechanics.png[width=500] |
| |
| The objects propagating through the traversal are wrapped in a `Traverser<T>`. The traverser provides the means by |
| which steps remain stateless. A traverser maintains all the metadata about the traversal -- e.g., how many times the |
| traverser has gone through a loop, the path history of the traverser, the current object being traversed, etc. |
| Traverser metadata may be accessed by a step. A classic example is the <<path-step,`path()`>>-step. |
| |
| [gremlin-groovy,modern] |
| ---- |
| g.V(marko).out('knows').values('name').path() |
| ---- |
| |
| WARNING: Path calculation is costly in terms of space as an array of previously seen objects is stored in each path |
| of the respective traverser. Thus, a traversal strategy analyzes the traversal to determine if path metadata is |
| required. If not, then path calculations are turned off. |
| |
| Another example is the <<repeat-step,`repeat()`>>-step which takes into account the number of times the traverser |
| has gone through a particular section of the traversal expression (i.e. a loop). |
| |
| [gremlin-groovy,modern] |
| ---- |
| g.V(marko).repeat(out()).times(2).values('name') |
| ---- |
| |
| WARNING: TinkerPop does not guarantee the order of results returned from a traversal. It only guarantees not to modify |
| the iteration order provided by the underlying graph. Therefore it is important to understand the order guarantees of |
| the graph database being used. A traversal's result is never ordered by TinkerPop unless performed explicitly by means |
| of <<order-step,`order()`>>-step. |
| |
| == On Gremlin Language Variants |
| |
| Gremlin is written in Java 8. There are various language variants of Gremlin such as Gremlin-Groovy (packaged with |
| TinkerPop3), Gremlin-Python (packaged with TinkerPop3), link:https://github.com/mpollmeier/gremlin-scala[Gremlin-Scala], |
| Gremlin-JavaScript, Gremlin-Clojure (known as link:https://github.com/clojurewerkz/ogre[Ogre]), etc. |
| It is best to think of Gremlin as a style of graph traversing that is not bound to a particular programming language per se. |
| Within a programming language familiar to the developer, there is a Gremlin variant that they can use that leverages |
| the idioms of that language. At minimum, a programming language providing a Gremlin implementation must support |
| link:http://en.wikipedia.org/wiki/Method_chaining[function chaining] (with |
| link:http://en.wikipedia.org/wiki/Anonymous_function[lambdas/anonymous functions] being a "nice to have" if the |
| variants wishes to offer arbitrary computations beyond the provided Gremlin steps). |
| |
| Throughout the documentation, the examples provided are primarily written in Gremlin-Groovy. The reason for this is |
| the <<gremlin-console,Gremlin Console>> -- an interactive programming environment exists that does not require |
| code compilation. For learning TinkerPop3 and interacting with a live graph system in an ad hoc manner, the Gremlin |
| Console is invaluable. However, for developers interested in working with Gremlin-Java, a few Groovy-to-Java patterns |
| are presented below. |
| |
| [source,groovy] |
| // Gremlin-Groovy |
| g.V().out('knows').values('name') <1> |
| g.V().out('knows').map{it.get().value('name') + ' is the friend name'} <2> |
| g.V().out('knows').sideEffect(System.out.&println) <3> |
| g.V().as('person').out('knows').as('friend').select('person','friend').by{it.value('name').length()} <4> |
| |
| [source,java] |
| // Gremlin-Java |
| g.V().out("knows").values("name") <1> |
| g.V().out("knows").map(t -> t.get().value("name") + " is the friend name") <2> |
| g.V().out("knows").sideEffect(System.out::println) <3> |
| g.V().as("person").out("knows").as("friend").select("person","friend").by((Function<Vertex, Integer>) v -> v.<String>value("name").length()) <4> |
| |
| <1> All the non-lambda step chaining is identical in Gremlin-Groovy and Gremlin-Java. However, note that Groovy |
| supports `'` strings as well as `"` strings. |
| <2> In Groovy, lambdas are called closures and have a different syntax, where Groovy supports the `it` keyword and |
| Java doesn't with all parameters requiring naming. |
| <3> The syntax for method references differs slightly between link:https://docs.oracle.com/javase/tutorial/java/javaOO/methodreferences.html[Java] |
| and link:http://mrhaki.blogspot.de/2009/08/groovy-goodness-turn-methods-into.html[Gremlin-Groovy]. |
| <4> Groovy is lenient on object typing and Java is not. When the parameter type of the lambda is not known, |
| typecasting is required. |
| |
| Please see the <<gremlin-variants, Gremlin Variants>> section for more information on this topic. |
| |
| == Graph System Integration |
| |
| image:provider-integration.png[width=395,float=right] TinkerPop is a framework composed of various interoperable |
| components. At the foundation there is the <<graph,core TinkerPop3 API>> which defines what a `Graph`, `Vertex`, |
| `Edge`, etc. are. At minimum a graph system provider must implement the core API. Once implemented, the Gremlin |
| <<traversal,traversal language>> is available to the graph system's users. However, the provider can go further and |
| develop specific <<traversalstrategy,`TraversalStrategy`>> optimizations that allow the graph system to inspect a |
| Gremlin query at runtime and optimize it for its particular implementation (e.g. index lookups, step reordering). If |
| the graph system is a graph processor (i.e. provides OLAP capabilities), the system should implement the |
| <<graphcomputer,`GraphComputer`>> API. This API defines how messages/traversers are passed between communicating |
| workers (i.e. threads and/or machines). Once implemented, the same Gremlin traversals execute against both the graph |
| database (OLTP) and the graph processor (OLAP). Note that the Gremlin language interprets the graph in terms of |
| vertices and edges -- i.e. Gremlin is a graph-based domain specific language. Users can create their own domain |
| specific languages to process the graph in terms of higher-order constructs such as people, companies, and their |
| various relationships. Finally, <<gremlin-server,Gremlin Server>> can be leveraged to allow over the wire |
| communication with the TinkerPop-enabled graph system. Gremlin Server provides a configurable communication interface |
| along with metrics and monitoring capabilities. In total, this is The TinkerPop. |