blob: 1d0c19c10a1c943183b15e6ab6de8fe8be47009c [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
[[intro]]
= Introduction to Graph Computing
image::graph-computing.png[width=350]
[source,xml]
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>gremlin-core</artifactId>
<version>x.y.z</version>
</dependency>
A link:http://en.wikipedia.org/wiki/Graph_(data_structure)[graph] is a data structure composed of vertices (nodes,
dots) and edges (arcs, lines). When modeling a graph in a computer and applying it to modern data sets and practices,
the generic mathematically-oriented, binary graph is extended to support both labels and key/value properties. This
structure is known as a property graph. More formally, it is a directed, binary, attributed multi-graph. An example
property graph is diagrammed below. This graph example will be used extensively throughout the documentation and is
called "TinkerPop Modern" as it is a modern variation of the original demo graph distributed with TinkerPop0 back
in 2009 (i.e. the good ol' days -- it was the best of times and it was the worst of times).
TIP: The TinkerPop graph is available with <<tinkergraph-gremlin,TinkerGraph>> via `TinkerFactory.createModern()`.
TinkerGraph is the reference implementation of TinkerPop3 and is used in nearly all the examples in this documentation.
Note that there also exists the classic `TinkerFactory.createClassic()` which is the graph used in TinkerPop2 and does
not include vertex labels.
TIP: All of the toy graphs available in TinkerPop are described in
link:https://tinkerpop.apache.org/docs/x.y.z/tutorials/the-gremlin-console/#toy-graphs[The Gremlin Console] tutorial.
[[tinkerpop-modern]]
.TinkerPop Modern
image::tinkerpop-modern.png[width=500]
TinkerPop3 is the third incarnation of the Apache TinkerPop™ graph computing framework. Similar to computing in
general, graph computing makes a distinction between *structure* (graph) and *process* (traversal). The structure of
the graph is the data model defined by a vertex/edge/property link:http://en.wikipedia.org/wiki/Network_topology[topology].
The process of the graph is the means by which the structure is analyzed. The typical form of graph processing is
called a link:http://en.wikipedia.org/wiki/Graph_traversal[traversal].
Generally speaking, the structure or "graph" API is meant for link:https://tinkerpop.apache.org/providers.html[graph providers]
who are implementing the TinkerPop interfaces and the process or "traversal" API (i.e. Gremlin) is meant for end-users
who are utilizing a graph system from a graph provider. While the components of the process API are itemized below,
they are described in greater detail in the link:https://tinkerpop.apache.org/docs/x.y.z/tutorials/gremlins-anatomy/[Gremlin's Anatomy]
tutorial.
.Primary components of the TinkerPop3 *structure* API
* `Graph`: maintains a set of vertices and edges, and access to database functions such as transactions.
* `Element`: maintains a collection of properties and a string label denoting the element type.
** `Vertex`: extends Element and maintains a set of incoming and outgoing edges.
** `Edge`: extends Element and maintains an incoming and outgoing vertex.
* `Property<V>`: a string key associated with a `V` value.
** `VertexProperty<V>`: a string key associated with a `V` value as well as a collection of `Property<U>` properties (*vertices only*)
.Primary components of the TinkerPop3 *process* API
* `TraversalSource`: a generator of traversals for a particular graph, link:http://en.wikipedia.org/wiki/Domain-specific_language[domain specific language] (DSL), and execution engine.
** `Traversal<S,E>`: a functional data flow process transforming objects of type `S` into object of type `E`.
*** `GraphTraversal`: a traversal DSL that is oriented towards the semantics of the raw graph (i.e. vertices, edges, etc.).
* `GraphComputer`: a system that processes the graph in parallel and potentially, distributed over a multi-machine cluster.
** `VertexProgram`: code executed at all vertices in a logically parallel manner with intercommunication via message passing.
** `MapReduce`: a computation that analyzes all vertices in the graph in parallel and yields a single reduced result.
IMPORTANT: TinkerPop3 is licensed under the popular link:http://www.apache.org/licenses/LICENSE-2.0.html[Apache2]
free software license. However, note that the underlying graph engine used with TinkerPop3 may have a different
license. Thus, be sure to respect the license caveats of the graph system product.
image:tinkerpop-enabled.png[width=135,float=left] When a graph system implements the TinkerPop3 structure and process
link:http://en.wikipedia.org/wiki/Application_programming_interface[APIs], their technology is considered
_TinkerPop3-enabled_ and becomes nearly indistinguishable from any other TinkerPop-enabled graph system save for
their respective time and space complexity. The purpose of this documentation is to describe the structure/process
dichotomy at length and in doing so, explain how to leverage TinkerPop3 for the sole purpose of graph system-agnostic
graph computing. Before deep-diving into the various structure/process APIs, a short introductory review of both APIs
is provided.
NOTE: The TinkerPop3 API rides a fine line between providing concise "query language" method names and respecting
Java method naming standards. The general convention used throughout TinkerPop3 is that if a method is "user exposed,"
then a concise name is provided (e.g. `out()`, `path()`, `repeat()`). If the method is primarily for graph systems
providers, then the standard Java naming convention is followed (e.g. `getNextStep()`, `getSteps()`,
`getElementComputeKeys()`).
== The Graph Structure
image:gremlin-standing.png[width=125,float=left] A graph's structure is the topology formed by the explicit references
between its vertices, edges, and properties. A vertex has incident edges. A vertex is adjacent to another vertex if
they share an incident edge. A property is attached to an element and an element has a set of properties. A property
is a key/value pair, where the key is always a character `String`. Conceptual knowledge of how a graph is composed is
essential to end-users working with graphs, however, as mentioned earlier, the structure API is not the appropriate
way for users to think when building applications with TinkerPop. The structure API is reserved for usage by graph
providers. Those interested in implementing the structure API to make their graph system TinkerPop enabled can learn
more about it in the link:https://tinkerpop.apache.org/docs/x.y.z/dev/provider/[Graph Provider] documentation.
[[the-graph-process]]
== The Graph Process
image:gremlin-running.png[width=125,float=left] The primary way in which graphs are processed are via graph
traversals. The TinkerPop3 process API is focused on allowing users to create graph traversals in a
syntactically-friendly way over the structures defined in the previous section. A traversal is an algorithmic walk
across the elements of a graph according to the referential structure explicit within the graph data structure.
For example: _"What software does vertex 1's friends work on?"_ This English-statement can be represented in the
following algorithmic/traversal fashion:
. Start at vertex 1.
. Walk the incident knows-edges to the respective adjacent friend vertices of 1.
. Move from those friend-vertices to software-vertices via created-edges.
. Finally, select the name-property value of the current software-vertices.
Traversals in Gremlin are spawned from a `TraversalSource`. The `GraphTraversalSource` is the typical "graph-oriented"
DSL used throughout the documentation and will most likely be the most used DSL in a TinkerPop application.
`GraphTraversalSource` provides two traversal methods.
. `GraphTraversalSource.V(Object... ids)`: generates a traversal starting at vertices in the graph (if no ids are provided, all vertices).
. `GraphTraversalSource.E(Object... ids)`: generates a traversal starting at edges in the graph (if no ids are provided, all edges).
The return type of `V()` and `E()` is a `GraphTraversal`. A GraphTraversal maintains numerous methods that return
`GraphTraversal`. In this way, a `GraphTraversal` supports function composition. Each method of `GraphTraversal` is
called a step and each step modulates the results of the previous step in one of five general ways.
. `map`: transform the incoming traverser's object to another object (S &rarr; E).
. `flatMap`: transform the incoming traverser's object to an iterator of other objects (S &rarr; E*).
. `filter`: allow or disallow the traverser from proceeding to the next step (S &rarr; E &sube; S).
. `sideEffect`: allow the traverser to proceed unchanged, but yield some computational sideEffect in the process (S &rarrlp; S).
. `branch`: split the traverser and send each to an arbitrary location in the traversal (S &rarr; { S~1~ &rarr; E*, ..., S~n~ &rarr; E* } &rarr; E*).
Nearly every step in GraphTraversal either extends `MapStep`, `FlatMapStep`, `FilterStep`, `SideEffectStep`, or `BranchStep`.
TIP: `GraphTraversal` is a link:http://en.wikipedia.org/wiki/Monoid[monoid] in that it is an algebraic structure
that has a single binary operation that is associative. The binary operation is function composition (i.e. method
chaining) and its identity is the step `identity()`. This is related to a
link:http://en.wikipedia.org/wiki/Monad_(functional_programming)[monad] as popularized by the functional programming
community.
Given the TinkerPop graph, the following query will return the names of all the people that the marko-vertex knows.
The following query is demonstrated using Gremlin-Groovy.
[source,groovy]
----
$ bin/gremlin.sh
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
gremlin> graph = TinkerFactory.createModern() // <1>
==>tinkergraph[vertices:6 edges:6]
gremlin> g = graph.traversal() // <2>
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().has('name','marko').out('knows').values('name') // <3>
==>vadas
==>josh
----
<1> Open the toy graph and reference it by the variable `graph`.
<2> Create a graph traversal source from the graph using the standard, OLTP traversal engine. This object should be created once and then re-used.
<3> Spawn a traversal off the traversal source that determines the names of the people that the marko-vertex knows.
.The Name of The People That Marko Knows
image::tinkerpop-classic-ex1.png[width=500]
Or, if the marko-vertex is already realized with a direct reference pointer (i.e. a variable), then the traversal can
be spawned off that vertex.
[gremlin-groovy,modern]
----
marko = g.V().has('name','marko').next() <1>
g.V(marko).out('knows') <2>
g.V(marko).out('knows').values('name') <3>
----
<1> Set the variable `marko` to the vertex in the graph `g` named "marko".
<2> Get the vertices that are outgoing adjacent to the marko-vertex via knows-edges.
<3> Get the names of the marko-vertex's friends.
=== The Traverser
When a traversal is executed, the source of the traversal is on the left of the expression (e.g. vertex 1), the steps
are the middle of the traversal (e.g. `out('knows')` and `values('name')`), and the results are "traversal.next()'d"
out of the right of the traversal (e.g. "vadas" and "josh").
image::traversal-mechanics.png[width=500]
The objects propagating through the traversal are wrapped in a `Traverser<T>`. The traverser provides the means by
which steps remain stateless. A traverser maintains all the metadata about the traversal -- e.g., how many times the
traverser has gone through a loop, the path history of the traverser, the current object being traversed, etc.
Traverser metadata may be accessed by a step. A classic example is the <<path-step,`path()`>>-step.
[gremlin-groovy,modern]
----
g.V(marko).out('knows').values('name').path()
----
WARNING: Path calculation is costly in terms of space as an array of previously seen objects is stored in each path
of the respective traverser. Thus, a traversal strategy analyzes the traversal to determine if path metadata is
required. If not, then path calculations are turned off.
Another example is the <<repeat-step,`repeat()`>>-step which takes into account the number of times the traverser
has gone through a particular section of the traversal expression (i.e. a loop).
[gremlin-groovy,modern]
----
g.V(marko).repeat(out()).times(2).values('name')
----
WARNING: TinkerPop does not guarantee the order of results returned from a traversal. It only guarantees not to modify
the iteration order provided by the underlying graph. Therefore it is important to understand the order guarantees of
the graph database being used. A traversal's result is never ordered by TinkerPop unless performed explicitly by means
of <<order-step,`order()`>>-step.
== On Gremlin Language Variants
Gremlin is written in Java 8. There are various language variants of Gremlin such as Gremlin-Groovy (packaged with
TinkerPop3), Gremlin-Python (packaged with TinkerPop3), link:https://github.com/mpollmeier/gremlin-scala[Gremlin-Scala],
Gremlin-JavaScript, Gremlin-Clojure (known as link:https://github.com/clojurewerkz/ogre[Ogre]), etc.
It is best to think of Gremlin as a style of graph traversing that is not bound to a particular programming language per se.
Within a programming language familiar to the developer, there is a Gremlin variant that they can use that leverages
the idioms of that language. At minimum, a programming language providing a Gremlin implementation must support
link:http://en.wikipedia.org/wiki/Method_chaining[function chaining] (with
link:http://en.wikipedia.org/wiki/Anonymous_function[lambdas/anonymous functions] being a "nice to have" if the
variants wishes to offer arbitrary computations beyond the provided Gremlin steps).
Throughout the documentation, the examples provided are primarily written in Gremlin-Groovy. The reason for this is
the <<gremlin-console,Gremlin Console>> -- an interactive programming environment exists that does not require
code compilation. For learning TinkerPop3 and interacting with a live graph system in an ad hoc manner, the Gremlin
Console is invaluable. However, for developers interested in working with Gremlin-Java, a few Groovy-to-Java patterns
are presented below.
[source,groovy]
// Gremlin-Groovy
g.V().out('knows').values('name') <1>
g.V().out('knows').map{it.get().value('name') + ' is the friend name'} <2>
g.V().out('knows').sideEffect(System.out.&println) <3>
g.V().as('person').out('knows').as('friend').select('person','friend').by{it.value('name').length()} <4>
[source,java]
// Gremlin-Java
g.V().out("knows").values("name") <1>
g.V().out("knows").map(t -> t.get().value("name") + " is the friend name") <2>
g.V().out("knows").sideEffect(System.out::println) <3>
g.V().as("person").out("knows").as("friend").select("person","friend").by((Function<Vertex, Integer>) v -> v.<String>value("name").length()) <4>
<1> All the non-lambda step chaining is identical in Gremlin-Groovy and Gremlin-Java. However, note that Groovy
supports `'` strings as well as `"` strings.
<2> In Groovy, lambdas are called closures and have a different syntax, where Groovy supports the `it` keyword and
Java doesn't with all parameters requiring naming.
<3> The syntax for method references differs slightly between link:https://docs.oracle.com/javase/tutorial/java/javaOO/methodreferences.html[Java]
and link:http://mrhaki.blogspot.de/2009/08/groovy-goodness-turn-methods-into.html[Gremlin-Groovy].
<4> Groovy is lenient on object typing and Java is not. When the parameter type of the lambda is not known,
typecasting is required.
Please see the <<gremlin-variants, Gremlin Variants>> section for more information on this topic.
== Graph System Integration
image:provider-integration.png[width=395,float=right] TinkerPop is a framework composed of various interoperable
components. At the foundation there is the <<graph,core TinkerPop3 API>> which defines what a `Graph`, `Vertex`,
`Edge`, etc. are. At minimum a graph system provider must implement the core API. Once implemented, the Gremlin
<<traversal,traversal language>> is available to the graph system's users. However, the provider can go further and
develop specific <<traversalstrategy,`TraversalStrategy`>> optimizations that allow the graph system to inspect a
Gremlin query at runtime and optimize it for its particular implementation (e.g. index lookups, step reordering). If
the graph system is a graph processor (i.e. provides OLAP capabilities), the system should implement the
<<graphcomputer,`GraphComputer`>> API. This API defines how messages/traversers are passed between communicating
workers (i.e. threads and/or machines). Once implemented, the same Gremlin traversals execute against both the graph
database (OLTP) and the graph processor (OLAP). Note that the Gremlin language interprets the graph in terms of
vertices and edges -- i.e. Gremlin is a graph-based domain specific language. Users can create their own domain
specific languages to process the graph in terms of higher-order constructs such as people, companies, and their
various relationships. Finally, <<gremlin-server,Gremlin Server>> can be leveraged to allow over the wire
communication with the TinkerPop-enabled graph system. Gremlin Server provides a configurable communication interface
along with metrics and monitoring capabilities. In total, this is The TinkerPop.