blob: 40f2e6e2e4078ed4277bd72a88e792a21cb840ed [file] [log] [blame]
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<img src="img/tinkerpop-cityscape.png" class="img-responsive" />
<div class="container">
<div class="hero-unit" style="padding:10px">
<b><font size="5" face="american typewriter">Apache TinkerPop&trade;</font></b>
<p><font size="5">The Gremlin Graph Traversal Machine and Language</font></p>
</div>
</div>
<br/>
<div class="container-fluid">
<div class="container">
<div class="row">
<div class="col-sm-10 col-md-10">
<a href="http://arxiv.org/abs/1508.03843">Gremlin</a> is the graph traversal language of <a href="http://tinkerpop.apache.org/">Apache TinkerPop</a>.
Gremlin is a <a href="https://en.wikipedia.org/wiki/Functional_programming">functional</a>, <a href="https://en.wikipedia.org/wiki/Dataflow_programming">data-flow</a>
language that enables users to succinctly express complex traversals on (or queries of) their application's property graph. Every Gremlin traversal is composed of a sequence of (potentially nested) steps. A step
performs an atomic operation on the data stream. Every step is either a <em>map</em>-step (transforming the objects in the stream), a <em>filter</em>-step (removing objects
from the stream), or a <em>sideEffect</em>-step (computing statistics about the stream). The Gremlin step library extends on these 3-fundamental operations to provide
users a rich collection of steps that they can compose in order to ask any conceivable question they may have of their data for Gremlin is <a href="http://arxiv.org/abs/1508.03843">Turing Complete</a>.
</div>
<div class="col-sm-2 col-md-2">
<img src="img/gremlin-head.png" width="100%">
</div>
</div>
<br/>
<div style="border-radius:3px;border:1px solid black;padding:10px;padding-left:10px;height:170px" id="gremlinCarousel" class="carousel slide" data-ride="carousel" data-interval="30000">
<!-- Indicators -->
<ol class="carousel-indicators carousel-indicators-numbers">
<li data-target="#gremlinCarousel" data-slide-to="0" class="active">1</li>
<li data-target="#gremlinCarousel" data-slide-to="1">2</li>
<li data-target="#gremlinCarousel" data-slide-to="2">3</li>
<li data-target="#gremlinCarousel" data-slide-to="3">4</li>
<li data-target="#gremlinCarousel" data-slide-to="4">5</li>
<li data-target="#gremlinCarousel" data-slide-to="5">6</li>
</ol>
<div class="carousel-inner" role="listbox">
<div class="item active">
<div class="row">
<div class="col-xs-5">
<pre style="padding-left:10px;height:148px;overflow:hidden;"><code class="language-gremlin">
g.V().has("name","gremlin").
out("knows").
out("knows").
values("name")
</code></pre>
</div>
<div class="col-xs-7" style="border-left: thin solid #000000;height:148px">
<b>What are the names of Gremlin's friends' friends?</b>
<p/>
<ol style="padding-left:20px">
<li>Get the vertex with name "gremlin."</li>
<li>Traverse to the people that Gremlin knows.</li>
<li>Traverse to the people those people know.</li>
<li>Get those people's names.</li>
</ol>
<br/>
</div>
</div>
</div>
<div class="item">
<div class="row">
<div class="col-xs-5">
<pre style="padding-left:10px;height:148px;overflow:hidden;"><code class="language-gremlin">
g.V().match(
as("a").out("knows").as("b"),
as("a").out("created").as("c"),
as("b").out("created").as("c"),
as("c").in("created").count().is(2)).
select("c").by("name")</code></pre>
</div>
<div class="col-xs-7" style="border-left: thin solid #000000;height:148px">
<b>What are the names of the projects created by two friends?</b>
<p/>
<ol style="padding-left:20px">
<li>...there exists some "a" who knows "b".</li>
<li>...there exists some "a" who created "c".</li>
<li>...there exists some "b" who created "c".</li>
<li>...there exists some "c" created by 2 people.</li>
<li>Get the name of all matching "c" projects.</li>
</ol>
</div>
</div>
</div>
<div class="item">
<div class="row">
<div class="col-xs-5">
<pre style="padding-left:10px;height:148px;overflow:hidden;"><code class="language-gremlin">
g.V().has("name","gremlin").
repeat(in("manages")).
until(has("title","ceo")).
path().by("name")
</code></pre>
</div>
<div class="col-xs-7" style="border-left: thin solid #000000;height:148px">
<b>Get the managers from Gremlin to the CEO in the hiearchy.</b>
<p/>
<ol style="padding-left:20px">
<li>Get the vertex with the name "gremlin."</li>
<li>Traverse up the management chain...</li>
<li>...until a person with the title of CEO is reached.</li>
<li>Get name of the managers in the path traversed.</li>
</ol>
<br/>
</div>
</div>
</div>
<div class="item">
<div class="row">
<div class="col-xs-5">
<pre style="padding-left:10px;height:148px;overflow:hidden;"><code class="language-gremlin">
g.V().has("name","gremlin").as("a").
out("created").in("created").
where(neq("a")).
groupCount().by("title")
</code></pre>
</div>
<div class="col-xs-7" style="border-left: thin solid #000000;height:148px">
<b>Get the distribution of titles amongst Gremlin's collaborators.</b>
<p/>
<ol style="padding-left:20px">
<li>Get the vertex with the name "gremlin" and label it "a."</li>
<li>Get Gremlin's created projects and then who created them...</li>
<li>...that are not Gremlin.</li>
<li>Group count those collaborators by their titles.</li>
</ol>
<br/>
</div>
</div>
</div>
<div class="item">
<div class="row">
<div class="col-xs-5">
<pre style="padding-left:10px;height:148px;overflow:hidden;"><code class="language-gremlin">
g.V().has("name","gremlin").
out("bought").aggregate("stash").
in("bought").out("bought").
where(not(within("stash"))).
groupCount().order(local).by(values,desc)
</code></pre>
</div>
<div class="col-xs-7" style="border-left: thin solid #000000;height:148px">
<b>Get a ranked list of relevant products for Gremlin to purchase.</b>
<p/>
<ol style="padding-left:20px">
<li>Get the vertex with the name "gremlin."</li>
<li>Get the products Gremlin has purchased and save as "stash."</li>
<li>Who else bought those products and what else did they buy...</li>
<li>...that Gremlin has not already purchased.</li>
<li>Group count the products and order by their relevance.</li>
</ol>
</div>
</div>
</div>
<div class="item">
<div class="row">
<div class="col-xs-5">
<pre style="padding-left:10px;height:148px;overflow:hidden;"><code class="language-gremlin">
g.V().hasLabel("person").
pageRank().
by("friendRank").
by(outE("knows")).
order().by("friendRank",desc).
limit(10)</code></pre>
</div>
<div class="col-xs-7" style="border-left: thin solid #000000;height:148px">
<b>Get the 10 most central people in the knows-graph.</b>
<p/>
<ol style="padding-left:20px">
<li>Get all people vertices.</li>
<li>Calculate their PageRank using knows-edges.</li>
<li>Order the people by their friendRank score.</li>
<li>Get the top 10 ranked people.</li>
</ol>
</div>
</div>
</div>
</div>
</div>
</div>
<br/>
<div class="container">
<a name="oltp-and-olap-traversals"></a>
<h3>OLTP and OLAP Traversals</h3>
<br/>
Gremlin was designed according to the "write once, run anywhere"-philosophy. This means that not only can all TinkerPop-enabled
graph systems execute Gremlin traversals, but also, every Gremlin traversal can be evaluated as either a real-time database query
or as a batch analytics query. The former is known as an <em>online transactional process</em> (<a href="https://en.wikipedia.org/wiki/Online_transaction_processing">OLTP</a>) and the latter as an <em>online analytics
process</em> (<a href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a>). This universality is made possible by the Gremlin traversal machine. This distributed, graph-based <a href="https://en.wikipedia.org/wiki/Virtual_machine#Abstract_virtual_machine_techniques">virtual machine</a>
understands how to coordinate the execution of a multi-machine graph traversal. Moreover, not only can the execution either be OLTP or
OLAP, it is also possible for certain subsets of a traversal to execute OLTP while others via OLAP. The benefit is that the user does
not need to learn both a database query language and a domain-specific BigData analytics language (e.g. Spark DSL, MapReduce, etc.).
Gremlin is all that is required to build a graph-based application because the Gremlin traversal machine will handle the rest.
<br/><br/>
<center><img src="img/oltp-and-olap.png" style="width:80%;" class="img-responsive"></center>
</div>
<br/>
<div class="container">
<a name="imperative-and-declarative-traversals"></a>
<h3>Imperative and Declarative Traversals</h3>
<br/>
<div class="row">
<div class="col-sm-7 col-md-8">
A Gremlin traversal can be written in either an <em>imperative</em> (<a href="https://en.wikipedia.org/wiki/Imperative_programming">procedural</a>) manner, a <em>declarative</em> (<a href="https://en.wikipedia.org/wiki/Declarative_programming">descriptive</a>) manner,
or in a hybrid manner containing both imperative and declarative aspects. An imperative Gremlin traversal tells the traversers how to proceed at each step in the traversal. For instance,
the imperative traversal on the right first places a traverser at the vertex denoting Gremlin. That traverser then splits itself across all of Gremlin's collaborators that are not Gremlin
himself. Next, the traversers walk to the managers of those collaborators to ultimately be grouped into a manager name count distribution. This traversal is imperative in that it tells the
traversers to "go here and then go there" in an explicit, procedural manner.
</div>
<div class="col-sm-5 col-md-4">
<pre style="padding:10px;">
<code class="language-gremlin">g.V().has("name","gremlin").as("a").
out("created").in("created").
where(neq("a")).
in("manages").
groupCount().by("name")</code>
</pre>
</div>
</div>
<p/>
<div class="row">
<div class="col-sm-5 col-md-4">
<pre style="padding:10px;">
<code class="language-gremlin">g.V().match(
as("a").has("name","gremlin"),
as("a").out("created").as("b"),
as("b").in("created").as("c"),
as("c").in("manages").as("d"),
where("a",neq("c"))).
select("d").
groupCount().by("name")</code>
</pre>
</div>
<div class="col-sm-7 col-md-8">
A declarative Gremlin traversal does not tell the traversers the order in which to execute their walk, but instead, allows each traverser to select a pattern to execute from a collection
of (potentially nested) patterns. The <a href="http://tinkerpop.apache.org/docs/current/reference/#match-step">declarative traversal</a> on the left yields the same result as the imperative traversal above. However, the declarative traversal has the added benefit
that it leverages not only a compile-time query planner (like imperative traversals), but also a runtime query planner that chooses which traversal pattern to execute next based on the
historic statistics of each pattern -- favoring those patterns which tend to reduce/filter the most data.
</div>
</div>
<br/>
The user can write their traversals in any way they choose. However, ultimately when their traversal is compiled, and depending on the underlying execution engine
(i.e. an OLTP graph database or an OLAP graph processor), the user's traversal is rewritten by a set of <em><a href="http://tinkerpop.apache.org/docs/current/reference/#traversalstrategy">traversal strategies</a></em> which do their best to determine the most optimal execution
plan based on an understanding of graph data access costs as well as the underlying data systems's unique capabilities (e.g. fetch the Gremlin vertex from the graph database's "name"-index).
Gremlin has been designed to give users flexibility in how they express their queries and graph system providers flexibility in how to efficiently evaluate traversals against their TinkerPop-enabled data system.
</div>
<br/>
<div class="container">
<a name="host-language-embedding"></a>
<h3>Host Language Embedding</h3>
<br/>
<div class="row">
<div class="col-sm-5 col-md-4">
<img src="img/gremlin-language-variants.png" class="img-responsive">
</div>
<div class="col-sm-7 col-md-8">
Classic database query languages, like <a href="https://en.wikipedia.org/wiki/SQL">SQL</a>, were conceived as being fundamentally different from the programming languages that would
ultimately use them in a production setting. For this reason, classical databases require the developer to code both in their native programming
language as well as in the database's respective query language. An argument can be made that the difference between "query languages" and
"programming languages" are not as great as we are taught to believe. Gremlin unifies this divide because traversals can be written in any
programming language that supports function <a href="https://en.wikipedia.org/wiki/Function_composition">composition</a> and <a href="https://en.wikipedia.org/wiki/Nested_function">nesting</a> (which every major programming language supports). In this way, the user's
Gremlin traversals are written along side their application code and benefit from the advantages afforded by the host language and its tooling
(e.g. type checking, syntax highlighting, dot completion, etc.). Various <a href="http://tinkerpop.apache.org/docs/current/tutorials/gremlin-language-variants/">Gremlin language variants</a> exist including: Gremlin-Java, Gremlin-Groovy, <a href="http://tinkerpop.apache.org/docs/current/reference/#gremlin-python">Gremlin-Python</a>,
<a href="https://github.com/mpollmeier/gremlin-scala">Gremlin-Scala</a>, etc.
</div>
<div class="col-md-12">
<p><br/>The first example below shows a simple Java class. Note that the Gremlin traversal is expressed in Gremlin-Java and thus, is part of the user's application code. There is no need for the
developer to create a <code>String</code> representation of their query in (yet) another language to ultimately pass that <code>String</code> to the graph computing system and be returned a result set. Instead,
traversals are embedded in the user's host programming language and are on equal footing with all other application code. With Gremlin, users <strong>do not</strong> have to deal with the awkwardness exemplified
in the second example below which is a common anti-pattern found throughout the industry.
</p>
</div>
<br/><br/>
<div class="col-md-5">
<pre style="padding:10px;"><code class="language-gremlin">public class GremlinTinkerPopExample {
public void run(String name, String property) {
Graph graph = GraphFactory.open(...);
GraphTraversalSource g = graph.traversal();
double avg = g.V().has("name",name).
out("knows").out("created").
values(property).mean().next();
System.out.println("Average rating: " + avg);
}
}
</code>
</pre>
</div>
<div class="col-md-7">
<pre style="padding:10px;"><code class="language-gremlin">public class SqlJdbcExample {
public void run(String name, String property) {
Connection connection = DriverManager.getConnection(...)
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery(
"SELECT AVG(pr." + property + ") as AVERAGE FROM PERSONS p1" +
"INNER JOIN KNOWS k ON k.person1 = p1.id " +
"INNER JOIN PERSONS p2 ON p2.id = k.person2 " +
"INNER JOIN CREATED c ON c.person = p2.id " +
"INNER JOIN PROJECTS pr ON pr.id = c.project " +
"WHERE p.name = '" + name + "');
System.out.println("Average rating: " + result.next().getDouble("AVERAGE")
}
}</code>
</pre>
</div>
<div class="col-md-12">
<p><br/>Behind the scenes, a Gremlin traversal will evaluate locally against an embedded graph database, serialize itself across the network to a remote
graph database, or send itself to an OLAP processor for cluster-wide distributed execution. The traversal source definition determines where the traversal executes. Once a traversal source is
defined it can be used over and over again in a manner analogous to a database connection. The ultimate effect is that the user "feels" that their data and their traversals are all
co-located in their application and accessible via their application's native programming language. The "query language/programming language"-divide is bridged by Gremlin.
</p>
<br/>
</div>
<div class="col-md-12">
<pre style="padding:10px;"><code class="language-gremlin">Graph graph = GraphFactory.open(...);
GraphTraversalSource g;
g = graph.traversal(); // local OLTP
g = traversal().withRemote(DriverRemoteConnection.using("localhost", 8182)) // remote
g = graph.traversal().withComputer(SparkGraphComputer.class); // distributed OLAP</code>
</pre>
</div>
<br/>
</div>
<div class="container">
<hr/>
<h4>Related Resources</h4>
<br/>
<div class="carousel slide" data-ride="carousel" data-type="multi" data-interval="7000" id="relatedResources">
<div class="carouselGrid-inner">
<div class="item active">
<div class="col-lg-3 col-md-3 col-sm-4 col-xs-6"><a href="https://academy.datastax.com/resources/getting-started-graph-databases"><img src="img/resources/graph-databases-101-resource.png" width="100%" /></a></div>
</div>
<div class="item">
<div class="col-lg-3 col-md-3 col-sm-4 col-xs-6"><a href="http://datastax.com/dev/blog/the-benefits-of-the-gremlin-graph-traversal-machine"><img src="img/resources/benefits-gremlin-machine-resource.png" width="100%" /></a></div>
</div>
<div class="item">
<div class="col-lg-3 col-md-3 col-sm-4 col-xs-6"><a href="http://arxiv.org/abs/1508.03843"><img src="img/resources/arxiv-article-resource.png" width="100%" /></a></div>
</div>
<div class="item">
<div class="col-lg-3 col-md-3 col-sm-4 col-xs-6"><a href="http://sql2gremlin.com/"><img src="img/resources/sql-2-gremlin-resource.png" width="100%" /></a></div>
</div>
</div>
<a class="left carouselGrid-control" href="#relatedResources" data-slide="prev">
<span class="icon-prev" aria-hidden="true"></span>
<span class="sr-only">Previous</span>
</a>
<a class="right carouselGrid-control" href="#relatedResources" data-slide="next">
<span class="icon-next" aria-hidden="true"></span>
<span class="sr-only">Next</span>
</a>
</div>
<script>
$('.carousel[data-type="multi"] .item').each(function(){
var next = $(this).next();
if (!next.length) { // if ther isn't a next
next = $(this).siblings(':first'); // this is the first
}
next.children(':first-child').clone().appendTo($(this)); // put the next ones on the array
for (var i=0;i<2;i++) { // THIS LOOP SPITS OUT EXTRA ITEMS TO THE CAROUSEL
next=next.next();
if (!next.length) {
next = $(this).siblings(':first');
}
next.children(':first-child').clone().appendTo($(this));
}
});
</script>
</div>
<br/>
<br/>
</div>
</div>