blob: 1ddbef05fbbb4da66978e0d37ad15dc6256fbf42 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
[[centrality]]
== Centrality
There are many measures of link:https://en.wikipedia.org/wiki/Centrality[centrality] which are meant to help identify
the most important vertices in a graph. As these measures are common in graph theory, this section attempts to
demonstrate how some of these different indicators can be calculated using Gremlin.
[[degree-centrality]]
=== Degree Centrality
link:https://en.wikipedia.org/wiki/Centrality#Degree_centrality[Degree centrality] is a measure of the number of
edges associated to each vertex. The following examples use the modern toy graph:
[gremlin-groovy,modern]
----
g.V().group().by().by(bothE().count()) <1>
g.V().group().by().by(inE().count()) <2>
g.V().group().by().by(outE().count()) <3>
g.V().project("v","degree").by().by(bothE().count()) <4>
g.V().project("v","degree").by().by(bothE().count()). <5>
order().by("degree", desc).
limit(4)
----
<1> Calculation of degree centrality which counts all incident edges on each vertex to include those that are both
incoming and outgoing.
<2> Calculation of in-degree centrality which only counts incoming edges to a vertex.
<3> Calculation of out-degree centrality which only counts outgoing edges from a vertex.
<4> The previous examples all produce a single `Map` as their output. While that is a desirable output, producing a
stream of `Map` objects can allow some greater flexibility.
<5> For example, use of a stream enables use of an ordered limit that can be executed in a distributed fashion in
OLAP traversals.
NOTE: The link:https://tinkerpop.apache.org/docs/x.y.z/reference/#group-step[group] step takes up to two separate
link:https://tinkerpop.apache.org/docs/x.y.z/reference/#by-step[by] modulators. The first `by()` tells `group()`
what the key in the resulting `Map` will be (i.e. the value to group on). In the above examples, the `by()` is empty
and as a result, the grouping will be on the incoming `Vertex` object itself. The second `by()` is the value to be
stored in the `Map` for each key.
[[betweeness-centrality]]
=== Betweeness Centrality
link:https://en.wikipedia.org/wiki/Betweenness_centrality[Betweeness centrality] is a measure of the number of times
a vertex is found between the <<shortest-path,shortest path>> of each vertex pair in a graph. Consider the following
graph for demonstration purposes:
image:betweeness-example.png[width=600]
[gremlin-groovy ]
----
g.addV().property(id,'A').as('a').
addV().property(id,'B').as('b').
addV().property(id,'C').as('c').
addV().property(id,'D').as('d').
addV().property(id,'E').as('e').
addV().property(id,'F').as('f').
addE('next').from('a').to('b').
addE('next').from('b').to('c').
addE('next').from('b').to('d').
addE('next').from('c').to('e').
addE('next').from('d').to('e').
addE('next').from('e').to('f').iterate()
g.V().as("v"). <1>
repeat(both().simplePath().as("v")).emit(). <2>
filter(project("x","y","z").by(select(first, "v")). <3>
by(select(last, "v")).
by(select(all, "v").count(local)).as("triple").
coalesce(select("x","y").as("a"). <4>
select("triples").unfold().as("t").
select("x","y").where(eq("a")).
select("t"),
store("triples")). <5>
select("z").as("length").
select("triple").select("z").where(eq("length"))). <6>
select(all, "v").unfold(). <7>
groupCount().next() <8>
----
<1> Starting from each vertex in the graph...
<2> ...traverse on both - incoming and outgoing - edges, avoiding <<cycle-detection, cyclic paths>>.
<3> Create a triple consisting of the first vertex, the last vertex and the length of the path between them.
<4> Determine whether a path between those two vertices was already found.
<5> If this is the first path between the two vertices, store the triple in an internal collection named "triples".
<6> Keep only those paths between a pair of vertices that have the same length as the first path that was found between them.
<7> Select all shortest paths and unfold them.
<8> Count the number of occurrences of each vertex, which is ultimately its betweeness score.
WARNING: Since the betweeness centrality algorithm requires the shortest path between any pair of vertices in the graph,
its practical applications are very limited. It's recommended to use this algorithm only on small subgraphs (graphs like
the link:https://tinkerpop.apache.org/docs/current/reference/#grateful-dead[Grateful Dead graph] with only 808 vertices
and 8049 edges already require a massive amount of compute resources to determine the shortest paths between all vertex
pairs).
[[closeness-centrality]]
=== Closeness Centrality
link:https://en.wikipedia.org/wiki/Centrality[Closeness centrality] is a measure of the distance of one vertex to all
other reachable vertices in the graph. The following examples use the modern toy graph:
[gremlin-groovy,modern]
----
g = TinkerFactory.createModern().traversal()
g.withSack(1f).V().as("v"). <1>
repeat(both().simplePath().as("v")).emit(). <2>
filter(project("x","y","z").by(select(first, "v")). <3>
by(select(last, "v")).
by(select(all, "v").count(local)).as("triple").
coalesce(select("x","y").as("a"). <4>
select("triples").unfold().as("t").
select("x","y").where(eq("a")).
select("t"),
store("triples")). <5>
select("z").as("length").
select("triple").select("z").where(eq("length"))). <6>
group().by(select(first, "v")). <7>
by(select(all, "v").count(local).sack(div).sack().sum()).next()
----
<1> Defines a Gremlin link:https://tinkerpop.apache.org/docs/x.y.z/reference/#sack-step[sack] with a value of one.
<2> Traverses on both - incoming and outgoing - edges, avoiding <<cycle-detection, cyclic paths>>.
<3> Create a triple consisting of the first vertex, the last vertex and the length of the path between them.
<4> Determine whether a path between those two vertices was already found.
<5> If this is the first path between the two vertices, store the triple in an internal collection named "triples".
<6> Keep only those paths between a pair of vertices that have the same length as the first path that was found between them.
<7> For each vertex divide 1 by the product of the lengths of all shortest paths that start with this particular vertex.
WARNING: Since the closeness centrality algorithm requires the shortest path between any pair of vertices in the graph,
its practical applications are very limited. It's recommended to use this algorithm only on small subgraphs (graphs like
the link:https://tinkerpop.apache.org/docs/current/reference/#grateful-dead[Grateful Dead graph] with only 808 vertices
and 8049 edges already require a massive amount of compute resources to determine the shortest paths between all vertex
pairs).
[[eigenvector-centrality]]
=== Eigenvector Centrality
A calculation of link:https://en.wikipedia.org/wiki/Centrality#Eigenvector_centrality[eigenvector centrality] uses the
relative importance of adjacent vertices to help determine their centrality. In other words, unlike
<<degree-centrality, degree centrality>> the vertex with the greatest number of incident edges does not necessarily
give it the highest rank. Consider the following example using the Grateful Dead graph:
[gremlin-groovy]
----
g.io('data/grateful-dead.xml').read().iterate()
g.V().repeat(groupCount('m').by('name').out()).times(5).cap('m'). <1>
order(local).by(values, desc).limit(local, 10).next() <2>
g.V().repeat(groupCount('m').by('name').out().timeLimit(100)).times(5).cap('m'). <3>
order(local).by(values, desc).limit(local, 10).next()
----
<1> The traversal iterates through each vertex in the graph and for each one repeatedly group counts each vertex that
passes through using the vertex as the key. The `Map` of this group count is stored in a variable named "m". The
`out()` traversal is repeated thirty times or until the paths are exhausted. Five iterations should provide enough
time to converge on a solution. Calling `cap('m')` at the end simply extracts the `Map` side-effect stored in "m".
<2> The entries in the `Map` are then iterated and sorted with the top ten most central vertices presented as output.
<3> The previous examples can be expanded on a little bit by including a
link:https://tinkerpop.apache.org/docs/current/reference/#timelimit-step[time limit]. The `timeLimit()` prevents the
traversal from taking longer than one hundred milliseconds to execute (the previous example takes considerably longer
than that). While the answer provided with the `timeLimit()` is not the absolute ranking, it does provide a relative
ranking that closely matches the absolute one. The use of `timeLimit()` in certain algorithms (e.g. recommendations)
can shorten the time required to get a reasonable and usable result.
[[pagerank-centrality]]
=== PageRank Centrality
While not technically a recipe, it's worth noting here in the "Centrality Section" that
link:https://en.wikipedia.org/wiki/PageRank[PageRank] centrality can be calculated with Gremlin with the
link:https://tinkerpop.apache.org/docs/x.y.z/reference/#pagerank-step[pageRank()]-step which is designed to work with
`GraphComputer` (OLAP) based traversals.
[gremlin-groovy,modern]
----
g = graph.traversal().withComputer()
g.V().pageRank().by('pageRank').values('pageRank')
----