| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[centrality]] |
| == Centrality |
| |
| There are many measures of link:https://en.wikipedia.org/wiki/Centrality[centrality] which are meant to help identify |
| the most important vertices in a graph. As these measures are common in graph theory, this section attempts to |
| demonstrate how some of these different indicators can be calculated using Gremlin. |
| |
| [[degree-centrality]] |
| === Degree Centrality |
| |
| link:https://en.wikipedia.org/wiki/Centrality#Degree_centrality[Degree centrality] is a measure of the number of |
| edges associated to each vertex. The following examples use the modern toy graph: |
| |
| [gremlin-groovy,modern] |
| ---- |
| g.V().group().by().by(bothE().count()) <1> |
| g.V().group().by().by(inE().count()) <2> |
| g.V().group().by().by(outE().count()) <3> |
| g.V().project("v","degree").by().by(bothE().count()) <4> |
| g.V().project("v","degree").by().by(bothE().count()). <5> |
| order().by("degree", desc). |
| limit(4) |
| ---- |
| |
| <1> Calculation of degree centrality which counts all incident edges on each vertex to include those that are both |
| incoming and outgoing. |
| <2> Calculation of in-degree centrality which only counts incoming edges to a vertex. |
| <3> Calculation of out-degree centrality which only counts outgoing edges from a vertex. |
| <4> The previous examples all produce a single `Map` as their output. While that is a desirable output, producing a |
| stream of `Map` objects can allow some greater flexibility. |
| <5> For example, use of a stream enables use of an ordered limit that can be executed in a distributed fashion in |
| OLAP traversals. |
| |
| NOTE: The link:https://tinkerpop.apache.org/docs/x.y.z/reference/#group-step[group] step takes up to two separate |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#by-step[by] modulators. The first `by()` tells `group()` |
| what the key in the resulting `Map` will be (i.e. the value to group on). In the above examples, the `by()` is empty |
| and as a result, the grouping will be on the incoming `Vertex` object itself. The second `by()` is the value to be |
| stored in the `Map` for each key. |
| |
| [[betweeness-centrality]] |
| === Betweeness Centrality |
| |
| link:https://en.wikipedia.org/wiki/Betweenness_centrality[Betweeness centrality] is a measure of the number of times |
| a vertex is found between the <<shortest-path,shortest path>> of each vertex pair in a graph. Consider the following |
| graph for demonstration purposes: |
| |
| image:betweeness-example.png[width=600] |
| |
| [gremlin-groovy ] |
| ---- |
| g.addV().property(id,'A').as('a'). |
| addV().property(id,'B').as('b'). |
| addV().property(id,'C').as('c'). |
| addV().property(id,'D').as('d'). |
| addV().property(id,'E').as('e'). |
| addV().property(id,'F').as('f'). |
| addE('next').from('a').to('b'). |
| addE('next').from('b').to('c'). |
| addE('next').from('b').to('d'). |
| addE('next').from('c').to('e'). |
| addE('next').from('d').to('e'). |
| addE('next').from('e').to('f').iterate() |
| g.V().as("v"). <1> |
| repeat(both().simplePath().as("v")).emit(). <2> |
| filter(project("x","y","z").by(select(first, "v")). <3> |
| by(select(last, "v")). |
| by(select(all, "v").count(local)).as("triple"). |
| coalesce(select("x","y").as("a"). <4> |
| select("triples").unfold().as("t"). |
| select("x","y").where(eq("a")). |
| select("t"), |
| store("triples")). <5> |
| select("z").as("length"). |
| select("triple").select("z").where(eq("length"))). <6> |
| select(all, "v").unfold(). <7> |
| groupCount().next() <8> |
| ---- |
| |
| <1> Starting from each vertex in the graph... |
| <2> ...traverse on both - incoming and outgoing - edges, avoiding <<cycle-detection, cyclic paths>>. |
| <3> Create a triple consisting of the first vertex, the last vertex and the length of the path between them. |
| <4> Determine whether a path between those two vertices was already found. |
| <5> If this is the first path between the two vertices, store the triple in an internal collection named "triples". |
| <6> Keep only those paths between a pair of vertices that have the same length as the first path that was found between them. |
| <7> Select all shortest paths and unfold them. |
| <8> Count the number of occurrences of each vertex, which is ultimately its betweeness score. |
| |
| WARNING: Since the betweeness centrality algorithm requires the shortest path between any pair of vertices in the graph, |
| its practical applications are very limited. It's recommended to use this algorithm only on small subgraphs (graphs like |
| the link:https://tinkerpop.apache.org/docs/current/reference/#grateful-dead[Grateful Dead graph] with only 808 vertices |
| and 8049 edges already require a massive amount of compute resources to determine the shortest paths between all vertex |
| pairs). |
| |
| [[closeness-centrality]] |
| === Closeness Centrality |
| |
| link:https://en.wikipedia.org/wiki/Centrality[Closeness centrality] is a measure of the distance of one vertex to all |
| other reachable vertices in the graph. The following examples use the modern toy graph: |
| |
| [gremlin-groovy,modern] |
| ---- |
| g = TinkerFactory.createModern().traversal() |
| g.withSack(1f).V().as("v"). <1> |
| repeat(both().simplePath().as("v")).emit(). <2> |
| filter(project("x","y","z").by(select(first, "v")). <3> |
| by(select(last, "v")). |
| by(select(all, "v").count(local)).as("triple"). |
| coalesce(select("x","y").as("a"). <4> |
| select("triples").unfold().as("t"). |
| select("x","y").where(eq("a")). |
| select("t"), |
| store("triples")). <5> |
| select("z").as("length"). |
| select("triple").select("z").where(eq("length"))). <6> |
| group().by(select(first, "v")). <7> |
| by(select(all, "v").count(local).sack(div).sack().sum()).next() |
| ---- |
| |
| <1> Defines a Gremlin link:https://tinkerpop.apache.org/docs/x.y.z/reference/#sack-step[sack] with a value of one. |
| <2> Traverses on both - incoming and outgoing - edges, avoiding <<cycle-detection, cyclic paths>>. |
| <3> Create a triple consisting of the first vertex, the last vertex and the length of the path between them. |
| <4> Determine whether a path between those two vertices was already found. |
| <5> If this is the first path between the two vertices, store the triple in an internal collection named "triples". |
| <6> Keep only those paths between a pair of vertices that have the same length as the first path that was found between them. |
| <7> For each vertex divide 1 by the product of the lengths of all shortest paths that start with this particular vertex. |
| |
| WARNING: Since the closeness centrality algorithm requires the shortest path between any pair of vertices in the graph, |
| its practical applications are very limited. It's recommended to use this algorithm only on small subgraphs (graphs like |
| the link:https://tinkerpop.apache.org/docs/current/reference/#grateful-dead[Grateful Dead graph] with only 808 vertices |
| and 8049 edges already require a massive amount of compute resources to determine the shortest paths between all vertex |
| pairs). |
| |
| [[eigenvector-centrality]] |
| === Eigenvector Centrality |
| |
| A calculation of link:https://en.wikipedia.org/wiki/Centrality#Eigenvector_centrality[eigenvector centrality] uses the |
| relative importance of adjacent vertices to help determine their centrality. In other words, unlike |
| <<degree-centrality, degree centrality>> the vertex with the greatest number of incident edges does not necessarily |
| give it the highest rank. Consider the following example using the Grateful Dead graph: |
| |
| [gremlin-groovy] |
| ---- |
| g.io('data/grateful-dead.xml').read().iterate() |
| g.V().repeat(groupCount('m').by('name').out()).times(5).cap('m'). <1> |
| order(local).by(values, desc).limit(local, 10).next() <2> |
| g.V().repeat(groupCount('m').by('name').out().timeLimit(100)).times(5).cap('m'). <3> |
| order(local).by(values, desc).limit(local, 10).next() |
| ---- |
| |
| <1> The traversal iterates through each vertex in the graph and for each one repeatedly group counts each vertex that |
| passes through using the vertex as the key. The `Map` of this group count is stored in a variable named "m". The |
| `out()` traversal is repeated thirty times or until the paths are exhausted. Five iterations should provide enough |
| time to converge on a solution. Calling `cap('m')` at the end simply extracts the `Map` side-effect stored in "m". |
| <2> The entries in the `Map` are then iterated and sorted with the top ten most central vertices presented as output. |
| <3> The previous examples can be expanded on a little bit by including a |
| link:https://tinkerpop.apache.org/docs/current/reference/#timelimit-step[time limit]. The `timeLimit()` prevents the |
| traversal from taking longer than one hundred milliseconds to execute (the previous example takes considerably longer |
| than that). While the answer provided with the `timeLimit()` is not the absolute ranking, it does provide a relative |
| ranking that closely matches the absolute one. The use of `timeLimit()` in certain algorithms (e.g. recommendations) |
| can shorten the time required to get a reasonable and usable result. |
| |
| [[pagerank-centrality]] |
| === PageRank Centrality |
| |
| While not technically a recipe, it's worth noting here in the "Centrality Section" that |
| link:https://en.wikipedia.org/wiki/PageRank[PageRank] centrality can be calculated with Gremlin with the |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#pagerank-step[pageRank()]-step which is designed to work with |
| `GraphComputer` (OLAP) based traversals. |
| |
| [gremlin-groovy,modern] |
| ---- |
| g = graph.traversal().withComputer() |
| g.V().pageRank().with(PageRank.propertyName,'pageRank').values('pageRank') |
| ---- |