blob: 04c542d38ae435af2d971ff528a7d815f4656a99 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
[[shortest-path]]
== Shortest Path
image:shortest-path.png[width=300]
When working with a graph, it is often necessary to identify the
link:https://en.wikipedia.org/wiki/Shortest_path_problem[shortest path] between two identified vertices. The following
is a simple example that identifies the shortest path between vertex "1" and vertex "5" while traversing over out edges:
[gremlin-groovy]
----
g.addV().property(id, 1).as('1').
addV().property(id, 2).as('2').
addV().property(id, 3).as('3').
addV().property(id, 4).as('4').
addV().property(id, 5).as('5').
addE('knows').from('1').to('2').
addE('knows').from('2').to('4').
addE('knows').from('4').to('5').
addE('knows').from('2').to('3').
addE('knows').from('3').to('4').iterate()
g.V(1).repeat(out().simplePath()).until(hasId(5)).path().limit(1) <1>
g.V(1).repeat(out().simplePath()).until(hasId(5)).path().count(local) <2>
g.V(1).repeat(out().simplePath()).until(hasId(5)).path().
group().by(count(local)).next() <3>
----
<1> The traversal starts at vertex with the identifier of "1" and repeatedly traverses on out edges "until" it finds a
vertex with an identifier of "5". The inclusion of `simplePath` within the `repeat` is present to filter out repeated
paths. The traversal terminates with `limit` in this case as the first path returned will be the shortest one. Of
course, it is possible for there to be more than one path in the graph of the same length (i.e. two or more paths of
length three), but this example is not considering that.
<2> It might be interesting to know the path lengths for all paths between vertex "1" and "5".
<3> Alternatively, one might wish to do a path length distribution over all the paths.
The following code block demonstrates how the shortest path from `v[1]` to `v[5]` can be queried in OLAP, using the `shortestPath()` step.
[gremlin-groovy,existing]
----
g = g.withComputer()
g.V(1).shortestPath().
with(ShortestPath.edges, Direction.OUT).
with(ShortestPath.target, hasId(5))
----
The previous example defines the length of the path by the number of vertices in the path, but the "path" might also
be measured by data within the graph itself. The following example use the same graph structure as the previous example,
but includes a "weight" on the edges, that will be used to help determine the "cost" of a particular path:
[gremlin-groovy]
----
g.addV().property(id, 1).as('1').
addV().property(id, 2).as('2').
addV().property(id, 3).as('3').
addV().property(id, 4).as('4').
addV().property(id, 5).as('5').
addE('knows').from('1').to('2').property('weight', 1.25).
addE('knows').from('2').to('4').property('weight', 1.5).
addE('knows').from('4').to('5').property('weight', 0.25).
addE('knows').from('2').to('3').property('weight', 0.25).
addE('knows').from('3').to('4').property('weight', 0.25).iterate()
g.V(1).repeat(out().simplePath()).until(hasId(5)).path().
group().by(count(local)).next() <1>
g.V(1).repeat(outE().inV().simplePath()).until(hasId(5)).
path().by(coalesce(values('weight'),
constant(0.0))).
map(unfold().sum()) <2>
g.V(1).repeat(outE().inV().simplePath()).until(hasId(5)).
path().by(constant(0.0)).by('weight').map(unfold().sum()) <3>
g.V(1).repeat(outE().inV().simplePath()).until(hasId(5)).
path().as('p').
map(unfold().coalesce(values('weight'),
constant(0.0)).sum()).as('cost').
select('cost','p') <4>
----
<1> Note that the shortest path as determined by the structure of the graph is the same.
<2> Calculate the "cost" of the path as determined by the weight on the edges. As the "weight" data is on the edges
between the vertices, it is necessary to change the contents of the `repeat` step to use `outE().inV()` so that the
edge is included in the path. The path is then post-processed with a `by` modulator that extracts the "weight" value.
The traversal uses `coalesce` as there is a mixture of vertices and edges in the path and the traversal is only
interested in edge elements that can return a "weight" property. The final part of the traversal executes a map
function over each path, unfolding it and summing the weights.
<3> The same traversal as the one above it, but avoids the use of `coalesce` with the use of two `by` modulators. The
`by` modulator is applied in a round-robin fashion, so the first `by` will always apply to a vertex (as it is the first
item in every path) and the second `by` will always apply to an edge (as it always follows the vertex in the path).
<4> The output of the previous examples of the "cost" wasn't terribly useful as it didn't include which path had the
calculated cost. With some slight modifications given the use of `select` it becomes possible to include the path in
the output. Note that the path with the lowest "cost" actually has a longer path length as determined by the graph
structure.
The next code block demonstrates how the `shortestPath()` step can be used in OLAP to determine the shortest weighted path.
[gremlin-groovy,existing]
----
g = g.withComputer()
g.V(1).shortestPath().
with(ShortestPath.edges, Direction.OUT).
with(ShortestPath.distance, 'weight').
with(ShortestPath.target, hasId(5))
----
The following query illustrates how `select(<traversal>)` can be used to find all shortest weighted undirected paths
in the modern toy graph.
[gremlin-groovy,modern]
----
g.withSack(0.0).V().as("from"). <1>
repeat(bothE().
sack(sum).
by("weight").
otherV(). <2>
where(neq("from")).as("to"). <3>
group("m"). <4>
by(select("from","to")).
by(sack().min()).
filter(project("s","x"). <5>
by(sack()).
by(select("m").select(select("from","to"))).
where("s", eq("x"))).
group("p"). <6>
by(select("from", "to")).
by(map(union(path().by("name").by("weight"),
sack()).fold())).
barrier()).
cap("p").unfold().
order(). <7>
by(select(keys).select("from").id()).
by(select(keys).select("to").id()).
barrier().
dedup(). <8>
by(select(keys).select(values).order(local).by(id))
----
<1> Start the traversal from all vertices with an initial sack value of 0.
<2> Traverse into all directions and sum up the edge weight values.
<3> Filter out the initial start vertex.
<4> For the current start and end vertex, update the minimum sack value (weighted length of the path).
<5> Compare the current weighted path length to the current minimum weighted path length between the 2 vertices. Eliminate traversers that found a path that is longer than the current shortest path.
<6> Update the path and weighted path length for the current start and end vertex pair.
<7> Order the output by the start vertex id and then the end vertex id (for better readability).
<8> Deduplicate vertex pairs (the shortest path from `v[1]` to `v[6]` is the same as the path from `v[6]` to `v[1]`).
Again, this can be translated into an OLAP query using the `shortestPath()` step.
[gremlin-groovy,existing]
----
result = g.withComputer().V().
shortestPath().
with(ShortestPath.distance, 'weight').
with(ShortestPath.includeEdges, true).
filter(count(local).is(gt(1))).
group().
by(project('from','to').
by(limit(local, 1)).
by(tail(local, 1))).
unfold().
order().
by(select(keys).select('from').id()).
by(select(keys).select('to').id()).toList()
----
The obvious difference in the result is the absence of property values in the OLAP result. Since OLAP traversers are not
allowed to leave the local star graph, it's not possible to have the exact same result in an OLAP query. However, the determined
shortest paths can be passed back into the OLTP `GraphTraversalSource`, which can then be used to query the values.
[gremlin-groovy,existing]
----
g.withSideEffect('v', []). <1>
inject(result.toArray()).as('kv').select(values).
unfold().
map(unfold().as('v_or_e').
coalesce(V().where(eq('v_or_e')).store('v'),
select('v').tail(local, 1).bothE().where(eq('v_or_e'))).
values('name','weight').
fold()).
group().
by(select('kv').select(keys)).unfold().
order().
by(select(keys).select('from').id()).
by(select(keys).select('to').id()).toList()
----
<1> The side-effect `v` is used to keep track of the last processed vertex, hence it needs to be an order-preserving list. Without this explicit definition `v` would become a `BulkSet` which doesn't preserve the insert order.