blob: 46d61eb7fe30cc71c08a830981c1353ae6f78602 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
[[connected-components]]
== Connected Components
Gremlin can be used to find link:https://en.wikipedia.org/wiki/Connected_component_(graph_theory)[connected components]
in a graph. Consider the following graph which has three connected components:
image:connected-components.png[width=600]
[gremlin-groovy]
----
g.addV().property(id, "A").as("a").
addV().property(id, "B").as("b").
addV().property(id, "C").as("c").
addV().property(id, "D").as("d").
addV().property(id, "E").as("e").
addV().property(id, "F").
addE("link").from("a").to("b").
addE("link").from("b").to("c").
addE("link").from("d").to("e").iterate()
----
One way to detect the various subgraphs would be to do something like this:
[gremlin-groovy,existing]
----
g.V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()). <1>
path().aggregate("p"). <2>
unfold().dedup(). <3>
map(__.as("v").select("p").unfold(). <4>
filter(unfold().where(eq("v"))).
unfold().dedup().order().by(id).fold()).
dedup() <5>
----
<1> Iterate all vertices and repeatedly traverse over both incoming and outgoing edges (TinkerPop doesn't support
unidirectional graphs directly so it must be simulated by ignoring the direction with `both`). Note the use of `emit`
prior to `repeat` as this allows for return of a single length path.
<2> Aggregate the `path()` of the emitted vertices to "p". It is within these paths that the list of connected
components will be identified. Obviously the paths list are duplicative in the sense that they contains different
paths traveled over the same vertices.
<3> Unroll the elements in the path list with `unfold` and `dedup`.
<4> Use the first vertex in each path to filter against the paths stored in "p". When a path is found that has the
vertex in it, dedup the vertices in the path, order it by the identifier. Each path output from this `map` step
represents a connected component.
<5> The connected component list is duplicative given the nature of the paths in "p", but now that the vertices within
the paths are ordered, a final `dedup` will make the list of connective components unique.
NOTE: This is a nice example of where running smaller pieces of a large Gremlin statement make it easier to see what
is happening at each step. Consider running this example one line at a time (or perhaps even in a step at a time) to
see the output at each point.
While the above approach returns results nicely, the traversal doesn't appear to work with OLAP. A less efficient
approach, but one more suited for OLAP execution looks quite similar but does not use `dedup` as heavily (thus
`GraphComputer` is forced to analyze far more paths):
[gremlin-groovy,existing]
----
g.withComputer().V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).
aggregate("p").by(path()).cap("p").unfold().limit(local, 1).
map(__.as("v").select("p").unfold().
filter(unfold().where(eq("v"))).
unfold().dedup().order().by(id).fold()
).toSet()
----