blob: 38813003b647510bc8b6a58f6838e3a7c5a5f84c [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
// @author Daniel Kuppitz (anwer on gremlin user list)
// @author Robert Dale (answer on gremlin user list)
// @author Marc de Lignie
[[connected-components]]
== Connected Components
Gremlin can be used to find link:https://en.wikipedia.org/wiki/Connected_component_(graph_theory)[connected components]
in a graph. In a directed graph like in TinkerPop, components can be weakly or strongly connected. This recipe is
restricted to finding link:https://en.wikipedia.org/wiki/Directed_graph#Directed_graph_connectivity[weakly
connected components], in which the direction of edges is not taken into account.
Depending on the size of the graph, three solution regimes can be discriminated:
1. Small graphs that fit in the memory of a single machine
2. Medium-sized graphs backed by storage for which an OLTP linear scan is still feasible. This regime is left to third party
TinkerPop implementations, since TinkerPop itself has no storage-backed reference implementations. The idea is that
component membership is stored in the graph, rather than in memory.
3. Large graphs requiring an approach with `HadoopGraph` and `SparkGraphComputer` to yield results in a reasonable time.
These regimes are discussed separately using the following graph with three weakly connected components:
image:connected-components.png[width=600]
[gremlin-groovy]
----
g.addV().property(id, "A").as("a").
addV().property(id, "B").as("b").
addV().property(id, "C").as("c").
addV().property(id, "D").as("d").
addV().property(id, "E").as("e").
addV().property(id, "F").
addE("link").from("a").to("b").
addE("link").from("b").to("c").
addE("link").from("d").to("e").iterate()
----
=== Small graph traversals
Connected components in a small graph can be determined with either an OLTP traversal or the OLAP
`connectedComponent()`-step. The `connectedComponent()`-step is available as of TinkerPop 3.4.0 and is
described in more detail in the
link:https://tinkerpop.apache.org/docs/x.y.z/reference/#connectedcomponent-step[Reference Documentation].
The traversal looks like:
[gremlin-groovy,existing]
----
g.withComputer().V().connectedComponent().
group().by(ConnectedComponent.component).
select(values).unfold()
----
NOTE: The `component` option passed to `by()` is statically imported from `ConnectedComponent` and refers to the
default property key within which the result of the algorithm is stored.
A straightforward way to detect the various subgraphs with an OLTP traversal is to do this:
[gremlin-groovy,existing]
----
g.V().emit(cyclicPath().or().not(both())). <1>
repeat(__.where(without('a')).store('a').both()).until(cyclicPath()). <2>
group().by(path().unfold().limit(1)). <3>
by(path().unfold().dedup().fold()). <4>
select(values).unfold() <5>
----
<1> The initial emit() step allows for output of isolated vertices, in addition to the discovery of
components as described in (2).
<2> The entire component to which the first returned vertex belongs, is visited. To allow for components of any
structure, a repeat loop is applied that only stops for a particular branch of the component when it detects a cyclic
path. Collection `'a'` is used to keep track of visited vertices, for both subtraversals within a component
and new traversals resulting from the `g.V()` linear scan.
<3> While `'a'` nicely keeps track of vertices already visited, the actual components need to be extracted from the
path information. The `path().unfold().limit(1)` closure provides the starting vertex
of surviving traversers, which can be used to group the components.
<4> This clause collects the unique vertices from all paths with the same starting vertex, thus from the same
weak component.
<5> The values of the groupby map contain the lists of vertices making up the requested components.
=== Small graph scalability
The scalability of the OLTP traversal and the `connectedComponent()`-step for in-memory graphs is shown in the figures
below.
[[cc-scale-size]]
.Run times for finding connected components in a randomly generated graph with 10 components of equal size and with an edge/vertex ratio of 6
image::cc-scale-size.png[width=600, side=bottom]
In general, the `connectedComponent()`-step is almost a factor two faster than the OLTP traversal. Only, for very
small graphs the overhead of running the ConnectedComponentVertexProgram is larger than that of the OLTP traversal.
The vertex program works by having interconnected vertices exchange id's and store the lowest id until no vertex
receives a lower id. This algorithm is commonly applied in
link:https://en.wikipedia.org/wiki/Bulk_synchronous_parallel[bulk synchronous parallel] systems, e.g. in
link:https://spark.apache.org/graphx[Apache Spark GraphX]. Overhead for the vertex program arises because it has to run
as many cycles as the largest length of the shortest paths between any two vertices in a component of the graph. In
every cycle each vertex has to be checked for being
"halted". Overhead of the OLTP traversal consists of each traverser having to carry complete path information. For
pure depth-first-search or breadth-first-search implementations, connected-component algotithms should scale
as [.big]##O##(V+E). For the traversals in the figure above this is almost the case.
[[cc-scale-ratio]]
.Run times for finding connected components in a randomly generated graph with 10 components, each consisting of 6400 vertices
image::cc-scale-ratio.png[width=600]
The random graphs used for the scalability tests can be modulated with the edge/vertex ratio. For small ratios the
components generated are more lint-like and harder to process by the `connectedComponent()`-step. For high ratios
the components are more mesh-like and the ConnectedComponentVertexProgram needs few cycles to process the graph. These
characteristics show clearly from the graph. Indeed, for a given number of vertices, the run time of the
`connectedComponent()`-step does not depend on the number of edges, but rather on the maximum shortest path length in
the graph.
=== Large graphs
Large graphs in TinkerPop require distributed processing by `SparkGraphComputer` to get results in a reasonable time (OLAP
approach). This means that the graph must be available as `HadoopGraph` (third party TinkerPop implementations often
allow to make a graph available as an `HadoopGraph` by providing an Hadoop `InputFormat`). Running the
`connectedComponent()`-step on
an `HadoopGraph` works the same as for a small graph, provided that `SparkGraphComputer` is specified as the graph computer,
either with the `gremlin.hadoop.defaultGraphComputer` property or as part of the `withComputer()`-step.
Scalability of the the `connectedComponent()`-step with `SparkGraphComputer` is high, but note that:
* The graph should fit in the memory of the Spark cluster to allow the VertexProgram to run its cycles without spilling
intermediate results to disk and loosing most of the gains from the distributed processing.
* As discussed for small graphs, the BSP algorithm does not play well with graphs having a large shortest path between
any pair of vertices. Overcoming this limitation is still a
link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[subject of academic research].