docs/src/recipes/connected-components.asciidoc - tinkerpop - Git at Google

 ////
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ////

 // @author Daniel Kuppitz (anwer on gremlin user list)
 // @author Robert Dale (answer on gremlin user list)
 // @author Marc de Lignie

 [[connected-components]]
 == Connected Components

 Gremlin can be used to find link:https://en.wikipedia.org/wiki/Connected_component_(graph_theory)[connected components]
 in a graph. In a directed graph like in TinkerPop, components can be weakly or strongly connected. This recipe is
 restricted to finding link:https://en.wikipedia.org/wiki/Directed_graph#Directed_graph_connectivity[weakly
 connected components], in which the direction of edges is not taken into account.

 Depending on the size of the graph, three solution regimes can be discriminated:

 1. Small graphs that fit in the memory of a single machine

 2. Medium-sized graphs backed by storage for which an OLTP linear scan is still feasible. This regime is left to third party
 TinkerPop implementations, since TinkerPop itself has no storage-backed reference implementations. The idea is that
 component membership is stored in the graph, rather than in memory.

 3. Large graphs requiring an approach with `HadoopGraph` and `SparkGraphComputer` to yield results in a reasonable time.

 These regimes are discussed separately using the following graph with three weakly connected components:

 image:connected-components.png[width=600]

 [gremlin-groovy]
 ----
 g.addV().property(id, "A").as("a").
   addV().property(id, "B").as("b").
   addV().property(id, "C").as("c").
   addV().property(id, "D").as("d").
   addV().property(id, "E").as("e").
   addV().property(id, "F").
   addE("link").from("a").to("b").
   addE("link").from("b").to("c").
   addE("link").from("d").to("e").iterate()
 ----

 === Small graph traversals

 Connected components in a small graph can be determined with either an OLTP traversal or the OLAP
 `connectedComponent()`-step. The `connectedComponent()`-step is available as of TinkerPop 3.4.0 and is
 described in more detail in the
 link:https://tinkerpop.apache.org/docs/x.y.z/reference/#connectedcomponent-step[Reference Documentation].
 The traversal looks like:

 [gremlin-groovy,existing]
 ----
 g.withComputer().V().connectedComponent().
     group().by(ConnectedComponent.component).
     select(values).unfold()
 ----

 NOTE: The `component` option passed to `by()` is statically imported from `ConnectedComponent` and refers to the
 default property key within which the result of the algorithm is stored.

 A straightforward way to detect the various subgraphs with an OLTP traversal is to do this:

 [gremlin-groovy,existing]
 ----
 g.V().emit(cyclicPath().or().not(both())).                                    <1>
     repeat(__.where(without('a')).store('a').both()).until(cyclicPath()).     <2>
     group().by(path().unfold().limit(1)).                                     <3>
     by(path().unfold().dedup().fold()).                                       <4>
     select(values).unfold()                                                   <5>
 ----

 <1> The initial emit() step allows for output of isolated vertices, in addition to the discovery of
 components as described in (2).

 <2> The entire component to which the first returned vertex belongs, is visited. To allow for components of any
 structure, a repeat loop is applied that only stops for a particular branch of the component when it detects a cyclic
 path.  Collection `'a'` is used to keep track of visited vertices, for both subtraversals within a component
 and new traversals resulting from the `g.V()` linear scan.

 <3> While `'a'` nicely keeps track of vertices already visited, the actual components need to be extracted from the
 path information. The `path().unfold().limit(1)` closure provides the starting vertex
 of surviving traversers, which can be used to group the components.

 <4> This clause collects the unique vertices from all paths with the same starting vertex, thus from the same
 weak component.

 <5> The values of the groupby map contain the lists of vertices making up the requested components.

 === Small graph scalability

 The scalability of the OLTP traversal and the `connectedComponent()`-step for in-memory graphs is shown in the figures
 below.

 [[cc-scale-size]]
 .Run times for finding connected components in a randomly generated graph with 10 components of equal size and with an edge/vertex ratio of 6
 image::cc-scale-size.png[width=600, side=bottom]

 In general, the `connectedComponent()`-step is almost a factor two faster than the OLTP traversal. Only, for very
 small graphs the overhead of running the ConnectedComponentVertexProgram is larger than that of the OLTP traversal.
 The vertex program works by having interconnected vertices exchange id's and store the lowest id until no vertex
 receives a lower id. This algorithm is commonly applied in
 link:https://en.wikipedia.org/wiki/Bulk_synchronous_parallel[bulk synchronous parallel] systems, e.g. in
 link:https://spark.apache.org/graphx[Apache Spark GraphX]. Overhead for the vertex program arises because it has to run
 as many cycles as the largest length of the shortest paths between any two vertices in a component of the graph. In
 every cycle each vertex has to be checked for being
 "halted". Overhead of the OLTP traversal consists of each traverser having to carry complete path information. For
 pure depth-first-search or breadth-first-search implementations, connected-component algotithms should scale
 as [.big]##O##(V+E). For the traversals in the figure above this is almost the case.

 [[cc-scale-ratio]]
 .Run times for finding connected components in a randomly generated graph with 10 components, each consisting of 6400 vertices
 image::cc-scale-ratio.png[width=600]

 The random graphs used for the scalability tests can be modulated with the edge/vertex ratio. For small ratios the
 components generated are more lint-like and harder to process by the `connectedComponent()`-step. For high ratios
 the components are more mesh-like and the ConnectedComponentVertexProgram needs few cycles to process the graph. These
 characteristics show clearly from the graph. Indeed, for a given number of vertices, the run time of the
 `connectedComponent()`-step does not depend on the number of edges, but rather on the maximum shortest path length in
 the graph.

 === Large graphs

 Large graphs in TinkerPop require distributed processing by `SparkGraphComputer` to get results in a reasonable time (OLAP
 approach). This means that the graph must be available as `HadoopGraph` (third party TinkerPop implementations often
 allow to make a graph available as an `HadoopGraph` by providing an Hadoop `InputFormat`). Running the
 `connectedComponent()`-step on
 an `HadoopGraph` works the same as for a small graph, provided that `SparkGraphComputer` is specified as the graph computer,
 either with the `gremlin.hadoop.defaultGraphComputer` property or as part of the `withComputer()`-step.

 Scalability of the the `connectedComponent()`-step with `SparkGraphComputer` is high, but note that:

 * The graph should fit in the memory of the Spark cluster to allow the VertexProgram to run its cycles without spilling
 intermediate results to disk and loosing most of the gains from the distributed processing.
 * As discussed for small graphs, the BSP algorithm does not play well with graphs having a large shortest path between
 any pair of vertices. Overcoming this limitation is still a
 link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[subject of academic research].
	////
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////

	// @author Daniel Kuppitz (anwer on gremlin user list)
	// @author Robert Dale (answer on gremlin user list)
	// @author Marc de Lignie

	[[connected-components]]
	== Connected Components

	Gremlin can be used to find link:https://en.wikipedia.org/wiki/Connected_component_(graph_theory)[connected components]
	in a graph. In a directed graph like in TinkerPop, components can be weakly or strongly connected. This recipe is
	restricted to finding link:https://en.wikipedia.org/wiki/Directed_graph#Directed_graph_connectivity[weakly
	connected components], in which the direction of edges is not taken into account.

	Depending on the size of the graph, three solution regimes can be discriminated:

	1. Small graphs that fit in the memory of a single machine

	2. Medium-sized graphs backed by storage for which an OLTP linear scan is still feasible. This regime is left to third party
	TinkerPop implementations, since TinkerPop itself has no storage-backed reference implementations. The idea is that
	component membership is stored in the graph, rather than in memory.

	3. Large graphs requiring an approach with `HadoopGraph` and `SparkGraphComputer` to yield results in a reasonable time.

	These regimes are discussed separately using the following graph with three weakly connected components:

	image:connected-components.png[width=600]

	[gremlin-groovy]
	----
	g.addV().property(id, "A").as("a").
	addV().property(id, "B").as("b").
	addV().property(id, "C").as("c").
	addV().property(id, "D").as("d").
	addV().property(id, "E").as("e").
	addV().property(id, "F").
	addE("link").from("a").to("b").
	addE("link").from("b").to("c").
	addE("link").from("d").to("e").iterate()
	----

	=== Small graph traversals

	Connected components in a small graph can be determined with either an OLTP traversal or the OLAP
	`connectedComponent()`-step. The `connectedComponent()`-step is available as of TinkerPop 3.4.0 and is
	described in more detail in the
	link:https://tinkerpop.apache.org/docs/x.y.z/reference/#connectedcomponent-step[Reference Documentation].
	The traversal looks like:

	[gremlin-groovy,existing]
	----
	g.withComputer().V().connectedComponent().
	group().by(ConnectedComponent.component).
	select(values).unfold()
	----

	NOTE: The `component` option passed to `by()` is statically imported from `ConnectedComponent` and refers to the
	default property key within which the result of the algorithm is stored.

	A straightforward way to detect the various subgraphs with an OLTP traversal is to do this:

	[gremlin-groovy,existing]
	----
	g.V().emit(cyclicPath().or().not(both())). <1>
	repeat(__.where(without('a')).store('a').both()).until(cyclicPath()). <2>
	group().by(path().unfold().limit(1)). <3>
	by(path().unfold().dedup().fold()). <4>
	select(values).unfold() <5>
	----

	<1> The initial emit() step allows for output of isolated vertices, in addition to the discovery of
	components as described in (2).

	<2> The entire component to which the first returned vertex belongs, is visited. To allow for components of any
	structure, a repeat loop is applied that only stops for a particular branch of the component when it detects a cyclic
	path. Collection `'a'` is used to keep track of visited vertices, for both subtraversals within a component
	and new traversals resulting from the `g.V()` linear scan.

	<3> While `'a'` nicely keeps track of vertices already visited, the actual components need to be extracted from the
	path information. The `path().unfold().limit(1)` closure provides the starting vertex
	of surviving traversers, which can be used to group the components.

	<4> This clause collects the unique vertices from all paths with the same starting vertex, thus from the same
	weak component.

	<5> The values of the groupby map contain the lists of vertices making up the requested components.

	=== Small graph scalability

	The scalability of the OLTP traversal and the `connectedComponent()`-step for in-memory graphs is shown in the figures
	below.

	[[cc-scale-size]]
	.Run times for finding connected components in a randomly generated graph with 10 components of equal size and with an edge/vertex ratio of 6
	image::cc-scale-size.png[width=600, side=bottom]

	In general, the `connectedComponent()`-step is almost a factor two faster than the OLTP traversal. Only, for very
	small graphs the overhead of running the ConnectedComponentVertexProgram is larger than that of the OLTP traversal.
	The vertex program works by having interconnected vertices exchange id's and store the lowest id until no vertex
	receives a lower id. This algorithm is commonly applied in
	link:https://en.wikipedia.org/wiki/Bulk_synchronous_parallel[bulk synchronous parallel] systems, e.g. in
	link:https://spark.apache.org/graphx[Apache Spark GraphX]. Overhead for the vertex program arises because it has to run
	as many cycles as the largest length of the shortest paths between any two vertices in a component of the graph. In
	every cycle each vertex has to be checked for being
	"halted". Overhead of the OLTP traversal consists of each traverser having to carry complete path information. For
	pure depth-first-search or breadth-first-search implementations, connected-component algotithms should scale
	as [.big]##O##(V+E). For the traversals in the figure above this is almost the case.

	[[cc-scale-ratio]]
	.Run times for finding connected components in a randomly generated graph with 10 components, each consisting of 6400 vertices
	image::cc-scale-ratio.png[width=600]

	The random graphs used for the scalability tests can be modulated with the edge/vertex ratio. For small ratios the
	components generated are more lint-like and harder to process by the `connectedComponent()`-step. For high ratios
	the components are more mesh-like and the ConnectedComponentVertexProgram needs few cycles to process the graph. These
	characteristics show clearly from the graph. Indeed, for a given number of vertices, the run time of the
	`connectedComponent()`-step does not depend on the number of edges, but rather on the maximum shortest path length in
	the graph.

	=== Large graphs

	Large graphs in TinkerPop require distributed processing by `SparkGraphComputer` to get results in a reasonable time (OLAP
	approach). This means that the graph must be available as `HadoopGraph` (third party TinkerPop implementations often
	allow to make a graph available as an `HadoopGraph` by providing an Hadoop `InputFormat`). Running the
	`connectedComponent()`-step on
	an `HadoopGraph` works the same as for a small graph, provided that `SparkGraphComputer` is specified as the graph computer,
	either with the `gremlin.hadoop.defaultGraphComputer` property or as part of the `withComputer()`-step.

	Scalability of the the `connectedComponent()`-step with `SparkGraphComputer` is high, but note that:

	* The graph should fit in the memory of the Spark cluster to allow the VertexProgram to run its cycles without spilling
	intermediate results to disk and loosing most of the gains from the distributed processing.
	* As discussed for small graphs, the BSP algorithm does not play well with graphs having a large shortest path between
	any pair of vertices. Overcoming this limitation is still a
	link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[subject of academic research].