| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| |
| // @author Daniel Kuppitz (anwer on gremlin user list) |
| // @author Robert Dale (answer on gremlin user list) |
| // @author Marc de Lignie |
| |
| [[connected-components]] |
| == Connected Components |
| |
| Gremlin can be used to find link:https://en.wikipedia.org/wiki/Connected_component_(graph_theory)[connected components] |
| in a graph. In a directed graph like in TinkerPop, components can be weakly or strongly connected. This recipe is |
| restricted to finding link:https://en.wikipedia.org/wiki/Directed_graph#Directed_graph_connectivity[weakly |
| connected components], in which the direction of edges is not taken into account. |
| |
| Depending on the size of the graph, three solution regimes can be discriminated: |
| |
| 1. Small graphs that fit in the memory of a single machine |
| |
| 2. Medium-sized graphs backed by storage for which an OLTP linear scan is still feasible. This regime is left to third party |
| TinkerPop implementations, since TinkerPop itself has no storage-backed reference implementations. The idea is that |
| component membership is stored in the graph, rather than in memory. |
| |
| 3. Large graphs requiring an approach with `HadoopGraph` and `SparkGraphComputer` to yield results in a reasonable time. |
| |
| These regimes are discussed separately using the following graph with three weakly connected components: |
| |
| image:connected-components.png[width=600] |
| |
| [gremlin-groovy] |
| ---- |
| g.addV().property(id, "A").as("a"). |
| addV().property(id, "B").as("b"). |
| addV().property(id, "C").as("c"). |
| addV().property(id, "D").as("d"). |
| addV().property(id, "E").as("e"). |
| addV().property(id, "F"). |
| addE("link").from("a").to("b"). |
| addE("link").from("b").to("c"). |
| addE("link").from("d").to("e").iterate() |
| ---- |
| |
| === Small graph traversals |
| |
| Connected components in a small graph can be determined with either an OLTP traversal or the OLAP |
| `connectedComponent()`-step. The `connectedComponent()`-step is available as of TinkerPop 3.4.0 and is |
| described in more detail in the |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#connectedcomponent-step[Reference Documentation]. |
| The traversal looks like: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.withComputer().V().connectedComponent(). |
| group().by(ConnectedComponent.component). |
| select(values).unfold() |
| ---- |
| |
| NOTE: The `component` option passed to `by()` is statically imported from `ConnectedComponent` and refers to the |
| default property key within which the result of the algorithm is stored. |
| |
| A straightforward way to detect the various subgraphs with an OLTP traversal is to do this: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().emit(cyclicPath().or().not(both())). <1> |
| repeat(__.where(without('a')).store('a').both()).until(cyclicPath()). <2> |
| group().by(path().unfold().limit(1)). <3> |
| by(path().unfold().dedup().fold()). <4> |
| select(values).unfold() <5> |
| ---- |
| |
| <1> The initial emit() step allows for output of isolated vertices, in addition to the discovery of |
| components as described in (2). |
| |
| <2> The entire component to which the first returned vertex belongs, is visited. To allow for components of any |
| structure, a repeat loop is applied that only stops for a particular branch of the component when it detects a cyclic |
| path. Collection `'a'` is used to keep track of visited vertices, for both subtraversals within a component |
| and new traversals resulting from the `g.V()` linear scan. |
| |
| <3> While `'a'` nicely keeps track of vertices already visited, the actual components need to be extracted from the |
| path information. The `path().unfold().limit(1)` closure provides the starting vertex |
| of surviving traversers, which can be used to group the components. |
| |
| <4> This clause collects the unique vertices from all paths with the same starting vertex, thus from the same |
| weak component. |
| |
| <5> The values of the groupby map contain the lists of vertices making up the requested components. |
| |
| === Small graph scalability |
| |
| The scalability of the OLTP traversal and the `connectedComponent()`-step for in-memory graphs is shown in the figures |
| below. |
| |
| [[cc-scale-size]] |
| .Run times for finding connected components in a randomly generated graph with 10 components of equal size and with an edge/vertex ratio of 6 |
| image::cc-scale-size.png[width=600, side=bottom] |
| |
| In general, the `connectedComponent()`-step is almost a factor two faster than the OLTP traversal. Only, for very |
| small graphs the overhead of running the ConnectedComponentVertexProgram is larger than that of the OLTP traversal. |
| The vertex program works by having interconnected vertices exchange id's and store the lowest id until no vertex |
| receives a lower id. This algorithm is commonly applied in |
| link:https://en.wikipedia.org/wiki/Bulk_synchronous_parallel[bulk synchronous parallel] systems, e.g. in |
| link:https://spark.apache.org/graphx[Apache Spark GraphX]. Overhead for the vertex program arises because it has to run |
| as many cycles as the largest length of the shortest paths between any two vertices in a component of the graph. In |
| every cycle each vertex has to be checked for being |
| "halted". Overhead of the OLTP traversal consists of each traverser having to carry complete path information. For |
| pure depth-first-search or breadth-first-search implementations, connected-component algotithms should scale |
| as [.big]##O##(V+E). For the traversals in the figure above this is almost the case. |
| |
| [[cc-scale-ratio]] |
| .Run times for finding connected components in a randomly generated graph with 10 components, each consisting of 6400 vertices |
| image::cc-scale-ratio.png[width=600] |
| |
| The random graphs used for the scalability tests can be modulated with the edge/vertex ratio. For small ratios the |
| components generated are more lint-like and harder to process by the `connectedComponent()`-step. For high ratios |
| the components are more mesh-like and the ConnectedComponentVertexProgram needs few cycles to process the graph. These |
| characteristics show clearly from the graph. Indeed, for a given number of vertices, the run time of the |
| `connectedComponent()`-step does not depend on the number of edges, but rather on the maximum shortest path length in |
| the graph. |
| |
| === Large graphs |
| |
| Large graphs in TinkerPop require distributed processing by `SparkGraphComputer` to get results in a reasonable time (OLAP |
| approach). This means that the graph must be available as `HadoopGraph` (third party TinkerPop implementations often |
| allow to make a graph available as an `HadoopGraph` by providing an Hadoop `InputFormat`). Running the |
| `connectedComponent()`-step on |
| an `HadoopGraph` works the same as for a small graph, provided that `SparkGraphComputer` is specified as the graph computer, |
| either with the `gremlin.hadoop.defaultGraphComputer` property or as part of the `withComputer()`-step. |
| |
| Scalability of the the `connectedComponent()`-step with `SparkGraphComputer` is high, but note that: |
| |
| * The graph should fit in the memory of the Spark cluster to allow the VertexProgram to run its cycles without spilling |
| intermediate results to disk and loosing most of the gains from the distributed processing. |
| * As discussed for small graphs, the BSP algorithm does not play well with graphs having a large shortest path between |
| any pair of vertices. Overcoming this limitation is still a |
| link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[subject of academic research]. |