| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[duplicate-edge]] |
| == Duplicate Edge Detection |
| |
| Whether part of a graph maintenance process or for some other analysis need, it is sometimes necessary to detect |
| if there is more than one edge between two vertices. The following examples will assume that an edge with the same |
| label and direction will be considered "duplicate". |
| |
| The "modern" graph does not have any duplicate edges that fit that definition, so the following example adds one |
| that is duplicative of the "created" edge between vertex "1" and "3". |
| |
| [gremlin-groovy,modern] |
| ---- |
| g.V(1).as("a").V(3).addE("created").from("a").iterate() |
| g.V(1).outE("created") |
| ---- |
| |
| One way to find the duplicate edges would be to do something like this: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().outE(). |
| project("a","b"). <1> |
| by().by(inV().path().by().by(label)). |
| group(). <2> |
| by("b"). |
| by(select("a").fold()). |
| unfold(). <3> |
| select(values). <4> |
| where(count(local).is(gt(1))) |
| ---- |
| |
| <1> The "a" and "b" from the `project` contain the edge and the path respectively. The path consists of a the outgoing |
| vertex, an edge, and the incoming vertex. The use of `by().by(label))` converts the edge to its label (recall that `by` |
| are applied in round-robin fashion), so the path will look something like: `[v[1],created,v[3]]`. |
| <2> Group by the path from "b" and construct a list of edges from "a". Any value in this `Map` that has a list of edges |
| greater than one means that there is more than one edge for that edge label between those two vertices (i.e. the `Map` |
| key). |
| <3> Unroll the key-value pairs in the `Map` of paths-edges. |
| <4> Only the values from the `Map` are needed and as mentioned earlier, those lists with more than one edge would |
| contain duplicate. |
| |
| This method find the duplicates, but does require more memory than other approaches. A slightly more complex approach |
| that uses less memory might look like this: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().as("ov"). |
| outE().as("e"). |
| inV().as("iv"). |
| inE(). <1> |
| where(neq("e")). <2> |
| where(eq("e")).by(label). |
| where(outV().as("ov")). |
| group(). |
| by(select("ov","e","iv").by().by(label)). <3> |
| unfold(). <4> |
| select(values). |
| where(count(local).is(gt(1))) |
| ---- |
| |
| <1> To this point in the traversal, the outgoing edges of a vertex are being iterated with the current edge labeled |
| as "e". For "e", Gremlin traverses to the incoming vertex and back on in edges of that vertex. |
| <2> Those incoming edges are filtered with the following `where` steps. The first ensures that it does not traverse |
| back over "e" (i.e. the current edge). The second determines if the edge label is equivalent (i.e. the test for the |
| working definition of "duplicate"). The third determines if the outgoing vertex matches the one that started the |
| path labeled as "ov". |
| <3> This line is quite similar to the output achieved in the previous example at step 2. A `Map` is produced that uses |
| the outgoing vertex, the edge label, and the incoming vertex as the key, with the list of edges for that path as the |
| value. |
| <4> The rest of the traversal is the same as the previous one. |
| |
| Note that the above traversal could also be written using `match` step: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().match( |
| __.as("ov").outE().as("e"), |
| __.as("e").inV().as("iv"), |
| __.as("iv").inE().as("ie"), |
| __.as("ie").outV().as("ov")). |
| where("ie",neq("e")). |
| where("ie",eq("e")).by(label). |
| select("ie"). |
| group(). |
| by(select("ov","e","iv").by().by(label)). |
| unfold().select(values). |
| where(count(local).is(gt(1))) |
| ---- |
| |
| A third way to approach this problem would be to force a link:https://en.wikipedia.org/wiki/Depth-first_search[depth-first search]. |
| The previous examples invoke traversal strategies that force a link:https://en.wikipedia.org/wiki/Breadth-first_search[breadth first search] |
| as a performance optimization. |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov"). <1> |
| outE().as("e1"). |
| inV().as("iv"). |
| inE(). |
| where(neq("e1")). |
| where(outV().as("ov")).as("e2"). <2> |
| where("e1", eq("e2")).by(label) <3> |
| ---- |
| |
| <1> Remove strategies that will optimize for breadth first searches and thus allow Gremlin to go depth first. |
| <2> To this point, the traversal is very much like the previous one. Review step 2 in the previous example to see the |
| parallels here. |
| <3> The final `where` simply looks for edges that match on label, which would then meet the working definition of |
| "duplicate". |
| |
| The basic pattern at play here is to compare the path of the outgoing vertex, its outgoing edge label and the incoming |
| vertex. This model can obviously be contracted or expanded as needed to fit different definitions of "duplicate". For |
| example, a "duplicate" definition could extended to the label and properties of the edge. For purposes of |
| demonstration, an additional edge is added to the "modern" graph: |
| |
| [gremlin-groovy,modern] |
| ---- |
| g.V(1).as("a").V(3).addE("created").property("weight",0.4d).from("a").iterate() |
| g.V(1).as("a").V(3).addE("created").property("weight",0.5d).from("a").iterate() |
| g.V(1).outE("created").elementMap() |
| ---- |
| |
| To identify the duplicate with this revised definition, the previous traversal can be modified to: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov"). |
| outE().as("e1"). |
| inV().as("iv"). |
| inE(). |
| where(neq("e1")). |
| where(outV().as("ov")).as("e2"). |
| where("e1", eq("e2")).by(label). |
| where("e1", eq("e2")).by("weight").elementMap() |
| ---- |