docs/src/recipes/duplicate-edge.asciidoc - tinkerpop - Git at Google

 ////
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ////
 [[duplicate-edge]]
 == Duplicate Edge Detection

 Whether part of a graph maintenance process or for some other analysis need, it is sometimes necessary to detect
 if there is more than one edge between two vertices. The following examples will assume that an edge with the same
 label and direction will be considered "duplicate".

 The "modern" graph does not have any duplicate edges that fit that definition, so the following example adds one
 that is duplicative of the "created" edge between vertex "1" and "3".

 [gremlin-groovy,modern]
 ----
 g.V(1).as("a").V(3).addE("created").from("a").iterate()
 g.V(1).outE("created")
 ----

 One way to find the duplicate edges would be to do something like this:

 [gremlin-groovy,existing]
 ----
 g.V().outE().
   project("a","b").                         <1>
     by().by(inV().path().by().by(label)).
   group().                                  <2>
     by(select("b")).
     by(select("a").fold()).
   unfold().                                 <3>
   select(values).                           <4>
   where(count(local).is(gt(1)))
 ----

 <1> The "a" and "b" from the `project` contain the edge and the path respectively. The path consists of a the outgoing
 vertex, an edge, and the incoming vertex. The use of `by().by(label))` converts the edge to its label (recall that `by`
 are applied in round-robin fashion), so the path will look something like: `[v[1],created,v[3]]`.
 <2> Group by the path from "b" and construct a list of edges from "a". Any value in this `Map` that has a list of edges
 greater than one means that there is more than one edge for that edge label between those two vertices (i.e. the `Map`
 key).
 <3> Unroll the key-value pairs in the `Map` of paths-edges.
 <4> Only the values from the `Map` are needed and as mentioned earlier, those lists with more than one edge would
 contain  duplicate.

 This method find the duplicates, but does require more memory than other approaches. A slightly more complex approach
 that uses less memory might look like this:

 [gremlin-groovy,existing]
 ----
 g.V().as("ov").
   outE().as("e").
   inV().as("iv").
   inE().                                          <1>
   where(neq("e")).                                <2>
   where(eq("e")).by(label).
   where(outV().as("ov")).
   group().
     by(select("ov","e","iv").by().by(label)).     <3>
   unfold().                                       <4>
   select(values).
   where(count(local).is(gt(1)))
 ----

 <1> To this point in the traversal, the outgoing edges of a vertex are being iterated with the current edge labeled
 as "e". For "e", Gremlin traverses to the incoming vertex and back on in edges of that vertex.
 <2> Those incoming edges are filtered with the following `where` steps. The first ensures that it does not traverse
 back over "e" (i.e. the current edge). The second determines if the edge label is equivalent (i.e. the test for the
 working definition of "duplicate"). The third determines if the outgoing vertex matches the one that started the
 path labeled as "ov".
 <3> This line is quite similar to the output achieved in the previous example at step 2. A `Map` is produced that uses
 the outgoing vertex, the edge label, and the incoming vertex as the key, with the list of edges for that path as the
 value.
 <4> The rest of the traversal is the same as the previous one.

 Note that the above traversal could also be written using `match` step:

 [gremlin-groovy,existing]
 ----
 g.V().match(
     __.as("ov").outE().as("e"),
     __.as("e").inV().as("iv"),
     __.as("iv").inE().as("ie"),
     __.as("ie").outV().as("ov")).
       where("ie",neq("e")).
       where("ie",eq("e")).by(label).
     select("ie").
     group().
       by(select("ov","e","iv").by().by(label)).
     unfold().select(values).
       where(count(local).is(gt(1)))
 ----

 A third way to approach this problem would be to force a link:https://en.wikipedia.org/wiki/Depth-first_search[depth-first search].
 The previous examples invoke traversal strategies that force a link:https://en.wikipedia.org/wiki/Breadth-first_search[breadth first search]
 as a performance optimization.

 [gremlin-groovy,existing]
 ----
 g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov").   <1>
   outE().as("e1").
   inV().as("iv").
   inE().
   where(neq("e1")).
   where(outV().as("ov")).as("e2").                                               <2>
   where("e1", eq("e2")).by(label)                                                <3>
 ----

 <1> Remove strategies that will optimize for breadth first searches and thus allow Gremlin to go depth first.
 <2> To this point, the traversal is very much like the previous one. Review step 2 in the previous example to see the
 parallels here.
 <3> The final `where` simply looks for edges that match on label, which would then meet the working definition of
 "duplicate".

 The basic pattern at play here is to compare the path of the outgoing vertex, its outgoing edge label and the incoming
 vertex. This model can obviously be contracted or expanded as needed to fit different definitions of "duplicate". For
 example, a "duplicate" definition could extended to the label and properties of the edge. For purposes of
 demonstration, an additional edge is added to the "modern" graph:

 [gremlin-groovy,modern]
 ----
 g.V(1).as("a").V(3).addE("created").property("weight",0.4d).from("a").iterate()
 g.V(1).as("a").V(3).addE("created").property("weight",0.5d).from("a").iterate()
 g.V(1).outE("created").valueMap(true)
 ----

 To identify the duplicate with this revised definition, the previous traversal can be modified to:

 [gremlin-groovy,existing]
 ----
 g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov").
   outE().as("e1").
   inV().as("iv").
   inE().
   where(neq("e1")).
   where(outV().as("ov")).as("e2").
   where("e1", eq("e2")).by(label).
   where("e1", eq("e2")).by("weight").valueMap(true)
 ----
	////
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////
	[[duplicate-edge]]
	== Duplicate Edge Detection

	Whether part of a graph maintenance process or for some other analysis need, it is sometimes necessary to detect
	if there is more than one edge between two vertices. The following examples will assume that an edge with the same
	label and direction will be considered "duplicate".

	The "modern" graph does not have any duplicate edges that fit that definition, so the following example adds one
	that is duplicative of the "created" edge between vertex "1" and "3".

	[gremlin-groovy,modern]
	----
	g.V(1).as("a").V(3).addE("created").from("a").iterate()
	g.V(1).outE("created")
	----

	One way to find the duplicate edges would be to do something like this:

	[gremlin-groovy,existing]
	----
	g.V().outE().
	project("a","b"). <1>
	by().by(inV().path().by().by(label)).
	group(). <2>
	by(select("b")).
	by(select("a").fold()).
	unfold(). <3>
	select(values). <4>
	where(count(local).is(gt(1)))
	----

	<1> The "a" and "b" from the `project` contain the edge and the path respectively. The path consists of a the outgoing
	vertex, an edge, and the incoming vertex. The use of `by().by(label))` converts the edge to its label (recall that `by`
	are applied in round-robin fashion), so the path will look something like: `[v[1],created,v[3]]`.
	<2> Group by the path from "b" and construct a list of edges from "a". Any value in this `Map` that has a list of edges
	greater than one means that there is more than one edge for that edge label between those two vertices (i.e. the `Map`
	key).
	<3> Unroll the key-value pairs in the `Map` of paths-edges.
	<4> Only the values from the `Map` are needed and as mentioned earlier, those lists with more than one edge would
	contain duplicate.

	This method find the duplicates, but does require more memory than other approaches. A slightly more complex approach
	that uses less memory might look like this:

	[gremlin-groovy,existing]
	----
	g.V().as("ov").
	outE().as("e").
	inV().as("iv").
	inE(). <1>
	where(neq("e")). <2>
	where(eq("e")).by(label).
	where(outV().as("ov")).
	group().
	by(select("ov","e","iv").by().by(label)). <3>
	unfold(). <4>
	select(values).
	where(count(local).is(gt(1)))
	----

	<1> To this point in the traversal, the outgoing edges of a vertex are being iterated with the current edge labeled
	as "e". For "e", Gremlin traverses to the incoming vertex and back on in edges of that vertex.
	<2> Those incoming edges are filtered with the following `where` steps. The first ensures that it does not traverse
	back over "e" (i.e. the current edge). The second determines if the edge label is equivalent (i.e. the test for the
	working definition of "duplicate"). The third determines if the outgoing vertex matches the one that started the
	path labeled as "ov".
	<3> This line is quite similar to the output achieved in the previous example at step 2. A `Map` is produced that uses
	the outgoing vertex, the edge label, and the incoming vertex as the key, with the list of edges for that path as the
	value.
	<4> The rest of the traversal is the same as the previous one.

	Note that the above traversal could also be written using `match` step:

	[gremlin-groovy,existing]
	----
	g.V().match(
	__.as("ov").outE().as("e"),
	__.as("e").inV().as("iv"),
	__.as("iv").inE().as("ie"),
	__.as("ie").outV().as("ov")).
	where("ie",neq("e")).
	where("ie",eq("e")).by(label).
	select("ie").
	group().
	by(select("ov","e","iv").by().by(label)).
	unfold().select(values).
	where(count(local).is(gt(1)))
	----

	A third way to approach this problem would be to force a link:https://en.wikipedia.org/wiki/Depth-first_search[depth-first search].
	The previous examples invoke traversal strategies that force a link:https://en.wikipedia.org/wiki/Breadth-first_search[breadth first search]
	as a performance optimization.

	[gremlin-groovy,existing]
	----
	g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov"). <1>
	outE().as("e1").
	inV().as("iv").
	inE().
	where(neq("e1")).
	where(outV().as("ov")).as("e2"). <2>
	where("e1", eq("e2")).by(label) <3>
	----

	<1> Remove strategies that will optimize for breadth first searches and thus allow Gremlin to go depth first.
	<2> To this point, the traversal is very much like the previous one. Review step 2 in the previous example to see the
	parallels here.
	<3> The final `where` simply looks for edges that match on label, which would then meet the working definition of
	"duplicate".

	The basic pattern at play here is to compare the path of the outgoing vertex, its outgoing edge label and the incoming
	vertex. This model can obviously be contracted or expanded as needed to fit different definitions of "duplicate". For
	example, a "duplicate" definition could extended to the label and properties of the edge. For purposes of
	demonstration, an additional edge is added to the "modern" graph:

	[gremlin-groovy,modern]
	----
	g.V(1).as("a").V(3).addE("created").property("weight",0.4d).from("a").iterate()
	g.V(1).as("a").V(3).addE("created").property("weight",0.5d).from("a").iterate()
	g.V(1).outE("created").valueMap(true)
	----

	To identify the duplicate with this revised definition, the previous traversal can be modified to:

	[gremlin-groovy,existing]
	----
	g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov").
	outE().as("e1").
	inV().as("iv").
	inE().
	where(neq("e1")).
	where(outV().as("ov")).as("e2").
	where("e1", eq("e2")).by(label).
	where("e1", eq("e2")).by("weight").valueMap(true)
	----