| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[tree]] |
| == Tree |
| |
| image:gremlin-tree.png[width=280] |
| |
| === Lowest Common Ancestor |
| |
| image:tree-lca.png[width=230,float=right] Given a tree, the link:https://en.wikipedia.org/wiki/Lowest_common_ancestor[lowest common ancestor] |
| is the deepest vertex that is common to two or more other vertices. The diagram to the right depicts the common |
| ancestor tree for vertices A and D in the various green shades. The C vertex, the vertex with the darkest green |
| shading, is the lowest common ancestor. |
| |
| The following code simply sets up the graph depicted above using "hasParent" for the edge label: |
| |
| [gremlin-groovy] |
| ---- |
| g.addV().property(id, 'A').as('a'). |
| addV().property(id, 'B').as('b'). |
| addV().property(id, 'C').as('c'). |
| addV().property(id, 'D').as('d'). |
| addV().property(id, 'E').as('e'). |
| addV().property(id, 'F').as('f'). |
| addV().property(id, 'G').as('g'). |
| addE('hasParent').from('a').to('b'). |
| addE('hasParent').from('b').to('c'). |
| addE('hasParent').from('d').to('c'). |
| addE('hasParent').from('c').to('e'). |
| addE('hasParent').from('e').to('f'). |
| addE('hasParent').from('g').to('f').iterate() |
| ---- |
| |
| Given that graph, the following traversal will get the lowest common ancestor for two vertices, A and D: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V('A'). |
| repeat(out('hasParent')).emit().as('x'). |
| repeat(__.in('hasParent')).emit(hasId('D')). |
| select('x').limit(1) |
| ---- |
| |
| The above traversal is reasonably straightforward to follow in that it simply traverses up the tree from the A vertex |
| and then traverses down from each ancestor until it finds the "D" vertex. The first path that uncovers that match is |
| the lowest common ancestor. |
| |
| The complexity of finding the lowest common ancestor increases when trying to find the ancestors of three or more |
| vertices. |
| |
| [gremlin-groovy,existing] |
| ---- |
| input = ['A','B','D'] |
| g.V(input.head()). |
| repeat(out('hasParent')).emit().as('x'). <1> |
| V().has(id, within(input.tail())). <2> |
| repeat(out('hasParent')).emit(where(eq('x'))). <3> |
| group(). |
| by(select('x')). |
| by(path().count(local).fold()). <4> |
| unfold().filter(select(values).count(local).is(input.tail().size())). <5> |
| order().by(select(values). |
| unfold().sum()). <6> |
| select(keys).limit(1) <7> |
| ---- |
| |
| <1> The start of the traversal is not so different than the previous one and starts with vertex A. |
| <2> Use a mid-traversal `V()` to find the child vertices B and D. |
| <3> Traverse up the tree for B and D and find common ancestors that were labeled with "x". |
| <4> Group on the common ancestors where the value of the grouping is the length of the path. |
| <5> The result of the previous step is a `Map` with a vertex (i.e. common ancestor) for the key and a list of path |
| lengths. Unroll the `Map` and ensure that the number of path lengths are equivalent to the number of children that |
| were given to the mid-traversal `V()`. |
| <6> Order the results based on the sum of the path lengths. |
| <7> Since the results were placed in ascending order, the first result must be the lowest common ancestor. |
| |
| As the above traversal utilizes a mid-traversal `V()`, it cannot be used for OLAP. In OLAP, the pattern changes a bit: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.withComputer(). |
| V().has(id, within(input)). |
| aggregate('input').hasId(input.head()). <1> |
| repeat(out('hasParent')).emit().as('x'). |
| select('input').unfold().has(id, within(input.tail())). |
| repeat(out('hasParent')).emit(where(eq('x'))). |
| group(). |
| by(select('x')). |
| by(path().count(local).fold()). |
| unfold().filter(select(values).count(local).is(input.tail().size())). |
| order(). |
| by(select(values).unfold().sum()). |
| select(keys).limit(1) |
| ---- |
| |
| <1> The main difference for OLAP is the use of `aggregate()` over the mid-traversal `V()`. |
| |
| === Maximum Depth |
| |
| Finding the maximum depth of a tree starting from a specified root vertex can be determined as follows: |
| |
| [gremlin-groovy] |
| ---- |
| g.addV().property('name', 'A').as('a'). |
| addV().property('name', 'B').as('b'). |
| addV().property('name', 'C').as('c'). |
| addV().property('name', 'D').as('d'). |
| addV().property('name', 'E').as('e'). |
| addV().property('name', 'F').as('f'). |
| addV().property('name', 'G').as('g'). |
| addE('hasParent').from('a').to('b'). |
| addE('hasParent').from('b').to('c'). |
| addE('hasParent').from('d').to('c'). |
| addE('hasParent').from('c').to('e'). |
| addE('hasParent').from('e').to('f'). |
| addE('hasParent').from('g').to('f').iterate() |
| g.V().has('name','F').repeat(__.in()).emit().path().count(local).max() |
| g.V().has('name','C').repeat(__.in()).emit().path().count(local).max() |
| ---- |
| |
| image:gremlin-max-depth.png[float=right,width=350]The traversals shown above are fairly straightforward. The traversal |
| beings at a particular starting vertex, traverse in on the "hasParent" edges emitting all vertices as it goes. It |
| calculates the path length and then selects the longest one. While this approach is quite direct, there is room for |
| improvement: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('name','F'). |
| repeat(__.in()).emit(__.not(inE())).tail(1). |
| path().count(local) |
| g.V().has('name','C'). |
| repeat(__.in()).emit(__.not(inE())).tail(1). |
| path().count(local) |
| ---- |
| |
| There are two optimizations at play. First, there is no need to emit all the vertices, only the "leaf" vertices (i.e. |
| those without incoming edges). Second, all results save the last one can be ignored to that point (i.e. the last one is |
| the one at the deepest point in the tree). In this way, the path and path length only need to be calculated for a |
| single result. |
| |
| The previous approaches to calculating the maximum depth use path calculations to achieve the answer. Path calculations |
| can be expensive and if possible avoided if they are not needed. Another way to express a traversal that calculates |
| the maximum depth is to use the `sack()`-step: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.withSack(1).V().has('name','F'). |
| repeat(__.in().sack(sum).by(constant(1))).emit(). |
| sack().max() |
| g.withSack(1).V().has('name','C'). |
| repeat(__.in().sack(sum).by(constant(1))).emit(). |
| sack().max() |
| ---- |
| |
| === Time-based Indexing |
| |
| Trees can be used for modeling time-oriented data in a graph. Modeling time where there are "year", "month" and "day" |
| vertices (or lower granularity as needed) allows the structure of the graph to inherently index data tied to them. |
| |
| image:gremlin-index-time.png[width=800] |
| |
| NOTE: This model is discussed further in this Neo4j link:https://neo4j.com/blog/modeling-a-multilevel-index-in-neoj4/[blog post]. |
| Also, there can be other versions of this model that utilize different edge/vertex labeling and property naming |
| strategies. The schema depicted here is designed for simplicity. |
| |
| The Gremlin script below creates the graph depicted in the graph above: |
| |
| [gremlin-groovy] |
| ---- |
| g.addV('year').property('name', '2016').as('y2016'). |
| addV('month').property('name', 'may').as('m05'). |
| addV('month').property('name', 'june').as('m06'). |
| addV('day').property('name', '30').as('d30'). |
| addV('day').property('name', '31').as('d31'). |
| addV('day').property('name', '01').as('d01'). |
| addV('event').property('name', 'A').as('eA'). |
| addV('event').property('name', 'B').as('eB'). |
| addV('event').property('name', 'C').as('eC'). |
| addV('event').property('name', 'D').as('eD'). |
| addV('event').property('name', 'E').as('eE'). |
| addE('may').from('y2016').to('m05'). |
| addE('june').from('y2016').to('m06'). |
| addE('day30').from('m05').to('d30'). |
| addE('day31').from('m05').to('d31'). |
| addE('day01').from('m06').to('d01'). |
| addE('has').from('d30').to('eA'). |
| addE('has').from('d30').to('eB'). |
| addE('has').from('d31').to('eC'). |
| addE('has').from('d31').to('eD'). |
| addE('has').from('d01').to('eE'). |
| addE('next').from('d30').to('d31'). |
| addE('next').from('d31').to('d01'). |
| addE('next').from('m05').to('m06').iterate() |
| ---- |
| |
| IMPORTANT: The code example above does not create any indices. Proper index creation, which is specific to the |
| graph implementation used, will be critical to the performance of traversals over this structure. |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('name','2016').out().out().out('has').values() <1> |
| g.V().has('name','2016').out('may').out().out('has').values() <2> |
| g.V().has('name','2016').out('may').out('day31').out('has').values() <3> |
| g.V().has('name','2016').out('may').out('day31').as('start'). |
| V().has('name','2016').out('june').out('day01').as('end'). |
| emit().repeat(__.in('next')).until(where(eq('start'))). |
| out('has'). |
| order().by('name').values('name') <4> |
| ---- |
| |
| <1> Find all the events in 2016. |
| <2> Find all the events in May of 2016. |
| <3> Find all the events on May 31, 2016. |
| <4> Find all the events between May 31, 2016 and June 1, 2016. |