blob: 5ace52870cf8c773b82765bb8e855042669a2802 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
[[collections]]
== Collections
image:gremlin-collections.png[width=400]
Lists and maps form the basis for much of the processing in Gremlin traversals. They are core to how side-effects
are typically held and how results are generally produced. Being able to pick them apart and reformat them is sometimes
required. This need to shape the data within a traversal may arise both at the
link:https://tinkerpop.apache.org/docs/current/reference/#terminal-steps[terminal step] of the traversal (technically
just prior to the terminal step) or in the middle of a traversal. Considering the former, a transformation just prior
to iteration will get the result into the form required by the application which would remove the need for additional
application level manipulation. Moreover, a transformation at this stage may reduce the size of the payload being
returned which could be useful in remote applications. Examining the latter, there may be times where a `List` or `Map`
requires some mid-traversal transformation so as to continue with the general logic of the traversal itself. For
example, a traversal at some point might produce a `Map` of `List` objects where the lists contain vertices, where each
`List` might need to be sorted by some criteria and then the top item for each extracted to become the basis for the
continued traversal. Executing transformations for either of these types of situations can be made possible with the
patterns described in this section.
The appearance of a `List` as a traverser in Gremlin usually arises as a result of a `fold()` operation, but may also
appear by way of some side-effect steps like `store()`:
[gremlin-groovy,modern]
----
g.V().fold()
g.V().store('a').cap('a')
----
It is worth noting that while a `Path` is not technically a `List` it does present like one and can be manipulated in
similar fashion to lists:
[gremlin-groovy,modern]
----
g.V().out().out().path()
----
These examples are obviously trivial and there are other ways that a traverser might end up in a `List` form, but, at
this moment, the point here is to focus less on how to get a `List` and more on how to manipulate one within the
Gremlin language. The examples going forward will also be similarly contrived insofar as producing a usable `List` to
manipulate. Bear in mind that it may be quite possible to get the same end results of these examples using more direct
means than what is demonstrated.
It may seem simple, but the most obvious choice to modifying what is in a list is to simply `unfold()` the `List`:
[gremlin-groovy,modern]
----
g.V().fold().unfold().values('name')
g.V().store('a').cap('a').unfold().values('name')
----
The above examples show that `unfold()` works quite well when you don't want to preserve the `List` structure of the
traverser as it just flattens `List` traversers to the traversal stream. The above examples only have one `List` as a
result, but consider what happens when there is more than one:
[gremlin-groovy,modern]
----
g.V().union(fold(),fold())
g.V().union(fold(),fold()).unfold().values('name')
----
The two separate `List` traversers are flattened to a single traversal stream and all the results are mixed together.
While this approach may be acceptable, there are many cases where it might not be so. To preserve the individual
structure of the `List` traversers "locally" `unfold()` the lists to transform them:
[gremlin-groovy,modern]
----
g.V().
union(fold(),fold()).
local(unfold().values('name').fold())
----
The call to `local()` executes its anonymous sub-traversal over each individual `List` iterator and as the
sub-traversal ends with a `fold()`-step, the results are reduced back into a `List` to preserve the original structure,
thus maintaining two traverser results.
This pattern for unfolding and folding `List` traversers ends up having other applications:
[gremlin-groovy,modern]
----
g.V().union(limit(3).fold(),tail(3).fold()) <1>
g.V().union(limit(3).fold(),tail(3).fold()).
local(unfold(). <2>
order().
by(bothE().count(),desc).
limit(1).
fold())
g.V().union(limit(3).fold(),tail(3).fold()). <3>
local(unfold().
has('age',gte(29)).
values('age').
mean())
----
<1> The output consists of two `List` traversers.
<2> For each `List` of vertices, order them by their number of edges, and choose the first one which will be the one
with the highest degree (i.e. number of edges). By ending with `fold()` the `List` traverser structure is preserved
thus returning two `List` objects. Consider this a method for choosing a "max" or a highly ranked vertex. In this case
the rank was determined by the number of edges, but it could have just as easily been determined by a vertex property,
edge property, a calculated value, etc. - one simply needs to alter the `by()`-step modulator to `order()`.
<3> For each `List` of vertices, filter that `List` to only include vertices that have an "age" property with a
value greater than or equal to "29" and then average the results of each list. More generally, consider how this
approach performs some kind of reducing calculation on each `List` traverser. In this case, an average was calculated,
but it might also have been a `sum()`, `count()` or similar operation that reduced the list to a single calculated
value.
So far, this section has focused on what to do with a `List` traverser once there is one present and there have been
fairly contrived examples for how to produce one in the first place. The use of `fold()` has been used most frequently
at this point to achieve list creation and that step should be recalled whenever there is a need to reduce some
traversal stream to an actual `List`. Of course, it may become necessary to more manually construct a `List`,
especially in cases where the expected output of the traversal is composed of one or more ordered results in the
form of a `List`. For example, consider the following three traversals:
[gremlin-groovy,modern]
----
g.V().has('name','marko').values('age') <1>
g.V().has('name','marko'). <2>
repeat(out()).
until(has('lang','java')).
path().
by('name')
g.V().has('name','marko'). <3>
repeat(outE().inV()).
until(has('lang','java')).
path().
local(unfold().
has('weight').
values('weight').
mean())
----
<1> Get the age of "marko"
<2> get the "name" values of the vertices in the collected paths that traverse out from "marko" to any vertex with
the "lang" of "java".
<3> Get the average of the "weight" values of edges in the collected paths that traverse out from "marko" to any vertex
with the "lang" of "java". Note the use of the earlier defined pattern that used `local()` in conjunction with
`unfold()`. In this case it filters out vertices from the `Path` as they are not relevant as the concern is only with
the "weight" property on the edges.
For purposes of this example, the three traversals above happen to represent three pieces of data that are required by
an application. It is plain to note that all of the above traversals hold a similar pattern that starts with
"getting 'marko'" and, in the case of the latter two, traversing on outgoing edges away from him and collecting data
from that path. Ideally, all three of these traversals should execute as one to prevent having to submit three separate
traversals, thus incurring additional query execution costs for what amounts to be largely the same underlying data but
with different transformations applied. The goal here would be to return the results of this data as a `List` with
three results (i.e. triple) that could then be submitted once by the application. The following example demonstrates
the use of `store()` to aid in construction of this `List`:
[gremlin-groovy,modern]
----
g.V().has('name','marko').as('v'). <1>
store('a'). <2>
by('age').
repeat(outE().as('e').inV().as('v')). <3>
until(has('lang','java')).
aggregate('b'). <4>
by(select(all,'v').unfold().values('name').fold()).
aggregate('c'). <5>
by(select(all,'e').unfold().values('weight').mean()).
fold(). <6>
store('a'). <7>
by(cap('b')).
store('a'). <8>
by(cap('c')).
cap('a')
----
<1> Get the "marko" vertex and label that step as "v".
<2> Store the first "age" of "marko" as the first item in the `List` called "a", which will ultimately be the result.
<3> Execute the traversal away from "marko" and continue to traverse on outgoing edges until the vertex has the value
of "java" for the "lang" property. Note the labels of "e" and "v". Note that "e" will contain a `List` of all of the
edges that have been traversed and "v" will contain a `List` of all the vertices that have been traversed.
<4> The incoming traverser to `aggregate('b')` are vertices that terminate the `repeat()` (i.e. those with the "lang"
of "java"). Note however that the `by()` modulator overrides that traverser completely by starting a fresh stream of
the list of vertices in "v". Those vertices are unfolded to retrieve the name property from each and then are reduced
with `fold()` back into a list to be stored in the side-effected named "b".
<5> A similar use of `aggregate()` as the previous step, though this one turns "e" into a stream of edges to calculate
the `mean()` to store in a `List` called "c". Note that `aggregate()` was used here instead of `store()`, as
the former is an eager collection of the elements in the stream (`store()` is lazy) and will force the traversal to be
iterated up to that point before moving forward. Without that eager collection, "v" and "e" would not contain the
complete information required for the production of "b" and "c".
<6> Adding `fold()`-step here is a bit of a trick. To see the trick, copy and paste all lines of Gremlin up to but
not including this `fold()`-step and run them against the "modern" graph. The output is three vertices and if the
`profile()`-step was added one would also see that the traversal contained three traversers. These three traversers
with a vertex in each one were produced from the `repeat()`-step (i.e. those vertices that had the "lang" of "java"
when traversing away from "marko"). The `aggregate()`-steps are side-effects and just allow the traversers to pass
through them unchanged. The `fold()` obviously converts those three traversers to a single `List` to make one
traverser with a `List` inside. That means that the remaining steps following the `fold()` will only be executed one
time each instead of three, which, as will be shown, is critical to the proper result.
<7> The single traverser with the `List` of three vertices in it passes to `store()`. The `by()` modulator presents an
override telling Gremlin to ignore the `List` of three vertices and simply grab the "b" side effect created earlier and
stick that into "a" as part of the result. The `List` with three vertices passes out unchanged as `store()` is a
side-effect step.
<8> Again, the single traverser with the `List` of three vertices passes to `store()` and again, the `by()` modulator
presents an override to include "c" into the result.
All of the above code and explanation show that `store()` can be used to construct `List` objects as side-effects
which can then be used as a result. Note that `aggregate()` can be used to similar effect, should it make sense that
lazy `List` creation is not acceptable with respect to the nature of the traversal. An interesting sub-pattern that
emerges here is that the `by()`-step can modulate its step to completely override the current traverser and ignore its
contents for purpose of that step. This ability to override a traverser acts as a powerful and flexible tool as it
means that each traverser can effectively become a completely different object as determined by a sub-traversal.
Another interesting method for `List` creation was demonstrated a bit earlier but not examined in detail - the use of
`union()`. It was shown earlier in the following context where it helped create a `List` of two lists of three
vertices each:
[gremlin-groovy,modern]
----
g.V().union(limit(3).fold(),tail(3).fold())
----
By folding the results of `union()`, it becomes possible to essentially construct lists with arbitrary traversal
results.
[gremlin-groovy,modern]
----
g.V().
local(union(identity(), <1>
bothE().count()).
fold())
g.V().
store('x').
by(union(select('x').count(local), <2>
identity(),
bothE().count()).
fold()).
cap('x')
----
<1> For each vertex, create a "pair" (i.e. a `List` of two objects) of the vertex itself and its edge count.
<2> For each vertex, create a "triple" (i.e. a `List`of three objects) of the index of the vertex (starting at zero),
the vertex itself and its edge count.
The pattern here is to use `union()` in conjunction with `fold()`. As explained earlier, the `fold()` operation reduces
the stream from `union()` to a single `List` that is then fed forward to the next step in the traversal.
Now that `List` patterns have been explained, there can now be some attention on `Map`. One of the most common ways
to end up with a `Map` is with `valueMap()`:
[gremlin-groovy,modern]
----
g.V().has('name','marko').valueMap('name','age')
----
The problem is that unless the graph is making use of multi-properties, there is little need to have the value of each
property stored as a `List`. One way to unwrap this value from the list is to avoid having it there in the first place
by avoiding use of `valueMap()`:
[gremlin-groovy,modern]
----
g.V().has('name','marko').
local(properties('name','age').
group().by(key()).by(value()))
----
Interestingly, it's worth looking at how to process the output of `valueMap()` to attain this output as the approach is
generally applicable to processing any `Map` instances with any sorts of values:
[gremlin-groovy,modern]
----
g.V().has('name','marko').
valueMap('name','age').
unfold().
group().
by(keys).
by(select(values).unfold())
----
The code above, basically deconstructs then reconstructs the `Map`. The key to the pattern is to first `unfold()` the
`Map` into its key and value entries. Then for each key and value produce a new `Map` using `group()` where the key
for that map is the key of the entry (those are obviously unique as you picked them out of the `valueMap()`) and the
value is simply the `unfold()` of the list of values in each entry. Recall that the `select(values).unfold()` only
returns one value (i.e. the first) not only because there is only one, but also because `by()` will only call `next()`
on that sub-traversal (it does not iterate the entire thing).
Generally speaking, a `Map` constructed as part of `group()` or `project()` will already be in the form required as
the `by()` modulators would be written in such a fashion as to produce that final output. It would be unnecessary to
deconstruct/reconstruct it. Be certain that there isn't a way to re-write the `group()` or `project()` to get the
desired output before taking this approach.
In the following case, `project()` is used to create a `Map` that does not meet this requirement as it contains some
unavoidable extraneous keys in the output `Map`:
[gremlin-groovy,modern]
----
g.V().
project('name','age','lang').
by('name').
by(coalesce(values('age'),constant('n/a'))).
by(coalesce(values('lang'),constant('n/a')))
----
The use of `coalesce()` works around the problem where "age" and "lang" are not necessarily property keys present on
every single vertex in the traversal stream. When the "age" or "lang" are not present, the constant of "n/a" is
supplied. While this may be an acceptable output, it is possible to shape the `Map` to be "nicer":
[gremlin-groovy,modern]
----
g.V().
project('name','age','lang').
by('name').
by(coalesce(values('age'),constant('n/a'))).
by(coalesce(values('lang'),constant('n/a'))).
local(unfold().
filter(select(values).is(P.neq('n/a'))).
group().
by(keys).
by(values))
----
The additional steps above `unfold()` the `Map` to key-value entries and filter the values for "n/a" and remove them
prior to reconstructing the `Map` with the method shown earlier. To go a step further, apply the pattern presented
earlier to flatten `List` values within a `Map`:
[gremlin-groovy,modern]
----
g.V().
project('name','age','lang').
by('name').
by(coalesce(values('age'),constant('n/a'))).
by(coalesce(values('lang'),constant('n/a'))).
local(unfold().
filter(select(values).is(P.neq('n/a'))).
group().
by(keys).
by(select(values).unfold()))
----
As there may be a desire to remove entries from a `Map`, there may also be the need to add keys to a `Map`. The pattern
here involves the use of a `union()` that returns the `Map` instances which can be flattened to entries and then
reconstructed as a new `Map` that has been merged together:
[gremlin-groovy,modern]
----
g.V().
has('name','marko').
union(project('degree'). <1>
by(bothE().count()),
valueMap(true)).
unfold(). <2>
group().
by(keys).
by(select(values).unfold())
----
<1> The `valueMap(true)` of a `Vertex` can be extended with the "degree" of the `Vertex` by performing a `union()` of
the two traversals that produce that output (both produce `Map` objects).
<2> The `unfold()`-step is used to decompose the `Map` objects into key/value entries that are then constructed back
into a single new `Map` given the patterns shown earlier. The `Map` objects of both traversals in the `union()` are
essentially merged together.
When using this pattern, it is important to recognize that if there are non-unique keys produced by the traversals
supplied to `union()`, they will overwrite one another given the final `by()` modulator above. If changed to
`by(select(values).unfold().fold())` they will merge to produce a `List` of values. Of course, that change will bring
a `List` back for all the values of the new `Map`. With some added logic the `Map` values can be flattened out of
`List` instances when necessary:
[gremlin-groovy,modern]
----
g.V().
has('name','marko').
union(valueMap(true),
project('age').
by(constant(100))).
unfold().
group().
by(keys).
by(select(values).
unfold().
fold().
choose(count(local).is(eq(1)), unfold()))
----