| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| [[recommendation]] |
| == Recommendation |
| |
| image:gremlin-recommendation.png[float=left,width=180]One of the more common use cases for a graph database is the |
| development of link:https://en.wikipedia.org/wiki/Recommender_system[recommendation systems] and a simple approach to |
| doing that is through link:https://en.wikipedia.org/wiki/Collaborative_filtering[collaborative filtering]. |
| Collaborative filtering assumes that if a person shares one set of opinions with a different person, they are likely to |
| have similar taste with respect to other issues. With that basis in mind, it is then possible to make predictions for a |
| specific person as to what their opinions might be. |
| |
| As a simple example, consider a graph that contains "person" and "product" vertices connected by "bought" edges. The |
| following script generates some data for the graph using that basic schema: |
| |
| [gremlin-groovy] |
| ---- |
| g.addV("person").property("name","alice"). |
| addV("person").property("name","bob"). |
| addV("person").property("name","jon"). |
| addV("person").property("name","jack"). |
| addV("person").property("name","jill")iterate() |
| (1..10).each { |
| g.addV("product").property("name","product #${it}").iterate() |
| }; [] |
| (3..7).each { |
| g.V().has("person","name","alice").as("p"). |
| V().has("product","name","product #${it}").addE("bought").from("p").iterate() |
| }; [] |
| (1..5).each { |
| g.V().has("person","name","bob").as("p"). |
| V().has("product","name","product #${it}").addE("bought").from("p").iterate() |
| }; [] |
| (6..10).each { |
| g.V().has("person","name","jon").as("p"). |
| V().has("product","name","product #${it}").addE("bought").from("p").iterate() |
| }; [] |
| 1.step(10, 2) { |
| g.V().has("person","name","jack").as("p"). |
| V().has("product","name","product #${it}").addE("bought").from("p").iterate() |
| }; [] |
| 2.step(10, 2) { |
| g.V().has("person","name","jill").as("p"). |
| V().has("product","name","product #${it}").addE("bought").from("p").iterate() |
| }; [] |
| ---- |
| |
| The first step to making a recommendation to "alice" using collaborative filtering is to understand what she bought: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('name','alice').out('bought').values('name') |
| ---- |
| |
| The following diagram depicts one of the edges traversed in the above example between "alice" and "product #5". |
| Obviously, the other products "alice" bought would have similar relations, but this diagram and those to follow will |
| focus on the neighborhood around that product. |
| |
| image:recommendation-alice-1.png[width=500] |
| |
| The next step is to determine who else purchased those products: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('name','alice').out('bought').in('bought').dedup().values('name') |
| ---- |
| |
| It is worth noting that "alice" is in the results above. She should really be excluded from the list as the |
| interest is in what individuals other than herself purchased: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('name','alice').as('her'). |
| out('bought'). |
| in('bought').where(neq('her')). |
| dedup().values('name') |
| ---- |
| |
| The following diagram shows "alice" and those others who purchased "product #5". |
| |
| image:recommendation-alice-2.png[width=600] |
| |
| The knowledge of the people who bought the same things as "alice" can then be used to find the set of products that |
| they bought: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('name','alice').as('her'). |
| out('bought'). |
| in('bought').where(neq('her')). |
| out('bought'). |
| dedup().values('name') |
| ---- |
| |
| image:recommendation-alice-3.png[width=800] |
| |
| This set of products could be the basis for recommendation, but it is important to remember that "alice" may have |
| already purchased some of these products and it would be better to not pester her with recommendations for products |
| that she already owns. Those products she already purchased can be excluded as follows: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('name','alice').as('her'). |
| out('bought').aggregate('self'). |
| in('bought').where(neq('her')). |
| out('bought').where(without('self')). |
| dedup().values('name') |
| ---- |
| |
| image:recommendation-alice-4.png[width=800] |
| |
| The final step would be to group the remaining products (instead of `dedup()` which was mostly done for demonstration |
| purposes) to form a ranking: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('person','name','alice').as('her'). <1> |
| out('bought').aggregate('self'). <2> |
| in('bought').where(neq('her')). <3> |
| out('bought').where(without('self')). <4> |
| groupCount(). |
| order(local). |
| by(values, desc) <5> |
| ---- |
| |
| <1> Find "alice" who is the person for whom the product recommendation is being made. |
| <2> Traverse to the products that "alice" bought and gather them for later use in the traversal. |
| <3> Traverse to the "person" vertices who bought the products that "alice" bought and exclude "alice" herself from that list. |
| <4> Given those people who bought similar products to "alice", find the products that they bought and exclude those that she already bought. |
| <5> Group the products and count the number of times they were purchased by others to come up with a ranking of products to recommend to "alice". |
| |
| The previous example was already described as "basic" and obviously could take into account whatever data is available |
| to further improve the quality of the recommendation (e.g. product ratings, times of purchase, etc.). One option to |
| improve the quality of what is recommended (without expanding the previous dataset) might be to choose the person |
| vertices that make up the recommendation to "alice" who have the largest common set of purchases. |
| |
| Looking back to the previous code example, consider its more strip down representation that shows those individuals |
| who have at least one product in common: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has("person","name","alice").as("alice"). |
| out("bought").aggregate("self"). |
| in("bought").where(neq("alice")).dedup() |
| ---- |
| |
| Next, do some grouping to find count how many products they have in common: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has("person","name","alice").as("alice"). |
| out("bought").aggregate("self"). |
| in("bought").where(neq("alice")).dedup(). |
| group(). |
| by().by(out("bought"). |
| where(within("self")).count()) |
| ---- |
| |
| The above output shows that the best that can be expected is three common products. The traversal needs to be aware of |
| that maximum: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has("person","name","alice").as("alice"). |
| out("bought").aggregate("self"). |
| in("bought").where(neq("alice")).dedup(). |
| group(). |
| by().by(out("bought"). |
| where(within("self")).count()). |
| select(values). |
| order(local). |
| by(desc).limit(local, 1) |
| ---- |
| |
| With the maximum value available, it can be used to chose those "person" vertices that have the three products in |
| common: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has("person","name","alice").as("alice"). |
| out("bought").aggregate("self"). |
| in("bought").where(neq("alice")).dedup(). |
| group(). |
| by().by(out("bought"). |
| where(within("self")).count()).as("g"). |
| select(values). |
| order(local). |
| by(desc).limit(local, 1).as("m"). |
| select("g").unfold(). |
| where(select(values).as("m")).select(keys) |
| ---- |
| |
| Now that there is a list of "person" vertices to base the recommendation on, traverse to the products that they |
| purchased: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has("person","name","alice").as("alice"). |
| out("bought").aggregate("self"). |
| in("bought").where(neq("alice")).dedup(). |
| group(). |
| by().by(out("bought"). |
| where(within("self")).count()).as("g"). |
| select(values). |
| order(local). |
| by(desc).limit(local, 1).as("m"). |
| select("g").unfold(). |
| where(select(values).as("m")).select(keys). |
| out("bought").where(without("self")) |
| ---- |
| |
| The above output shows that one product is held in common making it the top recommendation: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has("person","name","alice").as("alice"). |
| out("bought").aggregate("self"). |
| in("bought").where(neq("alice")).dedup(). |
| group(). |
| by().by(out("bought"). |
| where(within("self")).count()).as("g"). |
| select(values). |
| order(local). |
| by(desc).limit(local, 1).as("m"). |
| select("g").unfold(). |
| where(select(values).as("m")).select(keys). |
| out("bought").where(without("self")). |
| groupCount(). |
| order(local). |
| by(values, desc). |
| by(select(keys).values("name")). |
| unfold().select(keys).values("name") |
| ---- |
| |
| In considering the practical applications of this recipe, it is worth revisiting the earlier "basic" version of the |
| recommendation algorithm: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('person','name','alice').as('her'). |
| out('bought').aggregate('self'). |
| in('bought').where(neq('her')). |
| out('bought').where(without('self')). |
| groupCount(). |
| order(local). |
| by(values, desc) |
| ---- |
| |
| The above traversal performs a full ranking of items based on all the connected data. That could be a time consuming |
| operation depending on the number of paths being traversed. As it turns out, recommendations don't need to have perfect |
| knowledge of all data to provide a "pretty good" approximation of a recommendation. It can therefore make sense to |
| place additional limits on the traversal to have it better return more quickly at the expense of examining less data. |
| |
| |
| Gremlin provides a number of steps that can help with these limits like: |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#coin-step[coin()], |
| link:https://tinkerpop.apache.org/docs/x.y.z/reference/#sample-step[sample()], and |
| link:https://tinkerpop.apache.org/docs/current/reference/#timelimit-step[timeLimit()]. For example, to have the |
| traversal sample the data for no longer than one second, the previous "basic" recommendation could be changed to: |
| |
| [gremlin-groovy,existing] |
| ---- |
| g.V().has('person','name','alice').as('her'). |
| out('bought').aggregate('self'). |
| in('bought').where(neq('her')). |
| out('bought').where(without('self')).timeLimit(1000). |
| groupCount(). |
| order(local). |
| by(values, desc) |
| ---- |
| |
| In using sampling methods, it is important to consider that the natural ordering of edges in the graph may not produce |
| an ideal sample for the recommendation. For example, if the edges end up being returned oldest first, then the |
| recommendation will be based on the oldest data, which would not be ideal. As with any traversal, it is important to |
| understand the nature of the graph being traversed and the behavior of the underlying graph database to properly |
| achieve the desired outcome. |