docs/src/recipes/recommendation.asciidoc - tinkerpop - Git at Google

 ////
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ////
 [[recommendation]]
 == Recommendation

 image:gremlin-recommendation.png[float=left,width=180]One of the more common use cases for a graph database is the
 development of link:https://en.wikipedia.org/wiki/Recommender_system[recommendation systems] and a simple approach to
 doing that is through link:https://en.wikipedia.org/wiki/Collaborative_filtering[collaborative filtering].
 Collaborative filtering assumes that if a person shares one set of opinions with a different person, they are likely to
 have similar taste with respect to other issues. With that basis in mind, it is then possible to make predictions for a
 specific person as to what their opinions might be.

 As a simple example, consider a graph that contains "person" and "product" vertices connected by "bought" edges. The
 following script generates some data for the graph using that basic schema:

 [gremlin-groovy]
 ----
 g.addV("person").property("name","alice").
   addV("person").property("name","bob").
   addV("person").property("name","jon").
   addV("person").property("name","jack").
   addV("person").property("name","jill")iterate()
 (1..10).each {
   g.addV("product").property("name","product #${it}").iterate()
 }; []
 (3..7).each {
   g.V().has("person","name","alice").as("p").
     V().has("product","name","product #${it}").addE("bought").from("p").iterate()
 }; []
 (1..5).each {
   g.V().has("person","name","bob").as("p").
     V().has("product","name","product #${it}").addE("bought").from("p").iterate()
 }; []
 (6..10).each {
   g.V().has("person","name","jon").as("p").
     V().has("product","name","product #${it}").addE("bought").from("p").iterate()
 }; []
 1.step(10, 2) {
   g.V().has("person","name","jack").as("p").
     V().has("product","name","product #${it}").addE("bought").from("p").iterate()
 }; []
 2.step(10, 2) {
   g.V().has("person","name","jill").as("p").
     V().has("product","name","product #${it}").addE("bought").from("p").iterate()
 }; []
 ----

 The first step to making a recommendation to "alice" using collaborative filtering is to understand what she bought:

 [gremlin-groovy,existing]
 ----
 g.V().has('name','alice').out('bought').values('name')
 ----

 The following diagram depicts one of the edges traversed in the above example between "alice" and "product #5".
 Obviously, the other products "alice" bought would have similar relations, but this diagram and those to follow will
 focus on the neighborhood around that product.

 image:recommendation-alice-1.png[width=500]

 The next step is to determine who else purchased those products:

 [gremlin-groovy,existing]
 ----
 g.V().has('name','alice').out('bought').in('bought').dedup().values('name')
 ----

 It is worth noting that "alice" is in the results above. She should really be excluded from the list as the
 interest is in what individuals other than herself purchased:

 [gremlin-groovy,existing]
 ----
 g.V().has('name','alice').as('her').
       out('bought').
       in('bought').where(neq('her')).
       dedup().values('name')
 ----

 The following diagram shows "alice" and those others who purchased "product #5".

 image:recommendation-alice-2.png[width=600]

 The knowledge of the people who bought the same things as "alice" can then be used to find the set of products that
 they bought:

 [gremlin-groovy,existing]
 ----
 g.V().has('name','alice').as('her').
       out('bought').
       in('bought').where(neq('her')).
       out('bought').
       dedup().values('name')
 ----

 image:recommendation-alice-3.png[width=800]

 This set of products could be the basis for recommendation, but it is important to remember that "alice" may have
 already purchased some of these products and it would be better to not pester her with recommendations for products
 that she already owns. Those products she already purchased can be excluded as follows:

 [gremlin-groovy,existing]
 ----
 g.V().has('name','alice').as('her').
       out('bought').aggregate('self').
       in('bought').where(neq('her')).
       out('bought').where(without('self')).
       dedup().values('name')
 ----

 image:recommendation-alice-4.png[width=800]

 The final step would be to group the remaining products (instead of `dedup()` which was mostly done for demonstration
 purposes) to form a ranking:

 [gremlin-groovy,existing]
 ----
 g.V().has('person','name','alice').as('her').     <1>
       out('bought').aggregate('self').            <2>
       in('bought').where(neq('her')).             <3>
       out('bought').where(without('self')).       <4>
       groupCount().
       order(local).
         by(values, desc)                          <5>
 ----

 <1> Find "alice" who is the person for whom the product recommendation is being made.
 <2> Traverse to the products that "alice" bought and gather them for later use in the traversal.
 <3> Traverse to the "person" vertices who bought the products that "alice" bought and exclude "alice" herself from that list.
 <4> Given those people who bought similar products to "alice", find the products that they bought and exclude those that she already bought.
 <5> Group the products and count the number of times they were purchased by others to come up with a ranking of products to recommend to "alice".

 The previous example was already described as "basic" and obviously could take into account whatever data is available
 to further improve the quality of the recommendation (e.g. product ratings, times of purchase, etc.).  One option to
 improve the quality of what is recommended (without expanding the previous dataset) might be to choose the person
 vertices that make up the recommendation to "alice" who have the largest common set of purchases.

 Looking back to the previous code example, consider its more strip down representation that shows those individuals
 who have at least one product in common:

 [gremlin-groovy,existing]
 ----
 g.V().has("person","name","alice").as("alice").
       out("bought").aggregate("self").
       in("bought").where(neq("alice")).dedup()
 ----

 Next, do some grouping to find count how many products they have in common:

 [gremlin-groovy,existing]
 ----
 g.V().has("person","name","alice").as("alice").
       out("bought").aggregate("self").
       in("bought").where(neq("alice")).dedup().
       group().
         by().by(out("bought").
       where(within("self")).count())
 ----

 The above output shows that the best that can be expected is three common products. The traversal needs to be aware of
 that maximum:

 [gremlin-groovy,existing]
 ----
 g.V().has("person","name","alice").as("alice").
       out("bought").aggregate("self").
       in("bought").where(neq("alice")).dedup().
       group().
         by().by(out("bought").
       where(within("self")).count()).
       select(values).
       order(local).
         by(desc).limit(local, 1)
 ----

 With the maximum value available, it can be used to chose those "person" vertices that have the three products in
 common:

 [gremlin-groovy,existing]
 ----
 g.V().has("person","name","alice").as("alice").
       out("bought").aggregate("self").
       in("bought").where(neq("alice")).dedup().
       group().
         by().by(out("bought").
       where(within("self")).count()).as("g").
       select(values).
       order(local).
         by(desc).limit(local, 1).as("m").
       select("g").unfold().
       where(select(values).as("m")).select(keys)
 ----

 Now that there is a list of "person" vertices to base the recommendation on, traverse to the products that they
 purchased:

 [gremlin-groovy,existing]
 ----
 g.V().has("person","name","alice").as("alice").
       out("bought").aggregate("self").
       in("bought").where(neq("alice")).dedup().
       group().
         by().by(out("bought").
       where(within("self")).count()).as("g").
       select(values).
       order(local).
         by(desc).limit(local, 1).as("m").
       select("g").unfold().
       where(select(values).as("m")).select(keys).
       out("bought").where(without("self"))
 ----

 The above output shows that one product is held in common making it the top recommendation:

 [gremlin-groovy,existing]
 ----
 g.V().has("person","name","alice").as("alice").
       out("bought").aggregate("self").
       in("bought").where(neq("alice")).dedup().
       group().
         by().by(out("bought").
       where(within("self")).count()).as("g").
       select(values).
       order(local).
         by(desc).limit(local, 1).as("m").
       select("g").unfold().
       where(select(values).as("m")).select(keys).
       out("bought").where(without("self")).
       groupCount().
       order(local).
         by(values, desc).
         by(select(keys).values("name")).
       unfold().select(keys).values("name")
 ----

 In considering the practical applications of this recipe, it is worth revisiting the earlier "basic" version of the
 recommendation algorithm:

 [gremlin-groovy,existing]
 ----
 g.V().has('person','name','alice').as('her').
       out('bought').aggregate('self').
       in('bought').where(neq('her')).
       out('bought').where(without('self')).
       groupCount().
       order(local).
         by(values, desc)
 ----

 The above traversal performs a full ranking of items based on all the connected data. That could be a time consuming
 operation depending on the number of paths being traversed. As it turns out, recommendations don't need to have perfect
 knowledge of all data to provide a "pretty good" approximation of a recommendation. It can therefore make sense to
 place additional limits on the traversal to have it better return more quickly at the expense of examining less data.


 Gremlin provides a number of steps that can help with these limits like:
 link:https://tinkerpop.apache.org/docs/x.y.z/reference/#coin-step[coin()],
 link:https://tinkerpop.apache.org/docs/x.y.z/reference/#sample-step[sample()], and
 link:https://tinkerpop.apache.org/docs/current/reference/#timelimit-step[timeLimit()]. For example, to have the
 traversal sample the data for no longer than one second, the previous "basic" recommendation could be changed to:

 [gremlin-groovy,existing]
 ----
 g.V().has('person','name','alice').as('her').
       out('bought').aggregate('self').
       in('bought').where(neq('her')).
       out('bought').where(without('self')).timeLimit(1000).
       groupCount().
       order(local).
         by(values, desc)
 ----

 In using sampling methods, it is important to consider that the natural ordering of edges in the graph may not produce
 an ideal sample for the recommendation. For example, if the edges end up being returned oldest first, then the
 recommendation will be based on the oldest data, which would not be ideal. As with any traversal, it is important to
 understand the nature of the graph being traversed and the behavior of the underlying graph database to properly
 achieve the desired outcome.
	////
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////
	[[recommendation]]
	== Recommendation

	image:gremlin-recommendation.png[float=left,width=180]One of the more common use cases for a graph database is the
	development of link:https://en.wikipedia.org/wiki/Recommender_system[recommendation systems] and a simple approach to
	doing that is through link:https://en.wikipedia.org/wiki/Collaborative_filtering[collaborative filtering].
	Collaborative filtering assumes that if a person shares one set of opinions with a different person, they are likely to
	have similar taste with respect to other issues. With that basis in mind, it is then possible to make predictions for a
	specific person as to what their opinions might be.

	As a simple example, consider a graph that contains "person" and "product" vertices connected by "bought" edges. The
	following script generates some data for the graph using that basic schema:

	[gremlin-groovy]
	----
	g.addV("person").property("name","alice").
	addV("person").property("name","bob").
	addV("person").property("name","jon").
	addV("person").property("name","jack").
	addV("person").property("name","jill")iterate()
	(1..10).each {
	g.addV("product").property("name","product #${it}").iterate()
	}; []
	(3..7).each {
	g.V().has("person","name","alice").as("p").
	V().has("product","name","product #${it}").addE("bought").from("p").iterate()
	}; []
	(1..5).each {
	g.V().has("person","name","bob").as("p").
	V().has("product","name","product #${it}").addE("bought").from("p").iterate()
	}; []
	(6..10).each {
	g.V().has("person","name","jon").as("p").
	V().has("product","name","product #${it}").addE("bought").from("p").iterate()
	}; []
	1.step(10, 2) {
	g.V().has("person","name","jack").as("p").
	V().has("product","name","product #${it}").addE("bought").from("p").iterate()
	}; []
	2.step(10, 2) {
	g.V().has("person","name","jill").as("p").
	V().has("product","name","product #${it}").addE("bought").from("p").iterate()
	}; []
	----

	The first step to making a recommendation to "alice" using collaborative filtering is to understand what she bought:

	[gremlin-groovy,existing]
	----
	g.V().has('name','alice').out('bought').values('name')
	----

	The following diagram depicts one of the edges traversed in the above example between "alice" and "product #5".
	Obviously, the other products "alice" bought would have similar relations, but this diagram and those to follow will
	focus on the neighborhood around that product.

	image:recommendation-alice-1.png[width=500]

	The next step is to determine who else purchased those products:

	[gremlin-groovy,existing]
	----
	g.V().has('name','alice').out('bought').in('bought').dedup().values('name')
	----

	It is worth noting that "alice" is in the results above. She should really be excluded from the list as the
	interest is in what individuals other than herself purchased:

	[gremlin-groovy,existing]
	----
	g.V().has('name','alice').as('her').
	out('bought').
	in('bought').where(neq('her')).
	dedup().values('name')
	----

	The following diagram shows "alice" and those others who purchased "product #5".

	image:recommendation-alice-2.png[width=600]

	The knowledge of the people who bought the same things as "alice" can then be used to find the set of products that
	they bought:

	[gremlin-groovy,existing]
	----
	g.V().has('name','alice').as('her').
	out('bought').
	in('bought').where(neq('her')).
	out('bought').
	dedup().values('name')
	----

	image:recommendation-alice-3.png[width=800]

	This set of products could be the basis for recommendation, but it is important to remember that "alice" may have
	already purchased some of these products and it would be better to not pester her with recommendations for products
	that she already owns. Those products she already purchased can be excluded as follows:

	[gremlin-groovy,existing]
	----
	g.V().has('name','alice').as('her').
	out('bought').aggregate('self').
	in('bought').where(neq('her')).
	out('bought').where(without('self')).
	dedup().values('name')
	----

	image:recommendation-alice-4.png[width=800]

	The final step would be to group the remaining products (instead of `dedup()` which was mostly done for demonstration
	purposes) to form a ranking:

	[gremlin-groovy,existing]
	----
	g.V().has('person','name','alice').as('her'). <1>
	out('bought').aggregate('self'). <2>
	in('bought').where(neq('her')). <3>
	out('bought').where(without('self')). <4>
	groupCount().
	order(local).
	by(values, desc) <5>
	----

	<1> Find "alice" who is the person for whom the product recommendation is being made.
	<2> Traverse to the products that "alice" bought and gather them for later use in the traversal.
	<3> Traverse to the "person" vertices who bought the products that "alice" bought and exclude "alice" herself from that list.
	<4> Given those people who bought similar products to "alice", find the products that they bought and exclude those that she already bought.
	<5> Group the products and count the number of times they were purchased by others to come up with a ranking of products to recommend to "alice".

	The previous example was already described as "basic" and obviously could take into account whatever data is available
	to further improve the quality of the recommendation (e.g. product ratings, times of purchase, etc.). One option to
	improve the quality of what is recommended (without expanding the previous dataset) might be to choose the person
	vertices that make up the recommendation to "alice" who have the largest common set of purchases.

	Looking back to the previous code example, consider its more strip down representation that shows those individuals
	who have at least one product in common:

	[gremlin-groovy,existing]
	----
	g.V().has("person","name","alice").as("alice").
	out("bought").aggregate("self").
	in("bought").where(neq("alice")).dedup()
	----

	Next, do some grouping to find count how many products they have in common:

	[gremlin-groovy,existing]
	----
	g.V().has("person","name","alice").as("alice").
	out("bought").aggregate("self").
	in("bought").where(neq("alice")).dedup().
	group().
	by().by(out("bought").
	where(within("self")).count())
	----

	The above output shows that the best that can be expected is three common products. The traversal needs to be aware of
	that maximum:

	[gremlin-groovy,existing]
	----
	g.V().has("person","name","alice").as("alice").
	out("bought").aggregate("self").
	in("bought").where(neq("alice")).dedup().
	group().
	by().by(out("bought").
	where(within("self")).count()).
	select(values).
	order(local).
	by(desc).limit(local, 1)
	----

	With the maximum value available, it can be used to chose those "person" vertices that have the three products in
	common:

	[gremlin-groovy,existing]
	----
	g.V().has("person","name","alice").as("alice").
	out("bought").aggregate("self").
	in("bought").where(neq("alice")).dedup().
	group().
	by().by(out("bought").
	where(within("self")).count()).as("g").
	select(values).
	order(local).
	by(desc).limit(local, 1).as("m").
	select("g").unfold().
	where(select(values).as("m")).select(keys)
	----

	Now that there is a list of "person" vertices to base the recommendation on, traverse to the products that they
	purchased:

	[gremlin-groovy,existing]
	----
	g.V().has("person","name","alice").as("alice").
	out("bought").aggregate("self").
	in("bought").where(neq("alice")).dedup().
	group().
	by().by(out("bought").
	where(within("self")).count()).as("g").
	select(values).
	order(local).
	by(desc).limit(local, 1).as("m").
	select("g").unfold().
	where(select(values).as("m")).select(keys).
	out("bought").where(without("self"))
	----

	The above output shows that one product is held in common making it the top recommendation:

	[gremlin-groovy,existing]
	----
	g.V().has("person","name","alice").as("alice").
	out("bought").aggregate("self").
	in("bought").where(neq("alice")).dedup().
	group().
	by().by(out("bought").
	where(within("self")).count()).as("g").
	select(values).
	order(local).
	by(desc).limit(local, 1).as("m").
	select("g").unfold().
	where(select(values).as("m")).select(keys).
	out("bought").where(without("self")).
	groupCount().
	order(local).
	by(values, desc).
	by(select(keys).values("name")).
	unfold().select(keys).values("name")
	----

	In considering the practical applications of this recipe, it is worth revisiting the earlier "basic" version of the
	recommendation algorithm:

	[gremlin-groovy,existing]
	----
	g.V().has('person','name','alice').as('her').
	out('bought').aggregate('self').
	in('bought').where(neq('her')).
	out('bought').where(without('self')).
	groupCount().
	order(local).
	by(values, desc)
	----

	The above traversal performs a full ranking of items based on all the connected data. That could be a time consuming
	operation depending on the number of paths being traversed. As it turns out, recommendations don't need to have perfect
	knowledge of all data to provide a "pretty good" approximation of a recommendation. It can therefore make sense to
	place additional limits on the traversal to have it better return more quickly at the expense of examining less data.


	Gremlin provides a number of steps that can help with these limits like:
	link:https://tinkerpop.apache.org/docs/x.y.z/reference/#coin-step[coin()],
	link:https://tinkerpop.apache.org/docs/x.y.z/reference/#sample-step[sample()], and
	link:https://tinkerpop.apache.org/docs/current/reference/#timelimit-step[timeLimit()]. For example, to have the
	traversal sample the data for no longer than one second, the previous "basic" recommendation could be changed to:

	[gremlin-groovy,existing]
	----
	g.V().has('person','name','alice').as('her').
	out('bought').aggregate('self').
	in('bought').where(neq('her')).
	out('bought').where(without('self')).timeLimit(1000).
	groupCount().
	order(local).
	by(values, desc)
	----

	In using sampling methods, it is important to consider that the natural ordering of edges in the graph may not produce
	an ideal sample for the recommendation. For example, if the edges end up being returned oldest first, then the
	recommendation will be based on the oldest data, which would not be ideal. As with any traversal, it is important to
	understand the nature of the graph being traversed and the behavior of the underlying graph database to properly
	achieve the desired outcome.