Our user guide for item-based collaborative filtering (CF) introduced how to make recommendation based on item-item similarities. Here, we particularly focus on DIMSUM, an efficient and approximated similarity computation scheme, and try to make recommendation from the MovieLens data.

Caution

In order to obtain ranked list of items, this page introduces queries using to_ordered_map() such as map_values(to_ordered_map(rating, movieid, true)). However, this kind of usage has a potential issue that multiple movieid-s (i.e., values) which have the exactly same rating (i.e., key) will be aggregated to single arbitrary movieid, because to_ordered_map() creates a key-value map which uses duplicated rating as key.

Hence, if map key could duplicate on more then one map values, we recommend you to use to_ordered_list(value, key, '-reverse') instead of map_values(to_ordered_map(key, value, true)). The alternative approach is available from Hivemall v0.5-rc.1 or later.

Compute movie-movie similarity

As we explained in the general introduction of item-based CF, following query finds top-$$k$$ nearest-neighborhood movies for each movie:

drop table if exists dimsum_movie_similarity;
create table dimsum_movie_similarity 
as
with movie_magnitude as ( -- compute magnitude of each movie vector
  select
    to_map(j, mag) as mags
  from (
    select 
      movieid as j,
      l2_norm(rating) as mag
    from 
      training
    group by
      movieid
  ) t0
),
movie_features as (
  select
    userid as i,
    collect_list(
      feature(movieid, rating)
    ) as feature_vector
  from
    training
  group by
    userid
),
partial_result as ( -- launch DIMSUM in a MapReduce fashion
  select
    dimsum_mapper(f.feature_vector, m.mags, '-threshold 0.1 -disable_symmetric_output')
      as (movieid, other, s)
  from
    movie_features f
  left outer join movie_magnitude m
),
similarity as (
    -- reduce (i.e., sum up) mappers' partial results 
	select
	  movieid, 
	  other,
	  sum(s) as similarity
	from 
	  partial_result
	group by
	  movieid, other
),
topk as (
  select
    each_top_k( -- get top-10 nearest neighbors based on similarity score
      10, movieid, similarity,
      movieid, other -- output items
    ) as (rank, similarity, movieid, other)
  from (
    select * from similarity
    CLUSTER BY movieid
  ) t
)
select 
  movieid, other, similarity
from 
  topk
;
movieidothersimilarity
120950.9377422722094696
12310.9316530366756418
114070.9194745656079863
134420.9133853300741587
117920.9072960945403309
.........

Since we set k=10, output has 10 most-similar movies per movieid.

Note

Since we specified an option -disable_symmetric_output, output table does not contain inverted similarities such as <2095, 1>, <231, 1>, <1407, 1>, ...

Prediction

Next, we predict rating for unforeseen user-movie pairs based on the top-$$k$$ similarities:

drop table if exists dimsum_prediction;
create table dimsum_prediction
as
with similarity_all as (
  -- copy (i1, i2)'s similarity as (i2, i1)'s one
  select movieid, other, similarity from dimsum_movie_similarity
  union all
  select other as movieid, movieid as other, similarity from dimsum_movie_similarity
)
select 
	-- target user
	t1.userid,
	
	-- recommendation candidate
	t2.movieid,
	
	-- predicted rating: r_{u,i} = sum(s_{i,:} * r_{u,:}) / sum(s_{i,:})
	sum(t1.rating * t2.similarity) / sum(t2.similarity) as rating
from
	training t1 -- r_{u,<movieid>}
left join -- s_{i,<other>}
	similarity_all t2 
	on t1.movieid = t2.other
where
	-- do not include movies that user already rated
	NOT EXISTS (
		SELECT a.movieid FROM training a
		WHERE a.userid = t1.userid AND a.movieid = t2.movieid
	)
group by
	t1.userid, t2.movieid
;

This query computes estimated rating as follows:

useridmovieidrating
110005.0
110105.0
110124.246349332667371
110135.0
110145.0
.........

Theoretically, for the $$t$$-th nearest-neighborhood item $$\tau(t)$$, prediction can be done by top-$$k$$ weighted sum of target user's historical ratings: $$ r_{u,i} = \frac{\sum^k_{t=1} s_{i,\tau(t)} \cdot r_{u,\tau(t)} }{ \sum^k_{t=1} s_{i,\tau(t)} }, $$ where $$r_{u,i}$$ is user $$u$$'s rating for item (movie) $$i$$, and $$s_{i,j}$$ is similarity of $$i$$-$$j$$ (movieid-other) pair.

Caution

Since the number of similarities and users' past ratings are limited, we cannot say this output always contains prediction for every unforeseen user-item pairs; sometimes prediction for a specific user-item pair might be missing (i.e., NULL).

In fact, our goal is to make recommendation, but we can evaluate the intermediate result as a rating prediction problem:

select
	mae(t1.rating, t2.rating) as mae,
	rmse(t1.rating, t2.rating) as rmse
from
	testing t1 
left join
	dimsum_prediction t2
	on t1.movieid = t2.movieid
where
	t1.userid = t2.userid
;
maermse
0.73083658212302560.9594799959251938

Rating of the MovieLens data is in [1, 5] range, so this average errors are reasonable as a predictor.

Recommendation

By using the prediction table, making recommendation for each user is straightforward:

drop table if exists dimsum_recommendation;
create table dimsum_recommendation
as
select
  userid,
  map_values(to_ordered_map(rating, movieid, true)) as rec_movies
from
  dimsum_prediction
group by
  userid
;
useridrec_movies
1[“2590”,“999”,“372”,“1380”,“2078”,...]
2[“580”,“945”,“43”,“36”,“1704”,...]
3[“3744”,“852”,“1610”,“3740”,“2915”,...]
4[“3379”,“923”,“1997”,“2194”,“2944”,...]
5[“998”,“101”,“2696”,“2968”,“2275”,...]
......

Note

Size of rec_movies varies depending on each user's training samples and what movies he/she already rated.

Evaluation

Eventually, you can measure the quality of recommendation by using ranking measures:

with truth as (
  select 
    userid, 
    map_values(to_ordered_map(rating, cast(movieid as string), true)) as truth
  from
    testing
  group by
    userid
)
select 
  recall_at(t1.rec_movies, t2.truth, 10) as recall,
  precision_at(t1.rec_movies, t2.truth, 10) as precision,
  average_precision(t1.rec_movies, t2.truth) as average_precision,
  auc(t1.rec_movies, t2.truth) as auc,
  mrr(t1.rec_movies, t2.truth) as mrr,
  ndcg(t1.rec_movies, t2.truth) as ndcg
from
  dimsum_recommendation t1
join
  truth t2 on t1.userid = t2.userid
where -- at least 10 recommended items are necessary to compute recall@10 and precision@10
  size(t1.rec_movies) >= 10
;
measureaccuracy
Recall@100.027033598585322713
Precision@100.009001989389920506
Average Precision0.017363681149831108
AUC0.5264553136097863
MRR0.03507380742291146
NDCG0.15787655209987522

If you set larger value to the DIMSUM's -threshold option, similarity will be more aggressively approximated. Consequently, while efficiency is improved, the accuracy is likely to be decreased.

Caution

Before Hivemall v0.5-rc.1, recall_at() and precision_at() are respectively registered as recall() and precision(). However, since precision is a reserved keyword from Hive v2.2.0, we renamed the function names. If you are still using recall() and/or precision(), we strongly recommend you to use the latest version of Hivemall and replace them with the newer function names.