Hivemall supports a neighborhood-learning scheme using SLIM. SLIM is a representative of neighborhood-learning recommendation algorithm introduced in the following paper:

Caution: SLIM is supported from Hivemall v0.5-rc.1 or later.

SLIM optimization objective

The optimization objective of SLIM is similar to Elastic Net (L1+L2 regularization) with additional constraints as follows:

$$ \begin{aligned} & ;{\tiny\begin{matrix}\ \normalsize \text{minimize} \ ^{\scriptsize w_{j}}\end{matrix}}; && \frac{1}{2}\Vert r_{j} - Rw_{j} \Vert_2^2 + \frac{\beta}{2} \Vert w_{j} \Vert_2^2 + \lambda \Vert w_{j} \Vert_1 \ & \text{subject to} && w_{j} \geq 0 \ &&& diag(W)= 0 \end{aligned} $$

Data preparation

Rating binarization

In this article, each user-movie matrix element is binarized to reduce training samples and consider only high rated movies whose rating is 4 or 5. So, every matrix element having a lower rating than 4 is not used for training.

SET hivevar:seed=31;

DROP TABLE ratings2;
CREATE TABLE ratings2 as
select
  rand(${seed}) as rnd,
  userid,
  movieid as itemid,
  cast(1.0 as float) as rating -- double is also accepted
from
  ratings
where rating >= 4.
;

rnd field is appended for each record to split ratings2 into training and testing data later.

Binarization is an optional step, and you can use raw rating values to train a SLIM model.

Splitting dataset

To evaluate a recommendation model, this tutorial uses two type cross validations:

  • Leave-one-out cross validation
  • $$K$$-hold cross validation

The former is used in the SLIM's paper and the latter is used in Mendeley's slide.

Leave-one-out cross validation

For leave-one-out cross validation, the dataset is split into a training set and a testing set by randomly selecting one of the non-zero entries of each user and placing it into the testing set. In the following query, the movie has the smallest rnd value is used as test data (testing table) per a user. And, the others are used as training data (training table).

When we select slim's best hyperparameters, different test data is used in evaluation section several times.

DROP TABLE testing;
CREATE TABLE testing
as
WITH top_k as (
  select
     each_top_k(1, userid, rnd, userid, itemid, rating)
      as (rank, rnd, userid, itemid, rating)
  from (
    select * from ratings2
    CLUSTER BY userid
  ) t
)
select
  userid, itemid, rating
from
  top_k
;

DROP TABLE training;
CREATE TABLE training as
select
  l.*
from
  ratings2 l
  LEFT OUTER JOIN testing r ON (l.userid=r.userid and l.itemid=r.itemid)
where
  r.itemid IS NULL -- anti join
;

$$K$$-hold corss validation

When $$K=2$$, the dataset is divided into training data and testing dataset. The numbers of training and testing samples roughly equal.

When we select slim‘s best hyperparameters, you’ll first train a SLIM prediction model from training data and evaluate the prediction model by testing data.

Optionally, you can switch training data with testing data and evaluate again.

DROP TABLE testing;
CREATE TABLE testing
as
select * from ratings2
where rnd >= 0.5
;

DROP TABLE training;
CREATE TABLE training
as
select * from ratings2
where rnd < 0.5
;

Note

In the following section excluding evaluation section, we will show the example of queries and its results based on $$K$$-hold cross validation case. But, this article's queries are valid for leave-one-out cross validation.

Pre-compute item-item similarity

SLIM needs top-$$k$$ most similar movies for each movie to the approximate user-item matrix. Here, we particularly focus on DIMSUM, an efficient and approximated similarity computation scheme.

Because we set k=20, the output has 20 most-similar movies per itemid. We can adjust trade-off between training and prediction time and precision of matrix approximation by varying k. Larger k is the better approximation for raw user-item matrix, but training time and memory usage tend to increase.

As we explained in the general introduction of item-based CF, following query finds top-$$k$$ nearest-neighborhood movies for each movie:

set hivevar:k=20;

DROP TABLE knn_train;
CREATE TABLE knn_train
as
with item_magnitude as (
  select
    to_map(j, mag) as mags
  from (
    select
      itemid as j,
      l2_norm(rating) as mag
    from
      training
    group by
      itemid
  ) t0
),
item_features as (
  select
    userid as i,
    collect_list(
      feature(itemid, rating)
    ) as feature_vector
  from
    training
  group by
    userid
),
partial_result as (
  select
    dimsum_mapper(f.feature_vector, m.mags, '-threshold 0.1 -int_feature')
      as (itemid, other, s)
  from
    item_features f
    CROSS JOIN item_magnitude m
),
similarity as (
  select
    itemid,
    other,
    sum(s) as similarity
  from
    partial_result
  group by
    itemid, other
),
topk as (
  select
    each_top_k(
      ${k}, itemid, similarity, -- use top k items
      itemid, other
    ) as (rank, similarity, itemid, other)
  from (
    select * from similarity
    CLUSTER BY itemid
  ) t
)
select
  itemid, other, similarity
from
  topk
;
itemidothersimilarity
131140.28432244
112650.25180137
123550.24781825
123960.24435896
15880.24359442
.........

Caution

To run the query above, you may need to run the following statements:

set hive.strict.checks.cartesian.product=false;
set hive.mapred.mode=nonstrict;

Create training input tables

Here, we prepare input tables for SLIM training.

SLIM input consists of the following columns in slim_training_item:

  • i: axis item id
  • Ri: the user-rating vector of the axis item $$i$$ expressed as map<userid, rating>.
  • knn_i: top-$$K$$ similar item matrix of item $$i$$; the user-item rating matrix is expressed as map<userid, map<itemid, rating>>.
  • j: an item id in knn_i.
  • Rj: the user-rating vector of the item $$j$$ expressed as map<userid, rating>.
DROP TABLE item_matrix;
CREATE table item_matrix as
select
  itemid as i,
  to_map(userid, rating) as R_i
from
  training
group by
  itemid;

-- Temporary set off map join because the following query does not work well for map join
set hive.auto.convert.join=false;
-- set mapred.reduce.tasks=64;

-- Create SLIM input features
DROP TABLE slim_training_item;
CREATE TABLE slim_training_item as
WITH knn_item_user_matrix as (
  select
    l.itemid,
    r.userid,
    to_map(l.other, r.rating) ratings
  from
    knn_train l
    JOIN training r ON (l.other = r.itemid)
  group by
    l.itemid, r.userid
),
knn_item_matrix as (
  select
    itemid as i,
    to_map(userid, ratings) as KNN_i -- map<userid, map<itemid, rating>>
  from
    knn_item_user_matrix
  group by
    itemid
)
select
  l.itemid as i,
  r1.R_i,
  r2.knn_i,
  l.other as j,
  r3.R_i as R_j
from
  knn_train l
  JOIN item_matrix r1 ON (l.itemid = r1.i)
  JOIN knn_item_matrix r2 ON (l.itemid = r2.i)
  JOIN item_matrix r3 ON (l.other = r3.i)
;

-- set to the default value
set hive.auto.convert.join=true;

Training

Build a prediction model by SLIM

train_slim function outputs the nonzero elements of an item-item matrix. For item recommendation or prediction, this matrix is stored into the table named slim_model.

DROP TABLE slim_model;
CREATE TABLE slim_model as
select
  i, nn, avg(w) as w
from (
  select
    train_slim(i, r_i, knn_i, j, r_j) as (i, nn, w)
  from (
    select * from slim_training_item
    CLUSTER BY i
  ) t1
) t2
group by i, nn
;

Usage of train_slim

You can obtain information about train_slim function and its arguments by giving -help option as follows:

select train_slim("-help");
usage: train_slim( int i, map<int, double> r_i, map<int, map<int, double>> topKRatesOfI,
       int j, map<int, double> r_j [, constant string options])
       - Returns row index, column index and non-zero weight value of prediction model
       [-cv_rate <arg>] [-disable_cv] [-help] [-iters <arg>] [-l1 <arg>] [-l2 <arg>]
 -cv_rate,--convergence_rate <arg>   Threshold to determine convergence
                                     [default: 0.005]
 -disable_cv,--disable_cvtest        Whether to disable convergence check
                                     [default: enabled]
 -help                               Show function help
 -iters,--iterations <arg>           The number of iterations for
                                     coordinate descent [default: 30]
 -l1,--l1coefficient <arg>           Coefficient for l1 regularizer
                                     [default: 0.001]
 -l2,--l2coefficient <arg>           Coefficient for l2 regularizer
                                     [default: 0.0005]

Prediction and recommendation

Here, we predict ratng values of binarized user-item rating matrix of testing dataset based on ratings in training dataset.

Based on predicted rating scores, we can recommend top-k items for each user that he or she will be likely to put high scores.

Predict unknown ratings of a user-item matrix

Based on known ratings and SLIM weight matrix, we predict unknown ratings in the user-item matrix. SLIM predicts ratings of user-item pairs based on top-$$K$$ similar items.

The predict_pair table represents candidates for recommended user-movie pairs, excluding known ratings in the training dataset.

CREATE OR REPLACE VIEW predict_pair 
as
WITH testing_users as (
  select DISTINCT(userid) as userid from testing
),
training_items as (
  select DISTINCT(itemid) as itemid from training
),
user_items as (
  select
    l.userid,
    r.itemid
  from
    testing_users l
    CROSS JOIN training_items r
)
select
  l.userid,
  l.itemid
from
  user_items l
  LEFT OUTER JOIN training r ON (l.userid=r.userid and l.itemid=r.itemid)
where
  r.itemid IS NULL -- anti join
;
-- optionally set the mean/default value of prediction
set hivevar:mu=0.0;

DROP TABLE predicted;
CREATE TABLE predicted 
as
WITH knn_exploded as (
  select
    l.userid  as u,
    l.itemid as i, -- axis
    r1.other  as k, -- other
    r2.rating as r_uk
  from
    predict_pair l
    LEFT OUTER JOIN knn_train r1
      ON (r1.itemid = l.itemid)
    JOIN training r2
      ON (r2.userid = l.userid and r2.itemid = r1.other)
)
select
  l.u as userid,
  l.i as itemid,
  coalesce(sum(l.r_uk * r.w), ${mu}) as predicted
  -- coalesce(sum(l.r_uk * r.w)) as predicted
from
  knn_exploded l
  LEFT OUTER JOIN slim_model r ON (l.i = r.i and l.k = r.nn)
group by
  l.u, l.i
;

Caution

When $$k$$ is small, slim predicted value may be null. Then, $mu replaces null value. The mean value of item ratings is a good choice for $mu.

Top-$$K$$ item recommendation for each user

Here, we recommend top-3 items for each user based on predicted values.

SET hivevar:k=3;

DROP TABLE IF EXISTS recommend;
CREATE TABLE recommend
as
WITH top_n as (
  select
     each_top_k(${k}, userid, predicted, userid, itemid)
      as (rank, predicted, userid, itemid)
  from (
    select * from predicted
    CLUSTER BY userid
  ) t
)
select
  userid,
  collect_list(itemid) as items
from
  top_n
group by
  userid
;

select * from recommend limit 5;
useriditems
1[364,594,2081]
2[2028,3256,589]
3[260,1291,2791]
4[1196,1200,1210]
5[3813,1366,89]
......

Evaluation

Top-$$K$$ ranking measures: Hit-Rate@K, MRR@K, and Precision@K

In this section, Hit-Rate@k, MRR@k, and Precision@k are computed based on recommended items.

Precision@K is a good evaluation measure for $$K$$-hold cross validation.

On the other hand, Hit-Rate and Mean Reciprocal Rank (i.e., Average Reciprocal Hit-Rate) are good evaluation measures for leave-one-out cross validation.

SET hivevar:n=10;

WITH top_k as (
  select
    each_top_k(${n}, userid, predicted, userid, itemid)
      as (rank, predicted, userid, itemid)
  from (
    select * from predicted
    CLUSTER BY userid
  ) t
),
rec_items as (
  select
    userid,
    collect_list(itemid) as items
  from
    top_k
  group by
    userid
),
ground_truth as (
  select
    userid,
    collect_list(itemid) as truth
  from
    testing
  group by
    userid
)
select
  hitrate(l.items, r.truth) as hitrate,
  mrr(l.items, r.truth) as mrr,
  precision_at(l.items, r.truth) as prec
from
  rec_items l
  join ground_truth r on (l.userid=r.userid)
;

Leave-one-out result

hitratemrrprec
0.215173099221467630.093777525366062710.021517309922146725

Hit Rate and MRR are similar to ones in the result of Table II in Slim's paper

$$K$$-hold result

hitratemrrprec
0.89527754763877391.17515149721860570.3564871582435789

Precision value is similar to the result of Mendeley's slide.

Ranking measures: MRR

In this example, whole recommended items are evaluated using MRR.

WITH rec_items as (
  select
    userid,
    to_ordered_list(itemid, predicted, '-reverse') as items
  from
    predicted
  group by
    userid
),
ground_truth as (
  select
    userid,
    collect_list(itemid) as truth
  from
    testing
  group by
    userid
)
select
  mrr(l.items, r.truth) as mrr
from
  rec_items l
  join ground_truth r on (l.userid=r.userid)
;

Leave-one-out result

mrr
0.10782647321821472

$$K$$-hold result

mrr
0.6179983058881773

This MRR value is similar to one in the Mendeley's slide.