Hivemall supports a neighborhood-learning scheme using SLIM. SLIM is a representative of neighborhood-learning recommendation algorithm introduced in the following paper:
Caution: SLIM is supported from Hivemall v0.5-rc.1 or later.
The optimization objective of SLIM is similar to Elastic Net (L1+L2 regularization) with additional constraints as follows:
$$ \begin{aligned} & ;{\tiny\begin{matrix}\ \normalsize \text{minimize} \ ^{\scriptsize w_{j}}\end{matrix}}; && \frac{1}{2}\Vert r_{j} - Rw_{j} \Vert_2^2 + \frac{\beta}{2} \Vert w_{j} \Vert_2^2 + \lambda \Vert w_{j} \Vert_1 \ & \text{subject to} && w_{j} \geq 0 \ &&& diag(W)= 0 \end{aligned} $$
In this article, each user-movie matrix element is binarized to reduce training samples and consider only high rated movies whose rating is 4 or 5. So, every matrix element having a lower rating than 4 is not used for training.
SET hivevar:seed=31; DROP TABLE ratings2; CREATE TABLE ratings2 as select rand(${seed}) as rnd, userid, movieid as itemid, cast(1.0 as float) as rating -- double is also accepted from ratings where rating >= 4. ;
rnd
field is appended for each record to split ratings2
into training and testing data later.
Binarization is an optional step, and you can use raw rating values to train a SLIM model.
To evaluate a recommendation model, this tutorial uses two type cross validations:
The former is used in the SLIM's paper and the latter is used in Mendeley's slide.
For leave-one-out cross validation, the dataset is split into a training set and a testing set by randomly selecting one of the non-zero entries of each user and placing it into the testing set. In the following query, the movie has the smallest rnd
value is used as test data (testing
table) per a user. And, the others are used as training data (training
table).
When we select slim's best hyperparameters, different test data is used in evaluation section several times.
DROP TABLE testing; CREATE TABLE testing as WITH top_k as ( select each_top_k(1, userid, rnd, userid, itemid, rating) as (rank, rnd, userid, itemid, rating) from ( select * from ratings2 CLUSTER BY userid ) t ) select userid, itemid, rating from top_k ; DROP TABLE training; CREATE TABLE training as select l.* from ratings2 l LEFT OUTER JOIN testing r ON (l.userid=r.userid and l.itemid=r.itemid) where r.itemid IS NULL -- anti join ;
When $$K=2$$, the dataset is divided into training data and testing dataset. The numbers of training and testing samples roughly equal.
When we select slim‘s best hyperparameters, you’ll first train a SLIM prediction model from training data and evaluate the prediction model by testing data.
Optionally, you can switch training data with testing data and evaluate again.
DROP TABLE testing; CREATE TABLE testing as select * from ratings2 where rnd >= 0.5 ; DROP TABLE training; CREATE TABLE training as select * from ratings2 where rnd < 0.5 ;
Note
In the following section excluding evaluation section, we will show the example of queries and its results based on $$K$$-hold cross validation case. But, this article's queries are valid for leave-one-out cross validation.
SLIM needs top-$$k$$ most similar movies for each movie to the approximate user-item matrix. Here, we particularly focus on DIMSUM, an efficient and approximated similarity computation scheme.
Because we set k=20
, the output has 20 most-similar movies per itemid
. We can adjust trade-off between training and prediction time and precision of matrix approximation by varying k
. Larger k
is the better approximation for raw user-item matrix, but training time and memory usage tend to increase.
As we explained in the general introduction of item-based CF, following query finds top-$$k$$ nearest-neighborhood movies for each movie:
set hivevar:k=20; DROP TABLE knn_train; CREATE TABLE knn_train as with item_magnitude as ( select to_map(j, mag) as mags from ( select itemid as j, l2_norm(rating) as mag from training group by itemid ) t0 ), item_features as ( select userid as i, collect_list( feature(itemid, rating) ) as feature_vector from training group by userid ), partial_result as ( select dimsum_mapper(f.feature_vector, m.mags, '-threshold 0.1 -int_feature') as (itemid, other, s) from item_features f CROSS JOIN item_magnitude m ), similarity as ( select itemid, other, sum(s) as similarity from partial_result group by itemid, other ), topk as ( select each_top_k( ${k}, itemid, similarity, -- use top k items itemid, other ) as (rank, similarity, itemid, other) from ( select * from similarity CLUSTER BY itemid ) t ) select itemid, other, similarity from topk ;
itemid | other | similarity |
---|---|---|
1 | 3114 | 0.28432244 |
1 | 1265 | 0.25180137 |
1 | 2355 | 0.24781825 |
1 | 2396 | 0.24435896 |
1 | 588 | 0.24359442 |
... | ... | ... |
Caution
To run the query above, you may need to run the following statements:
set hive.strict.checks.cartesian.product=false; set hive.mapred.mode=nonstrict;
Here, we prepare input tables for SLIM training.
SLIM input consists of the following columns in slim_training_item
:
i
: axis item idRi
: the user-rating vector of the axis item $$i$$ expressed as map<userid, rating>
.knn_i
: top-$$K$$ similar item matrix of item $$i$$; the user-item rating matrix is expressed as map<userid, map<itemid, rating>>
.j
: an item id in knn_i
.Rj
: the user-rating vector of the item $$j$$ expressed as map<userid, rating>
.DROP TABLE item_matrix; CREATE table item_matrix as select itemid as i, to_map(userid, rating) as R_i from training group by itemid; -- Temporary set off map join because the following query does not work well for map join set hive.auto.convert.join=false; -- set mapred.reduce.tasks=64; -- Create SLIM input features DROP TABLE slim_training_item; CREATE TABLE slim_training_item as WITH knn_item_user_matrix as ( select l.itemid, r.userid, to_map(l.other, r.rating) ratings from knn_train l JOIN training r ON (l.other = r.itemid) group by l.itemid, r.userid ), knn_item_matrix as ( select itemid as i, to_map(userid, ratings) as KNN_i -- map<userid, map<itemid, rating>> from knn_item_user_matrix group by itemid ) select l.itemid as i, r1.R_i, r2.knn_i, l.other as j, r3.R_i as R_j from knn_train l JOIN item_matrix r1 ON (l.itemid = r1.i) JOIN knn_item_matrix r2 ON (l.itemid = r2.i) JOIN item_matrix r3 ON (l.other = r3.i) ; -- set to the default value set hive.auto.convert.join=true;
train_slim
function outputs the nonzero elements of an item-item matrix. For item recommendation or prediction, this matrix is stored into the table named slim_model
.
DROP TABLE slim_model; CREATE TABLE slim_model as select i, nn, avg(w) as w from ( select train_slim(i, r_i, knn_i, j, r_j) as (i, nn, w) from ( select * from slim_training_item CLUSTER BY i ) t1 ) t2 group by i, nn ;
train_slim
You can obtain information about train_slim
function and its arguments by giving -help
option as follows:
select train_slim("-help");
usage: train_slim( int i, map<int, double> r_i, map<int, map<int, double>> topKRatesOfI, int j, map<int, double> r_j [, constant string options]) - Returns row index, column index and non-zero weight value of prediction model [-cv_rate <arg>] [-disable_cv] [-help] [-iters <arg>] [-l1 <arg>] [-l2 <arg>] -cv_rate,--convergence_rate <arg> Threshold to determine convergence [default: 0.005] -disable_cv,--disable_cvtest Whether to disable convergence check [default: enabled] -help Show function help -iters,--iterations <arg> The number of iterations for coordinate descent [default: 30] -l1,--l1coefficient <arg> Coefficient for l1 regularizer [default: 0.001] -l2,--l2coefficient <arg> Coefficient for l2 regularizer [default: 0.0005]
Here, we predict ratng values of binarized user-item rating matrix of testing dataset based on ratings in training dataset.
Based on predicted rating scores, we can recommend top-k items for each user that he or she will be likely to put high scores.
Based on known ratings and SLIM weight matrix, we predict unknown ratings in the user-item matrix. SLIM predicts ratings of user-item pairs based on top-$$K$$ similar items.
The predict_pair
table represents candidates for recommended user-movie pairs, excluding known ratings in the training dataset.
CREATE OR REPLACE VIEW predict_pair as WITH testing_users as ( select DISTINCT(userid) as userid from testing ), training_items as ( select DISTINCT(itemid) as itemid from training ), user_items as ( select l.userid, r.itemid from testing_users l CROSS JOIN training_items r ) select l.userid, l.itemid from user_items l LEFT OUTER JOIN training r ON (l.userid=r.userid and l.itemid=r.itemid) where r.itemid IS NULL -- anti join ;
-- optionally set the mean/default value of prediction set hivevar:mu=0.0; DROP TABLE predicted; CREATE TABLE predicted as WITH knn_exploded as ( select l.userid as u, l.itemid as i, -- axis r1.other as k, -- other r2.rating as r_uk from predict_pair l LEFT OUTER JOIN knn_train r1 ON (r1.itemid = l.itemid) JOIN training r2 ON (r2.userid = l.userid and r2.itemid = r1.other) ) select l.u as userid, l.i as itemid, coalesce(sum(l.r_uk * r.w), ${mu}) as predicted -- coalesce(sum(l.r_uk * r.w)) as predicted from knn_exploded l LEFT OUTER JOIN slim_model r ON (l.i = r.i and l.k = r.nn) group by l.u, l.i ;
Caution
When $$k$$ is small, slim predicted value may be
null
. Then,$mu
replacesnull
value. The mean value of item ratings is a good choice for$mu
.
Here, we recommend top-3 items for each user based on predicted values.
SET hivevar:k=3; DROP TABLE IF EXISTS recommend; CREATE TABLE recommend as WITH top_n as ( select each_top_k(${k}, userid, predicted, userid, itemid) as (rank, predicted, userid, itemid) from ( select * from predicted CLUSTER BY userid ) t ) select userid, collect_list(itemid) as items from top_n group by userid ; select * from recommend limit 5;
userid | items |
---|---|
1 | [364,594,2081] |
2 | [2028,3256,589] |
3 | [260,1291,2791] |
4 | [1196,1200,1210] |
5 | [3813,1366,89] |
... | ... |
In this section, Hit-Rate@k
, MRR@k
, and Precision@k
are computed based on recommended items.
Precision@K
is a good evaluation measure for $$K$$-hold cross validation.
On the other hand, Hit-Rate
and Mean Reciprocal Rank
(i.e., Average Reciprocal Hit-Rate) are good evaluation measures for leave-one-out cross validation.
SET hivevar:n=10; WITH top_k as ( select each_top_k(${n}, userid, predicted, userid, itemid) as (rank, predicted, userid, itemid) from ( select * from predicted CLUSTER BY userid ) t ), rec_items as ( select userid, collect_list(itemid) as items from top_k group by userid ), ground_truth as ( select userid, collect_list(itemid) as truth from testing group by userid ) select hitrate(l.items, r.truth) as hitrate, mrr(l.items, r.truth) as mrr, precision_at(l.items, r.truth) as prec from rec_items l join ground_truth r on (l.userid=r.userid) ;
hitrate | mrr | prec |
---|---|---|
0.21517309922146763 | 0.09377752536606271 | 0.021517309922146725 |
Hit Rate and MRR are similar to ones in the result of Table II in Slim's paper
hitrate | mrr | prec |
---|---|---|
0.8952775476387739 | 1.1751514972186057 | 0.3564871582435789 |
Precision value is similar to the result of Mendeley's slide.
In this example, whole recommended items are evaluated using MRR.
WITH rec_items as ( select userid, to_ordered_list(itemid, predicted, '-reverse') as items from predicted group by userid ), ground_truth as ( select userid, collect_list(itemid) as truth from testing group by userid ) select mrr(l.items, r.truth) as mrr from rec_items l join ground_truth r on (l.userid=r.userid) ;
mrr |
---|
0.10782647321821472 |
mrr |
---|
0.6179983058881773 |
This MRR value is similar to one in the Mendeley's slide.