Cross-validation is a model validation technique for assessing how a prediction model will generalize to an independent data set. This example shows a way to perform k-fold cross validation to evaluate prediction performance.

Caution: Matrix factorization is supported in Hivemall v0.3 or later.

Data set creating for 10-folds cross validation.

use movielens;

set hivevar:kfold=10;
set hivevar:seed=31;

-- Adding group id (gid) to each training instance
drop table ratings_groupded;
create table ratings_groupded
as
select
  rand_gid2(${kfold}, ${seed}) gid, -- generates group id ranging from 1 to 10
  userid, 
  movieid, 
  rating
from
  ratings
cluster by gid, rand(${seed});

Set training hyperparameters

-- latent factors
set hivevar:factor=10;
-- maximum number of iterations
set hivevar:iters=50;
-- regularization parameter
set hivevar:lambda=0.05;
-- learning rate
set hivevar:eta=0.005;
-- conversion rate (if changes between iterations became less or equals to ${cv_rate}, the training will stop)
set hivevar:cv_rate=0.001;

Due to a bug in Hive, do not issue comments in CLI.

select avg(rating) from ratings;

3.581564453029317

-- mean rating value (Optional but recommended to set ${mu})
set hivevar:mu=3.581564453029317;

Note that it is not necessary to set an exact value for ${mu}.

SQL-generation for 10-folds cross validation

Run generate_cv.sh and create generate_cv.sql.

Then, issue SQL queies in generate_cv.sql to get MAE/RMSE.

0.6695442192077673 (MAE)

0.8502739040257945 (RMSE)

We recommend to use Tez for running queries having many stages.