[Cross-validation](http://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validationk-fold cross validation) is a model validation technique for assessing how a prediction model will generalize to an independent data set. This example shows a way to perform k-fold cross validation to evaluate prediction performance.
Caution: Matrix factorization is supported in Hivemall v0.3 or later.
use movielens; set hivevar:kfold=10; set hivevar:seed=31; -- Adding group id (gid) to each training instance drop table ratings_groupded; create table ratings_groupded as select rand_gid2(${kfold}, ${seed}) gid, -- generates group id ranging from 1 to 10 userid, movieid, rating from ratings cluster by gid, rand(${seed});
-- latent factors set hivevar:factor=10; -- maximum number of iterations set hivevar:iters=50; -- regularization parameter set hivevar:lambda=0.05; -- learning rate set hivevar:eta=0.005; -- conversion rate (if changes between iterations became less or equals to ${cv_rate}, the training will stop) set hivevar:cv_rate=0.001;
Due to a bug in Hive, do not issue comments in CLI.
select avg(rating) from ratings;
3.581564453029317
-- mean rating value (Optional but recommended to set ${mu}) set hivevar:mu=3.581564453029317;
Note that it is not necessary to set an exact value for ${mu}.
Run generate_cv.sh and create generate_cv.sql.
Then, issue SQL queies in generate_cv.sql to get MAE/RMSE.
0.6695442192077673 (MAE)
0.8502739040257945 (RMSE)
We recommend to use Tez for running queries having many stages.