The task is predicting the click through rate (CTR) of advertisement, meaning that we are to predict the probability of each ad being clicked.
https://www.kaggle.com/c/kddcup2012-track2
Caution: This example just shows a baseline result. Use token tables and amplifier to get better AUC score.
use kdd12track2; -- set mapred.max.split.size=134217728; -- [optional] set if OOM caused at mappers on training -- SET mapred.max.split.size=67108864; select count(1) from training_orcfile;
235582879
235582879 / 56 (mappers) = 4206837
set hivevar:total_steps=5000000; -- set mapred.reduce.tasks=64; -- [optional] set the explicit number of reducers to make group-by aggregation faster drop table lr_model; create table lr_model as select feature, cast(avg(weight) as float) as weight from (select logress(features, label, "-total_steps ${total_steps}") as (feature,weight) -- logress(features, label) as (feature,weight) from training_orcfile ) t group by feature; -- set mapred.max.split.size=-1; -- reset to the default value
Note: Setting the “-total_steps” option is optional.
drop table lr_predict; create table lr_predict ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY "\n" STORED AS TEXTFILE as select t.rowid, sigmoid(sum(m.weight)) as prob from testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) group by t.rowid order by rowid ASC;
You can download scoreKDD.py from KDD Cup 2012, Track 2 site. After logging-in to Kaggle, download scoreKDD.py.
hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/lr_predict lr_predict.tbl gawk -F "\t" '{print $2;}' lr_predict.tbl > lr_predict.submit pypy scoreKDD.py KDD_Track2_solution.csv lr_predict.submit
Note: You can use python instead of pypy.
Measure | Score |
---|---|
AUC | 0.741111 |
NWMAE | 0.045493 |
WRMSE | 0.142395 |
drop table pa_model; create table pa_model as select feature, cast(avg(weight) as float) as weight from (select train_pa1a_regr(features,label) as (feature,weight) from training_orcfile ) t group by feature;
PA1a is recommended when using PA for regression.
drop table pa_predict; create table pa_predict ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY "\n" STORED AS TEXTFILE as select t.rowid, sum(m.weight) as prob from testing_exploded t LEFT OUTER JOIN pa_model m ON (t.feature = m.feature) group by t.rowid order by rowid ASC;
The “prob” of PA can be used only for ranking and can have a negative value. A higher weight means much likely to be clicked. Note that AUC is sort a measure for evaluating ranking accuracy.
hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/pa_predict pa_predict.tbl gawk -F "\t" '{print $2;}' pa_predict.tbl > pa_predict.submit pypy scoreKDD.py KDD_Track2_solution.csv pa_predict.submit
Measure | Score |
---|---|
AUC | 0.739722 |
NWMAE | 0.049582 |
WRMSE | 0.143698 |