blob: 2bdc8b915c997677ee8b5893e769bd971370e9e9 [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
[Field-aware factorization machines](https://dl.acm.org/citation.cfm?id=2959134) (FFM) is a factorization model which has been used by the [#1 solution](https://www.kaggle.com/c/criteo-display-ad-challenge/discussion/10555) of the Criteo competition.
This page guides you to try the factorization technique with Hivemall's `train_ffm` and `ffm_predict` UDFs.
<!-- toc -->
> #### Note
> This feature is supported from Hivemall v0.5.1 or later.
# Preprocess data and convert into LIBFFM format
Since FFM is a relatively complex factor-based model which requires us to spend a significant amount of time for feature engineering, preprocessing data outside of Hive can be a reasonable option.
You can again use the repository **[takuti/criteo-ffm](https://github.com/takuti/criteo-ffm)** cloned in the [data preparation guide](criteo_dataset.md) to preprocess the data as the winning solution did:
```sh
cd criteo-ffm
# create the CSV files `tr.csv` and `te.csv`
make preprocess
```
Task `make preprocess` executes some Python scripts which are originally taken from [guestwalk/kaggle-2014-criteo](https://github.com/guestwalk/kaggle-2014-criteo) and [chenhuang-learn/ffm](https://github.com/chenhuang-learn/ffm).
Eventually, you will obtain the following files in so-called LIBFFM format:
- `tr.ffm` - Labeled training samples
- `tr.sp` - 80% of the labeled training samples randomly picked from `tr.ffm`
- `va.sp` - Remaining 20% of samples for evaluation
- `te.ffm` - Unlabeled test samples
```
<label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...
.
.
.
```
See [LIBFFM official README](https://github.com/guestwalk/libffm) for detail.
In order to evaluate the accuracy of prediction at the end of this tutorial, later sections use `tr.sp` and `va.sp`.
# Insert preprocessed data into tables
Create new tables used by the FFM UDFs:
```sh
hadoop fs -put tr.sp /criteo/ffm/train
hadoop fs -put va.sp /criteo/ffm/test
```
```sql
use criteo;
```
```sql
DROP TABLE IF EXISTS train_ffm;
CREATE EXTERNAL TABLE train_ffm (
label int,
-- quantitative features
i1 string,i2 string,i3 string,i4 string,i5 string,i6 string,i7 string,i8 string,i9 string,i10 string,i11 string,i12 string,i13 string,
-- categorical features
c1 string,c2 string,c3 string,c4 string,c5 string,c6 string,c7 string,c8 string,c9 string,c10 string,c11 string,c12 string,c13 string,c14 string,c15 string,c16 string,c17 string,c18 string,c19 string,c20 string,c21 string,c22 string,c23 string,c24 string,c25 string,c26 string
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/criteo/ffm/train';
```
```sql
DROP TABLE IF EXISTS test_ffm;
CREATE EXTERNAL TABLE test_ffm (
label int,
-- quantitative features
i1 string,i2 string,i3 string,i4 string,i5 string,i6 string,i7 string,i8 string,i9 string,i10 string,i11 string,i12 string,i13 string,
-- categorical features
c1 string,c2 string,c3 string,c4 string,c5 string,c6 string,c7 string,c8 string,c9 string,c10 string,c11 string,c12 string,c13 string,c14 string,c15 string,c16 string,c17 string,c18 string,c19 string,c20 string,c21 string,c22 string,c23 string,c24 string,c25 string,c26 string
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/criteo/ffm/test';
```
Vectorize the LIBFFM-formatted features with `rowid`:
```sql
DROP TABLE IF EXISTS train_vectorized;
CREATE TABLE train_vectorized AS
SELECT
row_number() OVER () AS rowid,
array(
i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13,
c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26
) AS features,
label
FROM
train_ffm
;
```
```sql
DROP TABLE IF EXISTS test_vectorized;
CREATE TABLE test_vectorized AS
SELECT
row_number() OVER () AS rowid,
array(
i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13,
c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26
) AS features,
label
FROM
test_ffm
;
```
# Training
```sql
DROP TABLE IF EXISTS criteo.ffm_model;
CREATE TABLE criteo.ffm_model (
model_id int,
i int,
Wi float,
Vi array<float>
);
```
```sql
INSERT OVERWRITE TABLE criteo.ffm_model
SELECT
train_ffm(
features,
label,
'-init_v random -max_init_value 0.5 -classification -iterations 15 -factors 4 -eta 0.2 -optimizer adagrad -lambda 0.00002'
)
FROM (
SELECT
features, label
FROM
criteo.train_vectorized
CLUSTER BY rand(1)
) t
;
```
The third argument of `train_ffm` accepts a variety of options:
```
hive> SELECT train_ffm(array(), 0, '-help');
usage: train_ffm(array<string> x, double y [, const string options]) -
Returns a prediction model [-alpha <arg>] [-auto_stop] [-beta
<arg>] [-c] [-cv_rate <arg>] [-disable_cv] [-enable_norm]
[-enable_wi] [-eps <arg>] [-eta <arg>] [-eta0 <arg>] [-f <arg>]
[-feature_hashing <arg>] [-help] [-init_v <arg>] [-int_feature]
[-iters <arg>] [-l1 <arg>] [-l2 <arg>] [-lambda0 <arg>] [-lambdaV
<arg>] [-lambdaW0 <arg>] [-lambdaWi <arg>] [-max <arg>] [-maxval
<arg>] [-min <arg>] [-min_init_stddev <arg>] [-no_norm]
[-num_fields <arg>] [-opt <arg>] [-p <arg>] [-power_t <arg>] [-seed
<arg>] [-sigma <arg>] [-t <arg>] [-va_ratio <arg>] [-va_threshold
<arg>] [-w0]
-alpha,--alphaFTRL <arg> Alpha value (learning rate)
of
Follow-The-Regularized-Reade
r [default: 0.2]
-auto_stop,--early_stopping Stop at the iteration that
achieves the best validation
on partial samples [default:
OFF]
-beta,--betaFTRL <arg> Beta value (a learning
smoothing parameter) of
Follow-The-Regularized-Reade
r [default: 1.0]
-c,--classification Act as classification
-cv_rate,--convergence_rate <arg> Threshold to determine
convergence [default: 0.005]
-disable_cv,--disable_cvtest Whether to disable
convergence check [default:
OFF]
-enable_norm,--l2norm Enable instance-wise L2
normalization
-enable_wi,--linear_term Include linear term
[default: OFF]
-eps <arg> A constant used in the
denominator of AdaGrad
[default: 1.0]
-eta <arg> The initial learning rate
-eta0 <arg> The initial learning rate
[default 0.1]
-f,--factors <arg> The number of the latent
variables [default: 5]
-feature_hashing <arg> The number of bits for
feature hashing in range
[18,31] [default: -1]. No
feature hashing for -1.
-help Show function help
-init_v <arg> Initialization strategy of
matrix V [random,
gaussian](default: 'random'
for regression / 'gaussian'
for classification)
-int_feature,--feature_as_integer Parse a feature as integer
[default: OFF]
-iters,--iterations <arg> The number of iterations
[default: 10]
-l1,--lambda1 <arg> L1 regularization value of
Follow-The-Regularized-Reade
r that controls model
Sparseness [default: 0.001]
-l2,--lambda2 <arg> L2 regularization value of
Follow-The-Regularized-Reade
r [default: 0.0001]
-lambda0,--lambda <arg> The initial lambda value for
regularization [default:
0.0001]
-lambdaV,--lambda_v <arg> The initial lambda value for
V regularization [default:
0.0001]
-lambdaW0,--lambda_w0 <arg> The initial lambda value for
W0 regularization [default:
0.0001]
-lambdaWi,--lambda_wi <arg> The initial lambda value for
Wi regularization [default:
0.0001]
-max,--max_target <arg> The maximum value of target
variable
-maxval,--max_init_value <arg> The maximum initial value in
the matrix V [default: 0.5]
-min,--min_target <arg> The minimum value of target
variable
-min_init_stddev <arg> The minimum standard
deviation of initial matrix
V [default: 0.1]
-no_norm,--disable_norm Disable instance-wise L2
normalization
-num_fields <arg> The number of fields
[default: 256]
-opt,--optimizer <arg> Gradient Descent optimizer
[default: ftrl, adagrad,
sgd]
-p,--num_features <arg> The size of feature
dimensions [default: -1]
-power_t <arg> The exponent for inverse
scaling learning rate
[default 0.1]
-seed <arg> Seed value [default: -1
(random)]
-sigma <arg> The standard deviation for
initializing V [default:
0.1]
-t,--total_steps <arg> The total number of training
examples
-va_ratio,--validation_ratio <arg> Ratio of training data used
for validation [default:
0.05f]
-va_threshold,--validation_threshold <arg> Threshold to start
validation. At least N
training examples are used
before validation [default:
1000]
-w0,--global_bias Whether to include global
bias term w0 [default: OFF]
```
Note that debug log describes the change of cumulative loss over iterations as follows:
```
Iteration #2 | average loss=0.5407147187026483, current cumulative loss=858.114258581103, previous cumulative loss=1682.1101438997914, change rate=0.48985846040280256, #trainingExamples=1587
Iteration #3 | average loss=0.5105058761578417, current cumulative loss=810.1728254624949, previous cumulative loss=858.114258581103, change rate=0.05586835626980435, #trainingExamples=1587
Iteration #4 | average loss=0.49045915570992393, current cumulative loss=778.3586801116493, previous cumulative loss=810.1728254624949, change rate=0.039268344174200345, #trainingExamples=1587
Iteration #5 | average loss=0.4752751205770395, current cumulative loss=754.2616163557617, previous cumulative loss=778.3586801116493, change rate=0.030958816766109738, #trainingExamples=1587
Iteration #6 | average loss=0.46308523885164105, current cumulative loss=734.9162740575543, previous cumulative loss=754.2616163557617, change rate=0.02564805351182389, #trainingExamples=1587
Iteration #7 | average loss=0.4529012395753083, current cumulative loss=718.7542672060143, previous cumulative loss=734.9162740575543, change rate=0.02199163009727323, #trainingExamples=1587
Iteration #8 | average loss=0.44411358945347845, current cumulative loss=704.8082664626703, previous cumulative loss=718.7542672060143, change rate=0.019403016273636577, #trainingExamples=1587
Iteration #9 | average loss=0.4363264696377158, current cumulative loss=692.450107315055, previous cumulative loss=704.8082664626703, change rate=0.017534072365012268, #trainingExamples=1587
Iteration #10 | average loss=0.4292753045556725, current cumulative loss=681.2599083298522, previous cumulative loss=692.450107315055, change rate=0.01616029641267912, #trainingExamples=1587
Iteration #11 | average loss=0.42277515600757143, current cumulative loss=670.9441725840159, previous cumulative loss=681.2599083298522, change rate=0.015142144165104322, #trainingExamples=1587
Iteration #12 | average loss=0.416689617663307, current cumulative loss=661.2864232316682, previous cumulative loss=670.9441725840159, change rate=0.014394266687126348, #trainingExamples=1587
Iteration #13 | average loss=0.4109140194740033, current cumulative loss=652.1205489052433, previous cumulative loss=661.2864232316682, change rate=0.013860672175351585, #trainingExamples=1587
Iteration #14 | average loss=0.4053667348634373, current cumulative loss=643.317008228275, previous cumulative loss=652.1205489052433, change rate=0.013499866998129951, #trainingExamples=1587
Iteration #15 | average loss=0.3999840450561501, current cumulative loss=634.7746795041102, previous cumulative loss=643.317008228275, change rate=0.013278568131893133, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
```
# Prediction and evaluation
```sql
DROP TABLE IF EXISTS criteo.test_exploded;
CREATE TABLE criteo.test_exploded AS
SELECT
t1.rowid,
t2.i,
t2.j,
t2.Xi,
t2.Xj
from
criteo.test_vectorized t1
LATERAL VIEW feature_pairs(t1.features, '-ffm') t2 AS i, j, Xi, Xj
;
```
```sql
WITH predicted AS (
SELECT
rowid,
avg(score) AS predicted
FROM (
SELECT
t1.rowid,
p1.model_id,
sigmoid(ffm_predict(p1.Wi, p1.Vi, p2.Vi, t1.Xi, t1.Xj)) AS score
FROM
criteo.test_exploded t1
JOIN criteo.ffm_model p1 ON (p1.i = t1.i) -- at least p1.i = 0 and t1.i = 0 exists
LEFT OUTER JOIN criteo.ffm_model p2 ON (p2.model_id = p1.model_id and p2.i = t1.j)
WHERE
p1.Wi is not null OR p2.Vi is not null
GROUP BY
t1.rowid, p1.model_id
) t
GROUP BY
rowid
)
SELECT
logloss(t1.predicted, t2.label)
FROM
predicted t1
JOIN
criteo.test_vectorized t2
ON t1.rowid = t2.rowid
;
```
> 0.47276208106423234
<br />
> #### Note
> The accuracy varies depending on the random separation of `tr.sp` and `va.sp`.
Notice that LogLoss around 0.45 is reasonable accuracy compared to the [competition leaderboard](https://github.com/guestwalk/libffm) and output from [LIBFFM](https://github.com/guestwalk/libffm).