Field-aware factorization machines (FFM) is a factorization model which has been used by the #1 solution of the Criteo competition.

This page guides you to try the factorization technique with Hivemall's train_ffm and ffm_predict UDFs.

Note
This feature is supported from Hivemall v0.5.1 or later.

Preprocess data and convert into LIBFFM format

Since FFM is a relatively complex factor-based model which requires us to spend a significant amount of time for feature engineering, preprocessing data outside of Hive can be a reasonable option.

You can again use the repository takuti/criteo-ffm cloned in the data preparation guide to preprocess the data as the winning solution did:

cd criteo-ffm
# create the CSV files `tr.csv` and `te.csv`
make preprocess

Task make preprocess executes some Python scripts which are originally taken from guestwalk/kaggle-2014-criteo and chenhuang-learn/ffm.

Eventually, you will obtain the following files in so-called LIBFFM format:

tr.ffm - Labeled training samples
- tr.sp - 80% of the labeled training samples randomly picked from tr.ffm
- va.sp - Remaining 20% of samples for evaluation
te.ffm - Unlabeled test samples

<label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...
.
.
.

See LIBFFM official README for detail.

In order to evaluate the accuracy of prediction at the end of this tutorial, later sections use tr.sp and va.sp.

Insert preprocessed data into tables

Create new tables used by the FFM UDFs:

hadoop fs -put tr.sp /criteo/ffm/train
hadoop fs -put va.sp /criteo/ffm/test

use criteo;

DROP TABLE IF EXISTS train_ffm;
CREATE EXTERNAL TABLE train_ffm (
  label int,
  -- quantitative features
  i1 string,i2 string,i3 string,i4 string,i5 string,i6 string,i7 string,i8 string,i9 string,i10 string,i11 string,i12 string,i13 string,
  -- categorical features
  c1 string,c2 string,c3 string,c4 string,c5 string,c6 string,c7 string,c8 string,c9 string,c10 string,c11 string,c12 string,c13 string,c14 string,c15 string,c16 string,c17 string,c18 string,c19 string,c20 string,c21 string,c22 string,c23 string,c24 string,c25 string,c26 string
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/criteo/ffm/train';

DROP TABLE IF EXISTS test_ffm;
CREATE EXTERNAL TABLE test_ffm (
  label int,
  -- quantitative features
  i1 string,i2 string,i3 string,i4 string,i5 string,i6 string,i7 string,i8 string,i9 string,i10 string,i11 string,i12 string,i13 string,
  -- categorical features
  c1 string,c2 string,c3 string,c4 string,c5 string,c6 string,c7 string,c8 string,c9 string,c10 string,c11 string,c12 string,c13 string,c14 string,c15 string,c16 string,c17 string,c18 string,c19 string,c20 string,c21 string,c22 string,c23 string,c24 string,c25 string,c26 string
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/criteo/ffm/test';

Vectorize the LIBFFM-formatted features with rowid:

DROP TABLE IF EXISTS train_vectorized;
CREATE TABLE train_vectorized AS
SELECT
  row_number() OVER () AS rowid,
  array(
    i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13,
    c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26
  ) AS features,
  label
FROM
  train_ffm
;

DROP TABLE IF EXISTS test_vectorized;
CREATE TABLE test_vectorized AS
SELECT
  row_number() OVER () AS rowid,
  array(
    i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13,
    c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26
  ) AS features,
  label
FROM
  test_ffm
;

Training

DROP TABLE IF EXISTS criteo.ffm_model;
CREATE TABLE  criteo.ffm_model (
  model_id int,
  i int,
  Wi float,
  Vi array<float>
);

INSERT OVERWRITE TABLE criteo.ffm_model
SELECT
  train_ffm(
    features,
    label,
    '-init_v random -max_init_value 0.5 -classification -iterations 15 -factors 4 -eta 0.2 -optimizer adagrad -lambda 0.00002'
  )
FROM (
  SELECT
    features, label
  FROM
    criteo.train_vectorized
  CLUSTER BY rand(1)
) t
;

The third argument of train_ffm accepts a variety of options:

hive> SELECT train_ffm(array(), 0, '-help');
usage: train_ffm(array<string> x, double y [, const string options]) -
       Returns a prediction model [-alpha <arg>] [-auto_stop] [-beta
       <arg>] [-c] [-cv_rate <arg>] [-disable_cv] [-enable_norm]
       [-enable_wi] [-eps <arg>] [-eta <arg>] [-eta0 <arg>] [-f <arg>]
       [-feature_hashing <arg>] [-help] [-init_v <arg>] [-int_feature]
       [-iters <arg>] [-l1 <arg>] [-l2 <arg>] [-lambda0 <arg>] [-lambdaV
       <arg>] [-lambdaW0 <arg>] [-lambdaWi <arg>] [-max <arg>] [-maxval
       <arg>] [-min <arg>] [-min_init_stddev <arg>] [-no_norm]
       [-num_fields <arg>] [-opt <arg>] [-p <arg>] [-power_t <arg>] [-seed
       <arg>] [-sigma <arg>] [-t <arg>] [-va_ratio <arg>] [-va_threshold
       <arg>] [-w0]
 -alpha,--alphaFTRL <arg>                     Alpha value (learning rate)
                                              of
                                              Follow-The-Regularized-Reade
                                              r [default: 0.2]
 -auto_stop,--early_stopping                  Stop at the iteration that
                                              achieves the best validation
                                              on partial samples [default:
                                              OFF]
 -beta,--betaFTRL <arg>                       Beta value (a learning
                                              smoothing parameter) of
                                              Follow-The-Regularized-Reade
                                              r [default: 1.0]
 -c,--classification                          Act as classification
 -cv_rate,--convergence_rate <arg>            Threshold to determine
                                              convergence [default: 0.005]
 -disable_cv,--disable_cvtest                 Whether to disable
                                              convergence check [default:
                                              OFF]
 -enable_norm,--l2norm                        Enable instance-wise L2
                                              normalization
 -enable_wi,--linear_term                     Include linear term
                                              [default: OFF]
 -eps <arg>                                   A constant used in the
                                              denominator of AdaGrad
                                              [default: 1.0]
 -eta <arg>                                   The initial learning rate
 -eta0 <arg>                                  The initial learning rate
                                              [default 0.1]
 -f,--factors <arg>                           The number of the latent
                                              variables [default: 5]
 -feature_hashing <arg>                       The number of bits for
                                              feature hashing in range
                                              [18,31] [default: -1]. No
                                              feature hashing for -1.
 -help                                        Show function help
 -init_v <arg>                                Initialization strategy of
                                              matrix V [random,
                                              gaussian](default: 'random'
                                              for regression / 'gaussian'
                                              for classification)
 -int_feature,--feature_as_integer            Parse a feature as integer
                                              [default: OFF]
 -iters,--iterations <arg>                    The number of iterations
                                              [default: 10]
 -l1,--lambda1 <arg>                          L1 regularization value of
                                              Follow-The-Regularized-Reade
                                              r that controls model
                                              Sparseness [default: 0.001]
 -l2,--lambda2 <arg>                          L2 regularization value of
                                              Follow-The-Regularized-Reade
                                              r [default: 0.0001]
 -lambda0,--lambda <arg>                      The initial lambda value for
                                              regularization [default:
                                              0.0001]
 -lambdaV,--lambda_v <arg>                    The initial lambda value for
                                              V regularization [default:
                                              0.0001]
 -lambdaW0,--lambda_w0 <arg>                  The initial lambda value for
                                              W0 regularization [default:
                                              0.0001]
 -lambdaWi,--lambda_wi <arg>                  The initial lambda value for
                                              Wi regularization [default:
                                              0.0001]
 -max,--max_target <arg>                      The maximum value of target
                                              variable
 -maxval,--max_init_value <arg>               The maximum initial value in
                                              the matrix V [default: 0.5]
 -min,--min_target <arg>                      The minimum value of target
                                              variable
 -min_init_stddev <arg>                       The minimum standard
                                              deviation of initial matrix
                                              V [default: 0.1]
 -no_norm,--disable_norm                      Disable instance-wise L2
                                              normalization
 -num_fields <arg>                            The number of fields
                                              [default: 256]
 -opt,--optimizer <arg>                       Gradient Descent optimizer
                                              [default: ftrl, adagrad,
                                              sgd]
 -p,--num_features <arg>                      The size of feature
                                              dimensions [default: -1]
 -power_t <arg>                               The exponent for inverse
                                              scaling learning rate
                                              [default 0.1]
 -seed <arg>                                  Seed value [default: -1
                                              (random)]
 -sigma <arg>                                 The standard deviation for
                                              initializing V [default:
                                              0.1]
 -t,--total_steps <arg>                       The total number of training
                                              examples
 -va_ratio,--validation_ratio <arg>           Ratio of training data used
                                              for validation [default:
                                              0.05f]
 -va_threshold,--validation_threshold <arg>   Threshold to start
                                              validation. At least N
                                              training examples are used
                                              before validation [default:
                                              1000]
 -w0,--global_bias                            Whether to include global
                                              bias term w0 [default: OFF]

Note that debug log describes the change of cumulative loss over iterations as follows:

Iteration #2 | average loss=0.5407147187026483, current cumulative loss=858.114258581103, previous cumulative loss=1682.1101438997914, change rate=0.48985846040280256, #trainingExamples=1587
Iteration #3 | average loss=0.5105058761578417, current cumulative loss=810.1728254624949, previous cumulative loss=858.114258581103, change rate=0.05586835626980435, #trainingExamples=1587
Iteration #4 | average loss=0.49045915570992393, current cumulative loss=778.3586801116493, previous cumulative loss=810.1728254624949, change rate=0.039268344174200345, #trainingExamples=1587
Iteration #5 | average loss=0.4752751205770395, current cumulative loss=754.2616163557617, previous cumulative loss=778.3586801116493, change rate=0.030958816766109738, #trainingExamples=1587
Iteration #6 | average loss=0.46308523885164105, current cumulative loss=734.9162740575543, previous cumulative loss=754.2616163557617, change rate=0.02564805351182389, #trainingExamples=1587
Iteration #7 | average loss=0.4529012395753083, current cumulative loss=718.7542672060143, previous cumulative loss=734.9162740575543, change rate=0.02199163009727323, #trainingExamples=1587
Iteration #8 | average loss=0.44411358945347845, current cumulative loss=704.8082664626703, previous cumulative loss=718.7542672060143, change rate=0.019403016273636577, #trainingExamples=1587
Iteration #9 | average loss=0.4363264696377158, current cumulative loss=692.450107315055, previous cumulative loss=704.8082664626703, change rate=0.017534072365012268, #trainingExamples=1587
Iteration #10 | average loss=0.4292753045556725, current cumulative loss=681.2599083298522, previous cumulative loss=692.450107315055, change rate=0.01616029641267912, #trainingExamples=1587
Iteration #11 | average loss=0.42277515600757143, current cumulative loss=670.9441725840159, previous cumulative loss=681.2599083298522, change rate=0.015142144165104322, #trainingExamples=1587
Iteration #12 | average loss=0.416689617663307, current cumulative loss=661.2864232316682, previous cumulative loss=670.9441725840159, change rate=0.014394266687126348, #trainingExamples=1587
Iteration #13 | average loss=0.4109140194740033, current cumulative loss=652.1205489052433, previous cumulative loss=661.2864232316682, change rate=0.013860672175351585, #trainingExamples=1587
Iteration #14 | average loss=0.4053667348634373, current cumulative loss=643.317008228275, previous cumulative loss=652.1205489052433, change rate=0.013499866998129951, #trainingExamples=1587
Iteration #15 | average loss=0.3999840450561501, current cumulative loss=634.7746795041102, previous cumulative loss=643.317008228275, change rate=0.013278568131893133, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)

Prediction and evaluation

DROP TABLE IF EXISTS criteo.test_exploded;
CREATE TABLE criteo.test_exploded AS
SELECT
  t1.rowid,
  t2.i,
  t2.j,
  t2.Xi,
  t2.Xj
from
  criteo.test_vectorized t1
  LATERAL VIEW feature_pairs(t1.features, '-ffm') t2 AS i, j, Xi, Xj
;

WITH predicted AS (
  SELECT
    rowid,
    avg(score) AS predicted
  FROM (
    SELECT
      t1.rowid,
      p1.model_id,
      sigmoid(ffm_predict(p1.Wi, p1.Vi, p2.Vi, t1.Xi, t1.Xj)) AS score
    FROM
      criteo.test_exploded t1
      JOIN criteo.ffm_model p1 ON (p1.i = t1.i) -- at least p1.i = 0 and t1.i = 0 exists
      LEFT OUTER JOIN criteo.ffm_model p2 ON (p2.model_id = p1.model_id and p2.i = t1.j)
    WHERE
      p1.Wi is not null OR p2.Vi is not null
    GROUP BY
      t1.rowid, p1.model_id
  ) t
  GROUP BY
    rowid
)
SELECT
  logloss(t1.predicted, t2.label)
FROM
  predicted t1
JOIN
  criteo.test_vectorized t2
  ON t1.rowid = t2.rowid
;

0.47276208106423234

Note
The accuracy varies depending on the random separation of tr.sp and va.sp.

Notice that LogLoss around 0.45 is reasonable accuracy compared to the competition leaderboard and output from LIBFFM.