docs/gitbook/binaryclass/news20b

In this tutorial, we build a binary classification model using XGBoost.

Feature Vector format for XGBoost

For feature vector, train_xgboost takes a sparse vector format (array<string>) or a dense vector format (array<double>). In the feature vector, each feature takes a LIBSVM format:

feature ::= <index>:<weight>

index ::= <Non-negative INT> (e.g., 0,1,2,...)
weight ::= <DOUBLE>

Note
Unlike the original libsvm format, it's not needed to sort a feature vector by ansceding order of feature index.

Target label format of binary classification follows this rule. Please refer xgboost document as well.

Label format in Binary Classification

The label must be an INT typed column and the values are positive (+1) or negative (-1) as follows:

<label> ::= 1 | -1

Alternatively, you can use the following format that represents 1 for a positive example and 0 for a negative example:

<label> ::= 0 | 1

Usage and Hyperparameters

You can find hyperparameters and it's default setting by running the following query:

select train_xgboost();

usage: train_xgboost(array<string|double> features, int|double target [,
       string options]) - Returns a relation consists of <string model_id,
       array<string> pred_model> [-alpha <arg>] [-base_score <arg>]
       [-booster <arg>] [-colsample_bylevel <arg>] [-colsample_bynode
       <arg>] [-colsample_bytree <arg>] [-disable_default_eval_metric
       <arg>] [-eta <arg>] [-eval_metric <arg>] [-feature_selector <arg>]
       [-gamma <arg>] [-grow_policy <arg>] [-lambda <arg>] [-lambda_bias
       <arg>] [-max_bin <arg>] [-max_delta_step <arg>] [-max_depth <arg>]
       [-max_leaves <arg>] [-maximize_evaluation_metrics <arg>]
       [-min_child_weight <arg>] [-normalize_type <arg>] [-num_class
       <arg>] [-num_early_stopping_rounds <arg>] [-num_feature <arg>]
       [-num_parallel_tree <arg>] [-num_pbuffer <arg>] [-num_round <arg>]
       [-objective <arg>] [-one_drop <arg>] [-process_type <arg>]
       [-rate_drop <arg>] [-refresh_leaf <arg>] [-sample_type <arg>]
       [-scale_pos_weight <arg>] [-seed <arg>] [-silent <arg>]
       [-sketch_eps <arg>] [-skip_drop <arg>] [-subsample <arg>] [-top_k
       <arg>] [-tree_method <arg>] [-tweedie_variance_power <arg>]
       [-updater <arg>] [-validation_ratio <arg>] [-verbosity <arg>]
 -alpha,--reg_alpha <arg>             L1 regularization term on weights.
                                      Increasing this value will make
                                      model more conservative. [default:
                                      0.0]
 -base_score <arg>                    Initial prediction score of all
                                      instances, global bias [default:
                                      0.5]
 -booster <arg>                       Set a booster to use, gbtree or
                                      gblinear or dart. [default: gbree]
 -colsample_bylevel <arg>             Subsample ratio of columns for each
                                      level [default: 1.0]
 -colsample_bynode <arg>              Subsample ratio of columns for each
                                      node [default: 1.0]
 -colsample_bytree <arg>              Subsample ratio of columns when
                                      constructing each tree [default:
                                      1.0]
 -disable_default_eval_metric <arg>   NFlag to disable default metric. Set
                                      to >0 to disable. [default: 0]
 -eta,--learning_rate <arg>           Step size shrinkage used in update
                                      to prevents overfitting [default:
                                      0.3]
 -eval_metric <arg>                   Evaluation metrics for validation
                                      data. A default metric is assigned
                                      according to the objective:
                                      - rmse: for regression
                                      - error: for classification
                                      - map: for ranking
                                      For a list of valid inputs, see
                                      XGBoost Parameters.
 -feature_selector <arg>              Feature selection and ordering
                                      method. [Choices: cyclic (default),
                                      shuffle, random, greedy, thrifty]
 -gamma,--min_split_loss <arg>        Minimum loss reduction required to
                                      make a further partition on a leaf
                                      node of the tree. [default: 0.0]
 -grow_policy <arg>                   Controls a way new nodes are added
                                      to the tree. Currently supported
                                      only if tree_method is set to hist.
                                      [default: depthwise, Choices:
                                      depthwise, lossguide]
 -lambda,--reg_lambda <arg>           L2 regularization term on weights.
                                      Increasing this value will make
                                      model more conservative. [default:
                                      1.0 for gbtree, 0.0 for gblinear]
 -lambda_bias <arg>                   L2 regularization term on bias
                                      [default: 0.0]
 -max_bin <arg>                       Maximum number of discrete bins to
                                      bucket continuous features. Only
                                      used if tree_method is set to hist.
                                      [default: 256]
 -max_delta_step <arg>                Maximum delta step we allow each
                                      tree's weight estimation to be
                                      [default: 0]
 -max_depth <arg>                     Max depth of decision tree [default:
                                      6]
 -max_leaves <arg>                    Maximum number of nodes to be added.
                                      Only relevant when
                                      grow_policy=lossguide is set.
                                      [default: 0]
 -maximize_evaluation_metrics <arg>   Maximize evaluation metrics
                                      [default: false]
 -min_child_weight <arg>              Minimum sum of instance weight
                                      (hessian) needed in a child
                                      [default: 1.0]
 -normalize_type <arg>                Type of normalization algorithm.
                                      [Choices: tree (default), forest]
 -num_class <arg>                     Number of classes to classify
 -num_early_stopping_rounds <arg>     Minimum rounds required for early
                                      stopping [default: 0]
 -num_feature <arg>                   Feature dimension used in boosting
                                      [default: set automatically by
                                      xgboost]
 -num_parallel_tree <arg>             Number of parallel trees constructed
                                      during each iteration. This option
                                      is used to support boosted random
                                      forest. [default: 1]
 -num_pbuffer <arg>                   Size of prediction buffer [default:
                                      set automatically by xgboost]
 -num_round,--iters <arg>             Number of boosting iterations
                                      [default: 10]
 -objective <arg>                     Specifies the learning task and the
                                      corresponding learning objective.
                                      Examples: reg:linear, reg:logistic,
                                      multi:softmax. For a full list of
                                      valid inputs, refer to XGBoost
                                      Parameters. [default: reg:linear]
 -one_drop <arg>                      When this flag is enabled, at least
                                      one tree is always dropped during
                                      the dropout. 0 or 1. [default: 0]
 -process_type <arg>                  A type of boosting process to run.
                                      [Choices: default, update]
 -rate_drop <arg>                     Dropout rate in range [0.0, 1.0].
                                      [default: 0.0]
 -refresh_leaf <arg>                  This is a parameter of the refresh
                                      updater plugin. When this flag is 1,
                                      tree leafs as well as tree nodes’
                                      stats are updated. When it is 0,
                                      only node stats are updated.
                                      [default: 1]
 -sample_type <arg>                   Type of sampling algorithm.
                                      [Choices: uniform (default),
                                      weighted]
 -scale_pos_weight <arg>              ontrol the balance of positive and
                                      negative weights, useful for
                                      unbalanced classes. A typical value
                                      to consider: sum(negative instances)
                                      / sum(positive instances) [default:
                                      1.0]
 -seed <arg>                          Random number seed. [default: 43]
 -silent <arg>                        Deprecated. Please use verbosity
                                      instead. 0 means printing running
                                      messages, 1 means silent mode
                                      [default: 1]
 -sketch_eps <arg>                    This roughly translates into O(1 /
                                      sketch_eps) number of bins.
                                      Compared to directly select number
                                      of bins, this comes with theoretical
                                      guarantee with sketch accuracy.
                                      Only used for tree_method=approx.
                                      Usually user does not have to tune
                                      this.  [default: 0.03]
 -skip_drop <arg>                     Probability of skipping the dropout
                                      procedure during a boosting
                                      iteration in range [0.0, 1.0].
                                      [default: 0.0]
 -subsample <arg>                     Subsample ratio of the training
                                      instance in range (0.0,1.0]
                                      [default: 1.0]
 -top_k <arg>                         The number of top features to select
                                      in greedy and thrifty feature
                                      selector. The value of 0 means using
                                      all the features. [default: 0]
 -tree_method <arg>                   The tree construction algorithm used
                                      in XGBoost. [default: auto, Choices:
                                      auto, exact, approx, hist]
 -tweedie_variance_power <arg>        Parameter that controls the variance
                                      of the Tweedie distribution in range
                                      [1.0, 2.0]. [default: 1.5]
 -updater <arg>                       A comma-separated string that
                                      defines the sequence of tree
                                      updaters to run. For a full list of
                                      valid inputs, please refer to
                                      XGBoost Parameters. [default:
                                      'grow_colmaker,prune' for gbtree,
                                      'shotgun' for gblinear]
 -validation_ratio <arg>              Validation ratio in range [0.0,1.0]
                                      [default: 0.2]
 -verbosity <arg>                     Verbosity of printing messages.
                                      Choices: 0 (silent), 1 (warning), 2
                                      (info), 3 (debug). [default: 0]

Objective function -objective SHOULD be specified though -objective reg:linear is used for Objective function by the default. For the full list of objective functions, please refer this xgboost v0.90 documentation.

The following objectives would widely be used for regression, binary classication, and multiclass classication, respectively.

reg:squarederror regression with squared loss.
binary:logistic logistic regression for binary classification, output probability.
binary:hinge hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
multi:softmax set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class (number of classes).
multi:softprob same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.

Other hyperparameters better to be tuned are:

-booster gbree Which booster to use. The default gbtree (Gradient Boosting Trees) would be fine for most cases. Can be gbtree, gblinear or dart; gbtree and dart use tree based models while gblinear uses linear functions.
-eta 0.1 The learning rate, 0.3 by the default. 0.05, 0.1, 0.3 are worth trying.
-max_depth 6 The maximum depth of the tree. The default value 6 would be fine for most case. Recommended value range is 5-10.
-num_class 3 The number of classes MUST be specified for multiclass classification (i.e., -objective multi:softmax or -objective multi:softprob)
-num_round 10 The number of rounds for boosting. 10 or more would be preferred.
-num_early_stopping_rounds 3 The number of rounds required for early stopping. Without specifying -num_early_stopping_rounds, no early stopping is NOT carried. When -num_round=100 and -num_early_stopping_rounds=5, traning could be early stopped at 15th iteration if there is no evaluation result greater than the 10th iteration's (best one). Early stopping 3 or so would be preferred.
-validation_ratio 0.2 The ratio data used for validation (early stopping). 0.2 would be enough for most cases. Note that 80% data is used for training when validation_ratio 0.2 is set.

You can find the underlying XGBoost version by:

select xgboost_version();
> 0.90

Training

train_xgboost UDTF is used for training.

The function signature is train_xgboost(array<string|double> features, double target [,string options]) and it returns a prediction model as a relation consist of <string model_id, array<string> pred_model>.

-- explicitly use 3 reducers
-- set mapred.reduce.tasks=3;

drop table xgb_lr_model;
create table xgb_lr_model as
select 
  train_xgboost(features, label, '-objective binary:logistic -num_round 10 -num_early_stopping_rounds 3') 
    as (model_id, model)
from (
  select features, label
  from news20b_train
  cluster by rand(43) -- shuffle data to reducers
) shuffled;

drop table xgb_hinge_model;
create table xgb_hinge_model as
select 
  train_xgboost(features, label, '-objective binary:hinge -num_round 10 -num_early_stopping_rounds 3') 
    as (model_id, model)
from (
  select features, label
  from news20b_train
  cluster by rand(43) -- shuffle data to reducers
) shuffled;

Caution
cluster by rand() is NOT required when training data is small and a single task is launched for XGBoost training. cluster by rand() shuffles data at random and divided it for multiple XGBoost instances.

prediction

drop table xgb_lr_predicted;
create table xgb_lr_predicted 
as
select
  rowid, 
  array_avg(predicted) as predicted,
  avg(predicted[0]) as prob
from (
  select
    -- fast predictition by xgboost-predictor-java (https://github.com/komiya-atsushi/xgboost-predictor-java/)
    xgboost_predict(rowid, features, model_id, model) as (rowid, predicted)
    -- predict by  xgboost4j (https://xgboost.readthedocs.io/en/stable/jvm/)
    -- xgboost_batch_predict(rowid, features, model_id, model) as (rowid, predicted)
  from
    -- for each model l 
    --   for each test r
    --     predict
    xgb_lr_model l
    LEFT OUTER JOIN news20b_test r 
) t
group by rowid;

drop table xgb_hinge_predicted;
create table xgb_hinge_predicted 
as
select
  rowid,
  -- voting
  -- if(sum(if(predicted[0]=1,1,0)) > sum(if(predicted[0]=0,1,0)),1,-1) as predicted
  majority_vote(if(predicted[0]=1, 1, -1)) as predicted
from (
  select
    -- binary:hinge is not supported in xgboost_predict
    -- binary:hinge returns [1.0] or [0.0] for predicted
    xgboost_batch_predict(rowid, features, model_id, model) 
      as (rowid, predicted)
  from
    -- for each model l 
    --   for each test r
    --     predict
    xgb_hinge_model l
    LEFT OUTER JOIN news20b_test r 
) t
group by
  rowid

You can find the function signature of xgboost_predict by

select xgboost_predict();

usage: xgboost_predict(PRIMITIVE rowid, array<string|double> features,
       string model_id, array<string> pred_model [, string options]) -
       Returns a prediction result as (string rowid, array<double>
       predicted)

select xgboost_batch_predict();

usage: xgboost_batch_predict(PRIMITIVE rowid, array<string|double>
       features, string model_id, array<string> pred_model [, string
       options]) - Returns a prediction result as (string rowid,
       array<double> predicted) [-batch_size <arg>]
 -batch_size <arg>   Number of rows to predict together [default: 128]

Caution
xgboost_predict outputs probability for -objective binary:logistic while 0/1 is resulted for -objective binary:hinge.
xgboost_predict only support the following models and objectives because it uses xgboost-predictor-java: Models: {gblinear, gbtree, dart} Objective functions: {binary:logistic, binary:logitraw, multi:softmax, multi:softprob, reg:linear, reg:squarederror, rank:pairwise}
For other models and objectives, please use xgboost_batch_predict that uses xgboost4j insead.

evaluation

WITH submit as (
  select 
    t.label as actual, 
    -- probability thresholding by 0.5
    if(p.prob > 0.5,1,-1)  as predicted
  from 
    news20b_test t 
    JOIN xgb_lr_predicted p
      on (t.rowid = p.rowid)
)
select 
  sum(if(actual = predicted, 1, 0)) / count(1) as accuracy
from
  submit;

0.8372698158526821 (logistic loss)

WITH submit as (
  select 
   t.label as actual, 
   p.predicted
  from 
    news20b_test t 
    JOIN xgb_hinge_predicted p
      on (t.rowid = p.rowid)
)
select 
  sum(if(actual=predicted,1,0)) / count(1) as accuracy
from
  submit;

0.7752201761409128 (hinge loss)