In this tutorial, we build a binary classification model using XGBoost.
For feature vector, train_xgboost takes a sparse vector format (array<string>) or a dense vector format (array<double>). In the feature vector, each feature takes a LIBSVM format:
feature ::= <index>:<weight> index ::= <Non-negative INT> (e.g., 0,1,2,...) weight ::= <DOUBLE>
Note
Unlike the original libsvm format, it's not needed to sort a feature vector by ansceding order of feature index.
Target label format of binary classification follows this rule. Please refer xgboost document as well.
The label must be an INT typed column and the values are positive (+1) or negative (-1) as follows:
<label> ::= 1 | -1
Alternatively, you can use the following format that represents 1 for a positive example and 0 for a negative example:
<label> ::= 0 | 1
You can find hyperparameters and it's default setting by running the following query:
select train_xgboost();
usage: train_xgboost(array<string|double> features, int|double target [,
string options]) - Returns a relation consists of <string model_id,
array<string> pred_model> [-alpha <arg>] [-base_score <arg>]
[-booster <arg>] [-colsample_bylevel <arg>] [-colsample_bynode
<arg>] [-colsample_bytree <arg>] [-disable_default_eval_metric
<arg>] [-eta <arg>] [-eval_metric <arg>] [-feature_selector <arg>]
[-gamma <arg>] [-grow_policy <arg>] [-lambda <arg>] [-lambda_bias
<arg>] [-max_bin <arg>] [-max_delta_step <arg>] [-max_depth <arg>]
[-max_leaves <arg>] [-maximize_evaluation_metrics <arg>]
[-min_child_weight <arg>] [-normalize_type <arg>] [-num_class
<arg>] [-num_early_stopping_rounds <arg>] [-num_feature <arg>]
[-num_parallel_tree <arg>] [-num_pbuffer <arg>] [-num_round <arg>]
[-objective <arg>] [-one_drop <arg>] [-process_type <arg>]
[-rate_drop <arg>] [-refresh_leaf <arg>] [-sample_type <arg>]
[-scale_pos_weight <arg>] [-seed <arg>] [-silent <arg>]
[-sketch_eps <arg>] [-skip_drop <arg>] [-subsample <arg>] [-top_k
<arg>] [-tree_method <arg>] [-tweedie_variance_power <arg>]
[-updater <arg>] [-validation_ratio <arg>] [-verbosity <arg>]
-alpha,--reg_alpha <arg> L1 regularization term on weights.
Increasing this value will make
model more conservative. [default:
0.0]
-base_score <arg> Initial prediction score of all
instances, global bias [default:
0.5]
-booster <arg> Set a booster to use, gbtree or
gblinear or dart. [default: gbree]
-colsample_bylevel <arg> Subsample ratio of columns for each
level [default: 1.0]
-colsample_bynode <arg> Subsample ratio of columns for each
node [default: 1.0]
-colsample_bytree <arg> Subsample ratio of columns when
constructing each tree [default:
1.0]
-disable_default_eval_metric <arg> NFlag to disable default metric. Set
to >0 to disable. [default: 0]
-eta,--learning_rate <arg> Step size shrinkage used in update
to prevents overfitting [default:
0.3]
-eval_metric <arg> Evaluation metrics for validation
data. A default metric is assigned
according to the objective:
- rmse: for regression
- error: for classification
- map: for ranking
For a list of valid inputs, see
XGBoost Parameters.
-feature_selector <arg> Feature selection and ordering
method. [Choices: cyclic (default),
shuffle, random, greedy, thrifty]
-gamma,--min_split_loss <arg> Minimum loss reduction required to
make a further partition on a leaf
node of the tree. [default: 0.0]
-grow_policy <arg> Controls a way new nodes are added
to the tree. Currently supported
only if tree_method is set to hist.
[default: depthwise, Choices:
depthwise, lossguide]
-lambda,--reg_lambda <arg> L2 regularization term on weights.
Increasing this value will make
model more conservative. [default:
1.0 for gbtree, 0.0 for gblinear]
-lambda_bias <arg> L2 regularization term on bias
[default: 0.0]
-max_bin <arg> Maximum number of discrete bins to
bucket continuous features. Only
used if tree_method is set to hist.
[default: 256]
-max_delta_step <arg> Maximum delta step we allow each
tree's weight estimation to be
[default: 0]
-max_depth <arg> Max depth of decision tree [default:
6]
-max_leaves <arg> Maximum number of nodes to be added.
Only relevant when
grow_policy=lossguide is set.
[default: 0]
-maximize_evaluation_metrics <arg> Maximize evaluation metrics
[default: false]
-min_child_weight <arg> Minimum sum of instance weight
(hessian) needed in a child
[default: 1.0]
-normalize_type <arg> Type of normalization algorithm.
[Choices: tree (default), forest]
-num_class <arg> Number of classes to classify
-num_early_stopping_rounds <arg> Minimum rounds required for early
stopping [default: 0]
-num_feature <arg> Feature dimension used in boosting
[default: set automatically by
xgboost]
-num_parallel_tree <arg> Number of parallel trees constructed
during each iteration. This option
is used to support boosted random
forest. [default: 1]
-num_pbuffer <arg> Size of prediction buffer [default:
set automatically by xgboost]
-num_round,--iters <arg> Number of boosting iterations
[default: 10]
-objective <arg> Specifies the learning task and the
corresponding learning objective.
Examples: reg:linear, reg:logistic,
multi:softmax. For a full list of
valid inputs, refer to XGBoost
Parameters. [default: reg:linear]
-one_drop <arg> When this flag is enabled, at least
one tree is always dropped during
the dropout. 0 or 1. [default: 0]
-process_type <arg> A type of boosting process to run.
[Choices: default, update]
-rate_drop <arg> Dropout rate in range [0.0, 1.0].
[default: 0.0]
-refresh_leaf <arg> This is a parameter of the refresh
updater plugin. When this flag is 1,
tree leafs as well as tree nodes’
stats are updated. When it is 0,
only node stats are updated.
[default: 1]
-sample_type <arg> Type of sampling algorithm.
[Choices: uniform (default),
weighted]
-scale_pos_weight <arg> ontrol the balance of positive and
negative weights, useful for
unbalanced classes. A typical value
to consider: sum(negative instances)
/ sum(positive instances) [default:
1.0]
-seed <arg> Random number seed. [default: 43]
-silent <arg> Deprecated. Please use verbosity
instead. 0 means printing running
messages, 1 means silent mode
[default: 1]
-sketch_eps <arg> This roughly translates into O(1 /
sketch_eps) number of bins.
Compared to directly select number
of bins, this comes with theoretical
guarantee with sketch accuracy.
Only used for tree_method=approx.
Usually user does not have to tune
this. [default: 0.03]
-skip_drop <arg> Probability of skipping the dropout
procedure during a boosting
iteration in range [0.0, 1.0].
[default: 0.0]
-subsample <arg> Subsample ratio of the training
instance in range (0.0,1.0]
[default: 1.0]
-top_k <arg> The number of top features to select
in greedy and thrifty feature
selector. The value of 0 means using
all the features. [default: 0]
-tree_method <arg> The tree construction algorithm used
in XGBoost. [default: auto, Choices:
auto, exact, approx, hist]
-tweedie_variance_power <arg> Parameter that controls the variance
of the Tweedie distribution in range
[1.0, 2.0]. [default: 1.5]
-updater <arg> A comma-separated string that
defines the sequence of tree
updaters to run. For a full list of
valid inputs, please refer to
XGBoost Parameters. [default:
'grow_colmaker,prune' for gbtree,
'shotgun' for gblinear]
-validation_ratio <arg> Validation ratio in range [0.0,1.0]
[default: 0.2]
-verbosity <arg> Verbosity of printing messages.
Choices: 0 (silent), 1 (warning), 2
(info), 3 (debug). [default: 0]
Objective function -objective SHOULD be specified though -objective reg:linear is used for Objective function by the default. For the full list of objective functions, please refer this xgboost v0.90 documentation.
The following objectives would widely be used for regression, binary classication, and multiclass classication, respectively.
reg:squarederror regression with squared loss.binary:logistic logistic regression for binary classification, output probability.binary:hinge hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.multi:softmax set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class (number of classes).multi:softprob same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.Other hyperparameters better to be tuned are:
-booster gbree Which booster to use. The default gbtree (Gradient Boosting Trees) would be fine for most cases. Can be gbtree, gblinear or dart; gbtree and dart use tree based models while gblinear uses linear functions.-eta 0.1 The learning rate, 0.3 by the default. 0.05, 0.1, 0.3 are worth trying.-max_depth 6 The maximum depth of the tree. The default value 6 would be fine for most case. Recommended value range is 5-10.-num_class 3 The number of classes MUST be specified for multiclass classification (i.e., -objective multi:softmax or -objective multi:softprob)-num_round 10 The number of rounds for boosting. 10 or more would be preferred.-num_early_stopping_rounds 3 The number of rounds required for early stopping. Without specifying -num_early_stopping_rounds, no early stopping is NOT carried. When -num_round=100 and -num_early_stopping_rounds=5, traning could be early stopped at 15th iteration if there is no evaluation result greater than the 10th iteration's (best one). Early stopping 3 or so would be preferred.-validation_ratio 0.2 The ratio data used for validation (early stopping). 0.2 would be enough for most cases. Note that 80% data is used for training when validation_ratio 0.2 is set.You can find the underlying XGBoost version by:
select xgboost_version(); > 0.90
train_xgboost UDTF is used for training.
The function signature is train_xgboost(array<string|double> features, double target [,string options]) and it returns a prediction model as a relation consist of <string model_id, array<string> pred_model>.
-- explicitly use 3 reducers -- set mapred.reduce.tasks=3; drop table xgb_lr_model; create table xgb_lr_model as select train_xgboost(features, label, '-objective binary:logistic -num_round 10 -num_early_stopping_rounds 3') as (model_id, model) from ( select features, label from news20b_train cluster by rand(43) -- shuffle data to reducers ) shuffled; drop table xgb_hinge_model; create table xgb_hinge_model as select train_xgboost(features, label, '-objective binary:hinge -num_round 10 -num_early_stopping_rounds 3') as (model_id, model) from ( select features, label from news20b_train cluster by rand(43) -- shuffle data to reducers ) shuffled;
Caution
cluster by rand()is NOT required when training data is small and a single task is launched for XGBoost training.cluster by rand()shuffles data at random and divided it for multiple XGBoost instances.
drop table xgb_lr_predicted; create table xgb_lr_predicted as select rowid, array_avg(predicted) as predicted, avg(predicted[0]) as prob from ( select -- fast predictition by xgboost-predictor-java (https://github.com/komiya-atsushi/xgboost-predictor-java/) xgboost_predict(rowid, features, model_id, model) as (rowid, predicted) -- predict by xgboost4j (https://xgboost.readthedocs.io/en/stable/jvm/) -- xgboost_batch_predict(rowid, features, model_id, model) as (rowid, predicted) from -- for each model l -- for each test r -- predict xgb_lr_model l LEFT OUTER JOIN news20b_test r ) t group by rowid; drop table xgb_hinge_predicted; create table xgb_hinge_predicted as select rowid, -- voting -- if(sum(if(predicted[0]=1,1,0)) > sum(if(predicted[0]=0,1,0)),1,-1) as predicted majority_vote(if(predicted[0]=1, 1, -1)) as predicted from ( select -- binary:hinge is not supported in xgboost_predict -- binary:hinge returns [1.0] or [0.0] for predicted xgboost_batch_predict(rowid, features, model_id, model) as (rowid, predicted) from -- for each model l -- for each test r -- predict xgb_hinge_model l LEFT OUTER JOIN news20b_test r ) t group by rowid
You can find the function signature of xgboost_predict by
select xgboost_predict(); usage: xgboost_predict(PRIMITIVE rowid, array<string|double> features, string model_id, array<string> pred_model [, string options]) - Returns a prediction result as (string rowid, array<double> predicted) select xgboost_batch_predict(); usage: xgboost_batch_predict(PRIMITIVE rowid, array<string|double> features, string model_id, array<string> pred_model [, string options]) - Returns a prediction result as (string rowid, array<double> predicted) [-batch_size <arg>] -batch_size <arg> Number of rows to predict together [default: 128]
Caution
xgboost_predictoutputs probability for-objective binary:logisticwhile 0/1 is resulted for-objective binary:hinge.
xgboost_predictonly support the following models and objectives because it uses xgboost-predictor-java: Models: {gblinear, gbtree, dart} Objective functions: {binary:logistic, binary:logitraw, multi:softmax, multi:softprob, reg:linear, reg:squarederror, rank:pairwise}For other models and objectives, please use
xgboost_batch_predictthat uses xgboost4j insead.
WITH submit as ( select t.label as actual, -- probability thresholding by 0.5 if(p.prob > 0.5,1,-1) as predicted from news20b_test t JOIN xgb_lr_predicted p on (t.rowid = p.rowid) ) select sum(if(actual = predicted, 1, 0)) / count(1) as accuracy from submit;
0.8372698158526821 (logistic loss)
WITH submit as ( select t.label as actual, p.predicted from news20b_test t JOIN xgb_hinge_predicted p on (t.rowid = p.rowid) ) select sum(if(actual=predicted,1,0)) / count(1) as accuracy from submit;
0.7752201761409128 (hinge loss)