This page explains the input format of training data in Hivemall. Here, we use EBNF-like notation for describing the format.
The classifiers of Hivemall takes 2 (or 3) arguments: features, label, and options (a.k.a. hyperparameters). The first two arguments of training functions represents training examples.
In Statistics, features and label are called Explanatory variable and Response Variable, respectively.
The format of features is common between (binary and multi-class) classification and regression. Hivemall accepts ARRAY<INT|BIGINT|TEXT>
for the type of features column.
Hivemall uses a sparse data format (cf. Compressed Row Storage) which is similar to LIBSVM and Vowpal Wabbit.
The format of each feature in an array is as follows:
feature ::= <index>:<weight> or <index>
Each element of index or weight then accepts the following format:
index ::= <INT | BIGINT | TEXT> weight ::= <FLOAT>
The index are usually a number (INT or BIGINT) starting from 1. Here is an instance of a features.
10:3.4 123:0.5 34567:0.231
Note: As mentioned later, index “0” is reserved for a Bias/Dummy variable.
In addition to numbers, you can use a TEXT value for an index. For example, you can use array(“height:1.5”, “length:2.0”) for the features.
"height:1.5" "length:2.0"
A quantitative variable must have an index entry.
Hivemall (v0.3.1 or later) provides add_feature_index function which is useful for adding indexes to quantitative variables.
select add_feature_index(array(3,4.0,5)) from dual;
[“1:3.0”,“2:4.0”,“3:5.0”]
You can omit specifying weight for each feature e.g. for Categorical variables as follows:
feature ::= <index>
Note 1.0 is used for the weight when omitting weight.
Note that “0” is reserved for a Bias variable (called dummy variable in Statistics).
The add_bias function is Hivemall appends “0:1.0” as an element of array in features.
Hivemall supports feature hashing/hashing trick through mhash function.
The mhash function takes a feature (i.e., index) of TEXT format and generates a hash number of a range from 1 to 2^24 (=16777216) by the default setting.
Feature hashing is useful where the dimension of feature vector (i.e., the number of elements in features) is so large. Consider applying mhash function when a prediction model does not fit in memory and OutOfMemory exception happens.
In general, you don't need to use mhash when the dimension of feature vector is less than 16777216. If feature index is very long TEXT (e.g., “xxxxxxx-yyyyyy-weight:55.3”) and uses huge memory spaces, consider using mhash as follows:
-- feature is v0.3.2 or before concat(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), ":", extract_weight("xxxxxxx-yyyyyy-weight:55.3")) -- feature is v0.3.2-1 or later feature(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), extract_weight("xxxxxxx-yyyyyy-weight:55.3"))
43352:55.3
Feature (weight) normalization is important in machine learning. Please refer this article for detail.
The label must be an INT typed column and the values are positive (+1) or negative (-1) as follows:
<label> ::= 1 | -1
Alternatively, you can use the following format that represents 1 for a positive example and 0 for a negative example:
<label> ::= 0 | 1
You can used any PRIMITIVE type in the multi-class label.
<label> ::= <primitive type>
Typically, the type of label column will be INT, BIGINT, or TEXT.
In regression, response/predictor variable (we denote it as target) is a real number.
Before Hivemall v0.3, we accepts only FLOAT type for target.
<target> ::= <FLOAT>
You need to explicitly cast a double value of target to float as follows:
CAST(target as FLOAT)
On the other hand, Hivemall v0.3 or later accepts double compatible numbers in target.
<target> ::= <FLOAT | DOUBLE | INT | TINYINT | SMALLINT| BIGINT >
Logistic regression is actually a binary classification scheme while it can produce probabilities of positive of a training example.
A target value of a training input must be in range 0.0 to 1.0, specifically 0.0 or 1.0.
-- hivemall v0.3.2 and before select concat("weight",":",55.0); -- hivemall v0.3.2-1 and later select feature("weight", 55.0);
weight:55.0
select extract_feature("weight:55.0"), extract_weight("weight:55.0");
weight | 55.0
-- hivemall v0.4.0 and later select feature_index(array("10:0.2","7:0.3","9"));
[10,7,9]
select convert_label(-1), convert_label(1), convert_label(0.0f), convert_label(1.0f) from dual;
0.0f | 1.0f | -1 | 1
array<string> quantitative_features(array<string> featureNames, feature1, feature2, .. [, const string options])
is a helper function to create sparse quantitative features from a table.
select quantitative_features( array("apple","height","weight"), 1,180.3,70.2 -- ,"-emit_null" );
[“apple:1.0”,“height:180.3”,“weight:70.2”]
select quantitative_features( array("apple","height","weight"), 1,cast(null as double),70.2 ,"-emit_null" );
[“apple:1.0”,null,“weight:70.2”]
array<string> categorical_features(array<string> featureNames, feature1, feature2, .. [, const string options])
is a helper function to create sparse categorical features from a table.
select categorical_features( array("is_cat","is_dog","is_lion","is_pengin","species"), 1, 0, 1.0, true, "dog" -- ,"-emit_null" );
[“is_cat#1”,“is_dog#0”,“is_lion#1.0”,“is_pengin#true”,“species#dog”]
select categorical_features( array("is_cat","is_dog","is_lion","is_pengin","species"), 1, 0, 1.0, true, null ,"-emit_null" );
[“is_cat#1”,“is_dog#0”,“is_lion#1.0”,“is_pengin#true”,null]
You can create a training data table as follows:
select rowid() as rowid, concat_array( array("bias:1.0"), categorical_features( array("id", "name"), id, name ), quantitative_features( array("height", "weight"), height, weight ) ) as features, click_or_not as label from table;