Step-by-Step Tutorial on Supervised Learning

What is Hivemall?

Apache Hivemall is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query:

SELECT
  train_classifier(
    features,
    label,
    '-loss_function logloss -optimizer SGD'
  ) as (feature, weight)
FROM
  training
;

Hivemall function hivemall_version() shows current Hivemall version, for example:

select hivemall_version();

“0.5.1-incubating-SNAPSHOT”

Below we list ML and relevant problems that Hivemall can solve:

Our YouTube demo video would be helpful to understand more about an overview of Hivemall.

This tutorial explains the basic usage of Hivemall with examples of supervised learning of simple regressor and binary classifier.

Binary classification

Imagine a scenario that we like to build a binary classifier from the mock purchase_history data and predict unforeseen purchases to conduct a new campaign effectively:

day_of_weekgenderpricecategorylabel
Saturdaymale600book1
Fridayfemale4800sports0
Fridayother18000entertainment0
Thursdaymale200food0
Wednesdayfemale1000electronics1

You can create this table as follows:

create table if not exists purchase_history as
select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label
union all
select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label
union all
select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label
union all
select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label
union all
select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label
;

Use Hivemall train_classifier() UDF to tackle the problem as follows.

Step 1. Feature representation

First of all, we have to convert the records into pairs of the feature vector and corresponding target value. Here, Hivemall requires you to represent input features in a specific format.

To be more precise, Hivemall represents single feature in a concatenation of index (i.e., name) and its value:

  • Quantitative feature: <index>:<value>
    • e.g., price:600.0
  • Categorical feature: <index>#<value>
    • e.g., gender#male

Feature index and feature value are separated by comma. When comma is omitted, the value is considered to be 1.0. So, a categorical feature gender#male a one-hot representation of index := gender#male and value := 1.0. Note that # is not a special character for categorical feature.

Each of those features is a string value in Hive, and “feature vector” means an array of string values like:

["price:600.0", "day of week#Saturday", "gender#male", "category#book"]

See also more detailed document for input format.

Therefore, what we first need to do is to convert the records into an array of feature strings, and Hivemall functions quantitative_features(), categorical_features() and array_concat() provide a simple way to create the pairs of feature vector and target value:

create table if not exists training as
select
  id,
  array_concat( -- concatenate two arrays of quantitative and categorical features into single array
    quantitative_features(
      array("price"), -- quantitative feature names
      price -- corresponding column names
    ),
    categorical_features(
      array("day of week", "gender", "category"), -- categorical feature names
      day_of_week, gender, category -- corresponding column names
    )
  ) as features,
  label
from
  purchase_history
;

The training table is as follows:

idfeatureslabel
1[“price:600.0”,“day of week#Saturday”,“gender#male”,“category#book”]1
2[“price:4800.0”,“day of week#Friday”,“gender#female”,“category#sports”]0
3[“price:18000.0”,“day of week#Friday”,“gender#other”,“category#entertainment”]0
4[“price:200.0”,“day of week#Thursday”,“gender#male”,“category#food”]0
5[“price:1000.0”,“day of week#Wednesday”,“gender#female”,“category#electronics”]1

The output table training will be directly used as an input to Hivemall's ML functions in the next step.

Note

You can apply extra Hivemall functions (e.g., rescale(), feature_hashing(), l1_normalize()) for the features in this step to make your prediction model more accurate and stable; it is known as feature engineering in the context of ML. See our documentation for more information.

Step 2. Training

Once the original table purchase_history has been converted into pairs of features and label, you can build a binary classifier by running the following query:

create table if not exists classifier as
select
  train_classifier(
    features, -- feature vector
    label, -- target value
    '-loss_function logloss -optimizer SGD -regularization l1' -- hyper-parameters
  ) as (feature, weight)
from
  training
;

What the above query does is to build a binary classifier with:

  • -loss_function logloss
    • Use logistic loss i.e., logistic regression
  • -optimizer SGD
    • Learn model parameters with the SGD optimization
  • -regularization l1
    • Apply L1 regularization

Eventually, the output table classifier stores model parameters as:

featureweight
day of week#Wednesday0.7443372011184692
day of week#Thursday1.415687620465178e-07
day of week#Friday-0.2697019577026367
day of week#Saturday0.7337419390678406
category#book0.7337419390678406
category#electronics0.7443372011184692
category#entertainment5.039264578954317e-07
category#food1.415687620465178e-07
category#sports-0.2697771489620209
gender#male0.7336684465408325
gender#female0.47442761063575745
gender#other5.039264578954317e-07
price-110.62307739257812

Notice that weight is learned for each possible value in a categorical feature, and for every single quantitative feature.

Of course, you can optimize hyper-parameters to build more accurate prediction model. Check the output of the following query to see all available options, including learning rate, number of iterations and regularization parameters, and their default values:

select train_classifier(array(), 0, '-help');

Step 3. Prediction

Now, the table classifier has liner coefficients for given features, and we can predict unforeseen samples by computing a weighted sum of their features.

How about the probability of purchase by a male customer who sees a food product priced at 120 on Friday? Which product is more likely to be purchased by the customer on Friday?

To differentiate potential purchases, create a unforeseen_samples table with these unknown combinations of features:

create table if not exists unforeseen_samples as
select 1 as id, array("gender#male", "category#food", "day of week#Friday", "price:120") as features
union all
select 2 as id, array("gender#male", "category#sports", "day of week#Friday", "price:1000") as features
union all
select 3 as id, array("gender#male", "category#electronics", "day of week#Friday", "price:540") as features
;

Prediction for the feature vectors can be made by join operation between unforeseen_samples and classifier on each feature as:

with features_exploded as (
  select
    id,
    -- split feature string into its name and value
    -- to join with a model table
    extract_feature(fv) as feature,
    extract_weight(fv) as value
  from unforeseen_samples t1 LATERAL VIEW explode(features) t2 as fv
)
select
  t1.id,
  sigmoid( sum(p1.weight * t1.value) ) as probability
from
  features_exploded t1
  LEFT OUTER JOIN classifier p1 ON (t1.feature = p1.feature)
group by
  t1.id
;

Note

sigmoid() should be applied only for logistic loss and you can't get a probability with other loss functions for a classification. See also this video.

Output for single sample can be:

idprobability
11.0261879540562902e-10

Evaluation

If you have test samples for evaluation, use Hivemall's evaluation UDFs to measure the accuracy of prediction.

For instance, prediction accuracy over the training samples can be measured as:

with features_exploded as (
  select
    id,
    extract_feature(fv) as feature,
    extract_weight(fv) as value
  from training t1 LATERAL VIEW explode(features) t2 as fv
),
predictions as (
  select
    t1.id,
    sigmoid( sum(p1.weight * t1.value) ) as probability
  from
    features_exploded t1
    LEFT OUTER JOIN classifier p1 ON (t1.feature = p1.feature)
  group by
    t1.id
)
select
  auc(probability, label) as auc,
  logloss(probability, label) as logloss
from (
  select t1.probability, t2.label
  from predictions t1
  join training t2 on (t1.id = t2.id)
  ORDER BY probability DESC
) t
;
auclogloss
0.59.200000003614099

Since we are trying to solve the binary classification problem, the accuracy is measured by Area Under the ROC Curve auc() and/or Logarithmic Loss logloss().

Regression

If you use train_regressor() instead of train_classifier(), you can also solve a regression problem with almost same queries.

Imagine the following customers table:

create table if not exists customers as
select 1 as id, "male" as gender, 23 as age, "Japan" as country, 12 as num_purchases
union all
select 2 as id, "female" as gender, 43 as age, "US" as country, 4 as num_purchases
union all
select 3 as id, "other" as gender, 19 as age, "UK" as country, 2 as num_purchases
union all
select 4 as id, "male" as gender, 31 as age, "US" as country, 20 as num_purchases
union all
select 5 as id, "female" as gender, 37 as age, "Australia" as country, 9 as num_purchases
;
genderagecountrynum_purchases
male23Japan12
female43US4
other19UK2
male31US20
female37Australia9

Now, our goal is to build a regression model to predict the number of purchases potentially done by new customers.

Step 1. Feature representation

Same as the classification example:

insert overwrite table training
select
  id,
  array_concat(
    quantitative_features(
      array("age"),
      age
    ),
    categorical_features(
      array("country", "gender"),
      country, gender
    )
  ) as features,
  num_purchases
from
  customers
;

Step 2. Training

train_regressor() requires you to specify an appropriate loss function. One option is to replace the classifier-specific loss function logloss with squared as:

create table if not exists regressor as
select
  train_regressor(
    features, -- feature vector
    label, -- target value
    '-loss_function squared -optimizer AdaGrad -regularization l2' -- hyper-parameters
  ) as (feature, weight)
from
  training
;

-loss_function squared means that this query builds a simple linear regressor with the squared loss. Meanwhile, this example optimizes the parameters based on the AdaGrad optimization scheme with l2 regularization.

Run the function with -help option to list available options:

select train_regressor(array(), 0, '-help');

Step 3. Prediction

Prepare dummy new customers:

create table if not exists new_customers as
select 1 as id, array("gender#male", "age:10", "country#Japan") as features
union all
select 2 as id, array("gender#female", "age:60", "country#US") as features
union all
select 3 as id, array("gender#other", "age:50", "country#UK") as features
;

A way of prediction is almost the same as classification, but not need to pass through the sigmoid() function:

with features_exploded as (
  select
    id,
    extract_feature(fv) as feature,
    extract_weight(fv) as value
  from new_customers t1 LATERAL VIEW explode(features) t2 as fv
)
select
  t1.id,
  sum(p1.weight * t1.value) as predicted_num_purchases
from
  features_exploded t1
  LEFT OUTER JOIN regressor p1 ON (t1.feature = p1.feature)
group by
  t1.id
;

Output is like:

idpredicted_num_purchases
13.645142912864685

Evaluation

Use Root Mean Square Error rmse() or Mean Absolute Error mae() UDFs for evaluation of regressors:

with features_exploded as (
  select
    id,
    extract_feature(fv) as feature,
    extract_weight(fv) as value
  from training t1 LATERAL VIEW explode(features) t2 as fv
),
predictions as (
  select
    t1.id,
    sum(p1.weight * t1.value) as predicted_num_purchases
  from
    features_exploded t1
    LEFT OUTER JOIN regressor p1 ON (t1.feature = p1.feature)
  group by
    t1.id
)
select
  rmse(t1.predicted_num_purchases, t2.label) as rmse,
  mae(t1.predicted_num_purchases, t2.label) as mae
from
  predictions t1
join
  training t2 on (t1.id = t2.id)
;

Output is like:

rmsemae
10.6650602857255048.341085218265652

Next steps

See the following resources for further information:

  • Detailed documentation of train_classifier and train_regressor
    • Query examples for some public datasets are also available in it.