Feature binning is a method of dividing quantitative variables into categorical values. It groups quantitative values into a pre-defined number of bins.

Note: This feature is supported from Hivemall v0.5-rc.1 or later.

Usage

Prepare sample data (users table) first as follows:

CREATE TABLE users (
  name string, age int, gender string
);

INSERT INTO users VALUES
  ('Jacob', 20, 'Male'),
  ('Mason', 22, 'Male'),
  ('Sophia', 35, 'Female'),
  ('Ethan', 55, 'Male'),
  ('Emma', 15, 'Female'),
  ('Noah', 46, 'Male'),
  ('Isabella', 20, 'Female');

A. Feature Vector trasformation by applying Feature Binning

WITH t AS (
  SELECT
    array_concat(
      categorical_features(
        array('name', 'gender'),
	name, gender
      ),
      quantitative_features(
	array('age'),
	age
      )
    ) AS features
  FROM
    users
),
bins AS (
  SELECT
    map('age', build_bins(age, 3)) AS quantiles_map
  FROM
    users
)
SELECT
  feature_binning(features, quantiles_map) AS features
FROM
  t CROSS JOIN bins;

Result

features: array<features::string>
[“name#Jacob”,“gender#Male”,“age:1”]
[“name#Mason”,“gender#Male”,“age:1”]
[“name#Sophia”,“gender#Female”,“age:2”]
[“name#Ethan”,“gender#Male”,“age:2”]
[“name#Emma”,“gender#Female”,“age:0”]
[“name#Noah”,“gender#Male”,“age:2”]
[“name#Isabella”,“gender#Female”,“age:1”]

B. Get a mapping table by Feature Binning

WITH bins AS (
  SELECT build_bins(age, 3) AS quantiles
  FROM users
)
SELECT
  age, feature_binning(age, quantiles) AS bin
FROM
  users CROSS JOIN bins;

Result

age: intbin: int
201
221
352
552
150
462
201

Function Signature

[UDAF] build_bins(weight, num_of_bins[, auto_shrink])

Input

weight: int|bigint|float|doublenum_of_bins: int[auto_shrink: boolean = false]
weight2 <=behavior when separations are repeated: T=>skip, F=>exception

Output

quantiles: array<double>
array of separation value

Note

There is the possibility quantiles are repeated because of too many num_of_bins or too few data. If auto_shrink is true, skip duplicated quantiles. If not, throw an exception.

[UDF] feature_binning(features, quantiles_map)/(weight, quantiles)

Variation: A

Input

features: array<features::string>quantiles_map: map<string, array<double>>
serialized featureentry:: key: col name, val: quantiles

Output

features: array<feature::string>
serialized and binned features

Variation: B

Input

weight: int|bigint|float|doublequantiles: array<double>
weightarray of separation value

Output

bin: int
categorical value (bin ID)