Feature binning is a method of dividing quantitative variables into categorical values. It groups quantitative values into a pre-defined number of bins.
Note: This feature is supported from Hivemall v0.5-rc.1 or later.
Usage
Prepare sample data (users table) first as follows:
CREATE TABLE users (
name string, age int, gender string
);
INSERT INTO users VALUES
('Jacob', 20, 'Male'),
('Mason', 22, 'Male'),
('Sophia', 35, 'Female'),
('Ethan', 55, 'Male'),
('Emma', 15, 'Female'),
('Noah', 46, 'Male'),
('Isabella', 20, 'Female');
A. Feature Vector trasformation by applying Feature Binning
WITH t AS (
SELECT
array_concat(
categorical_features(
array('name', 'gender'),
name, gender
),
quantitative_features(
array('age'),
age
)
) AS features
FROM
users
),
bins AS (
SELECT
map('age', build_bins(age, 3)) AS quantiles_map
FROM
users
)
SELECT
feature_binning(features, quantiles_map) AS features
FROM
t CROSS JOIN bins;
Result
features: array<features::string> |
---|
[“name#Jacob”,“gender#Male”,“age:1”] |
[“name#Mason”,“gender#Male”,“age:1”] |
[“name#Sophia”,“gender#Female”,“age:2”] |
[“name#Ethan”,“gender#Male”,“age:2”] |
[“name#Emma”,“gender#Female”,“age:0”] |
[“name#Noah”,“gender#Male”,“age:2”] |
[“name#Isabella”,“gender#Female”,“age:1”] |
B. Get a mapping table by Feature Binning
WITH bins AS (
SELECT build_bins(age, 3) AS quantiles
FROM users
)
SELECT
age, feature_binning(age, quantiles) AS bin
FROM
users CROSS JOIN bins;
Result
age: int | bin: int |
---|
20 | 1 |
22 | 1 |
35 | 2 |
55 | 2 |
15 | 0 |
46 | 2 |
20 | 1 |
Function Signature
[UDAF] build_bins(weight, num_of_bins[, auto_shrink])
Input
weight: int|bigint|float|double | num_of_bins: int | [auto_shrink: boolean = false] |
---|
weight | 2 <= | behavior when separations are repeated: T=>skip, F=>exception |
Output
quantiles: array<double> |
---|
array of separation value |
Note
There is the possibility quantiles are repeated because of too many num_of_bins
or too few data. If auto_shrink
is true, skip duplicated quantiles. If not, throw an exception.
[UDF] feature_binning(features, quantiles_map)/(weight, quantiles)
Variation: A
Input
features: array<features::string> | quantiles_map: map<string, array<double>> |
---|
serialized feature | entry:: key: col name, val: quantiles |
Output
features: array<feature::string> |
---|
serialized and binned features |
Variation: B
Input
weight: int|bigint|float|double | quantiles: array<double> |
---|
weight | array of separation value |
Output
bin: int |
---|
categorical value (bin ID) |