Feature Selection is the process of selecting a subset of relevant features for use in model construction.
It is a useful technique to 1) improve prediction results by omitting redundant features, 2) to shorten training time, and 3) to know important features for prediction.
Note: This feature is supported from Hivemall v0.5-rc.1 or later.
Supported Feature Selection algorithms
- Chi-square (Chi2)
- In statistics, the $$\chi^2$$ test is applied to test the independence of two even events. Chi-square statistics between every feature variable and the target variable can be applied to Feature Selection. Refer this article for Mathematical details.
- Signal Noise Ratio (SNR)
- The Signal Noise Ratio (SNR) is a univariate feature ranking metric, which can be used as a feature selection criterion for binary classification problems. SNR is defined as $$|\mu_{1} - \mu_{2}| / (\sigma_{1} + \sigma_{2})$$, where $$\mu_{k}$$ is the mean value of the variable in classes $$k$$, and $$\sigma_{k}$$ is the standard deviations of the variable in classes $$k$$. Clearly, features with larger SNR are useful for classification.
Usage
Feature Selection based on Chi-square test
CREATE TABLE input (
X array<double>, -- features
Y array<int> -- binarized label
);
set hivevar:k=2;
WITH stats AS (
SELECT
transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
FROM
input
),
test AS (
SELECT
transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
FROM
stats
),
chi2 AS (
SELECT
chi2(r.observed, l.expected) AS v -- struct<array<double>, array<double>>, each shape = (1, n_features)
FROM
test l
CROSS JOIN stats r
)
SELECT
select_k_best(l.X, r.v.chi2, ${k}) as features -- top-k feature selection based on chi2 score
FROM
input l
CROSS JOIN chi2 r;
Feature Selection based on Signal Noise Ratio (SNR)
CREATE TABLE input (
X array<double>, -- features
Y array<int> -- binarized label
);
set hivevar:k=2;
WITH snr AS (
SELECT snr(X, Y) AS snr -- aggregated SNR as array<double>, shape = (1, #features)
FROM input
)
SELECT
select_k_best(X, snr, ${k}) as features
FROM
input
CROSS JOIN snr;
Function signatures
[UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>
Input
array<number> X | array<number> Y |
---|
a row of matrix | a row of matrix |
Output
array<array<double>> dot product |
---|
dot(X.T, Y) of shape = (X.#cols, Y.#cols) |
[UDF] select_k_best(X::array<number>, importance_list::array<number>, k::int)::array<double>
Input
array<number> X | array<number> importance_list | int k |
---|
feature vector | importance of each feature | the number of features to be selected |
Output
array<array<double>> k-best features |
---|
top-k elements from feature vector X based on importance list |
[UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>
Input
array<number> observed | array<number> expected |
---|
observed features | expected features dot(class_prob.T, feature_count) |
Both of observed
and expected
have a shape (#classes, #features)
Output
struct<array<double>, array<double>> importance_list |
---|
chi2-value and p-value for each feature |
[UDAF] snr(X::array<number>, Y::array<int>)::array<double>
Input
array<number> X | array<int> Y |
---|
feature vector | one hot label |
Output
array<double> importance_list |
---|
Signal Noise Ratio for each feature |