Feature Selection is the process of selecting a subset of relevant features for use in model construction.

It is a useful technique to 1) improve prediction results by omitting redundant features, 2) to shorten training time, and 3) to know important features for prediction.

Note: This feature is supported from Hivemall v0.5-rc.1 or later.

Supported Feature Selection algorithms

  • Chi-square (Chi2)
    • In statistics, the $$\chi^2$$ test is applied to test the independence of two even events. Chi-square statistics between every feature variable and the target variable can be applied to Feature Selection. Refer this article for Mathematical details.
  • Signal Noise Ratio (SNR)
    • The Signal Noise Ratio (SNR) is a univariate feature ranking metric, which can be used as a feature selection criterion for binary classification problems. SNR is defined as $$|\mu_{1} - \mu_{2}| / (\sigma_{1} + \sigma_{2})$$, where $$\mu_{k}$$ is the mean value of the variable in classes $$k$$, and $$\sigma_{k}$$ is the standard deviations of the variable in classes $$k$$. Clearly, features with larger SNR are useful for classification.

Usage

Feature Selection based on Chi-square test

CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);
 
set hivevar:k=2;

WITH stats AS (
  SELECT
    transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
    array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
    array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
  FROM
    input
),
test AS (
  SELECT
    transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
  FROM
    stats
),
chi2 AS (
  SELECT
    chi2(r.observed, l.expected) AS v -- struct<array<double>, array<double>>, each shape = (1, n_features)
  FROM
    test l
    CROSS JOIN stats r
)
SELECT
  select_k_best(l.X, r.v.chi2, ${k}) as features -- top-k feature selection based on chi2 score
FROM
  input l
  CROSS JOIN chi2 r;

Feature Selection based on Signal Noise Ratio (SNR)

CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);

set hivevar:k=2;

WITH snr AS (
  SELECT snr(X, Y) AS snr -- aggregated SNR as array<double>, shape = (1, #features)
  FROM input
)
SELECT 
  select_k_best(X, snr, ${k}) as features
FROM
  input
  CROSS JOIN snr;

Function signatures

[UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

Input
array<number> Xarray<number> Y
a row of matrixa row of matrix
Output
array<array<double>> dot product
dot(X.T, Y) of shape = (X.#cols, Y.#cols)

[UDF] select_k_best(X::array<number>, importance_list::array<number>, k::int)::array<double>

Input
array<number> Xarray<number> importance_listint k
feature vectorimportance of each featurethe number of features to be selected
Output
array<array<double>> k-best features
top-k elements from feature vector X based on importance list

[UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>

Input
array<number> observedarray<number> expected
observed featuresexpected features dot(class_prob.T, feature_count)

Both of observed and expected have a shape (#classes, #features)

Output
struct<array<double>, array<double>> importance_list
chi2-value and p-value for each feature

[UDAF] snr(X::array<number>, Y::array<int>)::array<double>

Input
array<number> Xarray<int> Y
feature vectorone hot label
Output
array<double> importance_list
Signal Noise Ratio for each feature