Introduction

Expanding numeric labels to actual count of samples can contribute to accuracy improvement in some cases. binarize_label explode a record that keeps the count of positive/negative labeled samples into corresponding actual count of samples. For example,

positivenegativefeatures
23“[a:1, b:2]”

is converted into

featureslabel
“[a:1, b:2]”0
“[a:1, b:2]”0
“[a:1, b:2]”1
“[a:1, b:2]”1
“[a:1, b:2]”1

Function signature

binarize_label(int/long positive, int/long negative, ANY arg1, ANY arg2, ..., ANY argN) returns (ANY arg1, ANY arg2, ..., ANY argN, int label) where label is 0 or 1.

Usage

WITH input as (
  select 2 as positive, 3 as negative, array('a:1','b:2') as features
  UNION ALL
  select 2 as positive, 1 as negative, array('c:3','d:4') as features
)
SELECT
  binarize_label(positive, negative, features)
from 
  input;
featureslabel
[“a:1”,“b:2”]1
[“a:1”,“b:2”]1
[“a:1”,“b:2”]0
[“a:1”,“b:2”]0
[“a:1”,“b:2”]0
[“c:3”,“d:4”]1
[“c:3”,“d:4”]1
[“c:3”,“d:4”]0

Caution

Don't forget to shuffle converted training instances in a random order, e.g., by CLUSTER BY rand().