What's One-hot encoding?

Ont-hot encoding is a method to encode categorical features by a 1-of-K (thus called 1-hot) encoding scheme.

Suppose the following table:

CompanyPrice
VW290
Toyota300
Honda190
Honda250

A one-hot encoding output is expected as follows:

Company_VWCompany_ToyotaCompany_HondaPrice
100290
010300
001190
001250

The above one-hot table is a dense feature format and it can be expressed as follows by a sparse format:

CompanyPrice
{1}290
{2}300
{3}190
{3}250

The mapping for company name is {VW->1, Toyota->2, Honda->3}.

Now, suppose encoding two categorical variables as follows into a sparse vector.

category1category2
catmammal
dogmammal
humanmammal
seahawkbird
waspinsect
waspinsect
catmammal
dogmammal
humanmammal

The one-hot encoded feature vector could be as follows:

category1category2encoded_features
catmammal{1,6}
dogmammal{2,6}
humanmammal{3,6}
seahawkbird{4,7}
waspinsect{5,8}

We use this test table for explaration.

drop table test;
create table test (species string, category string, count int);

truncate table test;
insert into table test values
  ('cat','mammal',9), 
  ('dog','mammal',10),
  ('human','mammal',10),
  ('seahawk','bird',101),
  ('wasp','insect',3),
  ('wasp','insect',9),
  ('cat','mammal',101),
  ('dog','mammal',1),
  ('human','mammal',9);

One-hot encoding table

You can get one-hot encoding table for spieces as follows:

WITH t as (
  select onehot_encoding(species) m
  from test
)
select m.f1 from t;
f1
{“seahawk”:1,“cat”:2,“human”:3,“wasp”:4,“dog”:5}
WITH t as (
  select onehot_encoding(species, category) m
  from test
)
select m.f1, m.f2 from t;

| f1 | f2 | | {“seahawk”:1,“cat”:2,“human”:3,“wasp”:4,“dog”:5} | {“bird”:6,“insect”:7,“mammal”:8} |

You can create a mapping table as follows:

create table mapping as
WITH t as (
  select onehot_encoding(species, category) m
  from test
)
select m.f1, m.f2 from t;

desc mapping;

col_name    | data_type
------------|----------------
f1          | map<string,int>                             
f2          | map<string,int>   

How to use One-hot encoding

The following query applies one-hot encoding using the mapping table.

select
  t.species, m.f1[t.species],
  t.category, m.f2[t.category]
from
  test t
  CROSS JOIN mapping m;

cat 2   mammal  8 
dog 5   mammal  8 
human   3   mammal  8 
seahawk 1   bird    6 
wasp    4   insect  7 
wasp    4   insect  7 
cat 2   mammal  8 
dog 5   mammal  8
human   3   mammal  8

You can create a sparse feature vector as follows:

select
  array(m.f1[t.species],m.f2[t.category],feature('count',count)) as sparse_feature 
from
  test t
  CROSS JOIN mapping m;

sparse_feature
["2","8","count:9"]
["5","8","count:10"]
["3","8","count:10"]
["1","6","count:101"]
["4","7","count:3"]
["4","7","count:9"]
["2","8","count:101"]
["5","8","count:1"]
["3","8","count:9"]

It also can be achieved by a single query as follows:

WITH mapping as (
  select 
    m.f1, m.f2 
  from (
    select onehot_encoding(species, category) m
    from test
  ) tmp
)
select
  array(m.f1[t.species],m.f2[t.category],feature('count',count)) as sparse_features
from
  test t
  CROSS JOIN mapping m;

Note that one-hot encoding is required only for categorical variables. Feature hasing is another scalable way to encode categorical variables to numerical index.