Ont-hot encoding is a method to encode categorical features by a 1-of-K (thus called 1-hot) encoding scheme.
Suppose the following table:
Company | Price |
---|---|
VW | 290 |
Toyota | 300 |
Honda | 190 |
Honda | 250 |
A one-hot encoding output is expected as follows:
Company_VW | Company_Toyota | Company_Honda | Price |
---|---|---|---|
1 | 0 | 0 | 290 |
0 | 1 | 0 | 300 |
0 | 0 | 1 | 190 |
0 | 0 | 1 | 250 |
The above one-hot table is a dense feature format and it can be expressed as follows by a sparse format:
Company | Price |
---|---|
{1} | 290 |
{2} | 300 |
{3} | 190 |
{3} | 250 |
The mapping for company name is {VW->1, Toyota->2, Honda->3}.
Now, suppose encoding two categorical variables as follows into a sparse vector.
category1 | category2 |
---|---|
cat | mammal |
dog | mammal |
human | mammal |
seahawk | bird |
wasp | insect |
wasp | insect |
cat | mammal |
dog | mammal |
human | mammal |
The one-hot encoded feature vector could be as follows:
category1 | category2 | encoded_features |
---|---|---|
cat | mammal | {1,6} |
dog | mammal | {2,6} |
human | mammal | {3,6} |
seahawk | bird | {4,7} |
wasp | insect | {5,8} |
We use this test
table for explaration.
drop table test; create table test (species string, category string, count int); truncate table test; insert into table test values ('cat','mammal',9), ('dog','mammal',10), ('human','mammal',10), ('seahawk','bird',101), ('wasp','insect',3), ('wasp','insect',9), ('cat','mammal',101), ('dog','mammal',1), ('human','mammal',9);
You can get one-hot encoding table for spieces as follows:
WITH t as ( select onehot_encoding(species) m from test ) select m.f1 from t;
f1 |
---|
{“seahawk”:1,“cat”:2,“human”:3,“wasp”:4,“dog”:5} |
WITH t as ( select onehot_encoding(species, category) m from test ) select m.f1, m.f2 from t;
| f1 | f2 | | {“seahawk”:1,“cat”:2,“human”:3,“wasp”:4,“dog”:5} | {“bird”:6,“insect”:7,“mammal”:8} |
You can create a mapping table as follows:
create table mapping as WITH t as ( select onehot_encoding(species, category) m from test ) select m.f1, m.f2 from t; desc mapping; col_name | data_type ------------|---------------- f1 | map<string,int> f2 | map<string,int>
The following query applies one-hot encoding using the mapping table.
select t.species, m.f1[t.species], t.category, m.f2[t.category] from test t CROSS JOIN mapping m; cat 2 mammal 8 dog 5 mammal 8 human 3 mammal 8 seahawk 1 bird 6 wasp 4 insect 7 wasp 4 insect 7 cat 2 mammal 8 dog 5 mammal 8 human 3 mammal 8
You can create a sparse feature vector as follows:
select array(m.f1[t.species],m.f2[t.category],feature('count',count)) as sparse_feature from test t CROSS JOIN mapping m; sparse_feature ["2","8","count:9"] ["5","8","count:10"] ["3","8","count:10"] ["1","6","count:101"] ["4","7","count:3"] ["4","7","count:9"] ["2","8","count:101"] ["5","8","count:1"] ["3","8","count:9"]
It also can be achieved by a single query as follows:
WITH mapping as ( select m.f1, m.f2 from ( select onehot_encoding(species, category) m from test ) tmp ) select array(m.f1[t.species],m.f2[t.category],feature('count',count)) as sparse_features from test t CROSS JOIN mapping m;
Note that one-hot encoding is required only for categorical variables. Feature hasing is another scalable way to encode categorical variables to numerical index.