| |
| Sparse Vector Datatype |
| |
| ==================== |
| Background |
| ==================== |
| |
| When you use arrays of floating point numbers for various calculations, you will |
| often have long runs of zeros. This is common in many kinds of applications |
| such as scientific, retail optimization and text processing. Each floating |
| point number takes 8 bytes of storage in memory and/or disk, so saving those |
| zeros is often impractical. There are also many computations that benefit from |
| skipping over the zeros. |
| |
| Say for example that we have the following array of doubles in Postgres stored |
| as a "float8[]": |
| '{0, 33,...40,000 zeros..., 12, 22 }'::float8[] |
| |
| This array would occupy slightly more than 320KB of memory/disk, most of it |
| zeros. Also, if we perform various operations on this, we'll be doing work |
| on 40,001 fields that aren't important. This kind of array arises often in |
| text processing, where the dictionary may have 40-100K terms and each vector |
| will store the number of words in a particular document. |
| |
| We could use the built in "Array" datatype in Postgres/GP, which has a bitmap |
| for null values, but since it is not optimized for float8[], or for long runs |
| of zeros instead of nulls and the bitmap is not RLE compressed, it is a poor |
| choice for this use-case. If we stored the zeros as NULL in the Postgres/GP |
| array, then the bitmap for nulls would have 5KB of bits in it to mark the nulls, |
| which is not nearly as efficient as we'd like. |
| |
| A simple RLE scheme that is biased toward being efficient for zero value bitmap |
| results in 6 bytes for bitmap storage. |
| |
| We also want to implement vector operators for our type that take advantage of |
| an efficient sparse storage format to make computations faster. |
| |
| ==================== |
| What is it? |
| ==================== |
| |
| The "Sparse Vector Datatype" is named "svec" and implements a vector with |
| compressed storage of zeros. You can take the above vector and cast it into an |
| svec like this: |
| ('{0, 33,...40,000 zeros..., 12, 22 }'::float8[])::svec |
| |
| This is the same vector we input (I only used 5 zeros). |
| |
| We can use operations with svec type like <, >, *, **, /,=,+,SUM,COUNT_VEC |
| etc, and they have meanings associated with typical vector operations. The plus |
| (+) operator adds each of the terms of two vectors of same dimension together. |
| For instance, if we have a = {0,1,5} and b = {4,3,2}, then a+b = {4,4,7}. We |
| can see this using the svec type like this: |
| lukelonergan=# select |
| ('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[]; |
| float8 |
| --------- |
| {4,4,7} |
| |
| A vector dot product (*) between these two will result in a scalar result of |
| type float8. The dot product should be (0*4+1*3+5*2)=13, like this: |
| lukelonergan=# select '{0,1,5}'::float8[]::svec * '{4,3,2}'::float8[]::svec; |
| ?column? |
| ---------- |
| 13 |
| |
| A scalar product (**) between these two will result in the individual terms |
| being multiplied. The scalar product should be {0*4,1*3,5*2}={0,3,10}, like |
| this: |
| lukelonergan=# select |
| ('{0,1,5}'::float8[]::svec ** '{4,3,2}'::float8[]::svec)::float8[]; |
| float8 |
| ---------- |
| {0,3,10} |
| |
| Special vector aggregate functions are also useful. SUM is self explanatory. |
| COUNT_VEC (as opposed to COUNT) evaluates the count of non-zero terms found in |
| a set of svec and returns an svec with the counts. For instance, if we have |
| {0,1,5},{10,0,3},{0,0,3},{0,1,0}, then COUNT_VEC() would return {1,2,3}: |
| lukelonergan=# create table list (a svec); |
| CREATE TABLE |
| lukelonergan=# insert into list values |
| ('{0,1,5}'::float8[]), |
| ('{10,0,3}'::float8[]), |
| ('{0,0,3}'::float8[]), |
| ('{0,1,0}'::float8[]); |
| INSERT 0 4 |
| |
| lukelonergan=# select COUNT_VEC(a)::float8[] from list; |
| count_vec |
| ----------- |
| {1,2,3} |
| |
| ==================== |
| Example |
| ==================== |
| For a text classification example, lets assume we have a dictionary composed of |
| words in a text array: |
| |
| create table features (a text[][]) distributed randomly; |
| insert into features |
| values ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,' |
| 'one,really,second,the,third,this,until}'); |
| |
| We have a set of documents, also defined as arrays of words: |
| |
| lukelonergan=# create table documents(a int,b text[]); |
| CREATE TABLE |
| lukelonergan=# insert into documents values |
| (1,'{this,is,one,document,in,the,corpus}'), |
| (2,'{i,am,the,second,document,in,the,corpus}'), |
| (3,'{being,third,never,really",bothered,me,until,now}'), |
| (4,'{the,document,before,me,is,the,third,document}'); |
| |
| Now we have a dictionary and some documents, we would like to do some document |
| classification using vector arithmetic on word counts and proportions of |
| dictionary words in each document. |
| |
| To start this process, well need to find the dictionary words in each document. |
| Well prepared what is called a Sparse Feature Vector or SFV for each document. |
| An SFV is a vector of dimension N, where N is the number of dictionary words, |
| and in each SFV is a count of each dictionary word in the document. |
| |
| Inside the sparse vector toolset, we have a function that will create an SFV |
| from a document, so we can just do this: |
| |
| lukelonergan=# select gp_extract_feature_histogram( |
| (select a from features limit 1),b)::float8[] from documents; |
| gp_extract_feature_histogram |
| ----------------------------------------- |
| {0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} |
| {0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} |
| {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} |
| {0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} |
| |
| Note that the output of gp_extract_feature_histogram is an svec for each |
| document containing the count of each of the dictionary words in the ordinal |
| positions of the dictionary. This can more easily be understood by doing this: |
| |
| lukelonergan=# select gp_extract_feature_histogram( |
| (select a from features limit 1),b)::float8[],b from documents; |
| gp_extract_feature_histogram | b |
| -----------------------------------------+-------------------------------------------------- |
| {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus} |
| {0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document} |
| {0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus} |
| {0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now} |
| |
| lukelonergan=# select * from features; |
| a |
| -------------------------------------------------------------------------------------------------------- |
| {am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until} |
| |
| Now when we look at the first document "i am the second document in the corpus", |
| its SFV is {1,3*0,1,1,1,1,6*0,1,2}. The word "am" is the first ordinate in the |
| dictionary and there is 1 instance of it in the SFV. "before" has no instances |
| in the document, so its value is "0" and so on. |
| |
| gp_extract_feature_histogram is very speed optimized - in essence a single |
| routine version of a hash join, and should be able to process large numbers of |
| documents into their SFVs in parallel at the highest possible speeds. |
| |
| The rest of the classification process is all vector math. As Brian Dolan |
| describes the process: |
| The actual count is hardly ever used. Instead, it's turned into a weight. |
| The most common weight is called tf/idf for "Term Frequency / Inverse |
| Document Frequency". |
| The calculation for a given term in a given document is |
| {#Times in this document} * log {#Documents / #Documents the term appears in} |
| For instance, the term "document" in document A would have weight |
| |
| 1 * log (4/3). |
| |
| In document D it would have weight |
| |
| 2 * log (4/3). |
| |
| Terms that appear in every document would have tf/idf weight 0, since |
| log (4/4) = log(1) = 0. |
| (Our example has no term like that.) That usually sends a lot of values to |
| 0. |
| |
| For this part of the processing, well need to have a sparse vector of the |
| dictionary dimension (19) with the values (log(#documents/#Documents each term |
| appears in). There will be one such vector for the whole list of documents (aka |
| the "corpus"). The #documents is just a count of all of the documents, in this |
| case 4, but there is one divisor for each dictionary word and its value is the |
| count of all the times that word appears in the document. This single vector for |
| the whole corpus can then be scalar product multiplied by each document SFV to |
| produce the Term Frequency/Inverse Document Frequency. |
| |
| This can be done as follows: |
| lukelonergan=# create table corpus as |
| (select a, |
| gp_extract_feature_histogram( |
| (select a from features limit 1), |
| b) sfv from documents); |
| lukelonergan=# SELECT docnum, (a*logidf) tf_idf |
| FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo, |
| corpus |
| ORDER BY docnum; |
| docnum | tf_idf |
| --------+---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1 | {4,1,1,1,2,3,1,2,1,1,1,1}:{0,0.693147180559945,0.287682072451781,0,0.693147180559945,0,1.38629436111989,0,0.287682072451781,0,1.38629436111989,0} |
| 2 | {1,3,1,1,1,1,6,1,1,3}:{1.38629436111989,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,1.38629436111989,0.575364144903562,0} |
| 3 | {2,2,5,1,2,1,1,2,1,1,1}:{0,1.38629436111989,0,0.693147180559945,1.38629436111989,0,1.38629436111989,0,0.693147180559945,0,1.38629436111989} |
| 4 | {1,1,3,1,2,2,5,1,1,2}:{0,1.38629436111989,0,0.575364144903562,0,0.693147180559945,0,0.575364144903562,0.693147180559945,0} |
| |
| We can now get the "angular distance" between one document and the rest of the |
| documents using the ACOS of the dot product of the document vectors: |
| |
| lukelonergan=# CREATE TABLE weights AS |
| (SELECT docnum, (a*logidf) tf_idf |
| FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo, |
| corpus |
| ORDER BY docnum) |
| DISTRIBUTED RANDOMLY; |
| |
| Calculate the angular distance between the first document to each other |
| document |
| |
| lukelonergan=# SELECT docnum, |
| 180.*(ACOS(dmin(1., |
| (tf_idf%*%testdoc)/(svec_l2norm(tf_idf) |
| *svec_l2norm(testdoc))))/3.141592654) angular_distance |
| FROM weights, |
| (SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo |
| ORDER BY 1; |
| docnum | angular_distance |
| --------+------------------ |
| 1 | 0 |
| 2 | 78.8235846096986 |
| 3 | 89.9999999882484 |
| 4 | 80.0232034288617 |
| |
| We can see that the angular distance between document 1 and itself is 0 degrees |
| and between document 1 and 3 is 90 degrees because they share no features at |
| all. |