gpcontrib/gp_sparse_vector/README - cloudberry - Git at Google


 					Sparse Vector Datatype

 ====================
 Background
 ====================

 When you use arrays of floating point numbers for various calculations, you will
 often have long runs of zeros.  This is common in many kinds of applications
 such as scientific, retail optimization and text processing.  Each floating
 point number takes 8 bytes of storage in memory and/or disk, so saving those
 zeros is often impractical.  There are also many computations that benefit from
 skipping over the zeros.

 Say for example that we have the following array of doubles in Postgres stored
 as a "float8[]":
     '{0, 33,...40,000 zeros..., 12, 22 }'::float8[]

 This array would occupy slightly more than 320KB of memory/disk, most of it
 zeros.  Also, if we perform various operations on this, we'll be doing work
 on 40,001 fields that aren't important.  This kind of array arises often in
 text processing, where the dictionary may have 40-100K terms and each vector
 will store the number of words in a particular document.

 We could use the built in "Array" datatype in Postgres/GP, which has a bitmap
 for null values, but since it is not optimized for float8[], or for long runs
 of zeros instead of nulls and the bitmap is not RLE compressed, it is a poor
 choice for this use-case.  If we stored the zeros as NULL in the Postgres/GP
 array, then the bitmap for nulls would have 5KB of bits in it to mark the nulls,
 which is not nearly as efficient as we'd like.

 A simple RLE scheme that is biased toward being efficient for zero value bitmap
 results in 6 bytes for bitmap storage.

 We also want to implement vector operators for our type that take advantage of
 an efficient sparse storage format to make computations faster.

 ====================
 What is it?
 ====================

 The "Sparse Vector Datatype" is named "svec" and implements a vector with
 compressed storage of zeros. You can take the above vector and cast it into an
 svec like this:
 	('{0, 33,...40,000 zeros..., 12, 22 }'::float8[])::svec

 This is the same vector we input (I only used 5 zeros).

 We can use operations with svec type like <, >, *, **, /,=,+,SUM,COUNT_VEC
 etc, and they have meanings associated with typical vector operations.  The plus
 (+) operator adds each of the terms of two vectors of same dimension together.
 For instance, if we have a = {0,1,5} and b = {4,3,2}, then a+b = {4,4,7}.  We
 can see this using the svec type like this:
 	lukelonergan=# select
 	    ('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[];
  	float8
 	---------
  	{4,4,7}

 A vector dot product (*) between these two will result in a scalar result of
 type float8.  The dot product should be (0*4+1*3+5*2)=13, like this:
 	lukelonergan=# select '{0,1,5}'::float8[]::svec * '{4,3,2}'::float8[]::svec;
  	?column?
 	----------
        	13

 A scalar product (**) between these two will result in the individual terms
 being multiplied. The scalar product should be {0*4,1*3,5*2}={0,3,10}, like
 this:
 	lukelonergan=# select
 	    ('{0,1,5}'::float8[]::svec ** '{4,3,2}'::float8[]::svec)::float8[];
   	float8
 	----------
  	{0,3,10}

 Special vector aggregate functions are also useful. SUM is self explanatory.
 COUNT_VEC (as opposed to COUNT) evaluates the count of non-zero terms found in
 a set of svec and returns an svec with the counts. For instance, if we have
 {0,1,5},{10,0,3},{0,0,3},{0,1,0}, then COUNT_VEC() would return {1,2,3}:
 	lukelonergan=# create table list (a svec);
 	CREATE TABLE
 	lukelonergan=# insert into list values
 	    ('{0,1,5}'::float8[]),
 	    ('{10,0,3}'::float8[]),
 	    ('{0,0,3}'::float8[]),
 	    ('{0,1,0}'::float8[]);
 	INSERT 0 4

 	lukelonergan=# select COUNT_VEC(a)::float8[] from list;
  	count_vec
 	-----------
  	{1,2,3}

 ====================
 Example
 ====================
 For a text classification example, lets assume we have a dictionary composed of
 words in a text array:

     create table features (a text[][]) distributed randomly;
     insert into features
     values ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,'
             'one,really,second,the,third,this,until}');

 We have a set of documents, also defined as arrays of words:

 	lukelonergan=# create table documents(a int,b text[]);
 	CREATE TABLE
 	lukelonergan=# insert into documents values
 	    (1,'{this,is,one,document,in,the,corpus}'),
 	    (2,'{i,am,the,second,document,in,the,corpus}'),
 	    (3,'{being,third,never,really",bothered,me,until,now}'),
 	    (4,'{the,document,before,me,is,the,third,document}');

 Now we have a dictionary and some documents, we would like to do some document
 classification using vector arithmetic on word counts and proportions of
 dictionary words in each document.

 To start this process, well need to find the dictionary words in each document.
 Well prepared what is called a Sparse Feature Vector or SFV for each document.
 An SFV is a vector of dimension N, where N is the number of dictionary words,
 and in each SFV is a count of each dictionary word in the document.

 Inside the sparse vector toolset, we have a function that will create an SFV
 from a document, so we can just do this:

 	lukelonergan=# select gp_extract_feature_histogram(
 	    (select a from features limit 1),b)::float8[] from documents;
              	gp_extract_feature_histogram
 	-----------------------------------------
  	{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0}
  	{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1}
  	{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0}
  	{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0}

 Note that the output of gp_extract_feature_histogram is an svec for each
 document containing the count of each of the dictionary words in the ordinal
 positions of the dictionary.  This can more easily be understood by doing this:

 	lukelonergan=# select gp_extract_feature_histogram(
 	    (select a from features limit 1),b)::float8[],b from documents;
              	gp_extract_feature_histogram             |                        b
 	-----------------------------------------+--------------------------------------------------
  	{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus}
  	{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document}
  	{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus}
  	{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now}

 	lukelonergan=# select * from features;
                                                    	a
 	--------------------------------------------------------------------------------------------------------
  	{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}

 Now when we look at the first document "i am the second document in the corpus",
 its SFV is {1,3*0,1,1,1,1,6*0,1,2}.  The word "am" is the first ordinate in the
 dictionary and there is 1 instance of it in the SFV.  "before" has no instances
 in the document, so its value is "0" and so on.

 gp_extract_feature_histogram is very speed optimized - in essence a single
 routine version of a hash join, and should be able to process large numbers of
 documents into their SFVs in parallel at the highest possible speeds.

 The rest of the classification process is all vector math. As Brian Dolan
 describes the process:
 	The actual count is hardly ever used.  Instead, it's turned into a weight.
 	    The most common weight is called tf/idf for "Term Frequency / Inverse
 	    Document Frequency".
 	The calculation for a given term in a given document is
 	    {#Times in this document} * log {#Documents / #Documents  the term appears in}
 	For instance, the term "document" in document A would have weight

 	        1 * log (4/3).

 	    In document D it would have weight

 	        2 * log (4/3).

 	Terms that appear in every document would have tf/idf weight 0, since
 	    log (4/4) = log(1) = 0.
     (Our example has no term like that.)  That usually sends a lot of values to
     0.

 For this part of the processing, well need to have a sparse vector of the
 dictionary dimension (19) with the values (log(#documents/#Documents each term
 appears in).  There will be one such vector for the whole list of documents (aka
 the "corpus").  The #documents is just a count of all of the documents, in this
 case 4, but there is one divisor for each dictionary word and its value is the
 count of all the times that word appears in the document. This single vector for
 the whole corpus can then be scalar product multiplied by each document SFV to
 produce the Term Frequency/Inverse Document Frequency.

 This can be done as follows:
 	lukelonergan=# create table corpus as
 	    (select a,
 	        gp_extract_feature_histogram(
 	            (select a from features limit 1),
             b) sfv from documents);
 	lukelonergan=# SELECT docnum, (a*logidf) tf_idf
 	    FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo,
 	         corpus
 	    ORDER BY docnum;
  	 docnum |                                                                          tf_idf
 	--------+----------------------------------------------------------------------------------------------------------------------------------------------------------
       	      1 | {4,1,1,1,2,3,1,2,1,1,1,1}:{0,0.693147180559945,0.287682072451781,0,0.693147180559945,0,1.38629436111989,0,0.287682072451781,0,1.38629436111989,0}
       	      2 | {1,3,1,1,1,1,6,1,1,3}:{1.38629436111989,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,1.38629436111989,0.575364144903562,0}
       	      3 | {2,2,5,1,2,1,1,2,1,1,1}:{0,1.38629436111989,0,0.693147180559945,1.38629436111989,0,1.38629436111989,0,0.693147180559945,0,1.38629436111989}
       	      4 | {1,1,3,1,2,2,5,1,1,2}:{0,1.38629436111989,0,0.575364144903562,0,0.693147180559945,0,0.575364144903562,0.693147180559945,0}

 We can now get the "angular distance" between one document and the rest of the
 documents using the ACOS of the dot product of the document vectors:

 	lukelonergan=# CREATE TABLE weights AS
 	    (SELECT docnum, (a*logidf) tf_idf
 	        FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo,
 	               corpus
             ORDER BY docnum)
         DISTRIBUTED RANDOMLY;

 	Calculate the angular distance between the first document to each other
 	document

 	lukelonergan=# SELECT docnum,
 	    180.*(ACOS(dmin(1.,
 	    (tf_idf%*%testdoc)/(svec_l2norm(tf_idf)
 	    *svec_l2norm(testdoc))))/3.141592654) angular_distance
 	    FROM weights,
 	        (SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo
         ORDER BY 1;
  	docnum | angular_distance
        --------+------------------
              1 |                0
              2 | 78.8235846096986
              3 | 89.9999999882484
              4 | 80.0232034288617

 We can see that the angular distance between document 1 and itself is 0 degrees
 and between document 1 and 3 is 90 degrees because they share no features at
 all.

	Sparse Vector Datatype

	====================
	Background
	====================

	When you use arrays of floating point numbers for various calculations, you will
	often have long runs of zeros. This is common in many kinds of applications
	such as scientific, retail optimization and text processing. Each floating
	point number takes 8 bytes of storage in memory and/or disk, so saving those
	zeros is often impractical. There are also many computations that benefit from
	skipping over the zeros.

	Say for example that we have the following array of doubles in Postgres stored
	as a "float8[]":
	'{0, 33,...40,000 zeros..., 12, 22 }'::float8[]

	This array would occupy slightly more than 320KB of memory/disk, most of it
	zeros. Also, if we perform various operations on this, we'll be doing work
	on 40,001 fields that aren't important. This kind of array arises often in
	text processing, where the dictionary may have 40-100K terms and each vector
	will store the number of words in a particular document.

	We could use the built in "Array" datatype in Postgres/GP, which has a bitmap
	for null values, but since it is not optimized for float8[], or for long runs
	of zeros instead of nulls and the bitmap is not RLE compressed, it is a poor
	choice for this use-case. If we stored the zeros as NULL in the Postgres/GP
	array, then the bitmap for nulls would have 5KB of bits in it to mark the nulls,
	which is not nearly as efficient as we'd like.

	A simple RLE scheme that is biased toward being efficient for zero value bitmap
	results in 6 bytes for bitmap storage.

	We also want to implement vector operators for our type that take advantage of
	an efficient sparse storage format to make computations faster.

	====================
	What is it?
	====================

	The "Sparse Vector Datatype" is named "svec" and implements a vector with
	compressed storage of zeros. You can take the above vector and cast it into an
	svec like this:
	('{0, 33,...40,000 zeros..., 12, 22 }'::float8[])::svec

	This is the same vector we input (I only used 5 zeros).

	We can use operations with svec type like <, >, , *, /,=,+,SUM,COUNT_VEC
	etc, and they have meanings associated with typical vector operations. The plus
	(+) operator adds each of the terms of two vectors of same dimension together.
	For instance, if we have a = {0,1,5} and b = {4,3,2}, then a+b = {4,4,7}. We
	can see this using the svec type like this:
	lukelonergan=# select
	('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[];
	float8
	---------
	{4,4,7}

	A vector dot product (*) between these two will result in a scalar result of
	type float8. The dot product should be (04+13+5*2)=13, like this:
	lukelonergan=# select '{0,1,5}'::float8[]::svec * '{4,3,2}'::float8[]::svec;
	?column?
	----------
	13

	A scalar product (**) between these two will result in the individual terms
	being multiplied. The scalar product should be {04,13,5*2}={0,3,10}, like
	this:
	lukelonergan=# select
	('{0,1,5}'::float8[]::svec ** '{4,3,2}'::float8[]::svec)::float8[];
	float8
	----------
	{0,3,10}

	Special vector aggregate functions are also useful. SUM is self explanatory.
	COUNT_VEC (as opposed to COUNT) evaluates the count of non-zero terms found in
	a set of svec and returns an svec with the counts. For instance, if we have
	{0,1,5},{10,0,3},{0,0,3},{0,1,0}, then COUNT_VEC() would return {1,2,3}:
	lukelonergan=# create table list (a svec);
	CREATE TABLE
	lukelonergan=# insert into list values
	('{0,1,5}'::float8[]),
	('{10,0,3}'::float8[]),
	('{0,0,3}'::float8[]),
	('{0,1,0}'::float8[]);
	INSERT 0 4

	lukelonergan=# select COUNT_VEC(a)::float8[] from list;
	count_vec
	-----------
	{1,2,3}

	====================
	Example
	====================
	For a text classification example, lets assume we have a dictionary composed of
	words in a text array:

	create table features (a text[][]) distributed randomly;
	insert into features
	values ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,'
	'one,really,second,the,third,this,until}');

	We have a set of documents, also defined as arrays of words:

	lukelonergan=# create table documents(a int,b text[]);
	CREATE TABLE
	lukelonergan=# insert into documents values
	(1,'{this,is,one,document,in,the,corpus}'),
	(2,'{i,am,the,second,document,in,the,corpus}'),
	(3,'{being,third,never,really",bothered,me,until,now}'),
	(4,'{the,document,before,me,is,the,third,document}');

	Now we have a dictionary and some documents, we would like to do some document
	classification using vector arithmetic on word counts and proportions of
	dictionary words in each document.

	To start this process, well need to find the dictionary words in each document.
	Well prepared what is called a Sparse Feature Vector or SFV for each document.
	An SFV is a vector of dimension N, where N is the number of dictionary words,
	and in each SFV is a count of each dictionary word in the document.

	Inside the sparse vector toolset, we have a function that will create an SFV
	from a document, so we can just do this:

	lukelonergan=# select gp_extract_feature_histogram(
	(select a from features limit 1),b)::float8[] from documents;
	gp_extract_feature_histogram
	-----------------------------------------
	{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0}
	{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1}
	{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0}
	{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0}

	Note that the output of gp_extract_feature_histogram is an svec for each
	document containing the count of each of the dictionary words in the ordinal
	positions of the dictionary. This can more easily be understood by doing this:

	lukelonergan=# select gp_extract_feature_histogram(
	(select a from features limit 1),b)::float8[],b from documents;
	gp_extract_feature_histogram \| b
	-----------------------------------------+--------------------------------------------------
	{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} \| {i,am,the,second,document,in,the,corpus}
	{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} \| {the,document,before,me,is,the,third,document}
	{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} \| {this,is,one,document,in,the,corpus}
	{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} \| {being,third,never,really,bothered,me,until,now}

	lukelonergan=# select * from features;
	a
	--------------------------------------------------------------------------------------------------------
	{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}

	Now when we look at the first document "i am the second document in the corpus",
	its SFV is {1,30,1,1,1,1,60,1,2}. The word "am" is the first ordinate in the
	dictionary and there is 1 instance of it in the SFV. "before" has no instances
	in the document, so its value is "0" and so on.

	gp_extract_feature_histogram is very speed optimized - in essence a single
	routine version of a hash join, and should be able to process large numbers of
	documents into their SFVs in parallel at the highest possible speeds.

	The rest of the classification process is all vector math. As Brian Dolan
	describes the process:
	The actual count is hardly ever used. Instead, it's turned into a weight.
	The most common weight is called tf/idf for "Term Frequency / Inverse
	Document Frequency".
	The calculation for a given term in a given document is
	{#Times in this document} * log {#Documents / #Documents the term appears in}
	For instance, the term "document" in document A would have weight

	1 * log (4/3).

	In document D it would have weight

	2 * log (4/3).

	Terms that appear in every document would have tf/idf weight 0, since
	log (4/4) = log(1) = 0.
	(Our example has no term like that.) That usually sends a lot of values to
	0.

	For this part of the processing, well need to have a sparse vector of the
	dictionary dimension (19) with the values (log(#documents/#Documents each term
	appears in). There will be one such vector for the whole list of documents (aka
	the "corpus"). The #documents is just a count of all of the documents, in this
	case 4, but there is one divisor for each dictionary word and its value is the
	count of all the times that word appears in the document. This single vector for
	the whole corpus can then be scalar product multiplied by each document SFV to
	produce the Term Frequency/Inverse Document Frequency.

	This can be done as follows:
	lukelonergan=# create table corpus as
	(select a,
	gp_extract_feature_histogram(
	(select a from features limit 1),
	b) sfv from documents);
	lukelonergan=# SELECT docnum, (a*logidf) tf_idf
	FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo,
	corpus
	ORDER BY docnum;
	docnum \| tf_idf
	--------+----------------------------------------------------------------------------------------------------------------------------------------------------------
	1 \| {4,1,1,1,2,3,1,2,1,1,1,1}:{0,0.693147180559945,0.287682072451781,0,0.693147180559945,0,1.38629436111989,0,0.287682072451781,0,1.38629436111989,0}
	2 \| {1,3,1,1,1,1,6,1,1,3}:{1.38629436111989,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,1.38629436111989,0.575364144903562,0}
	3 \| {2,2,5,1,2,1,1,2,1,1,1}:{0,1.38629436111989,0,0.693147180559945,1.38629436111989,0,1.38629436111989,0,0.693147180559945,0,1.38629436111989}
	4 \| {1,1,3,1,2,2,5,1,1,2}:{0,1.38629436111989,0,0.575364144903562,0,0.693147180559945,0,0.575364144903562,0.693147180559945,0}

	We can now get the "angular distance" between one document and the rest of the
	documents using the ACOS of the dot product of the document vectors:

	lukelonergan=# CREATE TABLE weights AS
	(SELECT docnum, (a*logidf) tf_idf
	FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo,
	corpus
	ORDER BY docnum)
	DISTRIBUTED RANDOMLY;

	Calculate the angular distance between the first document to each other
	document

	lukelonergan=# SELECT docnum,
	180.*(ACOS(dmin(1.,
	(tf_idf%*%testdoc)/(svec_l2norm(tf_idf)
	*svec_l2norm(testdoc))))/3.141592654) angular_distance
	FROM weights,
	(SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo
	ORDER BY 1;
	docnum \| angular_distance
	--------+------------------
	1 \| 0
	2 \| 78.8235846096986
	3 \| 89.9999999882484
	4 \| 80.0232034288617

	We can see that the angular distance between document 1 and itself is 0 degrees
	and between document 1 and 3 is 90 degrees because they share no features at
	all.