gpcontrib/gp_sparse_vector/README.update - cloudberry - Git at Google

 On Dynamic Word Size for the RLE index
 ==========================================================================
 I thought we'd use a method where we identify the size of the next word in
 the index by leveraging the top two bits of the word to describe how large
 the following word is.

 This idea works if the word is laid out with the "top bits" inside the
 first byte.  However, big endian systems (Intel, AMD) store the topmost
 bits of their words as the *last* byte in the word, which means that we
 can't tell what size the word is without reading the word.  We can't
 read the word without knowing the size, so this is a non-starter.

 We could encode the size information into the bottom bits of the word,
 but then we end up with different code paths for big endian systems and
 little endian systems.  This is dangerous, and means we would not have
 a portable binary format between systems with different endian-ness.

 An alternative strategy would use the entire first byte of an RLE word
 to encode the following word size.  A modification of this scheme would
 encode the word size into the top two bits of the first byte, and if
 the word size is a single byte, then the remainder of the byte would
 represent the encoded length. This preserves the worst-case compression
 of the previous scheme and introduces an additional overhead to the
 middle compression cases equal to one byte per RLE word.

 ==========================================================================
 A multi-platform friendly dynamic word size RLE scheme
 ==========================================================================
 We have a sequence of numbers and we want a scheme that will store them in
 the most compact format possible using binary types of 8, 16, 32 and 64 bit
 widths.

 Let's say we have the following sequence: {13, 2000, 2*10^12, 3*10^26}. We
 can see that it would require one byte to minimally store the first number,
 two to store the second, 4 to store the third and 8 for the fourth.

 We can store the sequence using these minimum widths and we will have an
 array with sizes (in bytes): {1,2,4,8}.  However, to read this array, we
 need to be able to determine the width of each word.

 Here we will use a "length word" to describe the length of the storage
 type for each entry.  The size of the length word will always be 1 byte,
 though in the case where the stored number is less than 127, we will not
 use a length word, we'll just store a "-127" in a single byte. The value
 of the length word will be one of 1,2,4 or 8 to indicate the size of the
 following storage width.

 So our example would be stored in an array with the following word widths
 and contents (in parenthesis):
 	{1(-13), 1(2), 2(2000), 1(4), 4(2*10^12), 1(8), 8(3*10^26)}

 The total size of this array is 18 bytes, which should be compared to the
 size required if an 8 byte width was used to store all words, in which
 case we would need 32 bytes to store the array.  This is a savings of
 nearly half in this simple case.  In the worst case where we are storing
 numbers that all require 8 byte widths, we would be adding one byte per
 word, or a 12.5% overhead.  If we have lots of 1 byte words interspersed
 with an occasional 8 byte word, the savings move toward an 8:1 compression
 ratio.

 The algorithm for reading this array is as follows:
 	Read first byte "B"
 	if (B < 0)
 		N = -(B)
 	else
 		Copy "B" byte word into N

 We have to be careful about alignment when reading the dynamic sized word
 from the array - we can't just cast the next "B" bytes into the appropriate
 sized type and dereference it.  This would cause an unaligned access error,
 which will result in an abort on some architectures and poor performance on
 others.
	On Dynamic Word Size for the RLE index
	==========================================================================
	I thought we'd use a method where we identify the size of the next word in
	the index by leveraging the top two bits of the word to describe how large
	the following word is.

	This idea works if the word is laid out with the "top bits" inside the
	first byte. However, big endian systems (Intel, AMD) store the topmost
	bits of their words as the last byte in the word, which means that we
	can't tell what size the word is without reading the word. We can't
	read the word without knowing the size, so this is a non-starter.

	We could encode the size information into the bottom bits of the word,
	but then we end up with different code paths for big endian systems and
	little endian systems. This is dangerous, and means we would not have
	a portable binary format between systems with different endian-ness.

	An alternative strategy would use the entire first byte of an RLE word
	to encode the following word size. A modification of this scheme would
	encode the word size into the top two bits of the first byte, and if
	the word size is a single byte, then the remainder of the byte would
	represent the encoded length. This preserves the worst-case compression
	of the previous scheme and introduces an additional overhead to the
	middle compression cases equal to one byte per RLE word.

	==========================================================================
	A multi-platform friendly dynamic word size RLE scheme
	==========================================================================
	We have a sequence of numbers and we want a scheme that will store them in
	the most compact format possible using binary types of 8, 16, 32 and 64 bit
	widths.

	Let's say we have the following sequence: {13, 2000, 210^12, 310^26}. We
	can see that it would require one byte to minimally store the first number,
	two to store the second, 4 to store the third and 8 for the fourth.

	We can store the sequence using these minimum widths and we will have an
	array with sizes (in bytes): {1,2,4,8}. However, to read this array, we
	need to be able to determine the width of each word.

	Here we will use a "length word" to describe the length of the storage
	type for each entry. The size of the length word will always be 1 byte,
	though in the case where the stored number is less than 127, we will not
	use a length word, we'll just store a "-127" in a single byte. The value
	of the length word will be one of 1,2,4 or 8 to indicate the size of the
	following storage width.

	So our example would be stored in an array with the following word widths
	and contents (in parenthesis):
	{1(-13), 1(2), 2(2000), 1(4), 4(210^12), 1(8), 8(310^26)}

	The total size of this array is 18 bytes, which should be compared to the
	size required if an 8 byte width was used to store all words, in which
	case we would need 32 bytes to store the array. This is a savings of
	nearly half in this simple case. In the worst case where we are storing
	numbers that all require 8 byte widths, we would be adding one byte per
	word, or a 12.5% overhead. If we have lots of 1 byte words interspersed
	with an occasional 8 byte word, the savings move toward an 8:1 compression
	ratio.

	The algorithm for reading this array is as follows:
	Read first byte "B"
	if (B < 0)
	N = -(B)
	else
	Copy "B" byte word into N

	We have to be careful about alignment when reading the dynamic sized word
	from the array - we can't just cast the next "B" bytes into the appropriate
	sized type and dereference it. This would cause an unaligned access error,
	which will result in an abort on some architectures and poor performance on
	others.