gpcontrib/gp_sparse_vector/README.bitmap - cloudberry - Git at Google

 Revision: 02/03/10

 I’m spending most of my part time on working out the actual mechanics (with
 examples) of the matrix/vector implementation.

 What will help is to get some input on likely use-cases that can drive the work
 – though I’ve got a good number of examples from scientific and econometric
 modeling, I could always benefit from something concrete and immediate.

 Following is more on mechanics.

 Another finding on indexes, RLE compression and matrices:
 input matrices come in various flavors, many of them in non-sorted “not ready
 for sparse compression” formats
 in many circumstances there is no a-priori choice of whether there are zeros or
 NVP
 Because of (1), in the input processing and sorting, it is necessary to create
 an uncompressed index that is identical to the single RLE index I’ve described
 previously, sans the special values.

 I’ve described the input processing here in a program I wrote to process
 matrices from the Matrix Mart at NIST:

     /* The Matrix Market format does not specify the order in which the
      * successive elements are stored.  We need to compress the matrix into
      * one of CRS or CCS format (row or column compression), so we need to
      * sort the matrix into one of those forms.
      *
      * Here we'll use quicksort to sort the matrix elements into row
      * contiguous storage.
      *
      * If we desire row contiguous storage, we'll also need the column order
      * within the row preserved.  We'll do this by computing the index into
      * the matrix using a "columns progressive" ordering, which is computed
      * as NJ * I + J. Then we'll sort based on that index.  The index value
      * will range from 0 to NJ*NI-1. For any index value, you can compute I
      * and J using:
      *      I = index/NJ
      *      J = index - NJ * I
      *
      * If we sort the matrix by this index, we'll have things in row/column
      * order, suitable for CRS compression.
      *
      * Note that if we want CCS, we can construct the index using:
      *      index = NI * J + I
      */

 In the Yale/MKL formats, we deconstruct the single index into a set of two
 values for each run and that is stored as the compression index.

 I’m thinking now that for the RLE index we can do something along these lines:
 Treat all values symmetrically, that is: all runs of the various types we’re
 discussing (inf,-inf,0,NVP,value) are described by the same kind of word
 Each RLE word is composed of a leading set of descriptor bits, followed by a
 payload describing how many of the value are found.

 So if we pick the descriptor bits = 3, we can represent 8 “things”, of which 5
 things must be “what kind of value is this?”, leaving 3 things.  A good use of
 the 3 things can be word length, one of 8, 16 or 32.

 So if we go down this path, we’re left with 5, 13 and 29 bits to represent a run
 and can dynamically choose which one to use based on the data content.
 That’s 32, 8192 and 536,870,912 values in each run, which is perfect for
 capturing various kinds of sparsity.

 If we expand to one more descriptor bit, then we can represent 8 more things,
 which could be used to recognize additional types of values to compress out.
 That would leave us with 16, 4096 and 268,435,456 values in each run, which also
 works pretty well and leaves us room for the future in case we find there are
 more values than {inf,-inf,0,NVP,value}.

 Some additional results to share (see table below) – I did a computation of the
 various storage and compute requirements for the MKL/Yale index and the RLE
 index.  I used 5 matrix examples from NIST’s “Matrix Market”, read them in and
 translated them from Matrix Market format to MKL/RLE which required sorting them
 into CRS.  From this, I calculated the relevant stats on each in terms of
 storage and number of words in each of the RLE and MKL indices.

 Bottom line is that the results corroborated my earlier calculations. The RLE
 index is a lot smaller than MKL always, and is more computationally efficient
 for linear scans through the compressed matrix.  This means we should use only
 RLE for most use-cases and calculate MKL only when we need to use the
 algorithms in the matrix solver libraries.  We can optionally store the MKL
 index if we calculate it – we can provide a marker to indicate the
 presence/absence of it in the type.

 I’ve also thought through the compatibility with all the sparse solver libs –
 seems like we have easy compatibility with a CRS-based approach, so I think that
 should be the baseline.  We can provide another marker to indicate what is being
 stored – CRS, CCS, CDS, etc.

 I believe I have enough information to produce a draft design of the matrix type
 that we can review and get some other feedback about.  Here is a sketch of how I
 think it should work:
 Two separate types, one for vectors and one for matrices – but they share the
 same format and support routines.  The separate types allows definition of
 vector and matrix specific operators.
 The format stores the following metadata about the datum:
     nD, (n(i),i=1,nD): dimension and rank – for a vector, this is {1,N}; for a
         matrix it’s {2,Nrows,Ncolumns}
     nZ: number of values (non-special values)
     valType: type of value, one of {bit, byte, int4, int8, real4, complex4,
         real8, complex8}
     storType: type of storage, one of {CRS, CCS, CDS, ... Up to maybe 32 total
         with some reserved}
     specValList: a list of special values to be used in the compression, up to
         8, e.g. {inf,-inf,NVP,0}
     hasMKL: if true, the MKL index is stored
 The format stores the following data:
     values: a dense array of values, size is (nZ*sizeof(valType))
     rleIndex: an RLE index containing location of all values (including specials)
     sizeofRleIndex: the size of the RLE index for allocation purposes
     if (hasMKL) mklIndex: the MKL/Yale index, size is ((nZ+NI)*sizeof(int))

 01234567890123456789012345678901234567890123456789012345678901234567890123456789

 In order to efficiently compress the storage of multiple data types, we are
 using a bitmap that describes the positions of four special 64-bit floating
 point values: zero, negative infinity, positive infinity, and "no value present"
 or NVP.  Of these, the first three are all standard floating point values that
 can be the natural result of performing calculations. NVP is a special case
 often used in statistics and other analytical processing but is not
 representable as a standard floating point number.

 Here we're using two patterns for sparse storage and computation:
 1) recognition of compressible types and subsequent bitmap compression
 2) optimized computational usage of previously compressed values

 (1) recognizes that we're going to have many situations where special values
 arise as a natural consequence of computation.  In these cases, we will
 automatically locate and compress special values into a special bitmap storage
 format.

 (2) leverages the pre-existing compressed special value storage for optimized
 computation.  For example, if we have a vector of values V and a large fraction
 of the values in V are zero, then computing "W = 1/V" will produce a
 correspondingly large number of positive infinity values in W.  With our
 compressed bitmap storage of zeros, we can loop over the compressed zero
 elements and change their storage type from "zero" to "infinity", thereby
 saving the computation, memory and re-compression overheads.

 It's important to have a bitmap format that is both storage efficient and
 optimal for computation.  There are many existing formats for sparse vector and
 matrix storage, most of them developed for scientific computation.  The Intel
 Math Kernel Library (MKL) specifies the most common sparse matrix storage
 format, derived from the Yale Sparse Matrix format:

  "The Intel MKL direct sparse solvers use a row-major upper triangular storage
 format: the matrix is compressed row-by-row and for symmetric matrices only non-
 zero elements in the upper triangular half of the matrix are stored."

   The Intel MKL sparse matrix storage format for direct sparse solvers is
 specified by three arrays: values, columns, and rowIndex. The following table
 describes the arrays in terms of the values, row, and column positions of the
 non-zero elements in a sparse matrix.
 	values 	 - A real or complex array that contains the non-zero elements
 		   of a sparse matrix. The non-zero elements are mapped into the
 		   values array using the row-major upper triangular storage
 		   mapping described above.
 	columns  - Element I of the integer array columns is the number of the
 		   column that contains the I-th element in the values array.
 	rowIndex - Element j of the integer array rowIndex gives the index of
 		   the element in the values array that is first non-zero
 		   element in a row j.

   The length of the values and columns arrays is equal to the number of non-zero
 elements in the matrix."

 - Intel Math Kernel Library Reference Manual, March 2009, Document Number:
   630813-031US

 The MKL format stores the non-zero values compactly in a sequential array of
 either complex or double precision entries and maintains two index arrays that
 identify the location of the nonzero values in a two dimensional matrix.  The
 advantages of the MKL format are it's efficiency for computation and the fact
 it's in widespread usage in numerical libraries.

 The uncompressed size of a real matrix with NI rows and NJ columns would be:
 	Raw size = (8 bytes) * ( NI*NJ )

 The storage required by the MKL format for a real matrix with N nonzero values
 and NI number of rows is:
 	Comp Size = (8 bytes) * ( 2N  + NI )

 For a complex matrix it's:
 	Comp Size = (8 bytes) * ( 3N  + NI )

 Another way of representing the storage is to assess the "overhead" associated
 with the representation of the zero entries, which is the same for real and
 complex:
 	Overhead = (8 bytes) * ( N + NI )

 Now we can calculate the "savings" associated with using the MLK compressed
 format for a real matrix:
 	Savings = (Raw Size) - (Comp Size)
 		  (8 bytes) ( NI*NJ - 2N - NI )

 If we have a dense matrix, then N will be NI*NJ and the savings is equal to the
 negative overhead, which means we'll have a worst case savings of:
 	Dense (worst case) savings = -(8 bytes)(NI*NJ + NI)
 	                           = -(8 bytes)*NI*(NJ+1)

 For a dense matrix, this is a 100%+ storage penalty.

 There is one important shortcoming of the MKL format: it can only represent one
 type of special value - zero.  This is a consequence of the implicit nature of
 the approach, the location of the zero values is not stored, it is implied by
 the lack of a non-zero value in that matrix position.  There is no place to
 provide an attribute of the special value so there can be only one type of
 special value.

 We need an alternative format that can represent different types of special
 values. It will need to refer to the special values explicitly in an efficient
 manner. We can take advantage of the nature of most all sparse storage
 circumstances: the special values occur in relatively large runs compared to
 the non-zero values.  We can use a "run length encoding" or RLE approach to
 compose a compressed special value array, in addition to storing a regular
 packed array of the non-special values.  The index location of the non-special
 values can be inferred from the RLE array of special values during computation.

 Let's represent the array of special values using a 16-bit word to represent
 each run of special values, and an 8 bit word for each run of non-special
 values. Each of the two word types uses the first bit (the sign bit) to
 represent the run as consisting of special values (positive) or non-special
 values (negative). Each special value word uses the next 2 bits to describe what
 kind of special values are in the run, one of zero, inf, -inf and NVP.

 A complete array will be represented as a collection of these two types of
 words, terminated by a special "end of array" word consisting of 16 bits of
 zeros (0x0000). For example:
 	<special values><non special values><special values><0x0000>

 This can be seen in bytes form:
 	<2 bytes><1 byte><2 bytes><2 bytes>

 And if the special values are zeros (type 0), with runs of 8192 in each of the
 two runs with a run of 128 non-special values, the bits in this array would be:
 	<0001111111111111><11111111><0001111111111111><0000000000000000>

 Note that if there are typically a large number of zeros followed by only one
 non-zero value, then this representation will waste a lot of space in the
 non-special value words.  We could shorten the representation of non-special
 values to say 5 bits, which would allow for runs of 16 to be represented by
 each word and we would save three bits per run.  However, the alignment of the
 index computations would be less than 1 byte, which would likely reduce the
 computational efficiency.

 If a given run has more than the maximum representable by the word, words are
 concatenated as needed to represent the run.

 For matrices we can use the compressed row format, or the compressed column
 format - for purpose of the efficiency discussion let's assumed compressed rows.

 The storage efficiency of this approach is more complicated to compute, as the
 storage will depend on the dispersion of values.  We can compute best case and
 worst case scenarios however.

 A reasonable "best case" scenario will have one non-special value per row. In
 this case the storage required by the RLE format for a real matrix with N
 nonzero values, NI rows and NJ columns is:
 	Comp Size = (9 bytes) * N + (2 bytes) *(NJ*(NI/8192) +1)
 		  = (9 bytes) *NI + (2 bytes) *(NJ*(NI/8192) +1)
 			for the common case of NI < 8192:
 		  = (9 bytes) *NI + (2 bytes) *(NJ +1)

 In this case, the storage is O(NI+NJ).  For instance, if we have a matrix of
 size 2,000 by 2,000, the storage is (8 bytes) * 4,000 = 32KB.

 For a worst case scenario, we can assume alternating non-special and special
 values, in which case the storage is:
 	Comp Size = (9 bytes) * N + (2 bytes) * (NJ*(NI/2) +1)
 		  = (9 bytes) *NJ*(NI/2) + (2 bytes) * (NJ*(NI/2) +1)
 		  = (11 bytes) *NJ*(NI/2) + (2 bytes)

 What's interesting is that this worst case size is actually smaller (by almost
 half) than storing the full dense matrix.  This is due to the more efficient
 representation of the special values in 2 byte format. The overhead is nearly
 equivalent to storing the special values in addition to the non-special values.

 If we have a dense matrix, there is almost zero overhead because the special
 value array is blank and we only store the non-special data values.

 If we compare the MKL approach savings to the RLE approach in the best case
 scenario, we have an RLE savings of:
 	RLE
 	Savings = (Raw Size) - (Comp Size)
 		= (8 bytes) *NI*NJ - (9 bytes)*NI - (2 bytes)*(NJ +1)
 			for NI ~ NJ:
 		~ (8 bytes) *NI*NJ - (11 bytes)*NI
 		~ (8 bytes) *NI*(NJ-1)

 	MKL
 	Savings = (8 bytes) ( NI*NJ - 3NI )
 	        = (8 bytes) *NI*(NJ-3)

 The savings in the best case is nearly the same.  In the worst case scenario of
 a dense matrix, the savings of the RLE approach is nearly 0, which is the best
 result possible.  For MKL however, the savings in the worst case is more than
 a 100% storage penalty.

 Most cases are somewhere between best case and worst case.  We can conclude
 that the storage of the RLE scheme is going to be more efficient than MKL in
 all cases.
	Revision: 02/03/10

	I’m spending most of my part time on working out the actual mechanics (with
	examples) of the matrix/vector implementation.

	What will help is to get some input on likely use-cases that can drive the work
	– though I’ve got a good number of examples from scientific and econometric
	modeling, I could always benefit from something concrete and immediate.

	Following is more on mechanics.

	Another finding on indexes, RLE compression and matrices:
	input matrices come in various flavors, many of them in non-sorted “not ready
	for sparse compression” formats
	in many circumstances there is no a-priori choice of whether there are zeros or
	NVP
	Because of (1), in the input processing and sorting, it is necessary to create
	an uncompressed index that is identical to the single RLE index I’ve described
	previously, sans the special values.

	I’ve described the input processing here in a program I wrote to process
	matrices from the Matrix Mart at NIST:

	/* The Matrix Market format does not specify the order in which the
	* successive elements are stored. We need to compress the matrix into
	* one of CRS or CCS format (row or column compression), so we need to
	* sort the matrix into one of those forms.
	*
	* Here we'll use quicksort to sort the matrix elements into row
	* contiguous storage.
	*
	* If we desire row contiguous storage, we'll also need the column order
	* within the row preserved. We'll do this by computing the index into
	* the matrix using a "columns progressive" ordering, which is computed
	* as NJ * I + J. Then we'll sort based on that index. The index value
	* will range from 0 to NJ*NI-1. For any index value, you can compute I
	* and J using:
	* I = index/NJ
	* J = index - NJ * I
	*
	* If we sort the matrix by this index, we'll have things in row/column
	* order, suitable for CRS compression.
	*
	* Note that if we want CCS, we can construct the index using:
	* index = NI * J + I
	*/

	In the Yale/MKL formats, we deconstruct the single index into a set of two
	values for each run and that is stored as the compression index.

	I’m thinking now that for the RLE index we can do something along these lines:
	Treat all values symmetrically, that is: all runs of the various types we’re
	discussing (inf,-inf,0,NVP,value) are described by the same kind of word
	Each RLE word is composed of a leading set of descriptor bits, followed by a
	payload describing how many of the value are found.

	So if we pick the descriptor bits = 3, we can represent 8 “things”, of which 5
	things must be “what kind of value is this?”, leaving 3 things. A good use of
	the 3 things can be word length, one of 8, 16 or 32.

	So if we go down this path, we’re left with 5, 13 and 29 bits to represent a run
	and can dynamically choose which one to use based on the data content.
	That’s 32, 8192 and 536,870,912 values in each run, which is perfect for
	capturing various kinds of sparsity.

	If we expand to one more descriptor bit, then we can represent 8 more things,
	which could be used to recognize additional types of values to compress out.
	That would leave us with 16, 4096 and 268,435,456 values in each run, which also
	works pretty well and leaves us room for the future in case we find there are
	more values than {inf,-inf,0,NVP,value}.

	Some additional results to share (see table below) – I did a computation of the
	various storage and compute requirements for the MKL/Yale index and the RLE
	index. I used 5 matrix examples from NIST’s “Matrix Market”, read them in and
	translated them from Matrix Market format to MKL/RLE which required sorting them
	into CRS. From this, I calculated the relevant stats on each in terms of
	storage and number of words in each of the RLE and MKL indices.

	Bottom line is that the results corroborated my earlier calculations. The RLE
	index is a lot smaller than MKL always, and is more computationally efficient
	for linear scans through the compressed matrix. This means we should use only
	RLE for most use-cases and calculate MKL only when we need to use the
	algorithms in the matrix solver libraries. We can optionally store the MKL
	index if we calculate it – we can provide a marker to indicate the
	presence/absence of it in the type.

	I’ve also thought through the compatibility with all the sparse solver libs –
	seems like we have easy compatibility with a CRS-based approach, so I think that
	should be the baseline. We can provide another marker to indicate what is being
	stored – CRS, CCS, CDS, etc.

	I believe I have enough information to produce a draft design of the matrix type
	that we can review and get some other feedback about. Here is a sketch of how I
	think it should work:
	Two separate types, one for vectors and one for matrices – but they share the
	same format and support routines. The separate types allows definition of
	vector and matrix specific operators.
	The format stores the following metadata about the datum:
	nD, (n(i),i=1,nD): dimension and rank – for a vector, this is {1,N}; for a
	matrix it’s {2,Nrows,Ncolumns}
	nZ: number of values (non-special values)
	valType: type of value, one of {bit, byte, int4, int8, real4, complex4,
	real8, complex8}
	storType: type of storage, one of {CRS, CCS, CDS, ... Up to maybe 32 total
	with some reserved}
	specValList: a list of special values to be used in the compression, up to
	8, e.g. {inf,-inf,NVP,0}
	hasMKL: if true, the MKL index is stored
	The format stores the following data:
	values: a dense array of values, size is (nZ*sizeof(valType))
	rleIndex: an RLE index containing location of all values (including specials)
	sizeofRleIndex: the size of the RLE index for allocation purposes
	if (hasMKL) mklIndex: the MKL/Yale index, size is ((nZ+NI)*sizeof(int))

	01234567890123456789012345678901234567890123456789012345678901234567890123456789

	In order to efficiently compress the storage of multiple data types, we are
	using a bitmap that describes the positions of four special 64-bit floating
	point values: zero, negative infinity, positive infinity, and "no value present"
	or NVP. Of these, the first three are all standard floating point values that
	can be the natural result of performing calculations. NVP is a special case
	often used in statistics and other analytical processing but is not
	representable as a standard floating point number.

	Here we're using two patterns for sparse storage and computation:
	1) recognition of compressible types and subsequent bitmap compression
	2) optimized computational usage of previously compressed values

	(1) recognizes that we're going to have many situations where special values
	arise as a natural consequence of computation. In these cases, we will
	automatically locate and compress special values into a special bitmap storage
	format.

	(2) leverages the pre-existing compressed special value storage for optimized
	computation. For example, if we have a vector of values V and a large fraction
	of the values in V are zero, then computing "W = 1/V" will produce a
	correspondingly large number of positive infinity values in W. With our
	compressed bitmap storage of zeros, we can loop over the compressed zero
	elements and change their storage type from "zero" to "infinity", thereby
	saving the computation, memory and re-compression overheads.

	It's important to have a bitmap format that is both storage efficient and
	optimal for computation. There are many existing formats for sparse vector and
	matrix storage, most of them developed for scientific computation. The Intel
	Math Kernel Library (MKL) specifies the most common sparse matrix storage
	format, derived from the Yale Sparse Matrix format:

	"The Intel MKL direct sparse solvers use a row-major upper triangular storage
	format: the matrix is compressed row-by-row and for symmetric matrices only non-
	zero elements in the upper triangular half of the matrix are stored."

	The Intel MKL sparse matrix storage format for direct sparse solvers is
	specified by three arrays: values, columns, and rowIndex. The following table
	describes the arrays in terms of the values, row, and column positions of the
	non-zero elements in a sparse matrix.
	values - A real or complex array that contains the non-zero elements
	of a sparse matrix. The non-zero elements are mapped into the
	values array using the row-major upper triangular storage
	mapping described above.
	columns - Element I of the integer array columns is the number of the
	column that contains the I-th element in the values array.
	rowIndex - Element j of the integer array rowIndex gives the index of
	the element in the values array that is first non-zero
	element in a row j.

	The length of the values and columns arrays is equal to the number of non-zero
	elements in the matrix."

	- Intel Math Kernel Library Reference Manual, March 2009, Document Number:
	630813-031US

	The MKL format stores the non-zero values compactly in a sequential array of
	either complex or double precision entries and maintains two index arrays that
	identify the location of the nonzero values in a two dimensional matrix. The
	advantages of the MKL format are it's efficiency for computation and the fact
	it's in widespread usage in numerical libraries.

	The uncompressed size of a real matrix with NI rows and NJ columns would be:
	Raw size = (8 bytes) * ( NI*NJ )

	The storage required by the MKL format for a real matrix with N nonzero values
	and NI number of rows is:
	Comp Size = (8 bytes) * ( 2N + NI )

	For a complex matrix it's:
	Comp Size = (8 bytes) * ( 3N + NI )

	Another way of representing the storage is to assess the "overhead" associated
	with the representation of the zero entries, which is the same for real and
	complex:
	Overhead = (8 bytes) * ( N + NI )

	Now we can calculate the "savings" associated with using the MLK compressed
	format for a real matrix:
	Savings = (Raw Size) - (Comp Size)
	(8 bytes) ( NI*NJ - 2N - NI )

	If we have a dense matrix, then N will be NI*NJ and the savings is equal to the
	negative overhead, which means we'll have a worst case savings of:
	Dense (worst case) savings = -(8 bytes)(NI*NJ + NI)
	= -(8 bytes)NI(NJ+1)

	For a dense matrix, this is a 100%+ storage penalty.

	There is one important shortcoming of the MKL format: it can only represent one
	type of special value - zero. This is a consequence of the implicit nature of
	the approach, the location of the zero values is not stored, it is implied by
	the lack of a non-zero value in that matrix position. There is no place to
	provide an attribute of the special value so there can be only one type of
	special value.

	We need an alternative format that can represent different types of special
	values. It will need to refer to the special values explicitly in an efficient
	manner. We can take advantage of the nature of most all sparse storage
	circumstances: the special values occur in relatively large runs compared to
	the non-zero values. We can use a "run length encoding" or RLE approach to
	compose a compressed special value array, in addition to storing a regular
	packed array of the non-special values. The index location of the non-special
	values can be inferred from the RLE array of special values during computation.

	Let's represent the array of special values using a 16-bit word to represent
	each run of special values, and an 8 bit word for each run of non-special
	values. Each of the two word types uses the first bit (the sign bit) to
	represent the run as consisting of special values (positive) or non-special
	values (negative). Each special value word uses the next 2 bits to describe what
	kind of special values are in the run, one of zero, inf, -inf and NVP.

	A complete array will be represented as a collection of these two types of
	words, terminated by a special "end of array" word consisting of 16 bits of
	zeros (0x0000). For example:
	<special values><non special values><special values><0x0000>

	This can be seen in bytes form:
	<2 bytes><1 byte><2 bytes><2 bytes>

	And if the special values are zeros (type 0), with runs of 8192 in each of the
	two runs with a run of 128 non-special values, the bits in this array would be:
	<0001111111111111><11111111><0001111111111111><0000000000000000>

	Note that if there are typically a large number of zeros followed by only one
	non-zero value, then this representation will waste a lot of space in the
	non-special value words. We could shorten the representation of non-special
	values to say 5 bits, which would allow for runs of 16 to be represented by
	each word and we would save three bits per run. However, the alignment of the
	index computations would be less than 1 byte, which would likely reduce the
	computational efficiency.

	If a given run has more than the maximum representable by the word, words are
	concatenated as needed to represent the run.

	For matrices we can use the compressed row format, or the compressed column
	format - for purpose of the efficiency discussion let's assumed compressed rows.

	The storage efficiency of this approach is more complicated to compute, as the
	storage will depend on the dispersion of values. We can compute best case and
	worst case scenarios however.

	A reasonable "best case" scenario will have one non-special value per row. In
	this case the storage required by the RLE format for a real matrix with N
	nonzero values, NI rows and NJ columns is:
	Comp Size = (9 bytes) * N + (2 bytes) (NJ(NI/8192) +1)
	= (9 bytes) NI + (2 bytes) (NJ*(NI/8192) +1)
	for the common case of NI < 8192:
	= (9 bytes) NI + (2 bytes) (NJ +1)

	In this case, the storage is O(NI+NJ). For instance, if we have a matrix of
	size 2,000 by 2,000, the storage is (8 bytes) * 4,000 = 32KB.

	For a worst case scenario, we can assume alternating non-special and special
	values, in which case the storage is:
	Comp Size = (9 bytes) * N + (2 bytes) * (NJ*(NI/2) +1)
	= (9 bytes) NJ(NI/2) + (2 bytes) * (NJ*(NI/2) +1)
	= (11 bytes) NJ(NI/2) + (2 bytes)

	What's interesting is that this worst case size is actually smaller (by almost
	half) than storing the full dense matrix. This is due to the more efficient
	representation of the special values in 2 byte format. The overhead is nearly
	equivalent to storing the special values in addition to the non-special values.

	If we have a dense matrix, there is almost zero overhead because the special
	value array is blank and we only store the non-special data values.

	If we compare the MKL approach savings to the RLE approach in the best case
	scenario, we have an RLE savings of:
	RLE
	Savings = (Raw Size) - (Comp Size)
	= (8 bytes) NINJ - (9 bytes)NI - (2 bytes)(NJ +1)
	for NI ~ NJ:
	~ (8 bytes) NINJ - (11 bytes)*NI
	~ (8 bytes) NI(NJ-1)

	MKL
	Savings = (8 bytes) ( NI*NJ - 3NI )
	= (8 bytes) NI(NJ-3)

	The savings in the best case is nearly the same. In the worst case scenario of
	a dense matrix, the savings of the RLE approach is nearly 0, which is the best
	result possible. For MKL however, the savings in the worst case is more than
	a 100% storage penalty.

	Most cases are somewhere between best case and worst case. We can conclude
	that the storage of the RLE scheme is going to be more efficient than MKL in
	all cases.