All the sketches in the theta package share a common byte array data structure and similar strategies for managing use of memory when the sketches are instantiated on the Java heap or when instantiated in off-heap native memory.
The byte array structure has two forms Hash Table and Compact.
The Hash Table form is similar to how the sketch is instantiated on the Java heap. Hash tables consume more space depending on how full the table is. However, updating the sketch is much faster in this form and is the default for all the Update Sketches.
There are two Update Sketch families, QuickSelect, and Alpha. The QuickSelect family has both a Java heap and a direct, off-heap variants while the Alpha sketch is designed only for operation on the Java heap.
The Update Sketches consist of an internal hash table cache and a few other primitives. How the cache is sized and/or resized can be controlled by the user specified ResizeFactor enum that has the values X1, X2, X4, and X8 (default). The meaning of these values is as follows:
Resizing a cache is costly as a new array needs to be allocated and the hash table rebuilt. Depending on the application, the user can select the Resize Factor to trade-off memory size and overall update speed performance.
When an Update Sketch is converted to a byte array using toByteArray(), the internal structure is a 24-byte preamble followed by a non-contiguous hash table array of 8-byte, long data entries. This enables the sketch to be quickly reconstructed from the byte array so that updating of the sketch can be continued.
Once the updating of a sketch is completed the HT is no longer needed, so the sketch can be stored in a compact form. The size of this compact form is a simple function of the number of retained hash values (8 bytes) and a small preamble that varies from 8 to 24 bytes depending on the internal state of the sketch. An empty sketch is represented by only 8 bytes. The upper limit of the size of the sketch varies by the type of sketch but is in the range of 8k to 16k. k is the configured size of the sketch in nominal entries, and also determines the accuracy of the sketch.
Compact Sketches optimize memory usage and sketch merge performance, are inherently read only and can only be created from an existing Update Sketch or Set Operation (Union, Intersection, or AnotB) object. The internal structure of Compact Sketches is an 8, 16, or 24 byte preamble followed by a contiguous data array of 8-byte, long data entries. This data array can be either ordered or unordered. This data structure is the same whether the Compact Sketch is instantiated on the Java heap or in off-heap direct memory. An empty Compact Sketch consumes only 8 bytes.
The ordered form is ideal for systems environments where the building of the sketches from data occur in an offline system (like a Hadoop grid), then compacted, ordered and uploaded to a real-time query engine (like Druid) where the compact sketches can be quickly merged to satisfy end-user queries. The ordered form enables early stop enhancements to the merge algorithm that makes the merging extreemly fast (~ 10's of millions of sketches per second).
The unordered form is more desirable for systems environments where the building of the sketches from data occur in real-time and queried in real-time. In this environment there is no need to pay the cost of the sort.
Thus, the choice of ordered or unordered is a tradeoff between real-time sketch build & getEstimate() performance and offline sketch-build and real-time merge performance.
The Compact Sketch is created four different ways selected by the ordering preference (dstOrdered) and whether the Compact Sketch will reside on-heap or off-heap (dstMemory).
These tables compute the size of a sketch after it has been converted into Compact Form. The size of a sketch during the build phase is explained above as the sketch starts small and resizes by the configurable Resize Factor up to the in-memory size of 2k*8 bytes plus a few primitives.
Note: a sketch entry = 8 bytes.
The number of valid entries in the Quick Select Sketch after it enters estimation mode statistically varies from k to 15k/8 with an average of about 3k/2. It is a user option to force a rebuild() prior to compacting the sketch in which case the number of valid entries is never larger than k.
| Empty | After Rebuild() | Estimating Avg | Estimating Max Nominal Entries (k) : Formula -> | 8 | k8 +24 | k12 + 24 | k*15 + 24 ----------------|-------------|-------------|------------|-------------- 16 | 8 | 152 | 216 | 264 32 | 8 | 280 | 408 | 504 64 | 8 | 536 | 792 | 984 128 | 8 | 1,048 | 1,560 | 1,944 256 | 8 | 2,072 | 3,096 | 3,864 512 | 8 | 4,120 | 6,168 | 7,704 1,024 | 8 | 8,216 | 12,312 | 15,384 2,048 | 8 | 16,408 | 24,600 | 30,744 4,096 | 8 | 32,792 | 49,176 | 61,464 8,192 | 8 | 65,560 | 98,328 | 122,904 16,384 | 8 | 131,096 | 196,632 | 245,784 32,768 | 8 | 262,168 | 393,240 | 491,544 65,536 | 8 | 524,312 | 786,456 | 983,064 131,072 | 8 | 1,048,600 | 1,572,888 | 1,966,104 262,144 | 8 | 2,097,176 | 3,145,752 | 3,932,184 524,288 | 8 | 4,194,328 | 6,291,480 | 7,864,344 1,048,576 | 8 | 8,388,632 | 12,582,936 | 15,728,664
The number of valid entries in the Alpha Sketch after it enters estimation mode is a random variable that has a probability distribution with a mean of k and a standard deviation of sqrt(k). The last column computes the maximum size with a confidence of 99.997% representing plus 4 standard deviations.
| Empty | Estimating Avg | Std Dev | Max @ 99.997% Nominal Entries (k) : Formula -> | 8 | k*8 + 24 | sqrt(k) | (k+4SD)*8 +24 ----------------|-------------|-------------|------------|---------- 512 | 8 | 4,120 | 23 | 4,844 1,024 | 9 | 8,216 | 32 | 9,240 2,048 | 10 | 16,408 | 45 | 17,856 4,096 | 11 | 32,792 | 64 | 34,840 8,192 | 12 | 65,560 | 91 | 68,456 16,384 | 13 | 131,096 | 128 | 135,192 32,768 | 14 | 262,168 | 181 | 267,961 65,536 | 15 | 524,312 | 256 | 532,504 131,072 | 16 | 1,048,600 | 362 | 1,060,185 262,144 | 17 | 2,097,176 | 512 | 2,113,560 524,288 | 18 | 4,194,328 | 724 | 4,217,498 1,048,576 | 19 | 8,388,632 | 1,024 | 8,421,400
The Union operation has both a Java heap and a direct, off-heap variant. When a Union operation is converted to a byte array using toByteArray(), the internal structure is a 32-byte preamble followed by a non-contiguous hash table array of 8-byte, long data entries. This enables the Union to be quickly reconstructed from the byte array so that updating of the Union can be continued.
The Intersection operation has both a Java heap and a direct, off-heap variant. When an Intersection operation is converted to a byte array using toByteArray(), the internal structure is a 24-byte preamble followed by a non-contiguous hash table array of 8-byte, long data entries. This enables the Intersection to be quickly reconstructed from the byte array so that updating of the Intersection can be continued.
The A not B operation is asymmetric and stateless. Both the A and B arguments are presented and the difference is computed and returned. There is no need for a byte array form.