| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "https://www.w3.org/TR/html4/frameset.dtd"> |
| <html> |
| <head> |
| <title>Sketch Library Dictionary</title> |
| </head> |
| <body> |
| <h2>Sketch Library Dictionary</h2> |
| |
| <h3><a name="accuracy">Sketch Accuracy</a></h3> |
| Refers to sketch accuracy... |
| |
| <h3><a name="alphaTCF">Alpha TCF</a></h3> |
| The Alpha Theta Choosing Function (TCF) and the theory behind it is fully described in the |
| <a href="https://github.com/DataSketches/DataSketches.github.io/blob/master/docs/pdf/ThetaSketchFramework.pdf">Theta Sketch Framework</a> paper. |
| The alpha algorithm is optimized for speed and accuracy in a real-time sketch |
| building / estimating environment. |
| <p>One of the properties of the Alpha Algorithm used for cache management within a sketch is that |
| the value of <a href="#thetaLong">Theta Long</a> is always up-to-date. There may be |
| <a href="#dirtyHash">dirty hash</a> |
| values in the cache, but the Alpha Algorithm estimation function ignores them.</p> |
| |
| <h3><a name="defaultNomEntries">Default Nominal Entries</a></h3> |
| In Theta Sketches, the default nominal entries of 4096 is provided as a convenience for those cases where the |
| number of entries is not provided. |
| A sketch of 4096 entries has a Relative Standard Error (RSE) of +/- 1.56% at a confidence of 68%; |
| or equivalently, a Relative Error of +/- 3.1% at a confidence of 95.4%. |
| |
| <h3><a name="defaultUpdateSeed">Default Update Seed</a></h3> |
| <p>In Theta Sketches, the default <a href="#seed">Update Hash Seed</a> 9001 is a prime number that was chosen |
| very early on in experimental testing. |
| Choosing a seed is somewhat arbitrary, and this particular seed is not superior to other seeds. |
| <p>In performing set operations on two sketches it is critical that the same hash |
| function and seed was used for both sketches, otherwise the assumed 1:1 relationship |
| between the original source key value and the hashed bit string would be violated. |
| Once you have developed a history of stored sketches you are stuck with your chosen seed. |
| So don't change it! |
| <a href="#seed">Update Hash Seed</a> |
| |
| <h3><a name="degenerateSketch">Degenerate Sketch</a></h3> |
| A sketch is considered <i>degenerate</i> if it has never reached reached |
| <a href="#estMode">Estimation Mode</a> where the count of its retained hash values is |
| exact and less than <a href="#nomEntries">Nominal Entries</a> (or <i>k</i> entries) |
| and no estimation is needed. This is also referred to as <i>Exact Mode</i>. |
| |
| <h3><a name="dstMem">Destination Memory</a></h3> |
| The destination Memory for this object. If not required it may be null. If not null and large |
| enough the returned object will be backed by this Memory. |
| If null, the returned object will be on the Java heap. |
| |
| <h3><a name="dstOrdered">Destination Ordered</a></h3> |
| In Theta Sketches, this refers to the ordering of hash values in a returned CompactSketch. |
| If true, the internal hash values will be ordered. This will enable much higher |
| performance during union (merge) operations but at the cost of a sort, which may not be |
| desirable in some applications, especially real-time. |
| |
| <h3><a name="dirtyHash">Dirty Hash</a></h3> |
| For the Theta Alpha Sketch, a retained hash value is considered <i>dirty</i> if it is ≥ <a href="#thetaLong">Theta Long</a> or < 0. |
| See <a href="#validHash">Valid Hash</a>. |
| |
| <h3><a name="empty">Empty</a></h3> |
| In Theta Sketches, the state <i>Empty</i> for a sketch means that the sketch cache has zero hash values and that none of the |
| update methods have been called with valid data. In other words, the sketch has never seen any data. |
| However, if the P sampling mode is used, a virgin sketch will have a theta < 1.0, and a set operation could be |
| (and should be) impacted by this lower theta. This is by design. |
| It is only safe to exclude empty sketches from set operations if isEmpty() == true and getTheta() == 1.0. |
| |
| |
| <p>Note that <i>Empty</i> does not mean that theta is 1.0 because if <i>p</i> < 1.0, theta will be set |
| equal to <i>p</i> during construction. |
| Also, a cache of zero values does not mean that the sketch is <i>Empty</i> since set intersection or difference |
| operations can result in a sketch with zero values but with a valid theta and upper and lower bounds. |
| |
| <p>These are subtle distinctions and exist for mathematical correctness. Excluding sketches that are <i>not Empty</i> |
| from set operations that would otherwise include them could result in impacting the accuracy of results. |
| |
| <h3><a name="estMode">Estimation Mode</a></h3> |
| Once a Theta Sketch exceeds the configured <a href="#nomEntries">Nominal Entries</a>, or <i>k</i>, number of retained hash values, |
| the sketch transitions into <i>estimation mode</i> where it must now estimate the population of uniques that |
| the retained entries represent. A sketch can also be in estimation mode if the sketch was configured with a |
| <a href="#p">Sampling Probability</a> < 1.0. |
| <i>Estimation Mode</i> = (<a href="#theta">θ</a> < 1.0) & NOT <i>empty</i>. |
| |
| <h3><a name="lgNomLongs">lgNomLongs</a></h3> |
| For Theta Sketches, the Log base 2 of the number of <a href="#nomEntries">Nominal Entries</a>. |
| |
| <h3><a name="lgArrLongs">lgArrLongs</a></h3> |
| For Theta Sketches, the Log base 2 of the size of the internal hash table in 64-bit longs. |
| |
| <h3><a name="mem">Memory</a></h3> |
| The backing Memory object which may be a source or destination. |
| |
| <h3><a name="nomEntries">Nominal Entries</a></h3> |
| For Theta Sketches and depending on the specific sketch, the constructor data type is <i>int</i> or <i>String</i>. |
| Acceptable values for constructing new sketches are positive integers that are powers of two. |
| This parameter specifies the target nominal number of entries (a.k.a. <i>k</i>) that will be retained in the internal cache |
| and ultimately determines the accuracy of the sketch (see <a href="#accuracy">Sketch Accuracy</a>). |
| Internally each entry is retained as a long of 8 bytes. |
| |
| <p>The reason it is called "nominal" is that depending on the specific algorithm of the sketch, the actual |
| number of entries retained by the sketch after the sketch goes into <i>estimation mode</i> |
| may differ from the configured value. For the <i>AlphaSketch</i> the number of retained entries statistically varies |
| but has a mean of <i>k</i>. For the <i>QuickSelect</i> sketches, the number of retained entries will statistically |
| vary from <i>k</i> to almost <i>2*k</i>. |
| |
| <p>Each sketch type also has a minimum acceptable value for this value. For QuickSelect Sketches this value is 16 and for |
| Alpha Sketches this value is 512. Specifying a value less than this minimum value just results in the minimum value being used. |
| |
| <h3><a name="numStdDev">Number of Standard Deviations</a></h3> |
| This is a positive number, which may be either an integer (1, 2, or 3) or a double ≤ 3.0. |
| This value is used in the getUpperBounds(int numStdDev) and |
| getLowerBounds(int numStdDev) methods and represents (theoretically) the +/- standard deviation from the center of the |
| Standard Normal Gaussian Distribution. For example: |
| |
| <p>getUpperBound(1) returns the estimated quantile(0.841) of the distribution.<br> |
| getLowerBound(1) returns the estimated quantile(0.158) of the distribution.<br> |
| getUpperBound(2) returns the estimated quantile(0.977) of the distribution.<br> |
| getLowerBound(2) returns the estimated quantile(0.023) of the distribution.<br> |
| getUpperBound(3) returns the estimated quantile(0.9986) of the distribution.<br> |
| getLowerBound(3) returns the estimated quantile(0.0013) of the distribution.<br> |
| </p> |
| |
| <p>However, for sketches with small configured values of <i>Nominal Entries < 4096</i> for Theta or <i>lgConfigK < 12</i> for HLL, |
| the error distribution of the sketch becomes quite asymmetric and cannot be approximated with a Gaussian. In these cases the interpretation of |
| <i>numStdDev</i> is that of an index that returns the quantile of the sketch error distribution that corresponds to fractional normalized rank |
| of the standard normal distribution at the specified <i>numStdDev</i>. |
| |
| <p>Thus, getUpperBound(1) and getLowerBound(2) represent the 68.3% confidence bounds, |
| getUpperBound(2) and getLowerBound(2) represent the 95.4% confidence bounds, and |
| getUpperBound(3) and getLowerBound(3) represent the 99.7% confidence bounds. |
| <br> |
| |
| <p>For some sketches where the error distribution is not Gaussian, special mathematical approximation methods are used. |
| See <a href="#accuracy">Sketch Accuracy</a>.</p> |
| |
| |
| |
| <h3><a name="quickSelectTCF">Quick Select TCF</a></h3> |
| The fundamental Theta Sketch QuickSelect algorithm is described in classic algorithm texts by Sedgewick and |
| is the Theta Choosing Function (<a href="#tcf">TCF</a>) for the QuickSelect Sketches. |
| When the internal hash table of the sketch reaches its internal |
| <i>refresh threshold</i>, |
| the quick select algorithm is used to select the <code>(k+1)th order statistic</code> |
| from the hash table with a complexity of <i>O(n)</i>. |
| The value of the selected hash becomes the new |
| <a href="#thetaLong">Theta Long</a> |
| and immediately makes some number of entries in the table |
| <a href="#dirtyHash">dirty</a>. |
| The <i>rebuild()</i> method is called that rebuilds the hash table removing the |
| <a href="#dirtyHash">dirty</a> values. |
| Since the value of <a href="#thetaLong">Theta Long</a> |
| is only changed when the hash table needs to be rebuilt, |
| the values in the hash table are only ever <a href="#dirtyHash">dirty</a> |
| briefly during the rebuild process. |
| Thus, all the values in the hash table are always |
| <a href="#validHash">valid</a> during normal updating of the sketch. |
| <p>One of the benefits of using the QuickSelect algorithm for the cache management of the sketch is |
| that the number of <a href="#validHash">valid</a> hashes ranges from |
| <a href="#nomEntries">nominal entries</a> |
| to the current <i>REBUILD_THRESHOLD</i></a>, which is nominally 15/16 * <i>cacheSize</i>. |
| This means that without the user forcing |
| a <i>rebuild()</i>, the sketch, on average, may be about 50% larger than |
| <a href="#nomEntries">nominal entries</a>, about 19% more accurate, and faster.</p> |
| |
| <h3><a name="resizeFactor">Resize Factor</a></h3> |
| For Theta Sketches, the Resize Factor is a dynamic, speed performance vs. memory size tradeoff. |
| The sketches created on-heap and configured with a Resize Factor of > X1 start out with |
| an internal hash table size that is the smallest submultiple of the the target |
| <a href="#nomEntries">Nominal Entries</a> |
| and larger than the minimum required hash table size for that sketch. |
| When the sketch needs to be resized larger, then the Resize Factor is used as a multiplier of |
| the current sketch cache array size. <br> |
| "X1" means no resizing is allowed and the sketch will be intialized at full size.<br> |
| "X2" means the internal cache will start very small and double in size until the target size is reached.<br> |
| Similarly, "X4" is a factor of 4 and "X8 is a factor of 8. |
| |
| <h3><a name="p">Sampling Probability <i>p</i></a></h3> |
| For Theta Sketches, the uniform random pre-sketching sampling probability. |
| Depending on the specific sketch, the constructor data type is <i>float</i> or <i>String</i>. |
| Incoming hashed data values are sampled by this probability factor before being submitted to |
| the sketching algorithm. For example, if <i>p</i> were set to 0.25, then on average, only one |
| forth of the incoming values, selected uniformly and at random, would be evaluated by the |
| sketching algorithm to be retained by the sketch. |
| Its default value is 1.0 (no sampling). |
| Its value must be in the range: 0 < p ≤ 1.0. |
| <p>This mode is particularly useful when merging large numbers of |
| <a href="#degenerateSketch">degenerate sketches</a>. |
| |
| <h3><a name="seed">Seed</a></h3> |
| For Theta Sketches, the long (64-bit) seed is required by the Update Hash Function. |
| This seed value is intentionally not serialized along with this sketch in order to provide |
| some security and protection against "dictionary attacks". |
| <p>In order to provide some protection against accidental mixing |
| of sketches that were generated with different seeds a short, 16-bit, |
| <a href="#seedHash">Seed Hash</a> is stored with the sketch image. |
| When heapifying or wrapping an UpdateSketch image, which can be either a byte array or a Memory object, |
| the user must provide the original seed either directly or indirectly by assuming the <i>DEFAULT_UPDATE_SEED</i>. |
| The provided seed will be hashed and validated against the internal short Seed Hash and an error will be thrown if the seed hashes do not match. |
| The Set Operations classes, Union, Intersection and AnotB also require the user to provide the seed either directly or indirectly. |
| <p>An internal check will be made to make sure that the provided seed does not hash to a 16-bit value of zero. |
| If it does produce zero, an error will be thrown and the user must provide a different seed value. |
| <br> |
| See also <a href="#defaultUpdateSeed">Default Update Seed</a>. |
| |
| <h3><a name="seedHash">Seed Hash</a></h3> |
| For Theta Sketches, a 16-bit hash of the <a href="#seed">Update Hash Seed</a> used internally to validate |
| (1) that two sketches undergoing set operations were, in fact, created using matching <a href="#seed">Update Hash Seeds</a>; |
| or (2) that when deserializing or wrapping a sketch image that the caller has the correct seed. |
| |
| <h3><a name="SnowPlow">Snow Plow Effect</a></h3> |
| When coordinated hash tables are merged and if the merging process does not update the target sketch with sufficient randomness, clustering |
| in the target hash table can be greatly exaggerated causing poor speed performance for both updates and searches. This is called the |
| "snowplow" effect because of the analogy of visualizing the clusters in a hash table as piles of snow that grow larger and larger. Since the |
| size of the clusters are only represented by their width (not height like piles of snow), the clusters push themselves out horizontally and |
| merge together as if they were pushed together with a snowplow. |
| |
| <h3><a name="tcf">Theta Choosing Function (TCF)</a></h3> |
| For Theta Sketches, the Theta Choosing Function (TCF) and the theory behind it is fully described in the |
| <a href="https://github.com/DataSketches/DataSketches.github.io/blob/master/docs/pdf/ThetaSketchFramework.pdf">Theta Sketch Framework</a> paper. |
| |
| <h3><a name="theta">Theta, θ</a></h3> |
| For Theta Sketches, refers to the mathematical random variable θ that represents the current probability |
| that the next, non-duplicate unique input value presented to the sketch will change the state of the sketch. |
| Given <i>N</i> uniquified inputs to a sketch configured to retain at most <i>k</i> hashes of the inputs, |
| <code>θ ≈ <i>k/N</i>, 0 < θ ≤ 1.0</code>, and θ = (double) thetaLong / Long.MAX_VALUE;. |
| See <a href="#thetaLong">thetaLong</a>. |
| |
| <h3><a name="thetaLong">thetaLong</a></h3> |
| For Theta Sketches, the 64-bit, positive <i>long</i> equivalent of <a href="#theta">theta</a> where<br> |
| 0 < <i>thetaLong</i> ≤ <i>Long.MAX_VALUE</i>, and <i>thetaLong</i> = θ * Long.MAX_VALUE. |
| |
| <h3><a name="thetaSketch">Theta Sketch Framework</a></h3> |
| This framework enables sketches with different algorithms and Theta Choosing Functions |
| to be arguments to the Union, Intersection and AnotB Set Operations. |
| This framework also enables the sketches to share estimation, upper and lower bounds algorithms and |
| a common serialization data structure. |
| The Theta Sketch Framework, Theta Choosing Functions and the theory behind them is fully described |
| in the <a href="https://github.com/DataSketches/DataSketches.github.io/blob/master/docs/pdf/ThetaSketchFramework.pdf">Theta Sketch Framework</a> paper. |
| |
| <h3><a name="updateReturnState">Update Return State</a></h3> |
| For Theta Sketches, this provides useful detail for sketch characterization and debugging. It is not required that any of |
| these values be monitored during normal operation. |
| The UpdateReturnState is defined as follows: |
| <ul> |
| <li> InsertedCountIncremented: Inserted, not full, retained_entries_count incremented.</li> |
| <li> InsertedCountNotIncremented: Inserted, not full, retained-entries-count not incremented |
| because a dirty value was overridden.</li> |
| <li> RejectedDuplicate: Rejected as duplicate.</li> |
| <li> RejectedNullOrEmpty: Rejected because input was null or empty. Only for update objects.</li> |
| <li> RejectedOverTheta: Rejected because the computed hash was over the current value of thetaLong.</li> |
| </ul> |
| |
| |
| <h3><a name="validHash">Valid Hash</a></h3> |
| For Theta Sketches, a retained hash value is considered <i>valid</i> if it is greater than zero and less than |
| <a href="#thetaLong">thetaLong</a>. See <a href="#dirtyHash">Dirty Hash</a>. |
| |
| </body> |
| </html> |