Moved the sketches into independent subdirectories

commit: b6de0f9a83d80653a761856a124881b11915b548 [log] [tgz]
author: Charlie Dickens <charlie@apache.org> Tue Jan 16 10:59:08 2024 +0000
committer: Charlie Dickens <charlie@apache.org> Tue Jan 16 10:59:08 2024 +0000
tree: a2aabec7cb2c3d6bddf63cae9a1cbf7758ec9290
parent: a83ebbce7cf8ec99b18fe31664473d8222fb2a4b [diff]
diff --git a/docs/source/distinct_counting/cpc.rst b/docs/source/distinct_counting/cpc.rst
new file mode 100644
index 0000000..2959b02
--- /dev/null
+++ b/docs/source/distinct_counting/cpc.rst

@@ -0,0 +1,24 @@
+Compressed Probabilistic Counting (CPC)
+---------------------------------------
+High performance C++ implementation of Compressed Probabilistic Counting (CPC) Sketch.
+This is a unique-counting sketch that implements the Compressed Probabilistic Counting (CPC, a.k.a FM85) algorithms developed by Kevin Lang in his paper
+`Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm <https://arxiv.org/abs/1708.06839>`_.
+This sketch is extremely space-efficient when serialized. 
+In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, this new algorithm simultaneously wins on the two dimensions of the space/accuracy tradeoff and produces sketches that are smaller than the entropy of HLL, so no possible implementation of compressed HLL can match its space efficiency for a given accuracy. As described in the paper this sketch implements a newly developed ICON estimator algorithm that survives unioning operations, another well-known estimator, the Historical Inverse Probability (HIP) estimator does not. 
+The update speed performance of this sketch is quite fast and is comparable to the speed of HLL. 
+The unioning (merging) capability of this sketch also allows for merging of sketches with different configurations of K.
+For additional security this sketch can be configured with a user-specified hash seed.
+
+
+.. autoclass:: _datasketches.cpc_sketch
+    :members:
+    :undoc-members:
+    :exclude-members: deserialize,
+    :member-order: groupwise
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+
+    .. rubric:: Non-static Methods:
+

diff --git a/docs/source/distinct_counting/hyper_log_log.rst b/docs/source/distinct_counting/hyper_log_log.rst
new file mode 100644
index 0000000..ec79d93
--- /dev/null
+++ b/docs/source/distinct_counting/hyper_log_log.rst

@@ -0,0 +1,32 @@
+HyperLogLog (HLL)
+-----------------
+This is a high performance implementation of Phillipe Flajolet's HLL sketch but with significantly improved error behavior.
+
+If the ONLY use case for sketching is counting uniques and merging, the HLL sketch is a reasonable choice, although the highest performing in terms of accuracy for storage space consumed is CPC (Compressed Probabilistic Counting). For large enough counts, this HLL version (with HLL_4) can be 2 to 16 times smaller than the Theta sketch family for the same accuracy.
+
+This implementation offers three different types of HLL sketch, each with different trade-offs with accuracy, space and performance. 
+These types are specified with the target_hll_type parameter.
+
+In terms of accuracy, all three types, for the same lg_config_k, have the same error distribution as a function of n, the number of unique values fed to the sketch.
+The configuration parameter `lg_config_k` is the log-base-2 of `K`, where `K` is the number of buckets or slots for the sketch.
+
+During warmup, when the sketch has only received a small number of unique items (up to about 10% of `K`), this implementation leverages a new class of estimator algorithms with significantly better accuracy.
+
+This sketch also offers the capability of operating off-heap. 
+Given a WritableMemory object created by the user, the sketch will perform all of its updates and internal phase transitions in that object, which can actually reside either on-heap or off-heap based on how it is configured. 
+In large systems that must update and merge many millions of sketches, having the sketch operate off-heap avoids the serialization and deserialization costs of moving sketches to and from off-heap memory-mapped files, for example, and eliminates big garbage collection delays.
+
+.. autoclass:: _datasketches.hll_sketch
+    :members:
+    :undoc-members:
+    :exclude-members: deserialize, get_max_updatable_serialization_bytes, get_rel_err 
+
+    :member-order: groupwise
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+    .. automethod:: get_max_updatable_serialization_bytes
+    .. automethod:: get_rel_err
+
+    .. rubric:: Non-static Methods:

diff --git a/docs/source/distinct_counting/index.rst b/docs/source/distinct_counting/index.rst
new file mode 100644
index 0000000..c9c4d4a
--- /dev/null
+++ b/docs/source/distinct_counting/index.rst

@@ -0,0 +1,11 @@
+Distinct Counting 
+=================
+These are all of the sketches for distinct counting....
+
+.. toctree::
+  :maxdepth: 1
+   
+  hyper_log_log
+  cpc
+  theta 
+  tuple
\ No newline at end of file

diff --git a/docs/source/distinct_counting/theta.rst b/docs/source/distinct_counting/theta.rst
new file mode 100644
index 0000000..4593f37
--- /dev/null
+++ b/docs/source/distinct_counting/theta.rst

@@ -0,0 +1,14 @@
+Theta Sketch
+------------
+The theta package contains the basic sketch classes that are members of the `Theta Sketch Framework <https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html>`_.
+There is a separate Tuple package for many of the sketches that are derived from the same algorithms defined in the Theta Sketch Framework paper.
+
+The *Theta Sketch* sketch is a space-efficient method for estimating cardinalities of sets.
+It can also easily handle set operations (such as union, intersection, difference) while maintaining good accuracy.
+Theta sketch is a practical variant of the K-Minimum Values sketch which avoids the need to sort the stored 
+hash values on every insertion to the sketch.
+It has better error properties than the HyperLogLog sketch for set operations beyond the simple union.
+
+.. autoclass:: _datasketches.theta_sketch
+    :members:
+    :undoc-members:
\ No newline at end of file

diff --git a/docs/source/distinct_counting/tuple.rst b/docs/source/distinct_counting/tuple.rst
new file mode 100644
index 0000000..4bca37f
--- /dev/null
+++ b/docs/source/distinct_counting/tuple.rst

@@ -0,0 +1,18 @@
+Tuple Sketch
+------------
+
+Tuple sketches are an extension of Theta sketch that allow the keeping of arbitrary summaries associated with each retained key 
+(for example, a count for every key).
+
+.. autoclass:: datasketches.tuple_sketch
+    :members:
+    :undoc-members:
+
+.. autoclass:: datasketches.AccumulatorPolicy        
+    :members:    
+
+.. autoclass:: datasketches.MaxIntPolicy        
+    :members:    
+
+.. autoclass:: datasketches.MinIntPolicy        
+    :members:    
\ No newline at end of file

diff --git a/docs/source/frequency/count_min_sketch.rst b/docs/source/frequency/count_min_sketch.rst
new file mode 100644
index 0000000..2f66c8d
--- /dev/null
+++ b/docs/source/frequency/count_min_sketch.rst

@@ -0,0 +1,28 @@
+CountMin Sketch
+---------------
+
+The CountMin sketch, as described in Cormode and Muthukrishnan in
+http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf,
+is used for approximate Frequency Estimation. 
+For an item :math:`x` with frequency :math:`f_x`, the sketch provides an estimate, :math:`\hat{f_x}`, 
+such that :math:`f_x \approx \hat{f_x}.` 
+The sketch guarantees that :math:`f_x \le \hat{f_x}` and provides a probabilistic upper bound which is dependent on the size parameters.
+The sketch provides an estimate of the occurrence frequency for any queried item but, in contrast
+to the Frequent Items Sketch, this sketch does not provide a list of 
+heavy hitters.
+
+.. currentmodule:: _datasketches
+
+.. autoclass:: count_min_sketch
+    :members:
+    :undoc-members:
+    :exclude-members: deserialize, suggest_num_buckets, suggest_num_hashes
+    :member-order: groupwise
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+    .. automethod:: suggest_num_buckets
+    .. automethod:: suggest_num_hashes
+
+    .. rubric:: Non-static Methods:

diff --git a/docs/source/frequency/frequent_items.rst b/docs/source/frequency/frequent_items.rst
new file mode 100644
index 0000000..bb580e5
--- /dev/null
+++ b/docs/source/frequency/frequent_items.rst

@@ -0,0 +1,83 @@
+Frequent Items
+--------------
+
+This sketch is useful for tracking approximate frequencies of items of type `<T>` with optional associated counts `(<T> item, int count)` 
+that are members of a multiset of such items. 
+The true frequency of an item is defined to be the sum of associated counts.
+
+This implementation provides the following capabilities:
+
+* Estimate the *frequency* of an item.
+* Return *upper* and *lower bounds* of any item, such that the true frequency is always between the upper and lower bounds.
+* Return a global *maximum error* that holds for all items in the stream.
+* Return an array of frequent items that qualify either a *NO_FALSE_POSITIVES* or a *NO_FALSE_NEGATIVES* error type.
+* *Merge* itself with another sketch object created from this class.
+* *Serialize/Deserialize* to/from a byte array.
+
+**Space Usage**
+
+The sketch is initialized with a maximum map size, `maxMapSize`, that specifies the maximum physical length of the internal hash map of the form `(<T> item, int count)`. 
+The maximum map size is always a power of 2, defined through the variables `lg_max_map_size`.
+
+The hash map starts at a very small size (8 entries) and grows as needed up to the specified maximum map size.
+
+Excluding external space required for the item objects, the internal memory space usage of this sketch is `18 * mapSize bytes` (assuming 8 bytes for each reference), 
+plus a small constant number of additional bytes. 
+The internal memory space usage of this sketch will never exceed `18 * maxMapSize` bytes, plus a small constant number of additional bytes.
+
+**Maximum Capacity of the Sketch**
+
+The `LOAD_FACTOR` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of `(item, count)` pairs is `mapCap = 0.75 * mapSize`. 
+The maximum capacity of `(item, count)`` pairs of the sketch is `maxMapCap = 0.75 * maxMapSize`.
+
+**Updating the sketch with `(item, count)` pairs**
+
+If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count; otherwise, a new counter `"(item, count) pair"` is created. 
+If the number of tracked counters reaches the maximum capacity of the hash map, the sketch decrements all of the counters (by an approximately computed median) 
+and removes any non-positive counters.
+
+**Accuracy**
+
+If fewer than `0.75 * maxMapSize` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
+
+The logic of the frequent items sketch is such that the stored counts and true counts are never too different. 
+More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).
+
+For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item. 
+Specifically, `(UB- LB) ≤ W * epsilon`, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
+
+This is a worst-case guarantee that applies to arbitrary inputs.
+For inputs typically seen in practice, `(UB-LB)` is usually much smaller.
+
+**Background**
+
+This code implements a variant of what is commonly known as the "Misra-Gries algorithm". 
+Variants of it were discovered and rediscovered and redesigned several times over the years:
+
+* *Finding repeated elements*, Misra, Gries, 1982
+* *Frequency estimation of Internet packet streams with limited space* Demaine, Lopez-Ortiz, Munro, 2002
+* *A simple algorithm for finding frequent elements in streams and bags* Karp, Shenker, Papadimitriou, 2003
+* *Efficient Computation of Frequent and Top-k Elements in Data Streams* Metwally, Agrawal, Abbadi, 2006
+
+
+For speed, we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run. 
+However, we have ensured that this probability is extremely small. 
+For example, if the stream causes one table purge (rebuild), our proof of the worst-case bound applies with a probability of at least `1 - 1E-14`. 
+If the stream causes `1E9` purges, our proof applies with a probability of at least `1 - 1E-5`.
+
+Parameter: <T> The type of item to be tracked by this sketch
+
+
+.. autoclass:: _datasketches.frequent_items_sketch
+    :members:
+    :undoc-members:
+    :exclude-members: deserialize, get_epsilon_for_lg_size, get_apriori_error
+    :member-order: groupwise
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+    .. automethod:: get_epsilon_for_lg_size
+    .. automethod:: get_apriori_error
+
+    .. rubric:: Non-static Methods:

diff --git a/docs/source/frequency/index.rst b/docs/source/frequency/index.rst
new file mode 100644
index 0000000..cf08180
--- /dev/null
+++ b/docs/source/frequency/index.rst

@@ -0,0 +1,9 @@
+Frequency Sketches
+==================
+These are all of the sketches for frequency estimation
+
+.. toctree::
+  :maxdepth: 1
+   
+  frequent_items
+  count_min_sketch
\ No newline at end of file

diff --git a/docs/source/index.rst b/docs/source/index.rst
index b6998c2..3ba92db 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst

@@ -7,44 +7,16 @@
 =================================================
 
 **DataSketches** are highly-efficient algorithms to analyze big data quickly.
-
-
+   
+   
 Counting Distincts
 ##################
 ..
    maxdepth: 1 means only the heading is printed in the contents
 .. toctree::
-   :maxdepth: 1 
+   :maxdepth: 1
 
-   hyper_log_log
-   cpc
-   theta 
-   tuple
-
-Density Sketch
-##############
-.. toctree::
-   :maxdepth: 1 
-
-   density_sketch
-
-Frequency Estimation
-##########################
-
-.. toctree::
-   :maxdepth: 1 
-
-   count_min_sketch
-
-
-Frequent Items
-##########################
-This problem may also be known as **heavy hitters** or **TopK**
-
-.. toctree::
-   :maxdepth: 1 
-
-   frequent_items
+   distinct_counting/index
 
 Quantile Estimation
 ###################
@@ -52,30 +24,32 @@
 .. toctree::
    :maxdepth: 1 
 
-   kll
-   req
-   quantiles_depr
+   quantiles/index
+
+Frequency Sketches
+##################
+This problem may also be known as **heavy hitters** or **TopK**
+
+.. toctree::
+   :maxdepth: 1 
+
+   frequency/index
+
+Vector Sketches
+###############
+
+.. toctree::
+   :maxdepth: 1 
+
+   vector/index
    
 .. note::
 
    This project is under active development.
 
-
 Indices and tables
 ==================
 
 * :ref:`genindex`
 * :ref:`modindex`
-* :ref:`search`
-
-
-.. .. automodule:: datasketches    
-..    :members:
-
-.. .. automodule:: _datasketches
-..     :members:
-
-.. 
-.. 
-
-..    distinct_count
+* :ref:`search`
\ No newline at end of file

diff --git a/docs/source/quantiles/index.rst b/docs/source/quantiles/index.rst
new file mode 100644
index 0000000..a12225f
--- /dev/null
+++ b/docs/source/quantiles/index.rst

@@ -0,0 +1,10 @@
+Quantiles Sketches
+==================
+These are all of the sketches for quantile estimation....
+
+.. toctree::
+  :maxdepth: 1
+   
+  kll
+  req
+  quantiles_depr
\ No newline at end of file

diff --git a/docs/source/quantiles/kll.rst b/docs/source/quantiles/kll.rst
new file mode 100644
index 0000000..1be3dfa
--- /dev/null
+++ b/docs/source/quantiles/kll.rst

@@ -0,0 +1,124 @@
+KLL Sketch
+----------
+Implementation of a very compact quantiles sketch with lazy compaction scheme
+and nearly optimal accuracy per retained item.
+See `Optimal Quantile Approximation in Streams`.
+
+This is a stochastic streaming sketch that enables near real-time analysis of the
+approximate distribution of items from a very large stream in a single pass, requiring only
+that the items are comparable.
+The analysis is obtained using `get_quantile()` function or the
+inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and `get_cdf()`
+(Cumulative Distribution Function).
+
+As of May 2020, this implementation produces serialized sketches which are binary-compatible
+with the equivalent Java implementation only when template parameter `T = float`
+(32-bit single precision values).
+
+Given an input stream of `N` items, the `natural rank` of any specific
+item is defined as its index `(1 to N)` in inclusive mode
+or `(0 to N-1)` in exclusive mode
+in the hypothetical sorted stream of all `N` input items.
+
+The `normalized rank` (`rank`) of any specific item is defined as its
+`natural rank` divided by `N`.
+Thus, the `normalized rank` is between zero and one.
+In the documentation for this sketch `natural rank` is never used so any
+reference to just `rank` should be interpreted to mean `normalized rank`.
+
+This sketch is configured with a parameter `k`, which affects the size of the sketch
+and its estimation error.
+
+The estimation error is commonly called `epsilon` (or `eps`) and is a fraction
+between zero and one. Larger values of `k` result in smaller values of `epsilon`.
+Epsilon is always with respect to the rank and cannot be applied to the
+corresponding items.
+
+The relationship between the `normalized rank` and the corresponding items can be viewed
+as a two-dimensional monotonic plot with the `normalized rank` on one axis and the
+corresponding items on the other axis. If the y-axis is specified as the item-axis and
+the x-axis as the `normalized rank`, then `y = get_quantile(x)` is a monotonically
+increasing function.
+
+The function `get_quantile(rank)` translates ranks into
+corresponding quantiles. The functions `get_rank(item)`,
+`get_cdf(...)` (Cumulative Distribution Function), and `get_pmf(...)`
+(Probability Mass Function) perform the opposite operation and translate items into ranks.
+
+The `get_pmf(...)` function has about 13 to 47% worse rank error (depending
+on `k`) than the other queries because the mass of each "bin" of the PMF has
+"double-sided" error from the upper and lower edges of the bin as a result of a subtraction,
+as the errors from the two edges can sometimes add.
+
+The default `k` of 200 yields a "single-sided" `epsilon` of about 1.33% and a
+"double-sided" (PMF) `epsilon` of about 1.65%.
+
+A `get_quantile(rank)` query has the following guarantees:
+- Let `q = get_quantile(r)` where `r` is the rank between zero and one.
+- The quantile `q` will be an item from the input stream.
+- Let `true_rank` be the true rank of `q` derived from the hypothetical sorted
+stream of all `N` items.
+- Let `eps = get_normalized_rank_error(false)`.
+- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. Note that the
+error is on the rank, not the quantile.
+
+A `get_rank(item)` query has the following guarantees:
+- Let `r = get_rank(i)` where `i` is an item between the min and max items of
+the input stream.
+- Let `true_rank` be the true rank of `i` derived from the hypothetical sorted
+stream of all `N` items.
+- Let `eps = get_normalized_rank_error(false)`.
+- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%.
+
+A `get_pmf()` query has the following guarantees:
+- Let `{r1, r2, ..., r(m+1)} = get_pmf(s1, s2, ..., sm)` where `s1, s2` are
+split points (items from the input domain) between the min and max items of
+the input stream.
+- Let `mass_i = estimated mass between s_i and s_i+1`.
+- Let `true_mass` be the true mass between the items of `s_i`,
+`s_i+1` derived from the hypothetical sorted stream of all `N` items.
+- Let `eps = get_normalized_rank_error(true)`.
+- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
+- `r(m+1)` includes the mass of all points larger than `s_m`.
+
+A `get_cdf(...)` query has the following guarantees;
+- Let `{r1, r2, ..., r(m+1)} = get_cdf(s1, s2, ..., sm)` where `s1, s2, ...` are
+split points (items from the input domain) between the min and max items of
+the input stream.
+- Let `mass_i = r_(i+1) - r_i`.
+- Let `true_mass` be the true mass between the true ranks of `s_i`,
+`s_i+1` derived from the hypothetical sorted stream of all `N` items.
+- Let `eps = get_normalized_rank_error(true)`.
+- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
+- `1 - r(m+1)` includes the mass of all points larger than `s_m`.
+
+From the above, it might seem like we could make some estimates to bound the
+`item` returned from a call to `get_quantile()`. The sketch, however, does not
+let us derive error bounds or confidences around items. Because errors are independent, we
+can approximately bracket a value as shown below, but there are no error estimates available.
+Additionally, the interval may be quite large for certain distributions.
+- Let `q = get_quantile(r)`, the estimated quantile of rank `r`.
+- Let `eps = get_normalized_rank_error(false)`.
+- Let `q_lo = estimated quantile of rank (r - eps)`.
+- Let `q_hi = estimated quantile of rank (r + eps)`.
+- Then `q_lo ≤ q ≤ q_hi`, with 99% confidence.
+
+
+
+
+.. autoclass:: _datasketches.kll_ints_sketch
+    :members:
+    :undoc-members:
+
+.. autoclass:: _datasketches.kll_floats_sketch
+    :members:
+    :undoc-members:
+
+.. autoclass:: _datasketches.kll_doubles_sketch
+    :members:
+    :undoc-members:
+
+.. autoclass:: _datasketches.kll_items_sketch
+    :members:
+    :undoc-members:
+

diff --git a/docs/source/quantiles/quantiles_depr.rst b/docs/source/quantiles/quantiles_depr.rst
new file mode 100644
index 0000000..a2d12ea
--- /dev/null
+++ b/docs/source/quantiles/quantiles_depr.rst

@@ -0,0 +1,55 @@
+Quantiles Sketch (Deprecated)
+-----------------------------
+This is a deprecated quantiles sketch that is included for cross-language compatibility.
+
+This is a stochastic streaming sketch that enables near-real time analysis of the
+approximate distribution from a very large stream in a single pass.
+The analysis is obtained using `get_rank()` and `get_quantile()` functions,
+the Probability Mass Function from `get_pmf()`` and the Cumulative Distribution Function from `get_cdf`.
+
+Consider a large stream of one million values such as packet sizes coming into a network node.
+The natural rank of any specific size value is its index in the hypothetical sorted
+array of values.
+The normalized rank is the natural rank divided by the stream size,
+in this case one million.
+The value corresponding to the normalized rank of `0.5` represents the 50th percentile or median
+value of the distribution, or `get_quantile(0.5)`. 
+Similarly, the 95th percentile is obtained from `get_quantile(0.95)`.
+
+From the min and max values, for example, 1 and 1000 bytes,
+you can obtain the PMF from `get_pmf(100, 500, 900)` that will result in an array of
+4 fractional values such as {.4, .3, .2, .1}, which means that
+40% of the values were < 100,
+30% of the values were ≥ 100 and < 500,
+20% of the values were ≥ 500 and < 900, and
+10% of the values were ≥ 900.
+A frequency histogram can be obtained by multiplying these fractions by `get_n()`,
+which is the total count of values received.
+The `get_cdf()`` works similarly, but produces the cumulative distribution instead.
+
+As of November 2021, this implementation produces serialized sketches which are binary-compatible
+with the equivalent Java implementation only when template parameter T = double
+(64-bit double precision values).
+ 
+The accuracy of this sketch is a function of the configured value `k`, which also affects
+the overall size of the sketch. Accuracy of this quantile sketch is always with respect to
+the normalized rank. A `k` of 128 produces a normalized, rank error of about 1.7%.
+For example, the median item returned from `get_quantile(0.5)` will be between the actual items
+from the hypothetically sorted array of input items at normalized ranks of 0.483 and 0.517, with
+a confidence of about 99%.
+
+.. autoclass:: _datasketches.quantiles_ints_sketch
+    :members:
+    :undoc-members:
+
+.. autoclass:: _datasketches.quantiles_floats_sketch
+    :members:
+    :undoc-members:
+
+.. autoclass:: _datasketches.quantiles_doubles_sketch
+    :members:
+    :undoc-members:
+
+.. autoclass:: _datasketches.quantiles_items_sketch
+    :members:
+    :undoc-members:

diff --git a/docs/source/quantiles/req.rst b/docs/source/quantiles/req.rst
new file mode 100644
index 0000000..1df4d08
--- /dev/null
+++ b/docs/source/quantiles/req.rst

@@ -0,0 +1,40 @@
+Relative Error Quantiles (REQ) Sketch
+-------------------------------------
+This is an implementation based on the `paper <https://arxiv.org/abs/2004.01668>`_ "Relative Error Streaming Quantiles" by Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel Veselý, and loosely derived from a Python prototype written by Pavel Veselý.
+
+This implementation differs from the algorithm described in the paper in the following:
+
+The algorithm requires no upper bound on the stream length. 
+Instead, each relative-compactor counts the number of compaction operations performed so far (via variable state). 
+Initially, the relative-compactor starts with `INIT_NUMBER_OF_SECTIONS`. 
+Each time the number of compactions `(variable state) exceeds 2^{numSections - 1}`, we double `numSections`. 
+Note that after merging the sketch with another one variable state may not correspond to the number of compactions performed at a particular level, however, 
+since the state variable never exceeds the number of compactions, the guarantees of the sketch remain valid.
+
+The size of each section (variable `k` and `section_size` in the code and parameter `k` in the paper) is 
+initialized with a number set by the user via variable `k`. 
+When the number of sections doubles, we decrease section_size by a factor of `sqrt(2)`. 
+This is applied at each level separately. 
+Thus, when we double the number of sections, the nominal compactor size increases by a factor of approx. `sqrt(2) (+/- rounding)`.
+
+The merge operation here does not perform "special compactions", which are used in the paper to allow for a tight mathematical analysis of the sketch.
+This implementation provides a number of capabilities not discussed in the paper or provided in the Python prototype.
+
+The Python prototype only implemented high accuracy for low ranks. This implementation provides the user with the ability to 
+choose either high rank accuracy or low rank accuracy at the time of sketch construction.
+The Python prototype only implemented a comparison criterion of `INCLUSIVE`. 
+This implementation allows the user to use both the `INCLUSIVE` criterion and the `EXCLUSIVE` criterion.
+This implementation provides extensive debug visibility into the operation of the sketch with two levels of detail output. 
+This is not only useful for debugging, but is a powerful tool to help users understand how the sketch works.
+
+.. autoclass:: _datasketches.req_ints_sketch
+    :members:
+    :undoc-members:
+
+.. autoclass:: _datasketches.req_floats_sketch
+    :members:
+    :undoc-members:
+
+.. autoclass:: _datasketches.req_items_sketch
+    :members:
+    :undoc-members:
\ No newline at end of file

diff --git a/docs/source/vector/density_sketch.rst b/docs/source/vector/density_sketch.rst
new file mode 100644
index 0000000..8326ebf
--- /dev/null
+++ b/docs/source/vector/density_sketch.rst

@@ -0,0 +1,17 @@
+Density Sketch
+--------------
+Builds a coreset from the given set of input points. 
+Provides density estimate at a given point.
+
+Based on the following paper: Zohar Karnin, Edo Liberty 
+"Discrepancy, Coresets, and Sketches in Machine Learning" 
+https://proceedings.mlr.press/v99/karnin19a/karnin19a.pdf
+
+Inspired by the following implementation: https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde.py
+
+.. autoclass:: datasketches.density_sketch
+    :members:
+    :undoc-members:
+    
+.. autoclass:: datasketches.GaussianKernel
+    :members:
\ No newline at end of file

diff --git a/docs/source/vector/index.rst b/docs/source/vector/index.rst
new file mode 100644
index 0000000..f7121ea
--- /dev/null
+++ b/docs/source/vector/index.rst

@@ -0,0 +1,8 @@
+Vector Sketches
+==================
+These sketches are designed to accept vector inputs.
+
+.. toctree::
+  :maxdepth: 1
+   
+  density_sketch
\ No newline at end of file
commit	b6de0f9a83d80653a761856a124881b11915b548	[log] [tgz]
author	Charlie Dickens <charlie@apache.org>	Tue Jan 16 10:59:08 2024 +0000
committer	Charlie Dickens <charlie@apache.org>	Tue Jan 16 10:59:08 2024 +0000
tree	a2aabec7cb2c3d6bddf63cae9a1cbf7758ec9290
parent	a83ebbce7cf8ec99b18fe31664473d8222fb2a4b [diff]