layout: doc_page

Frequent Distinct Tuples Sketch

Given a multiset of tuples with dimensions {d1,d2, d3, ..., dN}, and a primary subset of dimensions M < N, the task is to identify the combinations of M subset dimensions that have the most frequent number of distinct combinations of the N-M non-primary dimensions.

We define a specific combination of the M primary dimensions as a Primary Key and all combinations of the M primary dimensions as the set of Primary Keys.

We define the set of all combinations of N-M non-primary dimensions associated with a single primary key as a Group.

For example, assume N=3, M=2, where the set of Primary Keys are defined by {d1, d2}. After populating the sketch with a stream of tuples all of size N, we wish to identify the Primary Keys that have the most frequent number of distinct occurrences of {d3}. Equivalently, we want to identify the Primary Keys with the largest Groups.

Alternatively, if we choose the Primary Key as {d1}, then we can identify the {d1}s that have the largest groups of {d2, d3}. The choice of which dimensions to choose for the Primary Keys is performed in a post-processing phase after the sketch has been populated. Thus, multiple queries can be performed against the populated sketch with different selections of Primary Keys.

As a simple concrete example, let‘s assume N = 2 and let d1 := IP address, and d2 := User ID. Let’s choose {d1} as the Primary Keys, then the sketch allows the identification of the IP addresses that have the largest populations of distinct User IDs. Conversely, if we choose {d2} as the Primary Keys, the sketch allows the identification of the User IDs with the largest populations of distinct IP addresses.

An important caveat is that if the distribution is too flat, there may not be any “most frequent” combinations of the primary keys above the threshold. Also, if one primary key is too dominant, the sketch may be able to only report on the single most frequent primary key combination, which means the possible existance of false negatives.

In this implementation the input tuples presented to the sketch are string arrays.