layout: doc_page

Frequent Distinct Tuples Sketch

The Task

Given a multiset of tuples with N dimensions {d1,d2, d3, ..., dN}, and a primary subset of dimensions M < N, the task is to identify the combinations of M subset dimensions that have the most frequent number of distinct combinations of the N - M non-primary dimensions.

Definitions

  • Equal Distribution Threshold

Suppose we have a stream of 160 items where the stream consists of four item types: A, B, C, and D. If the distribution of occurances was shared equally across the four items each would occur exactly 40 times or 25% of the total distribution of 160 items. Thus the equally distributed (or fair share) threshold would be 25% or as a fraction 0.25.

  • Most Frequent

We define Most Frequent items as those that consume more than the fair share threshold of the total occurances (also called the weight) of the entire stream.

Suppose we have a stream of 160 items where the stream consists of four item types: A, B, C, and D, which have the following frequency distribution:

  • A occurs 70 times
  • B occurs 50 times
  • C occurs 30 times
  • D occurs 10 times

We would declare that A is the most frequent and B is the next most frequent. We would not declare C and D in a list of most frequent items since their respective frequencies are below the threshold of 40 or 25%.

If all items occured with a frequency of 40, we could not declare any item as most frequent. Requesting a list of the “Top 4” items could be a list of the 4 items in any random order, or a list of zero items, depending on policy.

  • Relative Standard Error or RSE

When a stream becomes too large with too many different item types to efficiently process exacty we must turn to statistical methods to estimate the most frequent items. The returned list of “most frequent items” will have an associated “estimated frequency” that has an error component that is a random variable. The RSE is a measure of the width of the error distribution of this component and computed similarly to how a Standard Deviation is computed. It is relative to the estimated frequency in that the standard width of the error distribution can be computed as estimate * (1 + RSE) and estimate * (1 - RSE).

  • Primary Key

We define a specific combination of the M primary dimensions as a Primary Key and all combinations of the M primary dimensions as the set of Primary Keys.

  • Group

We define the set of all combinations of N-M non-primary dimensions associated with a single primary key as a Group.

For example, assume N = 3, M = 2, where the set of Primary Keys are defined by {d1, d2}. After populating the sketch with a stream of tuples all of size N = 3, we wish to identify the Primary Keys that have the most frequent number of distinct occurrences of {d3}. Equivalently, we want to identify the Primary Keys with the largest Groups.

Alternatively, if we choose the Primary Key as {d1}, then we can identify the {d1}s that have the largest groups of {d2, d3}. The choice of which dimensions to choose for the Primary Keys is performed in a post-processing phase after the sketch has been populated. Thus, multiple queries can be performed against the populated sketch with different selections of Primary Keys.

An important caveat is that if the distribution is too flat, there may not be any “most frequent” combinations of the primary keys above the threshold. Also, if one primary key is too dominant, the sketch may be able to only report on the single most frequent primary key combination, which means the possible existance of false negatives.

In this implementation the input tuples presented to the sketch are string arrays.

Using the FdtSketch

As a simple concrete example, let's assume N = 2 and let d1 := IP address, and d2 := User ID.

Let's choose {d1} as the Primary Keys, then the sketch will allow us to identify the IP addresses that have the largest populations of distinct User IDs.

Conversely, if we choose {d2} as the Primary Keys, the sketch will allow us to identify the User IDs with the largest populations of distinct IP addresses.

Let's set the threshold to 0.1 (10%) and the RSE to 0.05 (5%). This is telling the sketch to size itself such that items with distinct frequency counts greater than 10% of the total distinct population will be detectable and have a distinct frequency estimation error of less than or equal to 5% with 86% confidence.

//Construct the sketch
FdtSketch sketch = new FdtSketch(0.1, 0.05);

Assume the incoming data is a stream of {IP address, User ID} pairs:

//Populate
while (inputStream.hasRemainingItems()) {
  String[] in = new String[] {Pair.IPaddress, Pair.userID};
  sketch(in);
}

We are done populating the sketch, now we post process the data in the sketch:

int[] priKeyIndices = new int[] {0}; //identifies the IP address as the primary key
int numStdDev = 2; //for 95% confidence intervals
int limit = 20; //list only the top 20 groups
char sep = '|'; //the separator charactor for the group dimensions as strings
List<Group> list = sketch.getResult(priKeyIndices, limit, numStdDev, sep);
System.out.println(Group.getHeader())
Iterator<Group> itr = list.iterator()
while (itr.hasNext()) {
  System.out.println(itr.next())
}

If we want the converse relation we assign the UserID as the primary key. Note that we do not have to repopulate the sketch!

int[] priKeyIndices = new int[] {1}; //identifies the User ID as the primary key
...

Understanding the Group output columns

When the Group is printed as a string, it will output seven columns as follows:

  • Count: This is the number of retained occurrences in the sketch that belong to this group.

  • Est This is the sketch‘s estimate of the distinct cardinality of this group based on the Count and the sketche’s internal state of Theta.

  • UB: The Upper Bound of the confidence interval based on the given numStdDev.

  • LB: The Lower Bound of the confidence interval based on the given numStdDev.

  • Fraction: The fractional proportion of the cardinality of the entire sketch that this group represents.

  • RSE: The estimated RSE of this group based on the properties of this group and the internal properties of this sketch.

  • PriKey: The string identifying this group.