layout: doc_page

Note that these instructions work on unix-based systems including macs. Windows systems will need something similar.

Place the following in an empty text file called “sketch” and update the version numbers and the path to your local .m2/repository directory:

#!/bin/bash
# Update version numbers and the path to your local .m2/repository as necessary

COREVER="0.5.2"
MISCVER="0.1.0"
M2PATH="/path/to/.m2/repository"

COREPATH="$M2PATH/com/yahoo/datasketches/sketches-core/$COREVER/sketches-core-$COREVER.jar"
MISCPATH="$M2PATH/com/yahoo/datasketches/sketches-misc/$MISCVER/sketches-misc-$MISCVER.jar"
CLSPATH="$COREPATH:$MISCPATH"

java -cp $CLSPATH com.yahoo.sketches.cmd.CommandLine $@

Move this sketch file to a local system directory accessible from anywhere in your system, and make it executable.

cp sketch /usr/local/bin/sketch
chmod +x /usr/local/bin/sketch

Test your executable. You should see something like the following:

sketch

NAME
    sketch - sketch Uniques, Quantiles, Histograms, or Frequent Items.
SYNOPSIS
    sketch (this help)
    sketch TYPE help
    sketch TYPE [SIZE] [FILE]
DESCRIPTION
    Write a sketch(TYPE, SIZE) of FILE to standard output.
    TYPE is required.
    If SIZE is omitted, internal defaults are used.
    If FILE is omitted, Standard In is assumed.
TYPE DESCRIPTION
    sketch uniq    : Sketch the unique string items of a stream.
    sketch rank    : Sketch the rank-value distribution of a numeric value stream.
    sketch hist    : Sketch the linear-axis value-frequency distribution of numeric value stream.
    sketch loghist : Sketch the log-axis value-frequency distribution of numeric value stream.
    sketch freq    : Sketch the Heavy Hitters of a string item stream.

UNIQ SYNOPSIS
    sketch uniq help
    sketch uniq [SIZE] [FILE]

RANK SYNOPSIS
    sketch rank help
    sketch rank [SIZE] [FILE]

HIST SYNOPSIS
    sketch hist help
    sketch hist [SIZE] [FILE]

LOGHIST SYNOPSIS
    sketch loghist help
    sketch loghist [SIZE] [FILE]

FREQ SYNOPSIS
    sketch freq help
    sketch freq [SIZE] [FILE]

You can create a test data file, with duplicate values, like this:

$ python -c "exec(\"import random\\nfor _ in range(10000000): print random.randint(1,10000000)\")" > manyNumbers.txt

Now you can do either something like this:

$ cat manyNumbers.txt | sketch uniq
or
$ cat manyNumbers.txt | sketch uniq 16000

or like this:

$ sketch uniq manyNumbers.txt
or
$ sketch uniq 16000 manyNumbers.txt

Providing the size allows you to tune the accuracy.

Be sure to compare the speed of the above to the conventional method:

$ cat manyNumbers.txt | sort | uniq | wc -l

If you haven't already, clone and install sketches-core and sketches-misc as in the previous example.

Create the demo executable with the same content as the sketch executable except for the last line:

java -cp $CLSPATH com.yahoo.sketches.demo.ExactVsSketchDemo $@

Move this demo file to a local system directory accessible from anywhere in your system, and make it executable.

cp demo /usr/local/bin/demo
chmod +x /usr/local/bin/demo

When run, the output should look something like this:

demo

# COMPUTE DISTINCT COUNT EXACTLY:
## BUILD FILE:
Time Min:Sec.mSec = 0:17.569
Total Values: 100,000,000
Build Rate: 175 nSec/Value
Exact Uniques: 50,002,776
File Size Bytes: 1,693,331,301

## SORT & REMOVE DUPLICATES
Unix cmd: sort -u -o tmp/sorted.txt tmp/test.txt
Time Min:Sec.mSec = 1:49.571

## LINE COUNT
Unix cmd: wc -l tmp/sorted.txt
Time Min:Sec.mSec = 0:00.900
Output from wc command:
 50002776 tmp/sorted.txt

Total Exact Time Min:Sec.mSec = 2:08.040


# COMPUTE DISTINCT COUNT USING SKETCHES
## USING THETA SKETCH
Time Min:Sec.mSec = 0:00.614
Total Values: 100,000,000
Build Rate: 6 nSec/Value
Exact Uniques: 50,002,776
## SKETCH STATS
Sketch Estimate of Uniques: 50,098,990
Sketch Actual Relative Error: 0.192%
Sketch 95%ile Error Bounds  : +/- 1.563%
Max Sketch Size Bytes: 262,144
Speedup Factor 208.5

## USING HLL SKETCH
Time Min:Sec.mSec = 0:02.212
Total Values: 100,000,000
Build Rate: 22 nSec/Value
Exact Uniques: 50,002,776
## SKETCH STATS
Sketch Estimate of Uniques: 49,784,556
Sketch Actual Relative Error: -0.436%
Sketch 95%ile Error Bounds  : +/- 1.306%
Max Sketch Size Bytes: 8,192
Speedup Factor 57.9

The first part builds a file, separately timed, of 100M numbers with roughly 50% duplicates. The second part sorts and removes duplicates using the unix sort -u command and may take several minutes to run, so be patient. The third part does a line count using the unix wc -l command.

After that, two different sketch trials are run, one with a Theta Sketch and the other with a compact implementation of Flajolet's HLL Sketch. Sketches do not require a pre-built file. They run in true streaming mode with the random values generated on the fly.

Check out the statistics!

Enjoy!