Note that these instructions work on unix-based systems including macs. Windows systems will need something similar.
Place the following in an empty text file called “sketch” and update the version numbers and the path to your local .m2/repository directory:
#!/bin/bash # Update version numbers and the path to your local .m2/repository as necessary COREVER="0.5.2" MISCVER="0.1.0" M2PATH="/path/to/.m2/repository" COREPATH="$M2PATH/com/yahoo/datasketches/sketches-core/$COREVER/sketches-core-$COREVER.jar" MISCPATH="$M2PATH/com/yahoo/datasketches/sketches-misc/$MISCVER/sketches-misc-$MISCVER.jar" CLSPATH="$COREPATH:$MISCPATH" java -cp $CLSPATH com.yahoo.sketches.cmd.CommandLine $@
Move this sketch file to a local system directory accessible from anywhere in your system, and make it executable.
cp sketch /usr/local/bin/sketch chmod +x /usr/local/bin/sketch
Test your executable. You should see something like the following:
sketch NAME sketch - sketch Uniques, Quantiles, Histograms, or Frequent Items. SYNOPSIS sketch (this help) sketch TYPE help sketch TYPE [SIZE] [FILE] DESCRIPTION Write a sketch(TYPE, SIZE) of FILE to standard output. TYPE is required. If SIZE is omitted, internal defaults are used. If FILE is omitted, Standard In is assumed. TYPE DESCRIPTION sketch uniq : Sketch the unique string items of a stream. sketch rank : Sketch the rank-value distribution of a numeric value stream. sketch hist : Sketch the linear-axis value-frequency distribution of numeric value stream. sketch loghist : Sketch the log-axis value-frequency distribution of numeric value stream. sketch freq : Sketch the Heavy Hitters of a string item stream. UNIQ SYNOPSIS sketch uniq help sketch uniq [SIZE] [FILE] RANK SYNOPSIS sketch rank help sketch rank [SIZE] [FILE] HIST SYNOPSIS sketch hist help sketch hist [SIZE] [FILE] LOGHIST SYNOPSIS sketch loghist help sketch loghist [SIZE] [FILE] FREQ SYNOPSIS sketch freq help sketch freq [SIZE] [FILE]
You can create a test data file, with duplicate values, like this:
$ python -c "exec(\"import random\\nfor _ in range(10000000): print random.randint(1,10000000)\")" > manyNumbers.txt
Now you can do either something like this:
$ cat manyNumbers.txt | sketch uniq or $ cat manyNumbers.txt | sketch uniq 16000
or like this:
$ sketch uniq manyNumbers.txt or $ sketch uniq 16000 manyNumbers.txt
Providing the size allows you to tune the accuracy.
Be sure to compare the speed of the above to the conventional method:
$ cat manyNumbers.txt | sort | uniq | wc -l
If you haven't already, clone and install sketches-core and sketches-misc as in the previous example.
Create the demo executable with the same content as the sketch executable except for the last line:
java -cp $CLSPATH com.yahoo.sketches.demo.ExactVsSketchDemo $@
Move this demo file to a local system directory accessible from anywhere in your system, and make it executable.
cp demo /usr/local/bin/demo chmod +x /usr/local/bin/demo
When run, the output should look something like this:
demo # COMPUTE DISTINCT COUNT EXACTLY: ## BUILD FILE: Time Min:Sec.mSec = 0:17.569 Total Values: 100,000,000 Build Rate: 175 nSec/Value Exact Uniques: 50,002,776 File Size Bytes: 1,693,331,301 ## SORT & REMOVE DUPLICATES Unix cmd: sort -u -o tmp/sorted.txt tmp/test.txt Time Min:Sec.mSec = 1:49.571 ## LINE COUNT Unix cmd: wc -l tmp/sorted.txt Time Min:Sec.mSec = 0:00.900 Output from wc command: 50002776 tmp/sorted.txt Total Exact Time Min:Sec.mSec = 2:08.040 # COMPUTE DISTINCT COUNT USING SKETCHES ## USING THETA SKETCH Time Min:Sec.mSec = 0:00.614 Total Values: 100,000,000 Build Rate: 6 nSec/Value Exact Uniques: 50,002,776 ## SKETCH STATS Sketch Estimate of Uniques: 50,098,990 Sketch Actual Relative Error: 0.192% Sketch 95%ile Error Bounds : +/- 1.563% Max Sketch Size Bytes: 262,144 Speedup Factor 208.5 ## USING HLL SKETCH Time Min:Sec.mSec = 0:02.212 Total Values: 100,000,000 Build Rate: 22 nSec/Value Exact Uniques: 50,002,776 ## SKETCH STATS Sketch Estimate of Uniques: 49,784,556 Sketch Actual Relative Error: -0.436% Sketch 95%ile Error Bounds : +/- 1.306% Max Sketch Size Bytes: 8,192 Speedup Factor 57.9
The first part builds a file, separately timed, of 100M numbers with roughly 50% duplicates. The second part sorts and removes duplicates using the unix sort -u command and may take several minutes to run, so be patient. The third part does a line count using the unix wc -l command.
After that, two different sketch trials are run, one with a Theta Sketch and the other with a compact implementation of Flajolet's HLL Sketch. Sketches do not require a pre-built file. They run in true streaming mode with the random values generated on the fly.
Check out the statistics!
Enjoy!