register datasketches-memory-1.2.0-incubating.jar; register datasketches-java-1.2.0-incubating.jar; register datasketches-pig-1.0.0-incubating.jar; -- very small sketch just for the purpose of this tiny example define dataToSketch org.apache.datasketches.pig.frequencies.DataToFrequentStringsSketch('8'); define unionSketch org.apache.datasketches.pig.frequencies.UnionFrequentStringsSketch('8'); define getEstimates org.apache.datasketches.pig.frequencies.FrequentStringsSketchToEstimates(); a = load 'data.txt' as (item:chararray, category); b = group a by category; c = foreach b generate flatten(group) as (category), flatten(dataToSketch(a.item)) as (sketch); -- Sketches can be stored at this point in binary format to be used later: -- store c into 'intermediate/$date' using BinStorage(); -- The next two lines print the results in human readable form for the purpose of this example d = foreach c generate category, getEstimates(sketch); dump d; -- This can be a separate query. -- For example, the first part can produce a daily intermediate feed and store it. -- This part can load several instances of this daily intermediate feed and merge them -- c = load 'intermediate/$date1,intermediate/$date2' using BinStorage() as (category, sketch); e = group c all; f = foreach e generate flatten(unionSketch(c.sketch)) as (sketch); g = foreach f generate getEstimates(sketch); describe g; dump g;
The example input data has 2 fields: item and category. In the first part of the query the data is grouped by category with one FrequentItemsSketch<String> per category. In the second part of the query this intermediate result is merged across categories to produce one sketch. This way the usage of all 3 UDFs is demonstrated: DataToFrequentStringsSketch, UnionFrequentStringsSketch and FrequentStringsSketchToEstimates.
Results:
From ‘dump d’ (one sketch per category):
(c1,{(a,7,7,7),(d,2,2,2),(b,1,1,1)}) (c2,{(a,5,5,5),(d,2,2,2),(e,1,1,1),(c,1,1,1)})
From ‘dump g’ (merged across categories):
({(a,12,12,12),(d,4,4,4),(b,1,1,1),(e,1,1,1),(c,1,1,1)})
From ‘describe g’:
g: {bag_of_item_tuples: {item_tuple: (item: chararray,estimate: long,lower_bound: long,upper_bound: long)}}
In this example the results are exact due to small input.
a c1 a c1 a c1 a c2 a c1 b c1 c c2 d c1 e c2 a c1 a c2 a c2 a c2 d c1 d c2 a c1 a c2 a c1 d c2