register datasketches-memory-2.0.0.jar; register datasketches-java-3.1.0.jar; register datasketches-pig-1.1.0.jar; define dataToSketch org.apache.datasketches.pig.quantiles.DataToDoublesSketch(); define unionSketch org.apache.datasketches.pig.quantiles.UnionDoublesSketch(); define getQuantile org.apache.datasketches.pig.quantiles.GetQuantileFromDoublesSketch(); a = load 'data.txt' as (value:double, category); b = group a by category; c = foreach b generate flatten(group) as (category), flatten(dataToSketch(a.value)) as sketch; -- Sketches can be stored at this point in binary format to be used later: -- store c into 'intermediate/$date' using BinStorage(); -- The next two lines print the results in human readable form for the purpose of this example d = foreach c generate category, getQuantile(sketch, 0.5); -- median value from the sketch dump d; -- This can be a separate query -- For example, the first part can produce a daily intermediate feed and store it, -- and this part can load several instances of this daily intermediate feed and union them -- c = load 'intermediate/$date1,intermediate/$date2' using BinStorage() as (category, sketch); e = group c all; f = foreach e generate flatten(unionSketch(c.sketch)) as (sketch); g = foreach f generate getQuantile(sketch, 0.5); -- median value from the sketch dump g;
The example input data has 2 fields: value and category. The first part of the query produces a QuantilesSketch per category, and the second part merges sketches across categories.
From ‘dump d’:
(a,6.0) (b,16.0)
From ‘dump g’ (merged across categories):
(11.0)
1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 b 12 b 13 b 14 b 15 b 16 b 17 b 18 b 19 b 20 b