varopt pig udf simple example and data

commit: e81b6ba4a0d7269d44d8eee9022b0a1560186d87 [log] [tgz]
author: jmalkin <jmalkin@yahoo-inc.com> Wed Jul 05 16:02:36 2017 -0700
committer: jmalkin <jmalkin@yahoo-inc.com> Wed Jul 05 16:02:36 2017 -0700
tree: 5472d5438e9715bc208988ddebc94752459bcd9e
parent: 6f2263de76ec9991ea2d79809300cbf4d49d48da [diff]
diff --git a/docs/Sampling/VarOptPigUDFs.md b/docs/Sampling/VarOptPigUDFs.md
new file mode 100644
index 0000000..ed35e46
--- /dev/null
+++ b/docs/Sampling/VarOptPigUDFs.md

@@ -0,0 +1,82 @@
+---
+layout: doc_page
+---
+
+## VarOpt Sampling Sketch Pig UDFs
+
+### Instructions
+
+* get jars
+* save the following script as varopt_example.pig
+* adjust jar versions and paths as necessary
+* save the below data into a file called data.txt
+* copy data to hdfs: "hadoop fs -copyFromLocal data.txt"
+* run pig script: "pig varopt_example.pig"
+
+### varopt_example.pig script
+
+    register sketches-core-0.10.0.jar;
+    register sketches-pig-0.10.0.jar;
+
+    -- very small sketch just for the purpose of this tiny example
+    DEFINE VarOpt com.yahoo.sketches.pig.sampling.DataToVarOptSketch('4', '0');
+    DEFINE VarOptUnion com.yahoo.sketches.pig.sampling.VarOptUnion('4');
+    DEFINE GetSamples com.yahoo.sketches.pig.sampling.GetVarOptSamples();
+
+    raw_data = LOAD 'data.txt' USING PigStorage('\t') AS
+	(weight: double, id: chararray);
+
+    -- make a few independent sketches from the input data
+    bytes = FOREACH
+	(GROUP raw_data ALL)
+    GENERATE
+	VarOpt(raw_data) AS sketch0,
+	VarOpt(raw_data) AS sketch1
+	;
+
+    sketchBag = FOREACH
+	bytes
+    GENERATE
+	FLATTEN(TOBAG(sketch0,
+	      sketch1)) AS sketches
+	;
+
+    unioned = FOREACH
+	(GROUP sketchBag ALL)
+    GENERATE
+	VarOptUnion(sketchBag.sketches) AS binSketch
+	;
+
+    result = FOREACH
+	unioned
+    GENERATE
+	FLATTEN(com.yahoo.sketches.pig.sampling.GetVarOptSamples(binSketch)) AS (vo_weight, record:(id, weight))
+	;
+
+    DUMP result;
+    DESCRIBE result;
+
+The test data has 2 fields: weight and id. The first step of the query creates several varopt sketches from the input data. We merge the sketches into a bag in the next step, followed by unioning the set of independent sketches. Finally, the last step gets the final set of results.
+`
+Results:
+
+From 'DUMP result':
+    (30.0,(30.0,heavy))
+    (30.0,(30.0,heavy))
+    (28.0,(4.0,d))
+    (28.0,(7.0,g))
+
+By running this script repeatedly, we can obesrve that the heavy items will always be included, but that the remaining 2 items will differ across runs. We can also see that the varopt weight represents an adjusetd weight, although by keeping the entire input tuple the original weight value is also stored.
+
+From 'DESCRIBE result':
+    result: {vo_weight: double,record: (id: bytearray,weight: bytearray)}
+
+### [data.txt]({{site.docs_dir}}/Sampling/data.txt) (tab separated)
+    1.0	a
+    2.0	b
+    3.0	c
+    4.0	d
+    5.0	e
+    6.0	f
+    7.0	g
+    30.0	heavy

diff --git a/docs/Sampling/VarOptSampling.md b/docs/Sampling/VarOptSampling.md
index e24dbb1..eaefab7 100644
--- a/docs/Sampling/VarOptSampling.md
+++ b/docs/Sampling/VarOptSampling.md

@@ -13,7 +13,7 @@
 * VarOptItemsSketch&lt;T&gt;
 
     This sketch provides a random sample of items of type &lt;T&gt; from the stream of weighted items.
-    An item's inclusion probability will be roughly proportional to its weight, with some
+    An item's inclusion probability will usually be proportional to its weight, with some
     important technical caveats to ensure the optimal variance property.
 
     If the user needs to serialize and deserialize the resulting sketch for storage or transport, 
@@ -29,8 +29,8 @@
 that can be stored in the reservoir. In contrast to some other sketches in this library, the size does
 not need to be a power of 2.
 
-When serialized, these sketches use 16 bytes of header data in addition to the serialized size of the
-items in the sketch.
+When serialized, these sketches use 32 bytes of header data in addition to the serialized size of the
+items in the sketch. VarOpt unions may require some extra metadata beyond the regular header.
 
 
 ### Updating the sketch with new items
@@ -46,7 +46,8 @@
 
 The underlying goal of VarOpt sampling is to provide the best possible estimate of subset sums of items in the sample. As an example, we might select a sample o size <tt>k</tt> from the ~3200 counties (a political administrative region below the level of a state) in the United States, using the county population as the weight. We could then apply a predicate to our sample -- for instance, counties in the state of California -- and sum the resulting weights. That sum is our estimate of the total population of the state. The weights used when computing subset sums will, in general, be adjusted values rather than the original input weights.
 
-Unlike standard reservoir sampling, where each sample is considered independently, VarOpt attempts to minimize the total variance in the sample by selecting samples in a way such that they may be <em>negatively</em> correlated. This produces better estimates for subset sums, but does mean that our random sample is not necessarily uniform.
+Unlike standard reservoir sampling, where each sample is considered independently, VarOpt attempts to minimize the total variance in the sample by selecting samples in a way such that they may be <em>negatively</em> correlated. This produces better estimates for subset sums, but does mean that our random sample is not necessarily uniform and that an item's inclusion probability is not a simple function of the item weight.
+
 
 #### VarOpt Intuition
 

diff --git a/docs/Sampling/VarOptSamplingJava.md b/docs/Sampling/VarOptSamplingJava.md
index 2fc5965..f0349af 100644
--- a/docs/Sampling/VarOptSamplingJava.md
+++ b/docs/Sampling/VarOptSamplingJava.md

@@ -11,7 +11,7 @@
 
     perl -lane 's/^\s+//; s/[;\.,!?:\x27\[\]&]//g; s/--//g; s/\s+/\n/g; print lc if length > 0' input.txt | sort | uniq -c | awk '{print $1 "\t" $2}' > output.txt
 
-These were then used in the following example, slightly modified to remove error handling for clarity.
+These were then used in the following example, slightly modified to remove error handling for clarity. Serialization and deserialization are completely parallel to the Reservoir Sampling sketch, and example code for that may be found in thoes Java examples.
 
 
     import java.io.BufferedReader;
@@ -19,7 +19,6 @@
     import java.io.FileInputStream;
     import java.io.FileOutputStream;
     import java.io.FileReader;
-    import java.io.IOException;
 
     import com.yahoo.memory.Memory;
     import com.yahoo.sketches.ArrayOfLongsSerDe;

diff --git a/docs/Sampling/data.txt b/docs/Sampling/data.txt
new file mode 100644
index 0000000..c6a2d50
--- /dev/null
+++ b/docs/Sampling/data.txt

@@ -0,0 +1,8 @@
+    1.0	a
+    2.0	b
+    3.0	c
+    4.0	d
+    5.0	e
+    6.0	f
+    7.0	g
+    30.0	heavy
commit	e81b6ba4a0d7269d44d8eee9022b0a1560186d87	[log] [tgz]
author	jmalkin <jmalkin@yahoo-inc.com>	Wed Jul 05 16:02:36 2017 -0700
committer	jmalkin <jmalkin@yahoo-inc.com>	Wed Jul 05 16:02:36 2017 -0700
tree	5472d5438e9715bc208988ddebc94752459bcd9e
parent	6f2263de76ec9991ea2d79809300cbf4d49d48da [diff]