| --- |
| title: Statistics - Guide - Apache DataFu Pig |
| version: 1.4.0 |
| section_name: Apache DataFu Pig - Guide |
| license: > |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --- |
| |
| ## Statistics |
| |
| ### Median |
| |
| Apache DataFu has two UDFs that can be used to compute the [median](http://en.wikipedia.org/wiki/Median) of a bag. |
| [Median](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/stats/Median.html) computes the median exactly, but |
| requires that the input bag be sorted. [StreamingMedian](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/stats/StreamingMedian.html), |
| on the other hand, does not require that the bag be sorted, however it computes only an estimate of the median. But, because it does not require |
| the input bag to be sorted, it is more efficient. |
| |
| Let's take a look at computing the median using `StreamingMedian`: |
| |
| ```pig |
| define Median datafu.pig.stats.StreamingMedian(); |
| |
| -- input: 3,5,4,1,2 |
| input = LOAD 'input' AS (val:int); |
| |
| -- produces: 3 |
| medians = FOREACH (GROUP input ALL) GENERATE Median(input.val); |
| ``` |
| |
| ### Quantiles |
| |
| [Quantiles](http://en.wikipedia.org/wiki/Quantile) are points at regular intervals within an ordered data set. Essentially |
| we divide an ordered data set into segments, and the quantiles are the values between the segments. The quantiles people are probably |
| most familiar with are those for median and percentiles. |
| |
| Similar to median, DataFu has two UDFs that can compute quantiles. The median UDFs are in fact just wrappers around the quantile UDFs. |
| [Quantile](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/stats/Quantile.html) computes the quantiles of a sorted bag exactly, |
| and [StreamingQuantile](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/stats/StreamingQuantile.html) computes an estimate of |
| the quantiles of a bag that does not need to be sorted. |
| |
| Let's take a look at computing the median using `StreamingQuantile`: |
| |
| ```pig |
| define Quantile datafu.pig.stats.StreamingQuantile('0.0','0.5','1.0'); |
| |
| -- input: 9,10,2,3,5,8,1,4,6,7 |
| input = LOAD 'input' AS (val:int); |
| |
| -- produces: (1,5.5,10) |
| quantiles = FOREACH (GROUP input ALL) GENERATE Quantile(input.val); |
| ``` |
| |
| ### Variance |
| |
| [Variance](http://en.wikipedia.org/wiki/Variance) can be computed using the [VAR](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/stats/VAR.html) |
| UDF: |
| |
| ```pig |
| define VAR datafu.pig.stats.VAR(); |
| |
| -- input: 1,2,3,4,5,6,7,8,9 |
| input = LOAD 'input' AS (val:int); |
| |
| -- produces: 6.666666666666668 |
| variance = FOREACH (GROUP input ALL) GENERATE VAR(input.val); |
| ``` |