site/source/docs/datafu/getting-started.html.markdown.erb - datafu - Git at Google

 ---
 title: Apache DataFu Pig - Getting Started
 version: 1.6.1
 section_name: Getting Started
 license: >
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 ---

 # DataFu Pig

 Apache DataFu Pig is a collection of user-defined functions for working with large scale data in [Apache Pig](http://pig.apache.org/).  It has a number of useful functions available:

 <div class="row">
   <div class="col-lg-6">
     <h4>Statistics</h4>
     <p>
       Compute quantiles, median, variance, wilson binary confidence, etc.
     </p>

     <h4>Set Operations</h4>
     <p>
       Perform set intersection, union, or difference of bags.
     </p>

     <h4>Bags</h4>
     <p>
       Convenient functions for working with bags such as enumerate items,
       append, prepend, concat, group, distinct, etc.
     </p>

     <h4>Sessions</h4>
     <p>
       Sessionize events from a stream of data.
     </p>
   </div>

   <div class="col-lg-6">
     <h4>Estimation</h4>
     <p>
       Streaming implementations that can estimate
       quantiles and median.
     </p>

     <h4>Sampling</h4>
     <p>
       Simple random sampling with or without replacement,
       weighted sampling.
     </p>

     <h4>Link Analysis</h4>
     <p>
       Run PageRank on a graph represented by a bag of
       nodes and edges.
     </p>

     <h4>More</h4>
     <p>
       Other useful methods like Assert and Coalesce.
     </p>
   </div>
 </div>

 If you'd like to read more details about these functions, check out the [Guide](/docs/datafu/guide.html).  Otherwise if you are
 ready to get started using DataFu Pig, keep reading.

 The rest of this page assumes you already have a built JAR available.  If this is not the case, please see the [Download](/docs/download.html) page.

 ## Basic Example: Computing Median

 Let's use DataFu Pig to perform a very basic task: computing the median of some data.
 Suppose we have a file `input` in Hadoop with the following content:

     1
     2
     3
     2
     2
     2
     3
     2
     2
     1

 We can clearly see that the median is 2 for this data set.  First we'll start up Pig's grunt shell by running `pig` and
 then register the DataFu JAR:

 ```pig
 register datafu-pig-<%= current_page.data.version %>.jar
 ```

 To compute the median we'll use DataFu's `StreamingMedian`, which computes an estimate of the median but has the benefit
 of not requiring the data to be sorted:

 ```pig
 DEFINE Median datafu.pig.stats.StreamingMedian();
 ```

 Next we can load the data and pass it into the function to compute the median:

 ```pig
 data = LOAD 'input' using PigStorage() as (val:int);
 data = FOREACH (GROUP data ALL) GENERATE Median(data);
 DUMP data
 ```

 This produces the expected output:

     ((2.0))

 ## Next Steps

 Check out the [Guide](/docs/datafu/guide.html) for more information on what you can do with DataFu Pig.
	---
	title: Apache DataFu Pig - Getting Started
	version: 1.6.1
	section_name: Getting Started
	license: >
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	---

	# DataFu Pig

	Apache DataFu Pig is a collection of user-defined functions for working with large scale data in [Apache Pig](http://pig.apache.org/). It has a number of useful functions available:

	<div class="row">
	<div class="col-lg-6">
	<h4>Statistics</h4>
	<p>
	Compute quantiles, median, variance, wilson binary confidence, etc.
	</p>

	<h4>Set Operations</h4>
	<p>
	Perform set intersection, union, or difference of bags.
	</p>

	<h4>Bags</h4>
	<p>
	Convenient functions for working with bags such as enumerate items,
	append, prepend, concat, group, distinct, etc.
	</p>

	<h4>Sessions</h4>
	<p>
	Sessionize events from a stream of data.
	</p>
	</div>

	<div class="col-lg-6">
	<h4>Estimation</h4>
	<p>
	Streaming implementations that can estimate
	quantiles and median.
	</p>

	<h4>Sampling</h4>
	<p>
	Simple random sampling with or without replacement,
	weighted sampling.
	</p>

	<h4>Link Analysis</h4>
	<p>
	Run PageRank on a graph represented by a bag of
	nodes and edges.
	</p>

	<h4>More</h4>
	<p>
	Other useful methods like Assert and Coalesce.
	</p>
	</div>
	</div>

	If you'd like to read more details about these functions, check out the [Guide](/docs/datafu/guide.html). Otherwise if you are
	ready to get started using DataFu Pig, keep reading.

	The rest of this page assumes you already have a built JAR available. If this is not the case, please see the [Download](/docs/download.html) page.

	## Basic Example: Computing Median

	Let's use DataFu Pig to perform a very basic task: computing the median of some data.
	Suppose we have a file `input` in Hadoop with the following content:

	1
	2
	3
	2
	2
	2
	3
	2
	2
	1

	We can clearly see that the median is 2 for this data set. First we'll start up Pig's grunt shell by running `pig` and
	then register the DataFu JAR:

	```pig
	register datafu-pig-<%= current_page.data.version %>.jar
	```

	To compute the median we'll use DataFu's `StreamingMedian`, which computes an estimate of the median but has the benefit
	of not requiring the data to be sorted:

	```pig
	DEFINE Median datafu.pig.stats.StreamingMedian();
	```

	Next we can load the data and pass it into the function to compute the median:

	```pig
	data = LOAD 'input' using PigStorage() as (val:int);
	data = FOREACH (GROUP data ALL) GENERATE Median(data);
	DUMP data
	```

	This produces the expected output:

	((2.0))

	## Next Steps

	Check out the [Guide](/docs/datafu/guide.html) for more information on what you can do with DataFu Pig.