blob: 5a4ca615458bb4a5477ea48f9a2ca289b172c1df [file] [log] [blame]
---
title: Apache DataFu Pig - Getting Started
version: 1.6.1
section_name: Getting Started
license: >
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---
# DataFu Pig
Apache DataFu Pig is a collection of user-defined functions for working with large scale data in [Apache Pig](http://pig.apache.org/). It has a number of useful functions available:
<div class="row">
<div class="col-lg-6">
<h4>Statistics</h4>
<p>
Compute quantiles, median, variance, wilson binary confidence, etc.
</p>
<h4>Set Operations</h4>
<p>
Perform set intersection, union, or difference of bags.
</p>
<h4>Bags</h4>
<p>
Convenient functions for working with bags such as enumerate items,
append, prepend, concat, group, distinct, etc.
</p>
<h4>Sessions</h4>
<p>
Sessionize events from a stream of data.
</p>
</div>
<div class="col-lg-6">
<h4>Estimation</h4>
<p>
Streaming implementations that can estimate
quantiles and median.
</p>
<h4>Sampling</h4>
<p>
Simple random sampling with or without replacement,
weighted sampling.
</p>
<h4>Link Analysis</h4>
<p>
Run PageRank on a graph represented by a bag of
nodes and edges.
</p>
<h4>More</h4>
<p>
Other useful methods like Assert and Coalesce.
</p>
</div>
</div>
If you'd like to read more details about these functions, check out the [Guide](/docs/datafu/guide.html). Otherwise if you are
ready to get started using DataFu Pig, keep reading.
The rest of this page assumes you already have a built JAR available. If this is not the case, please see the [Download](/docs/download.html) page.
## Basic Example: Computing Median
Let's use DataFu Pig to perform a very basic task: computing the median of some data.
Suppose we have a file `input` in Hadoop with the following content:
1
2
3
2
2
2
3
2
2
1
We can clearly see that the median is 2 for this data set. First we'll start up Pig's grunt shell by running `pig` and
then register the DataFu JAR:
```pig
register datafu-pig-<%= current_page.data.version %>.jar
```
To compute the median we'll use DataFu's `StreamingMedian`, which computes an estimate of the median but has the benefit
of not requiring the data to be sorted:
```pig
DEFINE Median datafu.pig.stats.StreamingMedian();
```
Next we can load the data and pass it into the function to compute the median:
```pig
data = LOAD 'input' using PigStorage() as (val:int);
data = FOREACH (GROUP data ALL) GENERATE Median(data);
DUMP data
```
This produces the expected output:
((2.0))
## Next Steps
Check out the [Guide](/docs/datafu/guide.html) for more information on what you can do with DataFu Pig.