| --- |
| title: Getting Started - Apache DataFu Pig |
| version: 1.3.1 |
| section_name: Apache DataFu Pig |
| license: > |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --- |
| |
| # Getting Started |
| |
| Apache DataFu Pig is a collection of user-defined functions for working with large scale data in [Apache Pig](http://pig.apache.org/). It has a number of useful functions available: |
| |
| <div class="row"> |
| <div class="col-lg-6"> |
| <h4>Statistics</h4> |
| <p> |
| Compute quantiles, median, variance, wilson binary confidence, etc. |
| </p> |
| |
| <h4>Set Operations</h4> |
| <p> |
| Perform set intersection, union, or difference of bags. |
| </p> |
| |
| <h4>Bags</h4> |
| <p> |
| Convenient functions for working with bags such as enumerate items, |
| append, prepend, concat, group, distinct, etc. |
| </p> |
| |
| <h4>Sessions</h4> |
| <p> |
| Sessionize events from a stream of data. |
| </p> |
| </div> |
| |
| <div class="col-lg-6"> |
| <h4>Estimation</h4> |
| <p> |
| Streaming implementations that can estimate |
| quantiles, median, cardinality. |
| </p> |
| |
| <h4>Sampling</h4> |
| <p> |
| Simple random sampling with or without replacement, |
| weighted sampling. |
| </p> |
| |
| <h4>Link Analysis</h4> |
| <p> |
| Run PageRank on a graph represented by a bag of |
| nodes and edges. |
| </p> |
| |
| <h4>More</h4> |
| <p> |
| Other useful methods like Assert and Coalesce. |
| </p> |
| </div> |
| </div> |
| |
| If you'd like to read more details about these functions, check out the [Guide](/docs/datafu/guide.html). Otherwise if you are |
| ready to get started using DataFu Pig, keep reading. |
| |
| The rest of this page assumes you already have a built JAR available. If this is not the case, please see [Quick Start](/docs/quick-start.html). |
| |
| ## Basic Example: Computing Median |
| |
| Let's use DataFu Pig to perform a very basic task: computing the median of some data. |
| Suppose we have a file `input` in Hadoop with the following content: |
| |
| 1 |
| 2 |
| 3 |
| 2 |
| 2 |
| 2 |
| 3 |
| 2 |
| 2 |
| 1 |
| |
| We can clearly see that the median is 2 for this data set. First we'll start up Pig's grunt shell by running `pig` and |
| then register the DataFu JAR: |
| |
| ```pig |
| register datafu-pig-incubating-<%= current_page.data.version %>.jar |
| ``` |
| |
| To compute the median we'll use DataFu's `StreamingMedian`, which computes an estimate of the median but has the benefit |
| of not requiring the data to be sorted: |
| |
| ```pig |
| DEFINE Median datafu.pig.stats.StreamingMedian(); |
| ``` |
| |
| Next we can load the data and pass it into the function to compute the median: |
| |
| ```pig |
| data = LOAD 'input' using PigStorage() as (val:int); |
| data = FOREACH (GROUP data ALL) GENERATE Median(data); |
| DUMP data |
| ``` |
| |
| This produces the expected output: |
| |
| ((2.0)) |
| |
| ## Next Steps |
| |
| Check out the [Guide](/docs/datafu/guide.html) for more information on what you can do with DataFu Pig. |