blob: 49a6363cc12886584d4da243daf8103ae1109a69 [file] [log] [blame]
---
title: Guide - Apache DataFu Pig
version: 1.4.0
section_name: Apache DataFu Pig
license: >
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---
# Guide
Apache DataFu Pig is a collection of user-defined functions for working with large scale data in [Apache Pig](https://pig.apache.org/).
It has a number of useful functions available. This guide provides examples of how to use these functions and serves as an overview for working with the library.
* [Statistics](/docs/datafu/guide/statistics.html): median, quantiles, variance
* [Bag Operations](/docs/datafu/guide/bag-operations.html): join, prepend, append, count items, concat
* [Set Operations](/docs/datafu/guide/set-operations.html): set intersection, union, difference
* [Sessions](/docs/datafu/guide/sessions.html): sessionize streams of data
* [Sampling](/docs/datafu/guide/sampling.html): simple random sample with/without replacement, weighted sample
* [Hashing](/docs/datafu/guide/hashing.html): SHA and MD5
* [Link Analysis](/docs/datafu/guide/link-analysis.html): PageRank
* [More Tips and Tricks](/docs/datafu/guide/more-tips-and-tricks.html)
There are also [Javadocs](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/) available for all UDFs in the library. We continue to add
UDFs to the library. If you are interested in helping out please follow the [Contributing](/community/contributing.html)
guide.
## Pig Compatibility
The current version of DataFu has been tested against Pig 0.14.0. DataFu should be compatible with some older versions of Pig, however we do not do any sort of testing with prior versions of Pig and do not guarantee compatibility.
Our policy is to test against the most recent version of Pig whenever we release and make sure DataFu works with that version.
## Blog Posts
* [Introducing DataFu](/blog/2012/01/10/introducing-datafu.html)
* [DataFu: The WD-40 of Big Data](/blog/2013/01/24/datafu-the-wd-40-of-big-data.html)
* [DataFu 1.0](/blog/2013/09/04/datafu-1-0.html)
## Slides
* [A Brief Tour of DataFu](http://www.slideshare.net/matthewterencehayes/datafu)
* [Building Data Products at LinkedIn with DataFu](http://www.slideshare.net/matthewterencehayes/building-data-products-at-linkedin-with-datafu)
* [DataFu @ ApacheCon 2014](http://www.slideshare.net/williamgvaughan/datafu-apachecon-33420740)
## Videos
* [Introduction to Apache DataFu @ ApacheCon North America 2014](http://www.youtube.com/watch?v=JWI9tVsQ1cY)