blob: 2355fbf1fe102d78c8d33f36e7e9a6e59230d709 [file] [log] [blame]
---
title: More Tips and Tricks - Guide - Apache DataFu Pig
version: 1.4.0
section_name: Apache DataFu Pig - Guide
license: >
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---
## More Tips and Tricks
### Coalesce
Using ternary operators is fairly common in Pig. For example, often you want to replace null
values with zero:
```pig
data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result;
```
Or, sometimes you want to return the first non-null value among several fields:
```pig
data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 :
(val2 IS NOT NULL ? val2 :
(val3 IS NOT NULL ? val3 :
NULL))) as result;
```
The above code is very hard to follow, and it is very cumersome to write. To solve this problem,
Apache DataFu provides the useful
[Coalesce](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/util/Coalesce.html). This is
very similar to the [COALESCE](http://msdn.microsoft.com/en-us/library/ms190349.aspx) available
in some SQL implementations. It simply takes the first non-null value from the arguments passed in.
With `Coalesce` we can clean up the code above.
To replace any null value with 0:
```pig
DEFINE Coalesce datafu.pig.util.Coalesce();
data = FOREACH data GENERATE Coalesce(val,0) as result;
```
To return the first non-null value:
```pig
data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result;
```
### Left Joining Multiple Relations
Suppose we have three data sets:
```pig
input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT);
input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT);
input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT);
```
Let's say we want to left join `input1` with `input2` and `input3`. You can do this in
SQL. Unfortunately Pig does not support outer joins on more than two relations.
```pig
-- DOES NOT WORK
joined = JOIN input1 BY key LEFT,
input2 BY key, input3 BY key;
```
Instead, you have to join twice, which means two MapReduce jobs:
```pig
data1 = JOIN input1 BY key LEFT, input2 BY key;
data2 = JOIN data1 BY input1::key LEFT, input3 BY key;
```
This is unfortunate, as left joins are very common, and for some applications it is common
to need to left join multiple relations. Take a recommendation system for example: you
start with a set of candidates to score and you join in multiple sets of features. Each
set of features requires another join.
You can, however, perform a left join effectively by using `COGROUP` with multiple relations
and applying clever use of `FLATTEN`. Then only a single MapReduce job is required.
But this gets pretty messy:
```pig
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;
data2 = FOREACH data1 GENERATE
FLATTEN(input1), -- left join on this
FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int),null)) : input2))
AS (input2::key,input2::val),
FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int),null)) : input3))
AS (input3::key,input3::val);
```
As messy as this looks, it does work. It relies on the fact that flattening an empty bag produces
no output. Therefore any records not appearing in `input` will be removed. Since we don't want
the lack of records in `input2` or `input3` to cause records to be removed, we replace the empty
bag with a bag having just a single tuple with null values. Using these tricks we are able to
simulate a left join on multiple relations.
To clean up this code, DataFu provides the
[EmptyBagToNullFields](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/bags/EmptyBagToNullFields.html)
UDF. This performs the same logic above and makes the code much easier to write and understand:
```pig
DEFINE EmptyBagToNullFields datafu.pig.bags.EmptyBagToNullFields();
data = FOREACH (COGROUP input1 BY key, input2 BY key, input3 BY key) GENERATE
FLATTEN(input1), -- left join on this
FLATTEN(EmptyBagToNullFields(input2)),
FLATTEN(EmptyBagToNullFields(input3));
```
While you're at it, why not create a macro:
```pig
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined {
cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3;
$joined = FOREACH cogrouped GENERATE
FLATTEN($relation1),
FLATTEN(EmptyBagToNullFields($relation2)),
FLATTEN(EmptyBagToNullFields($relation3));
}
```
Now you can simply apply your left join macro:
```pig
features = left_outer_join(input1, val1, input2, val2, input3, val3);
```