| --- |
| title: Set Operations - Guide - Apache DataFu Pig |
| version: 1.3.1 |
| section_name: Apache DataFu Pig - Guide |
| license: > |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --- |
| |
| ## Set Operations |
| |
| Apache DataFu has several methods for performing [set operations](http://en.wikipedia.org/wiki/Set_%28mathematics%29) on bags. |
| |
| ### Set Intersection |
| |
| Compute the [set intersection](http://en.wikipedia.org/wiki/Set_%28mathematics%29#Intersections) with [SetIntersect](/docs/datafu/<%= current_page.data.version %>/datafu/pig/sets/SetIntersect.html): |
| |
| ```pig |
| define SetIntersect datafu.pig.sets.SetIntersect(); |
| |
| -- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)}) |
| input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)}); |
| |
| intersected = FOREACH input { |
| sorted_b1 = ORDER B1 by val; |
| sorted_b2 = ORDER B2 by val; |
| GENERATE SetIntersect(sorted_b1,sorted_b2); |
| } |
| |
| -- produces: ({(1),(4),(5)}) |
| DUMP intersected; |
| ``` |
| |
| ### Set Union |
| |
| Compute the [set union](http://en.wikipedia.org/wiki/Set_%28mathematics%29#Unions) with [SetUnion](/docs/datafu/<%= current_page.data.version %>/datafu/pig/sets/SetUnion.html): |
| |
| ```pig |
| define SetUnion datafu.pig.sets.SetUnion(); |
| |
| -- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)}) |
| input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)}); |
| |
| unioned = FOREACH input GENERATE SetUnion(B1,B2); |
| |
| -- produces: ({(3),(4),(1),(2),(7),(5),(6),(0),(10)}) |
| DUMP unioned; |
| ``` |
| |
| This can also operate on multiple bags: |
| |
| ```pig |
| intersected = FOREACH input GENERATE SetUnion(B1,B2,B3); |
| ``` |
| |
| ### Set Difference |
| |
| Compute the [set difference](http://en.wikipedia.org/wiki/Set_%28mathematics%29#Complements) with [SetDifference](/docs/datafu/<%= current_page.data.version %>/datafu/pig/sets/SetDifference.html): |
| |
| ```pig |
| define SetDifference datafu.pig.sets.SetDifference(); |
| |
| -- ({(3),(4),(1),(2),(7),(5),(6)},{(1),(3),(5),(12)}) |
| input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)}); |
| |
| differenced = FOREACH input { |
| -- input bags must be sorted |
| sorted_b1 = ORDER B1 by val; |
| sorted_b2 = ORDER B2 by val; |
| GENERATE SetDifference(sorted_b1,sorted_b2); |
| } |
| |
| -- produces: ({(2),(4),(6),(7)}) |
| DUMP differenced; |
| ``` |