blob: 6b29d4fd550794ec7fcb66490440801c2ecf45f4 [file] [log] [blame] [view]
---
title: "Apache Hive : OuterJoinBehavior"
date: 2024-12-12
---
# Apache Hive : OuterJoinBehavior
# Hive Outer Join Behavior
{{< toc >}}
This document is based on a writeup of [DB2 Outer Join Behavior](http://www.ibm.com/developerworks/data/library/techarticle/purcell/0112purcell.html). The original HTML can be found [here](/attachments/OuterJoinBehavior.html).
## Definitions
| | |
| --- | --- |
| **Preserved Row table** | The table in an Outer Join that must return all rows. For left outer joins this is the *Left* table, for right outer joins it is the *Right* table, and for full outer joins both tables are *Preserved Row* tables. |
| **Null Supplying table** | This is the table that has nulls filled in for its columns in unmatched rows. In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also *Null Supplying* tables. |
| **During Join predicate** | A predicate that is in the JOIN **ON** clause. For example, in '`R1 join R2 on R1.x = 5`' the predicate '`R1.x = 5`' is a *During Join predicate*. |
| **After Join predicate** | A predicate that is in the WHERE clause. |
## Predicate Pushdown Rules
The logic can be summarized by these two rules:
1. **During Join predicates** cannot be pushed past **Preserved Row tables**.
2. **After Join predicates** cannot be pushed past **Null Supplying tables**.
This captured in the following table:
|   | Preserved Row Table | Null Supplying Table |
| --- | --- | --- |
| Join Predicate | Case J1: Not Pushed | Case J2: Pushed |
| Where Predicate | Case W1: Pushed | Case W2: Not Pushed |
See [Examples]({{< ref "#examples" >}}) below for illustrations of cases J1, J2, W1, and W2.
### Hive Implementation
Hive enforces the rules by these methods in the SemanticAnalyzer and JoinPPD classes:
**Rule 1:** During **QBJoinTree** construction in Plan Gen, the `parseJoinCondition()` logic applies this rule.
**Rule 2:** During **JoinPPD** (Join Predicate PushDown) the `getQualifiedAliases()` logic applies this rule.
## Examples
Given Src(Key String, Value String) the following Left Outer Join examples show that Hive has the correct behavior.
### Case J1: Join Predicate on Preserved Row Table
```
explain
select s1.key, s2.key
from src s1 left join src s2 on s1.key > '2';
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s1
TableScan
alias: s1
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: key
type: string
s2
TableScan
alias: s2
Reduce Output Operator
sort order:
tag: 1
value expressions:
expr: key
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0}
filter predicates:
0 {(VALUE._col0 > '2')}
1
handleSkewJoin: false
outputColumnNames: _col0, _col4
Select Operator
expressions:
expr: _col0
type: string
expr: _col4
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
```
### Case J2: Join Predicate on Null Supplying Table
```
explain
select s1.key, s2.key
from src s1 left join src s2 on s2.key > '2';
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s1
TableScan
alias: s1
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: key
type: string
s2
TableScan
alias: s2
Filter Operator
predicate:
expr: (key > '2')
type: boolean
Reduce Output Operator
sort order:
tag: 1
value expressions:
expr: key
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col4
Select Operator
expressions:
expr: _col0
type: string
expr: _col4
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
```
### Case W1: Where Predicate on Preserved Row Table
```
explain
select s1.key, s2.key
from src s1 left join src s2
where s1.key > '2';
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s1
TableScan
alias: s1
Filter Operator
predicate:
expr: (key > '2')
type: boolean
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: key
type: string
s2
TableScan
alias: s2
Reduce Output Operator
sort order:
tag: 1
value expressions:
expr: key
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col4
Select Operator
expressions:
expr: _col0
type: string
expr: _col4
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
```
### Case W2: Where Predicate on Null Supplying Table
```
explain
select s1.key, s2.key
from src s1 left join src s2
where s2.key > '2';
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s1
TableScan
alias: s1
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: key
type: string
s2
TableScan
alias: s2
Reduce Output Operator
sort order:
tag: 1
value expressions:
expr: key
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col4
Filter Operator
predicate:
expr: (_col4 > '2')
type: boolean
Select Operator
expressions:
expr: _col0
type: string
expr: _col4
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
```