blob: 6543d324618b57c85936caceffd27483af398e56 [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!-- toc -->
# Area Under the ROC Curve
[ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and Area Under the ROC Curve (AUC) are widely-used metric for binary (i.e., positive or negative) classification problems such as [Logistic Regression](../binaryclass/a9a_lr.html).
Binary classifiers generally predict how likely a sample is to be positive by computing probability. Ultimately, we can evaluate the classifiers by comparing the probabilities with truth positive/negative labels.
Now we assume that there is a table which contains predicted scores (i.e., probabilities) and truth labels as follows:
| probability<br/>(predicted score) | truth label |
|:---:|:---:|
| 0.5 | 0 |
| 0.3 | 1 |
| 0.2 | 0 |
| 0.8 | 1 |
| 0.7 | 1 |
Once the rows are sorted by the probabilities in a descending order, AUC gives a metric based on how many positive (`label=1`) samples are ranked higher than negative (`label=0`) samples. If many positive rows get larger scores than negative rows, AUC would be large, and hence our classifier would perform well.
# Compute AUC on Hivemall
In Hivemall, a function `auc(double score, int label)` provides a way to compute AUC for pairs of probability and truth label.
## Sequential AUC computation on a single node
For instance, the following query computes AUC of the table which was shown above:
```sql
with data as (
select 0.5 as prob, 0 as label
union all
select 0.3 as prob, 1 as label
union all
select 0.2 as prob, 0 as label
union all
select 0.8 as prob, 1 as label
union all
select 0.7 as prob, 1 as label
)
select
auc(prob, label) as auc
from (
select prob, label
from data
ORDER BY prob DESC
) t;
```
This query returns `0.83333` as AUC.
Since AUC is a metric based on ranked probability-label pairs as mentioned above, input data (rows) needs to be ordered by scores in a descending order.
## Parallel approximate AUC computation
Meanwhile, Hive's `distribute by` clause allows you to compute AUC in parallel:
```sql
with data as (
select 0.5 as prob, 0 as label
union all
select 0.3 as prob, 1 as label
union all
select 0.2 as prob, 0 as label
union all
select 0.8 as prob, 1 as label
union all
select 0.7 as prob, 1 as label
)
select
auc(prob, label) as auc
from (
select prob, label
from data
DISTRIBUTE BY floor(prob / 0.2)
SORT BY prob DESC
) t;
```
Note that `floor(prob / 0.2)` means that the rows are distributed to 5 bins for the AUC computation because the column `prob` is in a [0, 1] range.
# Difference between AUC and Logarithmic Loss
Hivemall has another metric called [Logarithmic Loss](regression.html#logarithmic-loss) for binary classification. Both AUC and Logarithmic Loss compute scores for probability-label pairs.
Score produced by AUC is a relative metric based on sorted pairs. On the other hand, Logarithmic Loss simply gives a metric by comparing probability with its truth label one-by-one.
To give an example, `auc(prob, label)` and `logloss(prob, label)` respectively returns `0.83333` and `0.54001` in the above case. Note that larger AUC and smaller Logarithmic Loss are better.