docs/gitbook/eval/auc.md - incubator-hivemall - Git at Google

 <!--
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 <!-- toc -->

 # Area Under the ROC Curve

 [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and Area Under the ROC Curve (AUC) are widely-used metric for binary (i.e., positive or negative) classification problems such as [Logistic Regression](../binaryclass/a9a_lr.html).

 Binary classifiers generally predict how likely a sample is to be positive by computing probability. Ultimately, we can evaluate the classifiers by comparing the probabilities with truth positive/negative labels.

 Now we assume that there is a table which contains predicted scores (i.e., probabilities) and truth labels as follows:

 | probability<br/>(predicted score) | truth label |
 |:---:|:---:|
 | 0.5 | 0 |
 | 0.3 | 1 |
 | 0.2 | 0 |
 | 0.8 | 1 |
 | 0.7 | 1 |

 Once the rows are sorted by the probabilities in a descending order, AUC gives a metric based on how many positive (`label=1`) samples are ranked higher than negative (`label=0`) samples. If many positive rows get larger scores than negative rows, AUC would be large, and hence our classifier would perform well.

 # Compute AUC on Hivemall

 In Hivemall, a function `auc(double score, int label)` provides a way to compute AUC for pairs of probability and truth label.

 ## Sequential AUC computation on a single node

 For instance, the following query computes AUC of the table which was shown above:

 ```sql
 with data as (
   select 0.5 as prob, 0 as label
   union all
   select 0.3 as prob, 1 as label
   union all
   select 0.2 as prob, 0 as label
   union all
   select 0.8 as prob, 1 as label
   union all
   select 0.7 as prob, 1 as label
 )
 select
   auc(prob, label) as auc
 from (
   select prob, label
   from data
   ORDER BY prob DESC
 ) t;
 ```

 This query returns `0.83333` as AUC.

 Since AUC is a metric based on ranked probability-label pairs as mentioned above, input data (rows) needs to be ordered by scores in a descending order.

 ## Parallel approximate AUC computation

 Meanwhile, Hive's `distribute by` clause allows you to compute AUC in parallel:

 ```sql
 with data as (
   select 0.5 as prob, 0 as label
   union all
   select 0.3 as prob, 1 as label
   union all
   select 0.2 as prob, 0 as label
   union all
   select 0.8 as prob, 1 as label
   union all
   select 0.7 as prob, 1 as label
 )
 select
   auc(prob, label) as auc
 from (
   select prob, label
   from data
   DISTRIBUTE BY floor(prob / 0.2)
   SORT BY prob DESC
 ) t;
 ```

 Note that `floor(prob / 0.2)` means that the rows are distributed to 5 bins for the AUC computation because the column `prob` is in a [0, 1] range.

 # Difference between AUC and Logarithmic Loss

 Hivemall has another metric called [Logarithmic Loss](regression.html#logarithmic-loss) for binary classification. Both AUC and Logarithmic Loss compute scores for probability-label pairs.

 Score produced by AUC is a relative metric based on sorted pairs. On the other hand, Logarithmic Loss simply gives a metric by comparing probability with its truth label one-by-one.

 To give an example, `auc(prob, label)` and `logloss(prob, label)` respectively returns `0.83333` and `0.54001` in the above case. Note that larger AUC and smaller Logarithmic Loss are better.
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<!-- toc -->

	# Area Under the ROC Curve

	[ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and Area Under the ROC Curve (AUC) are widely-used metric for binary (i.e., positive or negative) classification problems such as [Logistic Regression](../binaryclass/a9a_lr.html).

	Binary classifiers generally predict how likely a sample is to be positive by computing probability. Ultimately, we can evaluate the classifiers by comparing the probabilities with truth positive/negative labels.

	Now we assume that there is a table which contains predicted scores (i.e., probabilities) and truth labels as follows:

	\| probability<br/>(predicted score) \| truth label \|
	\|:---:\|:---:\|
	\| 0.5 \| 0 \|
	\| 0.3 \| 1 \|
	\| 0.2 \| 0 \|
	\| 0.8 \| 1 \|
	\| 0.7 \| 1 \|

	Once the rows are sorted by the probabilities in a descending order, AUC gives a metric based on how many positive (`label=1`) samples are ranked higher than negative (`label=0`) samples. If many positive rows get larger scores than negative rows, AUC would be large, and hence our classifier would perform well.

	# Compute AUC on Hivemall

	In Hivemall, a function `auc(double score, int label)` provides a way to compute AUC for pairs of probability and truth label.

	## Sequential AUC computation on a single node

	For instance, the following query computes AUC of the table which was shown above:

	```sql
	with data as (
	select 0.5 as prob, 0 as label
	union all
	select 0.3 as prob, 1 as label
	union all
	select 0.2 as prob, 0 as label
	union all
	select 0.8 as prob, 1 as label
	union all
	select 0.7 as prob, 1 as label
	)
	select
	auc(prob, label) as auc
	from (
	select prob, label
	from data
	ORDER BY prob DESC
	) t;
	```

	This query returns `0.83333` as AUC.

	Since AUC is a metric based on ranked probability-label pairs as mentioned above, input data (rows) needs to be ordered by scores in a descending order.

	## Parallel approximate AUC computation

	Meanwhile, Hive's `distribute by` clause allows you to compute AUC in parallel:

	```sql
	with data as (
	select 0.5 as prob, 0 as label
	union all
	select 0.3 as prob, 1 as label
	union all
	select 0.2 as prob, 0 as label
	union all
	select 0.8 as prob, 1 as label
	union all
	select 0.7 as prob, 1 as label
	)
	select
	auc(prob, label) as auc
	from (
	select prob, label
	from data
	DISTRIBUTE BY floor(prob / 0.2)
	SORT BY prob DESC
	) t;
	```

	Note that `floor(prob / 0.2)` means that the rows are distributed to 5 bins for the AUC computation because the column `prob` is in a [0, 1] range.

	# Difference between AUC and Logarithmic Loss

	Hivemall has another metric called [Logarithmic Loss](regression.html#logarithmic-loss) for binary classification. Both AUC and Logarithmic Loss compute scores for probability-label pairs.

	Score produced by AUC is a relative metric based on sorted pairs. On the other hand, Logarithmic Loss simply gives a metric by comparing probability with its truth label one-by-one.

	To give an example, `auc(prob, label)` and `logloss(prob, label)` respectively returns `0.83333` and `0.54001` in the above case. Note that larger AUC and smaller Logarithmic Loss are better.