blob: 613179073ba5cd504ce17415b01585e726c348e3 [file] [log] [blame]
---
title: Evaluation Explained (Recommendation)
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
A PredictionIO engine is instantiated by a set of parameters, these parameters
determines which algorithm is used as well as the parameter for the algorithm.
It naturally raises a question of how to choose the best set of parameters. The
evaluation module streamlines the process of tuning the engine to the best
parameter set and deploy it.
## Evaluation Quick Start
We assume you have run the [Recommendation Quick Start](/templates/recommendation/quickstart/)
will skip the data collection / import instructions.
### Edit the AppName
Edit MyRecommendation/src/main/scala/***Evaluation.scala*** to specify the
*appName* you used to import the data.
```scala
object ParamsList extends EngineParamsGenerator {
private[this] val baseEP = EngineParams(
dataSourceParams = DataSourceParams(
appName = "MyApp1",
...
)
...
}
```
### Build and run the evaluation
To run evaluation, the command `pio eval` is used. It takes two
mandatory parameter,
1. the `Evaluation` object, it tells PredictionIO the engine and metric we use
for the evaluation; and
2. the `EngineParamsGenerator`, it contains a list of engine params to test
against.
The following command kickstarts the evaluation
workflow for the classification template (replace "org.template" with your package).
```
$ pio build
...
$ pio eval org.template.RecommendationEvaluation \
org.template.EngineParamsList
```
You will see the following output:
```
...
[INFO 2015-03-31 00:31:53,934] [CoreWorkflow$] runEvaluation started
...
[INFO 2015-03-31 00:35:56,782] [CoreWorkflow$] Updating evaluation instance with result: MetricEvaluatorResult:
# engine params evaluated: 3
Optimal Engine Params:
{
"dataSourceParams":{
"":{
"appName":"MyApp1",
"evalParams":{
"kFold":5,
"queryNum":10
}
}
},
"preparatorParams":{
"":{
}
},
"algorithmParamsList":[
{
"als":{
"rank":10,
"numIterations":40,
"lambda":0.01,
"seed":3
}
}
],
"servingParams":{
"":{
}
}
}
Metrics:
Precision@K (k=10, threshold=4.0): 0.15205820105820103
PositiveCount (threshold=4.0): 5.753333333333333
Precision@K (k=10, threshold=2.0): 0.1542777777777778
PositiveCount (threshold=2.0): 6.833333333333333
Precision@K (k=10, threshold=1.0): 0.15068518518518517
PositiveCount (threshold=1.0): 10.006666666666666
[INFO 2015-03-31 00:36:01,516] [CoreWorkflow$] runEvaluation completed
```
The console prints out the evaluation metric score of each engine params, and finally
pretty print the optimal engine params. Amongst the 3 engine params we evaluate,
the best Prediction@k has a score of ~0.1521.
## The Evaluation Design
We assume you have read the [Tuning and Evaluation](/evaluation) section. We
will cover the evaluation aspects which are specific to the recommendation
engine.
In recommendation evaluation, the raw data is a sequence of known ratings. A
rating has 3 components: user, item, and a score. We use the $k-fold$ method for
evaluation, the raw data is sliced into a sequence of (training, validation)
data tuple.
In the validation data, we construct a query for *each user*, and get a list of
recommended items from the engine. It is vastly different from the
classification tutorial, where there is a one-to-one corresponding between the
training data point and the validation data point. In this evaluation,
our unit of evaluation is *user*, we evaluate the quality of an engine
using the known rating of a user.
### Key assumptions
There are multiple assumptions we have to make when we evaluate a
recommendation engine:
- Definition of 'good'. We want to quantify if the engine is able to recommend
items which the user likes, we need to define what is meant by 'good'. In this
example, we have two kinds of events: 'rate' and 'buy'. The 'rate' event is
associated with a rating value which ranges between 1 to 4, and the 'buy'
event is mapped to a rating of 4. When we
implement the metric, we have to specify a rating threshold, only the rating
above the threshold is considered 'good'.
- The absence of complete rating. It is extremely unlikely that the training
data contains rating for all user-item tuples. In contrast, of a system containing
1000 items, a user may only have rated 20 of them, leaving 980 items unrated. There
is no way for us to certainly tell if the user likes an unrated product.
When we examine the evaluation result, it is important for us to keep in mind
that the final metric is only an approximation of the actual result.
- Recommendation affects user behavior. Suppose you are a e-commerce company and
would like to use the recommendation engine to personalize the landing page,
the item you show in the landing page directly impacts what the user is going to
purchase. This is different from weather prediction, whatever the weather
forecast engine predicts, tomorrow's weather won't be affected. Therefore, when
we conduct offline evaluation for recommendation engines, it is possible that
the final user behavior is dramatically different from the evaluation result.
However, in the evaluation, for simplicity, we have to assume that user
behavior is homogenous.
## Evaluation Data Generation
### Actual Result
In MyRecommendation/src/main/scala/***Engine.scala***,
we define the `ActualResult` which represents the user rating for validation.
It stores the list of ratings in the validation set for a user.
```scala
case class ActualResult(
ratings: Array[Rating]
)
```
### Implement Data Generate Method in DataSource
In MyRecommendation/src/main/scala/***DataSource.scala***,
the method `readEval` method reads, and selects, data from datastore
and returns a sequence of (training, validation) data.
```scala
case class DataSourceEvalParams(kFold: Int, queryNum: Int)
case class DataSourceParams(
appName: String,
evalParams: Option[DataSourceEvalParams]) extends Params
class DataSource(val dsp: DataSourceParams)
extends PDataSource[TrainingData,
EmptyEvaluationInfo, Query, ActualResult] {
@transient lazy val logger = Logger[this.type]
def getRatings(sc: SparkContext): RDD[Rating] = {
val eventsRDD: RDD[Event] = PEventStore.find(
appName = dsp.appName,
entityType = Some("user"),
eventNames = Some(List("rate", "buy")), // read "rate" and "buy" event
// targetEntityType is optional field of an event.
targetEntityType = Some(Some("item")))(sc)
val ratingsRDD: RDD[Rating] = eventsRDD.map { event =>
val rating = try {
val ratingValue: Double = event.event match {
case "rate" => event.properties.get[Double]("rating")
case "buy" => 4.0 // map buy event to rating value of 4
case _ => throw new Exception(s"Unexpected event ${event} is read.")
}
// entityId and targetEntityId is String
Rating(event.entityId,
event.targetEntityId.get,
ratingValue)
} catch {
case e: Exception => {
logger.error(s"Cannot convert ${event} to Rating. Exception: ${e}.")
throw e
}
}
rating
}.cache()
ratingsRDD
}
...
override
def readEval(sc: SparkContext)
: Seq[(TrainingData, EmptyEvaluationInfo, RDD[(Query, ActualResult)])] = {
require(!dsp.evalParams.isEmpty, "Must specify evalParams")
val evalParams = dsp.evalParams.get
val kFold = evalParams.kFold
val ratings: RDD[(Rating, Long)] = getRatings(sc).zipWithUniqueId
(0 until kFold).map { idx => {
val trainingRatings = ratings.filter(_._2 % kFold != idx).map(_._1)
val testingRatings = ratings.filter(_._2 % kFold == idx).map(_._1)
val testingUsers: RDD[(String, Iterable[Rating])] = testingRatings.groupBy(_.user)
(new TrainingData(trainingRatings),
new EmptyEvaluationInfo(),
testingUsers.map {
case (user, ratings) => (Query(user, evalParams.queryNum), ActualResult(ratings.toArray))
}
)
}}
}
}
```
The evaluation data generate is controlled by two parameters
in the `DataSourceEvalParams`. The first parameter `kFold` is the number of
fold we use for evaluation, the second parameter `queryNum` is used for
query construction.
The `getRating` method is factored out from the `readTraining` method as
they both serve the same function of reading from data source and
perform a user-item action into a rating (lines 22 - 40).
The `readEval` method is a k-fold evaluation implementation.
We annotate each rating in the raw data by an index (line 54), then
in each fold, the rating goes to either the training or testing set
based on the modulus value.
We group ratings by user, and one query is constructed *for each user* (line 60).
## Evaluation Metrics
In the [evaluation and tuning tutorial](/evaluation/), we use ***Metric*** to
compute the quality of an engine variant.
However, in actual use cases like recommendation, as we have made many
assumptions in our model, using a single metric may lead to a biased evaluation.
We will discuss using multiple
***Metrics*** to generate a comprehensive evaluation, to generate a more global view
of the engine.
### Precision@K
Precision@K measures the portion of *relevant* items amongst the first *k* items.
Recommendation engine usually wants to make sure the top few items recommended
are appealing to the user. Think about Google search, we usually give up after
looking at the first and second result pages.
### Precision@K Parameters
There are two questions associated with it.
1. How do we define *relevant*?
2. What is a good value of *k*?
Before we answer these questions, we need to understand what constitute a good metric.
It is like exams, if everyone get full scores, the exam fails its goal to
determine what the candidates don't know; if everyone fails, the exam fails its goal
to determine what the candidates know.
A good metric should be able to distinguish the good from the bad.
A way to define relevant is to use the notion of rating threshold. If the user
rating for an item is higher than a certain threshold, we say it is relevant.
However, without looking at the data, it is hard to pick a reasonable threshold.
We can set the threshold be as high as the maximum rating of 4.0, but it may
severely limit the relevant set size, and the precision scores will be close to
zero or undefined (precision is undefined if there is no relevant data).
On the other hand, we can set the threshold be as low as the minimum rating, but
it makes the precision metric uninformative as well since all scores will be close
to 1.
Similar argument applies to picking a good value of *k* too.
A method to choose a good parameter is *not* to choose one, but instead test
out *a whole sprectrum of parameters*. If an engine variant is good, it should
robustly perform well across different metric parameters.
The evaluation module supports multiple metrics. The following code
snippets demonstrates a sample usage.
```scala
object ComprehensiveRecommendationEvaluation extends Evaluation {
val ratingThresholds = Seq(0.0, 2.0, 4.0)
val ks = Seq(1, 3, 10)
engineEvaluator = (
RecommendationEngine(),
MetricEvaluator(
metric = PrecisionAtK(k = 3, ratingThreshold = 2.0),
otherMetrics = (
(for (r <- ratingThresholds) yield PositiveCount(ratingThreshold = r)) ++
(for (r <- ratingThresholds; k <- ks) yield PrecisionAtK(k = k, ratingThreshold = r))
)))
}
```
We have two types of `Metric`s.
- `PositiveCount` is a helper metrics that returns the average
number of positive samples for a specific rating threshold, therefore we get some
idea about the *demographic* of the data. If `PositiveCount` is too low or too
high for certain threshold, we know that it should not be used.
We have three thresholds (line 2), and three instances of
`PositiveCount` metric are instantiated (line 10), one for each threshold.
- `Precision@K` is the actual metrics we use.
We have two lists of parameters (lines 2 to 3): `ratingThreshold` defines what rating is good,
and `k` defines how many items we evaluate in the `PredictedResult`.
We generate a list of all combinations (line 11).
These metrics are specified as `otherMetrics` (lines 9 to 11), they
will be calculated and generated on the evaluation UI.
To run this evaluation, you can:
```
$ pio eval org.template.ComprehensiveRecommendationEvaluation \
org.template.EngineParamsList
```