This page highlights major changes in each version and upgrade tools.
This release adds Elasticsearch 6 support. See pull request for details. Consequently, you must reindex your data.
$ curl -XGET 'http://localhost:9200/_cat/indices?v' health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open pio_event 6BAPz-DfQ2e9bICdVRr03g 5 1 1501 0 321.3kb 321.3kb yellow open pio_meta oxDMU1mGRn-vnXtAjmifSw 5 1 4 0 32.4kb 32.4kb $ curl -XGET "http://localhost:9200/pio_meta/_search" -d' { "aggs": { "typesAgg": { "terms": { "field": "_type", "size": 200 } } }, "size": 0 }' {"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":4,"max_score":0.0,"hits":[]},"aggregations":{"typesAgg":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,"buckets":[{"key":"accesskeys","doc_count":1},{"key":"apps","doc_count":1},{"key":"engine_instances","doc_count":1},{"key":"sequences","doc_count":1}]}}} $ curl -XGET "http://localhost:9200/pio_event/_search" -d' { "aggs": { "typesAgg": { "terms": { "field": "_type", "size": 200 } } }, "size": 0 }' {"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":1501,"max_score":0.0,"hits":[]},"aggregations":{"typesAgg":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,"buckets":[{"key":"1","doc_count":1501}]}}}
If you want to add specific settings associated with each index, we would recommend defining Index Templates.
For example,
$ curl -H "Content-Type: application/json" -XPUT "http://localhost:9600/_template/pio_meta" -d' { "index_patterns": ["pio_meta_*"], "settings": { "number_of_shards": 1, "number_of_replicas": 1 } }' $ curl -H "Content-Type: application/json" -XPUT "http://localhost:9600/_template/pio_event" -d' { "index_patterns": ["pio_event_*"], "settings": { "number_of_shards": 1, "number_of_replicas": 1 } }'
According to the following conversion table, you run the reindex every index that you need to migrate to your new cluster.
Old Cluster | New Cluster |
---|---|
index: pio_meta type: accesskeys | index: pio_meta_accesskeys |
index: pio_meta type: apps | index: pio_meta_apps |
index: pio_meta type: channels | index: pio_meta_channels |
index: pio_meta type: engine_instances | index: pio_meta_engine_instances |
index: pio_meta type: evaluation_instances | index: pio_meta_evaluation_instances |
index: pio_meta type: sequences | index: pio_meta_sequences |
index: pio_event type: It depends on your use case. (e.g. 1 ) | index: pio_event_<old_type> (e.g. pio_event_1 ) |
For example,
$ curl -H "Content-Type: application/json" -XPOST "http://localhost:9600/_reindex" -d' { "source": { "remote": { "host": "http://localhost:9200" }, "index": "pio_meta", "type": "accesskeys" }, "dest": { "index": "pio_meta_accesskeys" } }'
In 0.12.0, Elasticsearch 5.x client has been reimplemented as a singleton. Engine templates directly using Elasticsearch 5.x StorageClient require update for compatibility. See [pull request] (https://github.com/apache/predictionio/pull/421) for details.
Starting from 0.11.0, PredictionIO no longer bundles any JDBC drivers in the binary assembly. If your setup is using a JDBC backend and you run into storage connection errors after an upgrade, please manually install the JDBC driver. If you use PostgreSQL, you can find instructions here.
The Spark dependency has been upgraded to version 1.3.0. All engines must be rebuilt against it in order to work.
Open and edit build.sbt
of your engine, and look for these two lines:
"org.apache.spark" %% "spark-core" % "1.2.0" % "provided" "org.apache.spark" %% "spark-mllib" % "1.2.0" % "provided"
Change 1.2.0
to 1.3.0
, and do a clean rebuild by pio build --clean
. Your engine should now work with the latest Apache Spark.
In addition, new PEventStore and LEventStore API are introduced so that appName can be used as parameters in engine.json to access Event Store.
NOTE: The following changes are not required for using 0.9.2 but it's recommended to upgrade your engine code as described below because the old API will be deprecated.
remove this line of code:
import org.apache.predictionio.data.storage.Storage
and replace it by
import org.apache.predictionio.data.store.PEventStore
Change appId: Int
to appName: String
in DataSourceParams
For example,
case class DataSourceParams(appName: String) extends Params
remove this line of code: val eventsDb = Storage.getPEvents()
locate where eventsDb.aggregateProperties()
is used, change it to PEventStore.aggregateProperties()
:
For example,
val usersRDD: RDD[(String, User)] = PEventStore.aggregateProperties( // CHANGED appName = dsp.appName, // CHANGED: use appName entityType = "user" )(sc).map { ... }
locate where eventsDb.find()
is used, change it to PEventStore.find()
For example,
val viewEventsRDD: RDD[ViewEvent] = PEventStore.find( // CHANGED appName = dsp.appName, // CHANGED: use appName entityType = Some("user"), ...
If Storage.getLEvents() is also used in Algorithm (such as ALSAlgorithm of E-Commerce Recommendation template), you also need to do following:
NOTE: If org.apache.predictionio.data.storage.Storage
is not used at all (such as Recommendation, Similar Product, Classification, Lead Scoring, Product Ranking template), there is no need to change Algorithm and can go to the later engine.json section.
remove import org.apache.predictionio.data.storage.Storage
and replace it by import org.apache.predictionio.data.store.LEventStore
change appId
to appName
in the XXXAlgorithmParams class.
remove this line of code: @transient lazy val lEventsDb = Storage.getLEvents()
locate where LEventStore.findByEntity()
is used, change it to LEventStore.findByEntity()
:
For example, change following code
... val seenEvents: Iterator[Event] = lEventsDb.findSingleEntity( appId = ap.appId, entityType = "user", entityId = query.user, eventNames = Some(ap.seenEvents), targetEntityType = Some(Some("item")), // set time limit to avoid super long DB access timeout = Duration(200, "millis") ) match { case Right(x) => x case Left(e) => { logger.error(s"Error when read seen events: ${e}") Iterator[Event]() } }
to
val seenEvents: Iterator[Event] = try { // CHANGED: try catch block is used LEventStore.findByEntity( // CHANGED: new API appName = ap.appName, // CHANGED: use appName entityType = "user", entityId = query.user, eventNames = Some(ap.seenEvents), targetEntityType = Some(Some("item")), // set time limit to avoid super long DB access timeout = Duration(200, "millis") ) } catch { // CHANGED: try catch block is used case e: scala.concurrent.TimeoutException => logger.error(s"Timeout when read seen events." + s" Empty list is used. ${e}") Iterator[Event]() case e: Exception => logger.error(s"Error when read seen events: ${e}") throw e }
If you are using E-Commerce Recommendation template, please refer to the latest version for other updates related to LEventStore.findByEntity()
locate where appId
is used, change it to appName
and specify the name of the app instead.
For example:
... "datasource": { "params" : { "appName": "MyAppName" } },
Note that other components such as algorithms
may also have appId
param (e.g. E-Commerce Recommendation template). Remember to change it to appName
as well.
That's it! You can re-biuld your engine to try it out!
0.9.0 has the following new changes:
The signature of P2LAlgorithm
and PAlgorithm
's train()
method is changed from
def train(pd: PD): M
to
def train(sc: SparkContext, pd: PD): M
which allows you to access SparkContext inside train()
with this new parameter sc
.
A new SBT build plugin (pio-build
) is added for engine template
WARNING: If you have existing engine templates running with previous version of PredictionIO, you need to either download the latest templates which are compatible with 0.9.0, or follow the instructions below to modify them.
Follow instructions below to modify existing engine templates to be compatible with PredictionIO 0.9.0:
Add a new parameter sc: SparkContext
in the signature of train()
method of algorithm in the templates.
For example, in Recommendation engine template, you will find the following train()
function in ALSAlgorithm.scala
class ALSAlgorithm(val ap: ALSAlgorithmParams) extends P2LAlgorithm[PreparedData, ALSModel, Query, PredictedResult] { ... def train(data: PreparedData): ALSModel = ... ... }
Simply add the new parameter sc: SparkContext,
to train()
function signature:
class ALSAlgorithm(val ap: ALSAlgorithmParams) extends P2LAlgorithm[PreparedData, ALSModel, Query, PredictedResult] { ... def train(sc: SparkContext, data: PreparedData): ALSModel = ... ... }
You need to add the following import for your algorithm as well if it is not there:
import org.apache.spark.SparkContext
Modify the file build.sbt
in your template directory to use pioVersion.value
as the version of org.apache.predictionio.core dependency:
Under your template's root directory, you should see a file build.sbt
which has the following content:
libraryDependencies ++= Seq( "org.apache.predictionio" %% "core" % "0.8.6" % "provided", "org.apache.spark" %% "spark-core" % "1.2.0" % "provided", "org.apache.spark" %% "spark-mllib" % "1.2.0" % "provided")
Change the version of "org.apache.predictionio" && "core"
to pioVersion.value
:
libraryDependencies ++= Seq( "org.apache.predictionio" %% "core" % pioVersion.value % "provided", "org.apache.spark" %% "spark-core" % "1.2.0" % "provided", "org.apache.spark" %% "spark-mllib" % "1.2.0" % "provided")
Create a new file pio-build.sbt
in template's project/ directory with the following content:
addSbtPlugin("org.apache.predictionio" % "pio-build" % "0.9.0")
Then, you should see the following two files in the project/ directory:
your_template_directory$ ls project/ assembly.sbt pio-build.sbt
Create a new file template.json
file in the engine template's root directory with the following content:
{"pio": {"version": { "min": "0.9.0" }}}
This is to specify the minium PredictionIO version which the engine can run with.
Lastly, you can add /pio.sbt
into your engine template's .gitignore
. pio.sbt
is automatically generated by pio build
.
That's it! Now you can run pio build
, pio train
and pio deploy
with PredictionIO 0.9.0 in the same way as before!
##Upgrade to 0.8.4
engine.json has slightly changed its format in 0.8.4 in order to make engine more flexible. If you are upgrading to 0.8.4, engine.json needs to have the params
field for datasource, preparator, and serving. Here is the sample engine.json from templates/scala-parallel-recommendation-custom-preparator that demonstrate the change for datasource (line 7).
In 0.8.3 { "id": "default", "description": "Default settings", "engineFactory": "org.template.recommendation.RecommendationEngine", "datasource": { "appId": 1 }, "algorithms": [ { "name": "als", "params": { "rank": 10, "numIterations": 20, "lambda": 0.01 } } ] }
In 0.8.4 { "id": "default", "description": "Default settings", "engineFactory": "org.template.recommendation.RecommendationEngine", "datasource": { "params" : { "appId": 1 } }, "algorithms": [ { "name": "als", "params": { "rank": 10, "numIterations": 20, "lambda": 0.01 } } ]
##Upgrade from 0.8.2 to 0.8.3
0.8.3 disallows entity types pio_user and pio_item. These types are used by default for most SDKs. They are deprecated in 0.8.3, and SDKs helper functions have been updated to use user and item instead.
If you are upgrading to 0.8.3, you can follow these steps to migrate your data.
$ pio app new <my app name>
Please take note of the generated for the new app.
$ pio upgrade 0.8.2 0.8.3 <old app id> <new app id>
It will run a script that creates a new app with the new app id and migreate the data to the new app.
"datasource": { "appId": <new app id> },
0.8.2 contains HBase and Elasticsearch schema changes from previous versions. If you are upgrading from a pre-0.8.2 version, you need to first clear HBase and ElasticSearch. These will clear out all data in Elasticsearch and HBase. Please be extra cautious.
DANGER: ALL EXISTING DATA WILL BE LOST!
With Elasticsearch running, do
$ curl -X DELETE http://localhost:9200/_all
For details see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-delete-index.html.
$ $HBASE_HOME/bin/hbase shell ... > disable_all 'predictionio.*' ... > drop_all 'predictionio.*' ...
For details see http://wiki.apache.org/hadoop/Hbase/Shell.
Create an app to store the data
$ bin/pio app new <my app>
Replace by the returned app ID: ( is the original app ID used in 0.8.0/0.8.2.)
$ set -a $ source conf/pio-env.sh $ set +a $ sbt/sbt "data/run-main org.apache.predictionio.data.storage.hbase.upgrade.Upgrade <from app ID>" "<to app ID>"