This document describes a Scala-based single-machine linear regression engine.
Make sure you have built PredictionIO and setup storage described here.
This engine demonstrates how one can simply wrap around the Nak library to train a linear regression model and serve real-time predictions.
All code definition can be found here.
Training data is located at /examples/data/lr_data.txt
. The first column are values of the dependent variable, and the rest are values of explanatory variables. In this example, they are represented by the TrainingData
case class as a vector of double (all rows of the first column), and a vector of vector of double (all rows of the remaining columns) respectively.
The preparator in this example accepts two parameters: n
and k
. Each row of data is indexed by index
starting from 0. When n > 0
, rows matching index mod n = k
will be dropped.
This example engine contains one single algorithm that wraps around the Nak library's linear regression routine. The train()
method simply massage the TrainingData
into a form that can be used by Nak.
This example engine uses FirstServing
, which serves only predictions from the first algorithm. Since there is only one algorithm in this engine, predictions from the linear regression algorithm will be served.
This example provides a set of ready-to-use parameters for each component mentioned in the previous section. They are located inside the params
subdirectory.
Before training, you must let PredictionIO know about the engine. Run the following command to build and register the engine.
$ cd $PIO_HOME/examples/scala-local-regression $ ../../bin/pio register
where $PIO_HOME
is the root directory of the PredictionIO code tree.
To start training, use the following command.
$ cd $PIO_HOME/examples/scala-local-regression $ ../../bin/pio train
This will train a model and save it in PredictionIO's metadata storage. Notice that when the run is completed, it will display a run ID, like below.
2014-08-08 17:18:09,399 INFO SparkContext - Job finished: collect at DebugWorkflow.scala:571, took 0.046796 s 2014-08-08 17:18:09,399 INFO APIDebugWorkflow$ - Metrics is null. Stop here 2014-08-08 17:18:09,498 INFO APIDebugWorkflow$ - Saved engine instance with ID: CHURP-cvQta5VKxorx_9Aw
To run evaluation metrics, use the following command.
$ cd $PIO_HOME/examples/scala-local-regression $ ../../bin/pio eval --metrics-class io.prediction.controller.MeanSquareError
Notice the extra required argument --metrics-class io.prediction.controller.MeanSquareError
for the eval
command. This instructs PredictionIO to run the specified metrics during evaluation. When you look at the console output again, you should be able to see a mean square error computed, like the following.
2014-08-08 17:21:01,042 INFO APIDebugWorkflow$ - Set: The One Size: 1000 MSE: 0.092519 2014-08-08 17:21:01,042 INFO APIDebugWorkflow$ - APIDebugWorkflow.run completed. 2014-08-08 17:21:01,140 INFO APIDebugWorkflow$ - Saved engine instance with ID: icfEp9njR76NQOrvowC-dQ
Following from instructions above, you should have trained a model. Use the following command to start a server.
$ cd $PIO_HOME/examples/scala-local-regression $ ../../bin/pio deploy
This will create a server that by default binds to http://localhost:8000. You can visit that page in your web browser to check its status.
To perform real-time predictions, try the following.
$ curl -H "Content-Type: application/json" -d '[2.1419053154730548, 1.919407948982788, 0.0501333631091041, -0.10699028639933772, 1.2809776380727795, 1.6846227956326554, 0.18277859260127316, -0.39664340267804343, 0.8090554869291249, 2.48621339239065]' http://localhost:8000/queries.json $ curl -H "Content-Type: application/json" -d '[-0.8600615539670898, -1.0084357652346345, -1.3088407119560064, -1.9340485539299312, -0.6246990990796732, -2.325746651211032, -0.28429904752434976, -0.1272785164794058, -1.3787859877532718, -0.24374419289538318]' http://localhost:8000/queries.json
Congratulations! You have just trained a linear regression model and is able to perform real time prediction.
Prediction servers support reloading models on the fly with the latest completed run.
Assuming you already have a running prediction server from the previous section, go to http://localhost:8000 to check its status. Take note of the Run ID at the top.
Run training and deploy again.
$ cd $PIO_HOME/examples/scala-local-regression $ ../../bin/pio train $ ../../bin/pio deploy
Refresh the page at http://localhost:8000, you should see the prediction server status page with a new Run ID at the top.
Congratulations! You have just experienced a production-ready setup that can reload itself automatically after every training! Simply add the training or evaluation command to your crontab, and your setup will be able to re-deploy itself automatically in a regular interval.