| Distributed Recommendation Engine with RDD-based Model using MLlib's ALS |
| ======================================================================== |
| |
| This document describes a recommendation engine that is based on Apache Spark's |
| MLlib collaborative filtering algorithm. |
| |
| |
| Prerequisite |
| ------------ |
| |
| Make sure you have built PredictionIO and setup storage described |
| [here](/README.md). |
| |
| |
| High Level Description |
| ---------------------- |
| |
| This engine demonstrates how one can integrate MLlib's algorithms that produce |
| an RDD-based model, deploy it in production and serve real-time queries. |
| |
| For details about MLlib's collaborative filtering algorithms, please refer to |
| https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. |
| |
| All code definition can be found [here](src/main/scala/Run.scala). |
| |
| |
| ### Data Source |
| |
| Training data is located at `/examples/data/movielens.txt`. Values are delimited |
| by double colons (::). The first column are user IDs. The second column are item |
| IDs. The third column are ratings. In this example, they are represented as |
| `RDD[Rating]`, as described in the official MLlib guide. |
| |
| |
| ### Preparator |
| |
| The preparator in this example is an identity function, i.e. no further |
| preparation is done on the training data. |
| |
| |
| ### Algorithm |
| |
| This example engine contains one single algorithm that wraps around MLlib. The |
| `train()` method simply calls MLlib's `ALS.train()` method. |
| |
| |
| ### Serving |
| |
| This example engine uses `FirstServing`, which serves only predictions from the |
| first algorithm. Since there is only one algorithm in this engine, predictions |
| from MLlib's ALS algorithm will be served. |
| |
| |
| Training a Model |
| ---------------- |
| |
| This example provides a set of ready-to-use parameters for each component |
| mentioned in the previous section. They are located inside the `params` |
| subdirectory. |
| |
| Before training, you must let PredictionIO know about the engine. Run the |
| following command to build and register the engine. |
| ``` |
| $ cd $PIO_HOME/examples/scala-recommendations |
| $ ../../bin/pio build |
| ``` |
| where `$PIO_HOME` is the root directory of the PredictionIO code tree. |
| |
| To start training, use the following command. You need to install the |
| [`gfortran`](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries) |
| runtime library if it is not already present on your nodes. For Debian and |
| Ubuntu systems this would be "`sudo apt-get install libgfortran3`". |
| ``` |
| $ cd $PIO_HOME/examples/scala-recommendations |
| $ ../../bin/pio train |
| ``` |
| This will train a model and save it in PredictionIO's metadata storage. Notice |
| that when the run is completed, it will display a run ID, like below. |
| ``` |
| 2014-08-27 23:13:54,596 INFO SparkContext - Job finished: saveAsObjectFile at Run.scala:68, took 0.299989372 s |
| 2014-08-27 23:13:54,736 INFO APIDebugWorkflow$ - Saved engine instance with ID: txHBY2XRQTKFnxC-lYoVgA |
| ``` |
| |
| |
| Deploying a Real-time Prediction Server |
| --------------------------------------- |
| |
| Following from instructions above, you should have trained a model. Use the |
| following command to start a server. |
| ``` |
| $ cd $PIO_HOME/examples/scala-recommendations |
| $ ../../bin/pio deploy |
| ``` |
| This will create a server that by default binds to http://localhost:8000. You |
| can visit that page in your web browser to check its status. |
| |
| To perform real-time predictions, try the following. This predicts on how user 1 will rate item (movie) 4. As in all collaborative filtering algorithms, it will not handle the case of a cold user (when the user has not rated any movies). |
| ``` |
| $ curl -H "Content-Type: application/json" -d '[1,4]' http://localhost:8000/queries.json |
| ``` |
| Congratulations! You have just trained an ALS model and is able to perform real |
| time prediction distributed across an Apache Spark cluster! |
| |
| |
| Production Prediction Server Deployment |
| --------------------------------------- |
| |
| Prediction servers support reloading models on the fly with the latest completed |
| run. |
| |
| 1. Assuming you already have a running prediction server from the previous |
| section, go to http://localhost:8000 to check its status. Take note of the |
| **Run ID** at the top. |
| |
| 2. Run training and deploy again. There is no need to manually terminate the previous deploy instance. |
| |
| ``` |
| $ cd $PIO_HOME/examples/scala-recommendations |
| $ ../../bin/pio train |
| $ ../../bin/pio deploy |
| ``` |
| |
| 3. Refresh the page at http://localhost:8000, you should see the prediction |
| server status page with a new **Run ID** at the top. |
| |
| Congratulations! You have just experienced a production-ready setup that can |
| reload itself automatically after every training! Simply add the training or |
| evaluation command to your *crontab*, and your setup will be able to re-deploy |
| itself automatically at a regular interval. |