During development, you may want to run each component step by step to test out the data pipeline. In this tutorial, we will demonstrate how to do it easily.
In src/main/java/recommendations/tutorial2
, you can find Runner1.java
. It is a small program that uses JavaSimpleEngineBuilder
to build an engine and uses JavaWorkflow
to run the workflow.
To test the DataSource component, we can simply create an Engine with the DataSource component only and leave other components empty:
private static class HalfBakedEngineFactory implements IJavaEngineFactory { public JavaSimpleEngine<TrainingData, Object, Query, Float, Object> apply() { return new JavaSimpleEngineBuilder< TrainingData, Object, Query, Float, Object> () .dataSourceClass(DataSource.class) .build(); } }
Similarly, we only need to add the DataSourceParams
to JavaEngineParamsBuilder
.
JavaEngineParams engineParams = new JavaEngineParamsBuilder() .dataSourceParams(new DataSourceParams(filePath)) .build();
Then, you can run this Engine by using JavaWorkflow
.
JavaWorkflow.runEngine( (new HalfBakedEngineFactory()).apply(), engineParams, null, new EmptyParams(), new WorkflowParamsBuilder().batch("MyEngine").verbose(3).build() );
For quick testing purpose, a very simple test data is provided in data/test/ratings.csv
. Each row of the file represents user ID, item ID, and the rating value:
1,1,2 1,2,3 1,3,4 ...
The Runner1.java
takes the path of the rating file as argument. Execute the following command to run (The ../bin/pio run
command will automatically compile and package the JARs):
$ cd $PIO_HOME/examples $ ../bin/pio run io.prediction.examples.java.recommendations.tutorial2.Runner1 -- -- data/test/ratings.csv
where $PIO_HOME
is the root directory of the PredictionIO code tree. The --
is to separate parameters passed to pio run
(the Runner1
class in this case`), parameters passed to Apache Spark (no special parameters in this case), and parameters passed to the main class (the CSV file in this case).
If it runs successfully, you should see the following console output at the end. It prints out the TrainingData
generated by DataSource
.
2014-08-05 15:24:40,140 INFO SparkContext - Job finished: collect at DebugWorkflow.scala:411, took 0.022947 s 2014-08-05 15:24:40,141 INFO APIDebugWorkflow$ - Data Set 0 2014-08-05 15:24:40,142 INFO APIDebugWorkflow$ - Params: null 2014-08-05 15:24:40,142 INFO APIDebugWorkflow$ - TrainingData: 2014-08-05 15:24:40,142 INFO APIDebugWorkflow$ - [[(1,1,2.0), (1,2,3.0), (1,3,4.0), (2,3,4.0), (2,4,1.0), (3,2,2.0), (3,3,1.0), (3,4,3.0), (4,1,5.0), (4,2,3.0), (4,4,2.0)]] 2014-08-05 15:24:40,143 INFO APIDebugWorkflow$ - TestingData: (count=0) 2014-08-05 15:24:40,143 INFO APIDebugWorkflow$ - Data source complete 2014-08-05 15:24:40,143 INFO APIDebugWorkflow$ - Preparator is null. Stop here
As you can see, it stops after running the DataSource component and it prints out Training Data for debugging.
By simply adding addAlgorithmClass()
and addAlgorithmParams()
in the JavaSimpleEngineBuilder
and JavaEngineParamsBuilder
, you can test the Algorithm
class in the workflow as well, as shown in Runner2.java
:
private static class HalfBakedEngineFactory implements IJavaEngineFactory { public JavaSimpleEngine<TrainingData, Object, Query, Float, Object> apply() { return new JavaSimpleEngineBuilder< TrainingData, Object, Query, Float, Object> () .dataSourceClass(DataSource.class) .preparatorClass() // Use default Preparator .addAlgorithmClass("MyRecommendationAlgo", Algorithm.class) // Add Algorithm .build(); } }
JavaEngineParams engineParams = new JavaEngineParamsBuilder() .dataSourceParams(new DataSourceParams(filePath)) .addAlgorithmParams("MyRecommendationAlgo", new AlgoParams(0.2)) // Add Algorithm Params .build();
Execute the following command to run:
$ cd $PIO_HOME/examples $ ../bin/pio run io.prediction.examples.java.recommendations.tutorial2.Runner2 -- -- data/test/ratings.csv
You should see the Model generated by the Algorithm at the end of the console output:
2014-08-26 21:17:28,174 INFO SparkContext - Job finished: collect at DebugWorkflow.scala:71, took 0.051342917 s 2014-08-26 21:17:28,174 INFO APIDebugWorkflow$ - [Model: [itemSimilarity: {1=org.apache.commons.math3.linear.OpenMapRealVector@65fa6c0, 2=org.apache.commons.math3.linear.OpenMapRealVector@c2eb7f66, 3=org.apache.commons.math3.linear.OpenMapRealVector@2302395e, 4=org.apache.commons.math3.linear.OpenMapRealVector@d2fb7858}] [userHistory: {1=org.apache.commons.math3.linear.OpenMapRealVector@5a1123a3, 2=org.apache.commons.math3.linear.OpenMapRealVector@d1225bfd, 3=org.apache.commons.math3.linear.OpenMapRealVector@572123a3, 4=org.apache.commons.math3.linear.OpenMapRealVector@a51523a3}]] 2014-08-26 21:17:28,175 INFO APIDebugWorkflow$ - Serving is null. Stop here
By adding each component step by step, we can easily test and debug the data pipeline.
Next: Evaluation