tree: 7137d168766f39172f06c6902906d3e67f7d5ab0 [path history] [tgz]
  1. data/
  2. project/
  3. src/
  4. .gitignore
  5. build.sbt
  6. engine.json
  7. README.md
  8. template.json
examples/scala-parallel-similarproduct/no-set-user/README.md

Similar Product Template without sending ‘$set’ events for users

In some cases, if you don't need to keep track the user ID being created/deleted or user properties changes with events, then you can simplify the template as described in this example to get rid of sending ‘$set’ events for users. The user Id can be extracted from the user-to-item events (eg. view events). You can find the complete source code in src/ directory.

This example engine is based on Similar Product Template version v0.1.3.

Documentation

Please refer to http://docs.prediction.io/templates/similarproduct/quickstart/

Development Notes

There you can find several steps that we have done to modify original Similar Product template

Changes to import_eventserver.py

For importing sample data, you can remove the following code that generates ‘$set’ events for entities with type ‘user’

  for user_id in user_ids:
    print "Set user", user_id
    client.create_event(
      event="$set",
      entity_type="user",
      entity_id=user_id
    )
    count += 1

Changes to DataSource.scala

Remove the following code

    // create a RDD of (entityID, User)
    val usersRDD: RDD[(String, User)] = eventsDb.aggregateProperties(
      appId = dsp.appId,
      entityType = "user"
    )(sc).map { case (entityId, properties) =>
      val user = try {
        User()
      } catch {
        case e: Exception => {
          logger.error(s"Failed to get properties ${properties} of" +
            s" user ${entityId}. Exception: ${e}.")
          throw e
        }
      }
      (entityId, user)
    }.cache()

Modify class ‘TrainingData’ that should looks like the following code:

class TrainingData(
  val items: RDD[(String, Item)],
  val viewEvents: RDD[ViewEvent]
) extends Serializable {
  override def toString = {
    s"items: [${items.count()} (${items.take(2).toList}...)]" +
    s"viewEvents: [${viewEvents.count()}] (${viewEvents.take(2).toList}...)"
  }
}

Then fix compilation issue in invocation ‘TrainingData’ constructor that should looks like following:

    new TrainingData(
      items = itemsRDD,
      viewEvents = viewEventsRDD
    )

Changes to Preparator.scala

We removed RDD with users from class ‘TrainingData’ at the previous step. So, we should apply the same changes for class ‘PreparedData’.

The file Preparator.scala should looks like the following code:

class Preparator
  extends PPreparator[TrainingData, PreparedData] {
  def prepare(sc: SparkContext, trainingData: TrainingData): PreparedData = {
    new PreparedData(
      items = trainingData.items,
      viewEvents = trainingData.viewEvents)
  }
}
class PreparedData(
  val items: RDD[(String, Item)],
  val viewEvents: RDD[ViewEvent]
) extends Serializable

Changes to ALSAlgorithm.scala

Remove the following code:

    require(!data.users.take(1).isEmpty,
      s"users in PreparedData cannot be empty." +
      " Please check if DataSource generates TrainingData" +
      " and Preprator generates PreparedData correctly.")

and modify the line where ‘userStringIntMap’ value is defined to extract the user ID from the viewEvents and create user String ID to Int Index BiMap. It should look like the following code:

val userStringIntMap = BiMap.stringInt(data.viewEvents.map(_.user))

Test the Changes

$ pio build
$ pio train
$ pio deploy

After that you can try to make the same queries that were listed in the quickstart. The result will be the same.