| --- |
| mathjax: include |
| htmlTitle: FlinkML - Alternating Least Squares |
| title: <a href="../ml">FlinkML</a> - Alternating Least Squares |
| --- |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| * This will be replaced by the TOC |
| {:toc} |
| |
| ## Description |
| |
| The alternating least squares (ALS) algorithm factorizes a given matrix $R$ into two factors $U$ and $V$ such that $R \approx U^TV$. |
| The unknown row dimension is given as a parameter to the algorithm and is called latent factors. |
| Since matrix factorization can be used in the context of recommendation, the matrices $U$ and $V$ can be called user and item matrix, respectively. |
| The $i$th column of the user matrix is denoted by $u_i$ and the $i$th column of the item matrix is $v_i$. |
| The matrix $R$ can be called the ratings matrix with $$(R)_{i,j} = r_{i,j}$$. |
| |
| In order to find the user and item matrix, the following problem is solved: |
| |
| $$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j} - u_{i}^Tv_{j}\right)^2 + |
| \lambda \left(\sum_{i} n_{u_i} \left\lVert u_i \right\rVert^2 + \sum_{j} n_{v_j} \left\lVert v_j \right\rVert^2 \right)$$ |
| |
| with $\lambda$ being the regularization factor, $$n_{u_i}$$ being the number of items the user $i$ has rated and $$n_{v_j}$$ being the number of times the item $j$ has been rated. |
| This regularization scheme to avoid overfitting is called weighted-$\lambda$-regularization. |
| Details can be found in the work of [Zhou et al.](http://dx.doi.org/10.1007/978-3-540-68880-8_32). |
| |
| By fixing one of the matrices $U$ or $V$, we obtain a quadratic form which can be solved directly. |
| The solution of the modified problem is guaranteed to monotonically decrease the overall cost function. |
| By applying this step alternately to the matrices $U$ and $V$, we can iteratively improve the matrix factorization. |
| |
| The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$. |
| |
| ## Operations |
| |
| `ALS` is a `Predictor`. |
| As such, it supports the `fit` and `predict` operation. |
| |
| ### Fit |
| |
| ALS is trained on the sparse representation of the rating matrix: |
| |
| * `fit: DataSet[(Int, Int, Double)] => Unit` |
| |
| ### Predict |
| |
| ALS predicts for each tuple of row and column index the rating: |
| |
| * `predict: DataSet[(Int, Int)] => DataSet[(Int, Int, Double)]` |
| |
| ## Parameters |
| |
| The alternating least squares implementation can be controlled by the following parameters: |
| |
| <table class="table table-bordered"> |
| <thead> |
| <tr> |
| <th class="text-left" style="width: 20%">Parameters</th> |
| <th class="text-center">Description</th> |
| </tr> |
| </thead> |
| |
| <tbody> |
| <tr> |
| <td><strong>NumFactors</strong></td> |
| <td> |
| <p> |
| The number of latent factors to use for the underlying model. |
| It is equivalent to the dimension of the calculated user and item vectors. |
| (Default value: <strong>10</strong>) |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td><strong>Lambda</strong></td> |
| <td> |
| <p> |
| Regularization factor. Tune this value in order to avoid overfitting or poor performance due to strong generalization. |
| (Default value: <strong>1</strong>) |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td><strong>Iterations</strong></td> |
| <td> |
| <p> |
| The maximum number of iterations. |
| (Default value: <strong>10</strong>) |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td><strong>Blocks</strong></td> |
| <td> |
| <p> |
| The number of blocks into which the user and item matrix are grouped. |
| The fewer blocks one uses, the less data is sent redundantly. |
| However, bigger blocks entail bigger update messages which have to be stored on the heap. |
| If the algorithm fails because of an OutOfMemoryException, then try to increase the number of blocks. |
| (Default value: <strong>None</strong>) |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td><strong>Seed</strong></td> |
| <td> |
| <p> |
| Random seed used to generate the initial item matrix for the algorithm. |
| (Default value: <strong>0</strong>) |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td><strong>TemporaryPath</strong></td> |
| <td> |
| <p> |
| Path to a temporary directory into which intermediate results are stored. |
| If this value is set, then the algorithm is split into two preprocessing steps, the ALS iteration and a post-processing step which calculates a last ALS half-step. |
| The preprocessing steps calculate the <code>OutBlockInformation</code> and <code>InBlockInformation</code> for the given rating matrix. |
| The results of the individual steps are stored in the specified directory. |
| By splitting the algorithm into multiple smaller steps, Flink does not have to split the available memory amongst too many operators. |
| This allows the system to process bigger individual messages and improves the overall performance. |
| (Default value: <strong>None</strong>) |
| </p> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| |
| ## Examples |
| |
| {% highlight scala %} |
| // Read input data set from a csv file |
| val inputDS: DataSet[(Int, Int, Double)] = env.readCsvFile[(Int, Int, Double)]( |
| pathToTrainingFile) |
| |
| // Setup the ALS learner |
| val als = ALS() |
| .setIterations(10) |
| .setNumFactors(10) |
| .setBlocks(100) |
| .setTemporaryPath("hdfs://tempPath") |
| |
| // Set the other parameters via a parameter map |
| val parameters = ParameterMap() |
| .add(ALS.Lambda, 0.9) |
| .add(ALS.Seed, 42L) |
| |
| // Calculate the factorization |
| als.fit(inputDS, parameters) |
| |
| // Read the testing data set from a csv file |
| val testingDS: DataSet[(Int, Int)] = env.readCsvFile[(Int, Int)](pathToData) |
| |
| // Calculate the ratings according to the matrix factorization |
| val predictedRatings = als.predict(testingDS) |
| {% endhighlight %} |