| --- | 
 | mathjax: include | 
 | htmlTitle: FlinkML - Multiple linear regression | 
 | title: <a href="../ml">FlinkML</a> - Multiple linear regression | 
 | --- | 
 | <!-- | 
 | Licensed to the Apache Software Foundation (ASF) under one | 
 | or more contributor license agreements.  See the NOTICE file | 
 | distributed with this work for additional information | 
 | regarding copyright ownership.  The ASF licenses this file | 
 | to you under the Apache License, Version 2.0 (the | 
 | "License"); you may not use this file except in compliance | 
 | with the License.  You may obtain a copy of the License at | 
 |  | 
 |   http://www.apache.org/licenses/LICENSE-2.0 | 
 |  | 
 | Unless required by applicable law or agreed to in writing, | 
 | software distributed under the License is distributed on an | 
 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | 
 | KIND, either express or implied.  See the License for the | 
 | specific language governing permissions and limitations | 
 | under the License. | 
 | --> | 
 |  | 
 | * This will be replaced by the TOC | 
 | {:toc} | 
 |  | 
 | ## Description | 
 |  | 
 |  Multiple linear regression tries to find a linear function which best fits the provided input data. | 
 |  Given a set of input data with its value $(\mathbf{x}, y)$, multiple linear regression finds | 
 |  a vector $\mathbf{w}$ such that the sum of the squared residuals is minimized: | 
 |  | 
 |  $$ S(\mathbf{w}) = \sum_{i=1} \left(y - \mathbf{w}^T\mathbf{x_i} \right)^2$$ | 
 |  | 
 |  Written in matrix notation, we obtain the following formulation: | 
 |  | 
 |  $$\mathbf{w}^* = \arg \min_{\mathbf{w}} (\mathbf{y} - X\mathbf{w})^2$$ | 
 |  | 
 |  This problem has a closed form solution which is given by: | 
 |  | 
 |   $$\mathbf{w}^* = \left(X^TX\right)^{-1}X^T\mathbf{y}$$ | 
 |  | 
 |   However, in cases where the input data set is so huge that a complete parse over the whole data | 
 |   set is prohibitive, one can apply stochastic gradient descent (SGD) to approximate the solution. | 
 |   SGD first calculates for a random subset of the input data set the gradients. The gradient | 
 |   for a given point $\mathbf{x}_i$ is given by: | 
 |  | 
 |   $$\nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i}) = 2\left(\mathbf{w}^T\mathbf{x_i} - | 
 |     y\right)\mathbf{x_i}$$ | 
 |  | 
 |   The gradients are averaged and scaled. The scaling is defined by $\gamma = \frac{s}{\sqrt{j}}$ | 
 |   with $s$ being the initial step size and $j$ being the current iteration number. The resulting gradient is subtracted from the | 
 |   current weight vector giving the new weight vector for the next iteration: | 
 |  | 
 |   $$\mathbf{w}_{t+1} = \mathbf{w}_t - \gamma \frac{1}{n}\sum_{i=1}^n \nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i})$$ | 
 |  | 
 |   The multiple linear regression algorithm computes either a fixed number of SGD iterations or terminates based on a dynamic convergence criterion. | 
 |   The convergence criterion is the relative change in the sum of squared residuals: | 
 |  | 
 |   $$\frac{S_{k-1} - S_k}{S_{k-1}} < \rho$$ | 
 |    | 
 | ## Operations | 
 |  | 
 | `MultipleLinearRegression` is a `Predictor`. | 
 | As such, it supports the `fit` and `predict` operation. | 
 |  | 
 | ### Fit | 
 |  | 
 | MultipleLinearRegression is trained on a set of `LabeledVector`:  | 
 |  | 
 | * `fit: DataSet[LabeledVector] => Unit` | 
 |  | 
 | ### Predict | 
 |  | 
 | MultipleLinearRegression predicts for all subtypes of `Vector` the corresponding regression value:  | 
 |  | 
 | * `predict[T <: Vector]: DataSet[T] => DataSet[LabeledVector]` | 
 |  | 
 | If we call predict with a `DataSet[LabeledVector]`, we make a prediction on the regression value | 
 | for each example, and return a `DataSet[(Double, Double)]`. In each tuple the first element | 
 | is the true value, as was provided from the input `DataSet[LabeledVector]` and the second element | 
 | is the predicted value. You can then use these `(truth, prediction)` tuples to evaluate | 
 | the algorithm's performance. | 
 |  | 
 | * `predict: DataSet[LabeledVector] => DataSet[(Double, Double)]` | 
 |  | 
 | ## Parameters | 
 |  | 
 |   The multiple linear regression implementation can be controlled by the following parameters: | 
 |    | 
 |    <table class="table table-bordered"> | 
 |     <thead> | 
 |       <tr> | 
 |         <th class="text-left" style="width: 20%">Parameters</th> | 
 |         <th class="text-center">Description</th> | 
 |       </tr> | 
 |     </thead> | 
 |  | 
 |     <tbody> | 
 |       <tr> | 
 |         <td><strong>Iterations</strong></td> | 
 |         <td> | 
 |           <p> | 
 |             The maximum number of iterations. (Default value: <strong>10</strong>) | 
 |           </p> | 
 |         </td> | 
 |       </tr> | 
 |       <tr> | 
 |         <td><strong>Stepsize</strong></td> | 
 |         <td> | 
 |           <p> | 
 |             Initial step size for the gradient descent method. | 
 |             This value controls how far the gradient descent method moves in the opposite direction of the gradient. | 
 |             Tuning this parameter might be crucial to make it stable and to obtain a better performance.  | 
 |             (Default value: <strong>0.1</strong>) | 
 |           </p> | 
 |         </td> | 
 |       </tr> | 
 |       <tr> | 
 |         <td><strong>ConvergenceThreshold</strong></td> | 
 |         <td> | 
 |           <p> | 
 |             Threshold for relative change of the sum of squared residuals until the iteration is stopped. | 
 |             (Default value: <strong>None</strong>) | 
 |           </p> | 
 |         </td> | 
 |       </tr> | 
 |     </tbody> | 
 |   </table> | 
 |  | 
 | ## Examples | 
 |  | 
 | {% highlight scala %} | 
 | // Create multiple linear regression learner | 
 | val mlr = MultipleLinearRegression() | 
 | .setIterations(10) | 
 | .setStepsize(0.5) | 
 | .setConvergenceThreshold(0.001) | 
 |  | 
 | // Obtain training and testing data set | 
 | val trainingDS: DataSet[LabeledVector] = ... | 
 | val testingDS: DataSet[Vector] = ... | 
 |  | 
 | // Fit the linear model to the provided data | 
 | mlr.fit(trainingDS) | 
 |  | 
 | // Calculate the predictions for the test data | 
 | val predictions = mlr.predict(testingDS) | 
 | {% endhighlight %} |