| --- |
| mathjax: include |
| htmlTitle: FlinkML - Multiple linear regression |
| title: <a href="../ml">FlinkML</a> - Multiple linear regression |
| --- |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| * This will be replaced by the TOC |
| {:toc} |
| |
| ## Description |
| |
| Multiple linear regression tries to find a linear function which best fits the provided input data. |
| Given a set of input data with its value $(\mathbf{x}, y)$, multiple linear regression finds |
| a vector $\mathbf{w}$ such that the sum of the squared residuals is minimized: |
| |
| $$ S(\mathbf{w}) = \sum_{i=1} \left(y - \mathbf{w}^T\mathbf{x_i} \right)^2$$ |
| |
| Written in matrix notation, we obtain the following formulation: |
| |
| $$\mathbf{w}^* = \arg \min_{\mathbf{w}} (\mathbf{y} - X\mathbf{w})^2$$ |
| |
| This problem has a closed form solution which is given by: |
| |
| $$\mathbf{w}^* = \left(X^TX\right)^{-1}X^T\mathbf{y}$$ |
| |
| However, in cases where the input data set is so huge that a complete parse over the whole data |
| set is prohibitive, one can apply stochastic gradient descent (SGD) to approximate the solution. |
| SGD first calculates for a random subset of the input data set the gradients. The gradient |
| for a given point $\mathbf{x}_i$ is given by: |
| |
| $$\nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i}) = 2\left(\mathbf{w}^T\mathbf{x_i} - |
| y\right)\mathbf{x_i}$$ |
| |
| The gradients are averaged and scaled. The scaling is defined by $\gamma = \frac{s}{\sqrt{j}}$ |
| with $s$ being the initial step size and $j$ being the current iteration number. The resulting gradient is subtracted from the |
| current weight vector giving the new weight vector for the next iteration: |
| |
| $$\mathbf{w}_{t+1} = \mathbf{w}_t - \gamma \frac{1}{n}\sum_{i=1}^n \nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i})$$ |
| |
| The multiple linear regression algorithm computes either a fixed number of SGD iterations or terminates based on a dynamic convergence criterion. |
| The convergence criterion is the relative change in the sum of squared residuals: |
| |
| $$\frac{S_{k-1} - S_k}{S_{k-1}} < \rho$$ |
| |
| ## Operations |
| |
| `MultipleLinearRegression` is a `Predictor`. |
| As such, it supports the `fit` and `predict` operation. |
| |
| ### Fit |
| |
| MultipleLinearRegression is trained on a set of `LabeledVector`: |
| |
| * `fit: DataSet[LabeledVector] => Unit` |
| |
| ### Predict |
| |
| MultipleLinearRegression predicts for all subtypes of `Vector` the corresponding regression value: |
| |
| * `predict[T <: Vector]: DataSet[T] => DataSet[LabeledVector]` |
| |
| If we call predict with a `DataSet[LabeledVector]`, we make a prediction on the regression value |
| for each example, and return a `DataSet[(Double, Double)]`. In each tuple the first element |
| is the true value, as was provided from the input `DataSet[LabeledVector]` and the second element |
| is the predicted value. You can then use these `(truth, prediction)` tuples to evaluate |
| the algorithm's performance. |
| |
| * `predict: DataSet[LabeledVector] => DataSet[(Double, Double)]` |
| |
| ## Parameters |
| |
| The multiple linear regression implementation can be controlled by the following parameters: |
| |
| <table class="table table-bordered"> |
| <thead> |
| <tr> |
| <th class="text-left" style="width: 20%">Parameters</th> |
| <th class="text-center">Description</th> |
| </tr> |
| </thead> |
| |
| <tbody> |
| <tr> |
| <td><strong>Iterations</strong></td> |
| <td> |
| <p> |
| The maximum number of iterations. (Default value: <strong>10</strong>) |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td><strong>Stepsize</strong></td> |
| <td> |
| <p> |
| Initial step size for the gradient descent method. |
| This value controls how far the gradient descent method moves in the opposite direction of the gradient. |
| Tuning this parameter might be crucial to make it stable and to obtain a better performance. |
| (Default value: <strong>0.1</strong>) |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td><strong>ConvergenceThreshold</strong></td> |
| <td> |
| <p> |
| Threshold for relative change of the sum of squared residuals until the iteration is stopped. |
| (Default value: <strong>None</strong>) |
| </p> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| |
| ## Examples |
| |
| {% highlight scala %} |
| // Create multiple linear regression learner |
| val mlr = MultipleLinearRegression() |
| .setIterations(10) |
| .setStepsize(0.5) |
| .setConvergenceThreshold(0.001) |
| |
| // Obtain training and testing data set |
| val trainingDS: DataSet[LabeledVector] = ... |
| val testingDS: DataSet[Vector] = ... |
| |
| // Fit the linear model to the provided data |
| mlr.fit(trainingDS) |
| |
| // Calculate the predictions for the test data |
| val predictions = mlr.predict(testingDS) |
| {% endhighlight %} |