| --- |
| layout: global |
| title: SystemML Algorithms Reference - Regression |
| displayTitle: <a href="algorithms-reference.html">SystemML Algorithms Reference</a> |
| --- |
| <!-- |
| {% comment %} |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to you under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| {% endcomment %} |
| --> |
| |
| # 4. Regression |
| |
| |
| ## 4.1. Linear Regression |
| |
| ### Description |
| |
| Linear Regression scripts are used to model the relationship between one |
| numerical response variable and one or more explanatory (feature) |
| variables. The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ |
| where $x_i$ is a numerical vector of feature variables and $y_i$ is a |
| numerical response value for each training data record. The feature |
| vectors are provided as a matrix $X$ of size $n\,{\times}\,m$, where $n$ |
| is the number of records and $m$ is the number of features. The observed |
| response values are provided as a 1-column matrix $Y$, with a numerical |
| value $y_i$ for each $x_i$ in the corresponding row of matrix $X$. |
| |
| In linear regression, we predict the distribution of the response $y_i$ |
| based on a fixed linear combination of the features in $x_i$. We assume |
| that there exist constant regression coefficients |
| $\beta_0, \beta_1, \ldots, \beta_m$ and a constant residual |
| variance $\sigma^2$ such that |
| |
| $$ |
| \begin{equation} |
| y_i \sim Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\, |
| \mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} |
| \end{equation} |
| $$ |
| |
| Distribution |
| $y_i \sim Normal(\mu_i, \sigma^2)$ |
| models the "unexplained" residual noise and is assumed independent |
| across different records. |
| |
| The goal is to estimate the regression coefficients and the residual |
| variance. Once they are accurately estimated, we can make predictions |
| about $y_i$ given $x_i$ in new records. We can also use the $\beta_j$’s |
| to analyze the influence of individual features on the response value, |
| and assess the quality of this model by comparing residual variance in |
| the response, left after prediction, with its total variance. |
| |
| There are two scripts in our library, both doing the same estimation, |
| but using different computational methods. Depending on the size and the |
| sparsity of the feature matrix $X$, one or the other script may be more |
| efficient. The "direct solve" script `LinearRegDS.dml` is more |
| efficient when the number of features $m$ is relatively small |
| ($m \sim 1000$ or less) and matrix $X$ is either tall or fairly dense |
| (has ${\gg}\:m^2$ nonzeros); otherwise, the "conjugate gradient" script |
| `LinearRegCG.dml` is more efficient. If $m > 50000$, use only |
| `LinearRegCG.dml`. |
| |
| |
| ### Usage |
| |
| **Linear Regression - Direct Solve**: |
| |
| <div class="codetabs"> |
| <div data-lang="Python" markdown="1"> |
| {% highlight python %} |
| from systemml.mllearn import LinearRegression |
| # C = 1/reg |
| lr = LinearRegression(sqlCtx, fit_intercept=True, C=1.0, solver='direct-solve') |
| # X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame or SciPy Sparse Matrix |
| y_test = lr.fit(X_train, y_train) |
| # df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features" |
| y_test = lr.fit(df_train) |
| {% endhighlight %} |
| </div> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f LinearRegDS.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| O=[file] |
| icpt=[int] |
| reg=[double] |
| fmt=[format] |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f LinearRegDS.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| O=[file] |
| icpt=[int] |
| reg=[double] |
| fmt=[format] |
| </div> |
| </div> |
| |
| **Linear Regression - Conjugate Gradient**: |
| |
| <div class="codetabs"> |
| <div data-lang="Python" markdown="1"> |
| {% highlight python %} |
| from systemml.mllearn import LinearRegression |
| # C = 1/reg |
| lr = LinearRegression(sqlCtx, fit_intercept=True, max_iter=100, tol=0.000001, C=1.0, solver='newton-cg') |
| # X_train, y_train and X_test can be NumPy matrices or Pandas DataFrames or SciPy Sparse matrices |
| y_test = lr.fit(X_train, y_train) |
| # df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features" |
| y_test = lr.fit(df_train) |
| {% endhighlight %} |
| </div> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f LinearRegCG.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| O=[file] |
| Log=[file] |
| icpt=[int] |
| reg=[double] |
| tol=[double] |
| maxi=[int] |
| fmt=[format] |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f LinearRegCG.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| O=[file] |
| Log=[file] |
| icpt=[int] |
| reg=[double] |
| tol=[double] |
| maxi=[int] |
| fmt=[format] |
| </div> |
| </div> |
| |
| |
| ### Arguments |
| |
| **X**: Location (on HDFS) to read the matrix of feature vectors, each row |
| constitutes one feature vector |
| |
| **Y**: Location to read the 1-column matrix of response values |
| |
| **B**: Location to store the estimated regression parameters (the $\beta_j$’s), |
| with the intercept parameter $\beta_0$ at position |
| B\[$m\,{+}\,1$, 1\] if available |
| |
| **O**: (default: `" "`) Location to store the CSV-file of summary statistics defined |
| in [**Table 7**](algorithms-regression.html#table7), the default is to print it to the |
| standard output |
| |
| **Log**: (default: `" "`, `LinearRegCG.dml` only) Location to store |
| iteration-specific variables for monitoring and debugging purposes, see |
| [**Table 8**](algorithms-regression.html#table8) |
| for details. |
| |
| **icpt**: (default: `0`) Intercept presence and shifting/rescaling the features |
| in $X$: |
| |
| * 0 = no intercept (hence no $\beta_0$), no shifting or |
| rescaling of the features |
| * 1 = add intercept, but do not shift/rescale the features |
| in $X$ |
| * 2 = add intercept, shift/rescale the features in $X$ to |
| mean 0, variance 1 |
| |
| **reg**: (default: `0.000001`) L2-regularization parameter $\lambda\geq 0$; set to nonzero for highly |
| dependent, sparse, or numerous ($m \gtrsim n/10$) features |
| |
| **tol**: (default: `0.000001`, `LinearRegCG.dml` only) Tolerance $\varepsilon\geq 0$ used in the |
| convergence criterion: we terminate conjugate gradient iterations when |
| the $\beta$-residual reduces in L2-norm by this factor |
| |
| **maxi**: (default: `0`, `LinearRegCG.dml` only) Maximum number of conjugate |
| gradient iterations, or `0` if no maximum limit provided |
| |
| **fmt**: (default: `"text"`) Matrix file output format, such as `text`, |
| `mm`, or `csv`; see read/write functions in |
| SystemML Language Reference for details. |
| |
| |
| ### Examples |
| |
| **Linear Regression - Direct Solve**: |
| |
| <div class="codetabs"> |
| <div data-lang="Python" markdown="1"> |
| {% highlight python %} |
| import numpy as np |
| from sklearn import datasets |
| from systemml.mllearn import LinearRegression |
| from pyspark.sql import SQLContext |
| # Load the diabetes dataset |
| diabetes = datasets.load_diabetes() |
| # Use only one feature |
| diabetes_X = diabetes.data[:, np.newaxis, 2] |
| # Split the data into training/testing sets |
| diabetes_X_train = diabetes_X[:-20] |
| diabetes_X_test = diabetes_X[-20:] |
| # Split the targets into training/testing sets |
| diabetes_y_train = diabetes.target[:-20] |
| diabetes_y_test = diabetes.target[-20:] |
| # Create linear regression object |
| regr = LinearRegression(sqlCtx, solver='direct-solve') |
| # Train the model using the training sets |
| regr.fit(diabetes_X_train, diabetes_y_train) |
| # The mean square error |
| print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)) |
| {% endhighlight %} |
| </div> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f LinearRegDS.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| reg=1.0 |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f LinearRegDS.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| reg=1.0 |
| </div> |
| </div> |
| |
| **Linear Regression - Conjugate Gradient**: |
| |
| <div class="codetabs"> |
| <div data-lang="Python" markdown="1"> |
| {% highlight python %} |
| import numpy as np |
| from sklearn import datasets |
| from systemml.mllearn import LinearRegression |
| from pyspark.sql import SQLContext |
| # Load the diabetes dataset |
| diabetes = datasets.load_diabetes() |
| # Use only one feature |
| diabetes_X = diabetes.data[:, np.newaxis, 2] |
| # Split the data into training/testing sets |
| diabetes_X_train = diabetes_X[:-20] |
| diabetes_X_test = diabetes_X[-20:] |
| # Split the targets into training/testing sets |
| diabetes_y_train = diabetes.target[:-20] |
| diabetes_y_test = diabetes.target[-20:] |
| # Create linear regression object |
| regr = LinearRegression(sqlCtx, solver='newton-cg') |
| # Train the model using the training sets |
| regr.fit(diabetes_X_train, diabetes_y_train) |
| # The mean square error |
| print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)) |
| {% endhighlight %} |
| </div> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f LinearRegCG.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| reg=1.0 |
| tol=0.00000001 |
| maxi=100 |
| Log=/user/ml/log.csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f LinearRegCG.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| reg=1.0 |
| tol=0.00000001 |
| maxi=100 |
| Log=/user/ml/log.csv |
| </div> |
| </div> |
| |
| |
| * * * |
| |
| <a name="table7" /> |
| **Table 7**: Besides $\beta$, linear regression scripts compute a few summary statistics |
| listed below. The statistics are provided in CSV format, one comma-separated name-value |
| pair per each line. |
| |
| | Name | Meaning | |
| | --------------------- | ------- | |
| | AVG\_TOT\_Y | Average of the response value $Y$ |
| | STDEV\_TOT\_Y | Standard Deviation of the response value $Y$ |
| | AVG\_RES\_Y | Average of the residual $Y - \mathop{\mathrm{pred}}(Y \mid X)$, i.e. residual bias |
| | STDEV\_RES\_Y | Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y \mid X)$ |
| | DISPERSION | GLM-style dispersion, i.e. residual sum of squares / \#deg. fr. |
| | PLAIN\_R2 | Plain $R^2$ of residual with bias included vs. total average |
| | ADJUSTED\_R2 | Adjusted $R^2$ of residual with bias included vs. total average |
| | PLAIN\_R2\_NOBIAS | Plain $R^2$ of residual with bias subtracted vs. total average |
| | ADJUSTED\_R2\_NOBIAS | Adjusted $R^2$ of residual with bias subtracted vs. total average |
| | PLAIN\_R2\_VS\_0 | * Plain $R^2$ of residual with bias included vs. zero constant |
| | ADJUSTED\_R2\_VS\_0 | * Adjusted $R^2$ of residual with bias included vs. zero constant |
| |
| \* The last two statistics are only printed if there is no intercept (`icpt=0`) |
| |
| * * * |
| |
| <a name="table8" /> |
| **Table 8**: The `Log` file for the `LinearRegCG.dml` script |
| contains the above iteration variables in CSV format, each line |
| containing a triple (Name, Iteration\#, Value) with Iteration\# being 0 |
| for initial values. |
| |
| |
| | Name | Meaning | |
| | --------------------- | ------- | |
| | CG\_RESIDUAL\_NORM | L2-norm of conjug. grad. residual, which is $A$ %\*% $\beta - t(X)$ %\*% $y$ where $A = t(X)$ %\*% $X + diag(\lambda)$, or a similar quantity |
| | CG\_RESIDUAL\_RATIO | Ratio of current L2-norm of conjug. grad. residual over the initial |
| |
| * * * |
| |
| |
| ### Details |
| |
| To solve a linear regression problem over feature matrix $X$ and |
| response vector $Y$, we can find coefficients |
| $\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize the |
| joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed |
| statistical model (1). Since the joint likelihood of the |
| independent |
| $y_i \sim Normal(\mu_i, \sigma^2)$ |
| is proportional to the product of |
| $\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the |
| logarithm of this product, then multiply by $-2\sigma^2 < 0$ to obtain a |
| least squares problem: |
| |
| $$ |
| \begin{equation} |
| \sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\, |
| \sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 |
| \,\,\to\,\,\min |
| \end{equation} |
| $$ |
| |
| This may not be enough, however. The minimum may |
| sometimes be attained over infinitely many $\beta$-vectors, for example |
| if $X$ has an all-0 column, or has linearly dependent columns, or has |
| fewer rows than columns . Even if (2) has a unique |
| solution, other $\beta$-vectors may be just a little suboptimal[^1], yet |
| give significantly different predictions for new feature vectors. This |
| results in *overfitting*: prediction error for the training data ($X$ |
| and $Y$) is much smaller than for the test data (new records). |
| |
| Overfitting and degeneracy in the data is commonly mitigated by adding a |
| regularization penalty term to the least squares function: |
| |
| $$ |
| \begin{equation} |
| \sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 |
| \,+\,\, \lambda \sum_{j=1}^m \beta_j^2 |
| \,\,\to\,\,\min |
| \end{equation} |
| $$ |
| |
| The choice of $\lambda>0$, the regularization |
| constant, typically involves cross-validation where the dataset is |
| repeatedly split into a training part (to estimate the $\beta_j$’s) and |
| a test part (to evaluate prediction accuracy), with the goal of |
| maximizing the test accuracy. In our scripts, $\lambda$ is provided as |
| input parameter `reg`. |
| |
| The solution to the least squares problem (3), through |
| taking the derivative and setting it to 0, has the matrix linear |
| equation form |
| |
| $$ |
| \begin{equation} |
| A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\, |
| \textrm{where}\,\,\, |
| A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} diag(\hspace{0.5pt} |
| \underbrace{\lambda,\ldots, \lambda}_{\scriptstyle m}, 0) |
| \end{equation} |
| $$ |
| |
| where $[X,\,1]$ is $X$ with an extra column of 1s |
| appended on the right, and the diagonal matrix of $\lambda$’s has a zero |
| to keep the intercept $\beta_0$ unregularized. If the intercept is |
| disabled by setting $icpt=0$, the equation is simply $X^T X \beta = X^T Y$. |
| |
| We implemented two scripts for solving equation (4): one |
| is a "direct solver" that computes $A$ and then solves |
| $A\beta = [X,\,1]^T Y$ by calling an external package, the other |
| performs linear conjugate gradient (CG) iterations without ever |
| materializing $A$. The CG algorithm closely follows Algorithm 5.2 in |
| Chapter 5 of [[Nocedal2006]](algorithms-bibliography.html). Each step in the CG algorithm |
| computes a matrix-vector multiplication $q = Ap$ by first computing |
| $[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$. Usually the number of |
| such multiplications, one per CG iteration, is much smaller than $m$. |
| The user can put a hard bound on it with input |
| parameter `maxi`, or use the default maximum of $m+1$ (or $m$ |
| if no intercept) by having `maxi=0`. The CG iterations |
| terminate when the L2-norm of vector $r = A\beta - [X,\,1]^T Y$ |
| decreases from its initial value (for $\beta=0$) by the tolerance factor |
| specified in input parameter `tol`. |
| |
| The CG algorithm is more efficient if computing |
| $[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$, |
| an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix. The Direct Solver (DS) is more |
| efficient if $X$ takes up a lot more memory than $A$ (i.e. $X$ has a lot |
| more nonzeros than $m^2$) and if $m^2$ is small enough for the external |
| solver ($m \lesssim 50000$). A more precise determination between CG |
| and DS is subject to further research. |
| |
| In addition to the $\beta$-vector, the scripts estimate the residual |
| standard deviation $\sigma$ and the $R^2$, the ratio of "explained" |
| variance to the total variance of the response variable. These |
| statistics only make sense if the number of degrees of freedom |
| $n\,{-}\,m\,{-}\,1$ is positive and the regularization constant |
| $\lambda$ is negligible or zero. The formulas for $\sigma$ and |
| $R^2$ are: |
| |
| $$R^2_{\textrm{plain}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad |
| \sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad |
| R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}$$ |
| |
| where |
| |
| $$\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i - |
| \frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad |
| \mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2$$ |
| |
| Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the |
| estimated regression coefficients and the feature vectors. They may be |
| biased when no intercept is present, hence the RSS formula subtracts the |
| bias. |
| |
| Lastly, note that by choosing the input option `icpt=2` the |
| user can shift and rescale the columns of $X$ to have zero average and |
| the variance of 1. This is particularly important when using |
| regularization over highly disbalanced features, because regularization |
| tends to penalize small-variance columns (which need large $\beta_j$’s) |
| more than large-variance columns (with small $\beta_j$’s). At the end, |
| the estimated regression coefficients are shifted and rescaled to apply |
| to the original features. |
| |
| |
| ### Returns |
| |
| The estimated regression coefficients (the $\hat{\beta}_j$’s) are |
| populated into a matrix and written to an HDFS file whose path/name was |
| provided as the `B` input argument. What this matrix |
| contains, and its size, depends on the input argument `icpt`, |
| which specifies the user’s intercept and rescaling choice: |
| |
| * **icpt=0**: No intercept, matrix $B$ has size $m\,{\times}\,1$, with |
| $B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to $m = {}$ncol$(X)$. |
| * **icpt=1**: There is intercept, but no shifting/rescaling of $X$; matrix $B$ has |
| size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from |
| 1 to $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept |
| coefficient. |
| * **icpt=2**: There is intercept, and the features in $X$ are shifted to mean$ = 0$ |
| and rescaled to variance$ = 1$; then there are two versions of |
| the $\hat{\beta}_j$’s, one for the original features and another for the |
| shifted/rescaled features. Now matrix $B$ has size |
| $(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and |
| $B[\cdot, 2]$ for the shifted/rescaled features, in the above format. |
| Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are |
| obtained from $B[\cdot, 2]$ by complementary shifting and rescaling. |
| |
| The estimated summary statistics, including residual standard |
| deviation $\sigma$ and the $R^2$, are printed out or sent into a file |
| (if specified) in CSV format as defined in [**Table 7**](algorithms-regression.html#table7). |
| For conjugate gradient iterations, a log file with monitoring variables |
| can also be made available, see [**Table 8**](algorithms-regression.html#table8). |
| |
| |
| * * * |
| |
| ## 4.2. Stepwise Linear Regression |
| |
| ### Description |
| |
| Our stepwise linear regression script selects a linear model based on |
| the Akaike information criterion (AIC): the model that gives rise to the |
| lowest AIC is computed. |
| |
| |
| ### Usage |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f StepLinearRegDS.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| S=[file] |
| O=[file] |
| icpt=[int] |
| thr=[double] |
| fmt=[format] |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f StepLinearRegDS.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| S=[file] |
| O=[file] |
| icpt=[int] |
| thr=[double] |
| fmt=[format] |
| </div> |
| </div> |
| |
| ### Arguments |
| |
| **X**: Location (on HDFS) to read the matrix of feature vectors, each row |
| contains one feature vector. |
| |
| **Y**: Location (on HDFS) to read the 1-column matrix of response values |
| |
| **B**: Location (on HDFS) to store the estimated regression parameters (the |
| $\beta_j$’s), with the intercept parameter $\beta_0$ at position |
| B\[$m\,{+}\,1$, 1\] if available |
| |
| **S**: (default: `" "`) Location (on HDFS) to store the selected feature-ids in the |
| order as computed by the algorithm; by default the selected feature-ids |
| are forwarded to the standard output. |
| |
| **O**: (default: `" "`) Location (on HDFS) to store the CSV-file of summary |
| statistics defined in [**Table 7**](algorithms-regression.html#table7); by default the |
| summary statistics are forwarded to the standard output. |
| |
| **icpt**: (default: `0`) Intercept presence and shifting/rescaling the features |
| in $X$: |
| |
| * 0 = no intercept (hence no $\beta_0$), no shifting or |
| rescaling of the features; |
| * 1 = add intercept, but do not shift/rescale the features |
| in $X$; |
| * 2 = add intercept, shift/rescale the features in $X$ to |
| mean 0, variance 1 |
| |
| **thr**: (default: `0.01`) Threshold to stop the algorithm: if the decrease in the value |
| of the AIC falls below `thr` no further features are being |
| checked and the algorithm stops. |
| |
| **fmt**: (default: `"text"`) Matrix file output format, such as `text`, |
| `mm`, or `csv`; see read/write functions in |
| SystemML Language Reference for details. |
| |
| |
| ### Examples |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f StepLinearRegDS.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| S=/user/ml/selected.csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| thr=0.05 |
| fmt=csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f StepLinearRegDS.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| S=/user/ml/selected.csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| thr=0.05 |
| fmt=csv |
| </div> |
| </div> |
| |
| |
| ### Details |
| |
| Stepwise linear regression iteratively selects predictive variables in |
| an automated procedure. Currently, our implementation supports forward |
| selection: starting from an empty model (without any variable) the |
| algorithm examines the addition of each variable based on the AIC as a |
| model comparison criterion. The AIC is defined as |
| |
| $$ |
| \begin{equation} |
| AIC = -2 \log{L} + 2 edf,\label{eq:AIC} |
| \end{equation} |
| $$ |
| |
| where $L$ denotes the |
| likelihood of the fitted model and $edf$ is the equivalent degrees of |
| freedom, i.e., the number of estimated parameters. This procedure is |
| repeated until including no additional variable improves the model by a |
| certain threshold specified in the input parameter `thr`. |
| |
| For fitting a model in each iteration we use the `direct solve` method |
| as in the script `LinearRegDS.dml` discussed in |
| [Linear Regression](algorithms-regression.html#linear-regression). |
| |
| |
| ### Returns |
| |
| Similar to the outputs from `LinearRegDS.dml` the stepwise |
| linear regression script computes the estimated regression coefficients |
| and stores them in matrix $B$ on HDFS. The format of matrix $B$ is |
| identical to the one produced by the scripts for linear regression (see |
| [Linear Regression](algorithms-regression.html#linear-regression)). Additionally, `StepLinearRegDS.dml` |
| outputs the variable indices (stored in the 1-column matrix $S$) in the |
| order they have been selected by the algorithm, i.e., $i$th entry in |
| matrix $S$ corresponds to the variable which improves the AIC the most |
| in $i$th iteration. If the model with the lowest AIC includes no |
| variables matrix $S$ will be empty (contains one 0). Moreover, the |
| estimated summary statistics as defined in [**Table 7**](algorithms-regression.html#table7) |
| are printed out or stored in a file (if requested). In the case where an |
| empty model achieves the best AIC these statistics will not be produced. |
| |
| |
| * * * |
| |
| ## 4.3. Generalized Linear Models |
| |
| ### Description |
| |
| Generalized Linear Models |
| [[Gill2000](algorithms-bibliography.html), |
| [McCullagh1989](algorithms-bibliography.html), |
| [Nelder1972](algorithms-bibliography.html)] |
| extend the methodology of linear |
| and logistic regression to a variety of distributions commonly assumed |
| as noise effects in the response variable. As before, we are given a |
| collection of records $(x_1, y_1)$, …, $(x_n, y_n)$ where $x_i$ is a |
| numerical vector of explanatory (feature) variables of size $\dim x_i = m$, and $y_i$ |
| is the response (dependent) variable observed for this vector. GLMs |
| assume that some linear combination of the features in $x_i$ determines |
| the *mean* $\mu_i$ of $y_i$, while the observed $y_i$ is a random |
| outcome of a noise distribution |
| $Prob[y\mid \mu_i]\,$[^2] |
| with that mean $\mu_i$: |
| |
| $$x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} |
| \,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim Prob[y\mid \mu_i]$$ |
| |
| In linear regression the response mean $\mu_i$ *equals* some linear |
| combination over $x_i$, denoted above by $\eta_i$. In logistic |
| regression with $$y\in\{0, 1\}$$ (Bernoulli) the mean of $y$ is the same |
| as $Prob[y=1]$ |
| and equals $1/(1+e^{-\eta_i})$, the logistic function of $\eta_i$. In |
| GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone |
| function called the *link function*: $\eta_i = g(\mu_i)$. The unknown |
| linear combination parameters $\beta_j$ are assumed to be the same for |
| all records. |
| |
| The goal of the regression is to estimate the parameters $\beta_j$ from |
| the observed data. Once the $\beta_j$’s are accurately estimated, we can |
| make predictions about $y$ for a new feature vector $x$. To do so, |
| compute $\eta$ from $x$ and use the inverted link function |
| $\mu = g^{-1}(\eta)$ to compute the mean $\mu$ of $y$; then use the |
| distribution $Prob[y\mid \mu]$ |
| to make predictions about $y$. Both $g(\mu)$ and |
| $Prob[y\mid \mu]$ |
| are user-provided. Our GLM script supports a standard set of |
| distributions and link functions, see below for details. |
| |
| |
| ### Usage |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| fmt=[format] |
| O=[file] |
| Log=[file] |
| dfam=[int] |
| vpow=[double] |
| link=[int] |
| lpow=[double] |
| yneg=[double] |
| icpt=[int] |
| reg=[double] |
| tol=[double] |
| disp=[double] |
| moi=[int] |
| mii=[int] |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| fmt=[format] |
| O=[file] |
| Log=[file] |
| dfam=[int] |
| vpow=[double] |
| link=[int] |
| lpow=[double] |
| yneg=[double] |
| icpt=[int] |
| reg=[double] |
| tol=[double] |
| disp=[double] |
| moi=[int] |
| mii=[int] |
| </div> |
| </div> |
| |
| ### Arguments |
| |
| **X**: Location (on HDFS) to read the matrix of feature vectors; each row |
| constitutes an example. |
| |
| **Y**: Location to read the response matrix, which may have 1 or 2 columns |
| |
| **B**: Location to store the estimated regression parameters (the $\beta_j$’s), |
| with the intercept parameter $\beta_0$ at position |
| B\[$m\,{+}\,1$, 1\] if available |
| |
| **fmt**: (default: `"text"`) Matrix file output format, such as `text`, |
| `mm`, or `csv`; see read/write functions in |
| SystemML Language Reference for details. |
| |
| **O**: (default: `" "`) Location to write certain summary statistics described |
| in [**Table 9**](algorithms-regression.html#table9), |
| by default it is standard output. |
| |
| **Log**: (default: `" "`) Location to store iteration-specific variables for monitoring |
| and debugging purposes, see [**Table 10**](algorithms-regression.html#table10) for details. |
| |
| **dfam**: (default: `1`) Distribution family code to specify |
| $Prob[y\mid \mu]$, |
| see [**Table 11**](algorithms-regression.html#table11): |
| |
| * 1 = power distributions with $Var(y) = \mu^{\alpha}$ |
| * 2 = binomial or Bernoulli |
| |
| **vpow**: (default: `0.0`) When `dfam=1`, this provides the $q$ in |
| $Var(y) = a\mu^q$, the power |
| dependence of the variance of $y$ on its mean. In particular, use: |
| |
| * 0.0 = Gaussian |
| * 1.0 = Poisson |
| * 2.0 = Gamma |
| * 3.0 = inverse Gaussian |
| |
| **link**: (default: `0`) Link function code to determine the link |
| function $\eta = g(\mu)$: |
| |
| * 0 = canonical link (depends on the distribution family), |
| see [**Table 11**](algorithms-regression.html#table11) |
| * 1 = power functions |
| * 2 = logit |
| * 3 = probit |
| * 4 = cloglog |
| * 5 = cauchit |
| |
| **lpow**: (default: `1.0`) When link=1, this provides the $s$ in |
| $\eta = \mu^s$, the power link function; `lpow=0.0` gives the |
| log link $\eta = \log\mu$. Common power links: |
| |
| * -2.0 = $1/\mu^2$ |
| * -1.0 = reciprocal |
| * 0.0 = log |
| * 0.5 = sqrt |
| * 1.0 = identity |
| |
| **yneg**: (default: `0.0`) When `dfam=2` and the response matrix $Y$ has |
| 1 column, this specifies the $y$-value used for Bernoulli "No" label |
| ("Yes" is $1$): |
| 0.0 when $y\in\\{0, 1\\}$; -1.0 when |
| $y\in\\{-1, 1\\}$ |
| |
| **icpt**: (default: `0`) Intercept and shifting/rescaling of the features in $X$: |
| |
| * 0 = no intercept (hence no $\beta_0$), no |
| shifting/rescaling of the features |
| * 1 = add intercept, but do not shift/rescale the features |
| in $X$ |
| * 2 = add intercept, shift/rescale the features in $X$ to |
| mean 0, variance 1 |
| |
| **reg**: (default: `0.0`) L2-regularization parameter ($\lambda$) |
| |
| **tol**: (default: `0.000001`) Tolerance ($\varepsilon$) used in the convergence criterion: we |
| terminate the outer iterations when the deviance changes by less than |
| this factor; see below for details |
| |
| **disp**: (default: `0.0`) Dispersion parameter, or 0.0 to estimate it from |
| data |
| |
| **moi**: (default: `200`) Maximum number of outer (Fisher scoring) iterations |
| |
| **mii**: (default: `0`) Maximum number of inner (conjugate gradient) iterations, or 0 |
| if no maximum limit provided |
| |
| |
| ### Examples |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| dfam=2 |
| link=2 |
| yneg=-1.0 |
| icpt=2 |
| reg=0.01 |
| tol=0.00000001 |
| disp=1.0 |
| moi=100 |
| mii=10 |
| O=/user/ml/stats.csv |
| Log=/user/ml/log.csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| dfam=2 |
| link=2 |
| yneg=-1.0 |
| icpt=2 |
| reg=0.01 |
| tol=0.00000001 |
| disp=1.0 |
| moi=100 |
| mii=10 |
| O=/user/ml/stats.csv |
| Log=/user/ml/log.csv |
| </div> |
| </div> |
| |
| * * * |
| |
| <a name="table9" /> |
| **Table 9**: Besides $\beta$, GLM regression script computes a few summary |
| statistics listed below. They are provided in CSV format, one |
| comma-separated name-value pair per each line. |
| |
| | Name | Meaning | |
| | --------------------- | ------- | |
| | TERMINATION\_CODE | A positive integer indicating success/failure as follows: <br />1 = Converged successfully; 2 = Maximum \# of iterations reached; 3 = Input (X, Y) out of range; 4 = Distribution/link not supported |
| | BETA\_MIN | Smallest beta value (regression coefficient), excluding the intercept |
| | BETA\_MIN\_INDEX | Column index for the smallest beta value |
| | BETA\_MAX | Largest beta value (regression coefficient), excluding the intercept |
| | BETA\_MAX\_INDEX | Column index for the largest beta value |
| | INTERCEPT | Intercept value, or NaN if there is no intercept (if `icpt=0`) |
| | DISPERSION | Dispersion used to scale deviance, provided in disp input argument or estimated (same as DISPERSION\_EST) if disp argument is $\leq 0$ |
| | DISPERSION\_EST | Dispersion estimated from the dataset |
| | DEVIANCE\_UNSCALED | Deviance from the saturated model, assuming dispersion $= 1.0$ |
| | DEVIANCE\_SCALED | Deviance from the saturated model, scaled by DISPERSION value |
| |
| |
| * * * |
| |
| <a name="table10" /> |
| **Table 10**: The Log file for GLM regression contains the below |
| iteration variables in CSV format, each line containing a triple (Name, |
| Iteration\#, Value) with Iteration\# being 0 for initial values. |
| |
| | Name | Meaning | |
| | --------------------- | ------- | |
| | NUM\_CG\_ITERS | Number of inner (Conj. Gradient) iterations in this outer iteration |
| | IS\_TRUST\_REACHED | 1 = trust region boundary was reached, 0 = otherwise |
| | POINT\_STEP\_NORM | L2-norm of iteration step from old point ($\beta$-vector) to new point |
| | OBJECTIVE | The loss function we minimize (negative partial log-likelihood) |
| | OBJ\_DROP\_REAL | Reduction in the objective during this iteration, actual value |
| | OBJ\_DROP\_PRED | Reduction in the objective predicted by a quadratic approximation |
| | OBJ\_DROP\_RATIO | Actual-to-predicted reduction ratio, used to update the trust region |
| | GRADIENT\_NORM | L2-norm of the loss function gradient (omitted if point is rejected) |
| | LINEAR\_TERM\_MIN | The minimum value of $X$ %\*% $\beta$, used to check for overflows |
| | LINEAR\_TERM\_MAX | The maximum value of $X$ %\*% $\beta$, used to check for overflows |
| | IS\_POINT\_UPDATED | 1 = new point accepted; 0 = new point rejected, old point restored |
| | TRUST\_DELTA | Updated trust region size, the "delta" |
| |
| |
| * * * |
| |
| <a name="table11" /> |
| **Table 11**: Common GLM distribution families and link functions. (Here "\*" stands |
| for "any value.") |
| |
| | dfam | vpow | link | lpow | Distribution<br />Family | Link<br /> Function | Canonical | |
| | :--: | :--: | :--: | :--: | :------------: | :--------: | :----: | |
| | 1 | 0.0 | 1 | -1.0 | Gaussian | inverse | |
| | 1 | 0.0 | 1 | 0.0 | Gaussian | log | |
| | 1 | 0.0 | 1 | 1.0 | Gaussian | identity | Yes |
| | 1 | 1.0 | 1 | 0.0 | Poisson | log | Yes |
| | 1 | 1.0 | 1 | 0.5 | Poisson | sq.root | |
| | 1 | 1.0 | 1 | 1.0 | Poisson | identity | |
| | 1 | 2.0 | 1 | -1.0 | Gamma | inverse | Yes |
| | 1 | 2.0 | 1 | 0.0 | Gamma | log | |
| | 1 | 2.0 | 1 | 1.0 | Gamma | identity | |
| | 1 | 3.0 | 1 | -2.0 | Inverse Gauss | $1/\mu^2$ | Yes |
| | 1 | 3.0 | 1 | -1.0 | Inverse Gauss | inverse | |
| | 1 | 3.0 | 1 | 0.0 | Inverse Gauss | log | |
| | 1 | 3.0 | 1 | 1.0 | Inverse Gauss | identity | |
| | 2 | \* | 1 | 0.0 | Binomial | log | |
| | 2 | \* | 1 | 0.5 | Binomial | sq.root | |
| | 2 | \* | 2 | \* | Binomial | logit | Yes |
| | 2 | \* | 3 | \* | Binomial | probit | |
| | 2 | \* | 4 | \* | Binomial | cloglog | |
| | 2 | \* | 5 | \* | Binomial | cauchit | |
| |
| |
| * * * |
| |
| <a name="table12" /> |
| **Table 12**: The supported non-power link functions for the Bernoulli and the |
| binomial distributions. Here $\mu$ is the Bernoulli mean. |
| |
| | Name | Link Function | |
| | --------------------- | ------------- | |
| | Logit | $\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$ |
| | Probit | $$\displaystyle \mu = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut} e^{-\frac{t^2}{2}} dt$$ |
| | Cloglog | $\displaystyle \eta = \log \big(- \log(1 - \mu)\big)^{\mathstrut}$ |
| | Cauchit | $\displaystyle \eta = \tan\pi(\mu - 1/2)$ |
| |
| |
| * * * |
| |
| |
| ### Details |
| |
| In GLM, the noise distribution |
| $Prob[y\mid \mu]$ |
| of the response variable $y$ given its mean $\mu$ is restricted to have |
| the *exponential family* form |
| |
| $$ |
| \begin{equation} |
| Y \sim\, Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a} |
| + c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = E(Y) = b'(\theta). |
| \end{equation} |
| $$ |
| |
| Changing the mean in such a distribution simply |
| multiplies all |
| $Prob[y\mid \mu]$ |
| by $e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so |
| that they again integrate to 1. Parameter $\theta$ is called |
| *canonical*, and the function $\theta = b'^{\,-1}(\mu)$ that relates it |
| to the mean is called the *canonical link*; constant $a$ is called |
| *dispersion* and rescales the variance of $y$. Many common distributions |
| can be put into this form, see [**Table 11**](algorithms-regression.html#table11). The canonical |
| parameter $\theta$ is often chosen to coincide with $\eta$, the linear |
| combination of the regression features; other choices for $\eta$ are |
| possible too. |
| |
| Rather than specifying the canonical link, GLM distributions are |
| commonly defined by their variance |
| $Var(y)$ as the function of |
| the mean $\mu$. It can be shown from Eq. 5 that |
| $Var(y) = a\,b''(\theta) = a\,b''(b'^{\,-1}(\mu))$. |
| For example, for the Bernoulli distribution |
| $Var(y) = \mu(1-\mu)$, for the |
| Poisson distribution |
| $Var(y) = \mu$, and for the Gaussian distribution |
| $Var(y) = a\cdot 1 = \sigma^2$. |
| It turns out that for many common distributions |
| $Var(y) = a\mu^q$, a power |
| function. We support all distributions where |
| $Var(y) = a\mu^q$, as well as |
| the Bernoulli and the binomial distributions. |
| |
| For distributions with |
| $Var(y) = a\mu^q$ the |
| canonical link is also a power function, namely |
| $\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose |
| canonical link is $\theta = \log\mu$. We support all power link |
| functions in the form $\eta = \mu^s$, dropping any constant factor, with |
| $\eta = \log\mu$ for $s=0$. The binomial distribution has its own family |
| of link functions, which includes logit (the canonical link), probit, |
| cloglog, and cauchit (see [**Table 12**](algorithms-regression.html#table12)); we support |
| these only for the binomial and Bernoulli distributions. Links and |
| distributions are specified via four input parameters: |
| `dfam`, `vpow`, `link`, and |
| `lpow` (see [**Table 11**](algorithms-regression.html#table11)). |
| |
| The observed response values are provided to the regression script as a |
| matrix $Y$ having 1 or 2 columns. If a power distribution family is |
| selected (`dfam=1`), matrix $Y$ must have 1 column that |
| provides $y_i$ for each $x_i$ in the corresponding row of matrix $X$. |
| When dfam=2 and $Y$ has 1 column, we assume the Bernoulli |
| distribution for $$y_i\in\{y_{\mathrm{neg}}, 1\}$$ with $y_{\mathrm{neg}}$ |
| from the input parameter `yneg`. When `dfam=2` and |
| $Y$ has 2 columns, we assume the binomial distribution; for each row $i$ |
| in $X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide the positive and the |
| negative binomial counts respectively. Internally we convert the |
| 1-column Bernoulli into the 2-column binomial with 0-versus-1 counts. |
| |
| We estimate the regression parameters via L2-regularized negative |
| log-likelihood minimization: |
| |
| $$f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big) |
| \,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min$$ |
| |
| where |
| $\theta_i$ and $b(\theta_i)$ are from (6); note that $a$ and |
| $c(y, a)$ are constant w.r.t. $\beta$ and can be ignored here. The |
| canonical parameter $\theta_i$ depends on both $\beta$ and $x_i$: |
| |
| $$\theta_i \,\,=\,\, b'^{\,-1}(\mu_i) \,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\, |
| \big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}\right)$$ |
| |
| The user-provided (via `reg`) regularization coefficient |
| $\lambda\geq 0$ can be used to mitigate overfitting and degeneracy in |
| the data. Note that the intercept is never regularized. |
| |
| Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring |
| approximation to the difference |
| $\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$, |
| recomputed at each iteration: |
| |
| $$\begin{gathered} |
| \varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z, |
| \,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!diag(w) X \,+\, \lambda I\\ |
| \textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta, |
| \,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\ |
| \forall\,i = 1\ldots n: \,\,\,\, |
| w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1} |
| \!\!\!\!\!\!,\,\,\,\,\,\,\,\,\, |
| u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1} |
| \!\!\!\!\!\!.\,\,\,\,\end{gathered}$$ |
| |
| Here |
| $v(\mu_i)=Var(y_i)/a$, the |
| variance of $y_i$ as the function of the mean, and |
| $g'(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative. The |
| Fisher scoring approximation is minimized by trust-region conjugate |
| gradient iterations (called the *inner* iterations, with the Fisher |
| scoring iterations as the *outer* iterations), which approximately solve |
| the following problem: |
| |
| $$1/2 \cdot z^T A z \,+\, G^T z \,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\, |
| \|z\|_2 \leq \delta$$ |
| |
| The conjugate gradient algorithm closely follows |
| Algorithm 7.2 on page 171 of [[Nocedal2006]](algorithms-bibliography.html). The trust region |
| size $\delta$ is initialized as |
| $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$ and updated as described |
| in [[Nocedal2006]](algorithms-bibliography.html). |
| The user can specify the maximum number of |
| the outer and the inner iterations with input parameters |
| `moi` and `mii`, respectively. The Fisher scoring |
| algorithm terminates successfully if |
| $2|\varDelta f(z; \beta)| < (D_1(\beta) + 0.1)\hspace{0.5pt}{\varepsilon}$ |
| where ${\varepsilon}> 0$ is a tolerance supplied by the user via |
| `tol`, and $D_1(\beta)$ is the unit-dispersion deviance |
| estimated as |
| |
| $$D_1(\beta) \,\,=\,\, 2 \cdot \big(Prob[Y \mid \! |
| \begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, a\,{=}\,1] |
| \,\,-\,\,Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)$$ |
| |
| The deviance estimate is also produced as part of the output. Once the |
| Fisher scoring algorithm terminates, if requested by the user, we |
| estimate the dispersion $a$ from (6) using Pearson residuals |
| |
| $$ |
| \begin{equation} |
| \hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{v(\mu_i)} |
| \end{equation} |
| $$ |
| |
| and use it to adjust our deviance estimate: |
| $D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$. If input argument |
| disp is 0.0 we estimate $\hat{a}$, otherwise |
| we use its value as $a$. Note that in (7) $m$ counts |
| the intercept ($m \leftarrow m+1$) if it is present. |
| |
| |
| ### Returns |
| |
| The estimated regression parameters (the $\hat{\beta}_j$’s) are |
| populated into a matrix and written to an HDFS file whose path/name was |
| provided as the `B` input argument. What this matrix |
| contains, and its size, depends on the input argument `icpt`, |
| which specifies the user’s intercept and rescaling choice: |
| |
| * **icpt=0**: No intercept, matrix $B$ has size $m\,{\times}\,1$, with |
| $B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to $m = {}$ncol$(X)$. |
| * **icpt=1**: There is intercept, but no shifting/rescaling of $X$; matrix $B$ has |
| size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from |
| 1 to $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept |
| coefficient. |
| * **icpt=2**: There is intercept, and the features in $X$ are shifted to mean${} = 0$ |
| and rescaled to variance${} = 1$; then there are two versions of |
| the $\hat{\beta}_j$’s, one for the original features and another for the |
| shifted/rescaled features. Now matrix $B$ has size |
| $(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and |
| $B[\cdot, 2]$ for the shifted/rescaled features, in the above format. |
| Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are |
| obtained from $B[\cdot, 2]$ by complementary shifting and rescaling. |
| |
| Our script also estimates the dispersion $\hat{a}$ (or takes it from the |
| user’s input) and the deviances $D_1(\hat{\beta})$ and |
| $D_{\hat{a}}(\hat{\beta})$, see [**Table 9**](algorithms-regression.html#table9) for details. A |
| log file with variables monitoring progress through the iterations can |
| also be made available, see [**Table 10**](algorithms-regression.html#table10). |
| |
| |
| ### See Also |
| |
| In case of binary classification problems, consider using L2-SVM or |
| binary logistic regression; for multiclass classification, use |
| multiclass SVM or multinomial logistic regression. For the special cases |
| of linear regression and logistic regression, it may be more efficient |
| to use the corresponding specialized scripts instead of GLM. |
| |
| |
| * * * |
| |
| ## 4.4. Stepwise Generalized Linear Regression |
| |
| ### Description |
| |
| Our stepwise generalized linear regression script selects a model based |
| on the Akaike information criterion (AIC): the model that gives rise to |
| the lowest AIC is provided. Note that currently only the Bernoulli |
| distribution family is supported (see below for details). |
| |
| |
| ### Usage |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f StepGLM.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| S=[file] |
| O=[file] |
| link=[int] |
| yneg=[double] |
| icpt=[int] |
| tol=[double] |
| disp=[double] |
| moi=[int] |
| mii=[int] |
| thr=[double] |
| fmt=[format] |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f StepGLM.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| S=[file] |
| O=[file] |
| link=[int] |
| yneg=[double] |
| icpt=[int] |
| tol=[double] |
| disp=[double] |
| moi=[int] |
| mii=[int] |
| thr=[double] |
| fmt=[format] |
| </div> |
| </div> |
| |
| |
| ### Arguments |
| |
| **X**: Location (on HDFS) to read the matrix of feature vectors; each row is an |
| example. |
| |
| **Y**: Location (on HDFS) to read the response matrix, which may have 1 or 2 |
| columns |
| |
| **B**: Location (on HDFS) to store the estimated regression parameters (the |
| $\beta_j$’s), with the intercept parameter $\beta_0$ at position |
| B\[$m\,{+}\,1$, 1\] if available |
| |
| **S**: (default: `" "`) Location (on HDFS) to store the selected feature-ids in the |
| order as computed by the algorithm, by default it is standard output. |
| |
| **O**: (default: `" "`) Location (on HDFS) to write certain summary statistics |
| described in [**Table 9**](algorithms-regression.html#table9), by default it is standard |
| output. |
| |
| **link**: (default: `2`) Link function code to determine the link |
| function $\eta = g(\mu)$, see [**Table 11**](algorithms-regression.html#table11); currently the |
| following link functions are supported: |
| |
| * 1 = log |
| * 2 = logit |
| * 3 = probit |
| * 4 = cloglog |
| |
| **yneg**: (default: `0.0`) Response value for Bernoulli "No" label, usually 0.0 or -1.0 |
| |
| **icpt**: (default: `0`) Intercept and shifting/rescaling of the features in $X$: |
| |
| * 0 = no intercept (hence no $\beta_0$), no |
| shifting/rescaling of the features |
| * 1 = add intercept, but do not shift/rescale the features |
| in $X$ |
| * 2 = add intercept, shift/rescale the features in $X$ to |
| mean 0, variance 1 |
| |
| **tol**: (default: `0.000001`) Tolerance ($\epsilon$) used in the convergence criterion: we |
| terminate the outer iterations when the deviance changes by less than |
| this factor; see below for details. |
| |
| **disp**: (default: `0.0`) Dispersion parameter, or `0.0` to estimate it from |
| data |
| |
| **moi**: (default: `200`) Maximum number of outer (Fisher scoring) iterations |
| |
| **mii**: (default: `0`) Maximum number of inner (conjugate gradient) iterations, or 0 |
| if no maximum limit provided |
| |
| **thr**: (default: `0.01`) Threshold to stop the algorithm: if the decrease in the value |
| of the AIC falls below `thr` no further features are being |
| checked and the algorithm stops. |
| |
| **fmt**: (default: `"text"`) Matrix file output format, such as `text`, |
| `mm`, or `csv`; see read/write functions in |
| SystemML Language Reference for details. |
| |
| |
| ### Examples |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f StepGLM.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| S=/user/ml/selected.csv |
| O=/user/ml/stats.csv |
| link=2 |
| yneg=-1.0 |
| icpt=2 |
| tol=0.000001 |
| moi=100 |
| mii=10 |
| thr=0.05 |
| fmt=csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f StepGLM.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| S=/user/ml/selected.csv |
| O=/user/ml/stats.csv |
| link=2 |
| yneg=-1.0 |
| icpt=2 |
| tol=0.000001 |
| moi=100 |
| mii=10 |
| thr=0.05 |
| fmt=csv |
| </div> |
| </div> |
| |
| |
| ### Details |
| |
| Similar to `StepLinearRegDS.dml` our stepwise GLM script |
| builds a model by iteratively selecting predictive variables using a |
| forward selection strategy based on the AIC (5). Note that |
| currently only the Bernoulli distribution family (`fam=2` in |
| [**Table 11**](algorithms-regression.html#table11)) together with the following link functions |
| are supported: `log`, `logit`, `probit`, and `cloglog` (link |
| $$\in\{1,2,3,4\}$$ in [**Table 11**](algorithms-regression.html#table11)). |
| |
| |
| ### Returns |
| |
| Similar to the outputs from `GLM.dml` the stepwise GLM script |
| computes the estimated regression coefficients and stores them in matrix |
| $B$ on HDFS; matrix $B$ follows the same format as the one produced by |
| `GLM.dml` (see [Generalized Linear Models](algorithms-regression.html#generalized-linear-models)). Additionally, |
| `StepGLM.dml` outputs the variable indices (stored in the |
| 1-column matrix $S$) in the order they have been selected by the |
| algorithm, i.e., $i$th entry in matrix $S$ stores the variable which |
| improves the AIC the most in $i$th iteration. If the model with the |
| lowest AIC includes no variables matrix $S$ will be empty. Moreover, the |
| estimated summary statistics as defined in [**Table 9**](algorithms-regression.html#table9) are |
| printed out or stored in a file on HDFS (if requested); these statistics |
| will be provided only if the selected model is nonempty, i.e., contains |
| at least one variable. |
| |
| |
| * * * |
| |
| ## 4.5. Regression Scoring and Prediction |
| |
| ### Description |
| |
| Script `GLM-predict.dml` is intended to cover all linear |
| model based regressions, including linear regression, binomial and |
| multinomial logistic regression, and GLM regressions (Poisson, gamma, |
| binomial with probit link etc.). Having just one scoring script for all |
| these regressions simplifies maintenance and enhancement while ensuring |
| compatible interpretations for output statistics. |
| |
| The script performs two functions, prediction and scoring. To perform |
| prediction, the script takes two matrix inputs: a collection of records |
| $X$ (without the response attribute) and the estimated regression |
| parameters $B$, also known as $\beta$. To perform scoring, in addition |
| to $X$ and $B$, the script takes the matrix of actual response |
| values $Y$ that are compared to the predictions made with $X$ and $B$. |
| Of course there are other, non-matrix, input arguments that specify the |
| model and the output format, see below for the full list. |
| |
| We assume that our test/scoring dataset is given by |
| $n\,{\times}\,m$-matrix $X$ of numerical feature vectors, where each |
| row $x_i$ represents one feature vector of one record; we have $\dim x_i = m$. Each |
| record also includes the response variable $y_i$ that may be numerical, |
| single-label categorical, or multi-label categorical. A single-label |
| categorical $y_i$ is an integer category label, one label per record; a |
| multi-label $y_i$ is a vector of integer counts, one count for each |
| possible label, which represents multiple single-label events |
| (observations) for the same $x_i$. Internally we convert single-label |
| categoricals into multi-label categoricals by replacing each label $l$ |
| with an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$. In |
| prediction-only tasks the actual $y_i$’s are not needed to the script, |
| but they are needed for scoring. |
| |
| To perform prediction, the script matrix-multiplies $X$ and $B$, adding |
| the intercept if available, then applies the inverse of the model’s link |
| function. All GLMs assume that the linear combination of the features |
| in $x_i$ and the betas in $B$ determines the means $\mu_i$ of |
| the $y_i$’s (in numerical or multi-label categorical form) with |
| $\dim \mu_i = \dim y_i$. The observed $y_i$ is assumed to follow a |
| specified GLM family distribution |
| $Prob[y\mid \mu_i]$ |
| with mean(s) $\mu_i$: |
| |
| $$x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} |
| \,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim Prob[y\mid \mu_i]$$ |
| |
| If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number. Then |
| our script’s output matrix $M$ is the $n\,{\times}\,1$-vector of these |
| means $\mu_i$. Note that $\mu_i$ predicts the mean of $y_i$, not the |
| actual $y_i$. For example, in Poisson distribution, the mean is usually |
| fractional, but the actual $y_i$ is always integer. |
| |
| If $y_i$ is categorical, i.e. a vector of label counts for record $i$, |
| then $\mu_i$ is a vector of non-negative real numbers, one number |
| $$\mu_{i,l}$$ per each label $l$. In this case we divide the $$\mu_{i,l}$$ |
| by their sum $\sum_l \mu_{i,l}$ to obtain predicted label |
| probabilities $$p_{i,l}\in [0, 1]$$. The output matrix $M$ is the |
| $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the |
| number of records and $k\,{+}\,1$ is the number of categories[^3]. Note |
| again that we do not predict the labels themselves, nor their actual |
| counts per record, but we predict the labels’ probabilities. |
| |
| Going from predicted probabilities to predicted labels, in the |
| single-label categorical case, requires extra information such as the |
| cost of false positive versus false negative errors. For example, if |
| there are 5 categories and we *accurately* predicted their probabilities |
| as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the highest-probability |
| label would be wrong 70% of the time, whereas picking the |
| lowest-probability label might be right if, say, it represents a |
| diagnosis of cancer or another rare and serious outcome. Hence, we keep |
| this step outside the scope of `GLM-predict.dml` for now. |
| |
| |
| ### Usage |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs X=<file> |
| Y=[file] |
| B=<file> |
| M=[file] |
| O=[file] |
| dfam=[int] |
| vpow=[double] |
| link=[int] |
| lpow=[double] |
| disp=[double] |
| fmt=[format] |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=[file] |
| B=<file> |
| M=[file] |
| O=[file] |
| dfam=[int] |
| vpow=[double] |
| link=[int] |
| lpow=[double] |
| disp=[double] |
| fmt=[format] |
| </div> |
| </div> |
| |
| |
| ### Arguments |
| |
| **X**: Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature |
| vectors, each row constitutes one feature vector (one record) |
| |
| **Y**: (default: `" "`) Location to read the response matrix $Y$ needed for scoring |
| (but optional for prediction), with the following dimensions: |
| |
| * $n {\times} 1$: acceptable for all distributions |
| (`dfam=1` or `2` or `3`) |
| * $n {\times} 2$: for binomial (`dfam=2`) if given by |
| (\#pos, \#neg) counts |
| * $n {\times} k\,{+}\,1$: for multinomial (`dfam=3`) if |
| given by category counts |
| |
| **M**: (default: `" "`) Location to write, if requested, the matrix of predicted |
| response means (for `dfam=1`) or probabilities (for |
| `dfam=2` or `3`): |
| |
| * $n {\times} 1$: for power-type distributions (`dfam=1`) |
| * $n {\times} 2$: for binomial distribution (`dfam=2`), |
| col\# 2 is the "No" probability |
| * $n {\times} k\,{+}\,1$: for multinomial logit (`dfam=3`), |
| col\# $k\,{+}\,1$ is for the baseline |
| |
| **B**: Location to read matrix $B$ of the betas, i.e. estimated GLM regression |
| parameters, with the intercept at row\# $m\,{+}\,1$ if available: |
| |
| * $\dim(B) \,=\, m {\times} k$: do not add intercept |
| * $\dim(B) \,=\, m\,{+}\,1 {\times} k$: add intercept as given by the |
| last $B$-row |
| * if $k > 1$, use only $B[, 1]$ unless it is Multinomial Logit |
| (`dfam=3`) |
| |
| **O**: (default: `" "`) Location to store the CSV-file with goodness-of-fit |
| statistics defined in [**Table 13**](algorithms-regression.html#table13), |
| the default is to |
| print them to the standard output |
| |
| **dfam**: (default: `1`) GLM distribution family code to specify the type of |
| distribution |
| $Prob[y\,|\,\mu]$ |
| that we assume: |
| |
| * 1 = power distributions with |
| $Var(y) = \mu^{\alpha}$, see |
| [**Table 11**](algorithms-regression.html#table11) |
| * 2 = binomial |
| * 3 = multinomial logit |
| |
| **vpow**: (default: `0.0`) Power for variance defined as (mean)$^{\textrm{power}}$ |
| (ignored if `dfam`$\,{\neq}\,1$): when `dfam=1`, |
| this provides the $q$ in |
| $Var(y) = a\mu^q$, the power |
| dependence of the variance of $y$ on its mean. In particular, use: |
| |
| * 0.0 = Gaussian |
| * 1.0 = Poisson |
| * 2.0 = Gamma |
| * 3.0 = inverse Gaussian |
| |
| **link**: (default: `0`) Link function code to determine the link |
| function $\eta = g(\mu)$, ignored for multinomial logit |
| (`dfam=3`): |
| |
| * 0 = canonical link (depends on the distribution family), |
| see [**Table 11**](algorithms-regression.html#table11) |
| * 1 = power functions |
| * 2 = logit |
| * 3 = probit |
| * 4 = cloglog |
| * 5 = cauchit |
| |
| **lpow**: (default: `1.0`) Power for link function defined as |
| (mean)$^{\textrm{power}}$ (ignored if `link`$\,{\neq}\,1$): |
| when `link=1`, this provides the $s$ in $\eta = \mu^s$, the |
| power link function; `lpow=0.0` gives the log link |
| $\eta = \log\mu$. Common power links: |
| |
| * -2.0 = $1/\mu^2$ |
| * -1.0 = reciprocal |
| * 0.0 = log |
| * 0.5 = sqrt |
| * 1.0 = identity |
| |
| **disp**: (default: `1.0`) Dispersion value, when available; must be positive |
| |
| **fmt**: (default: `"text"`) Matrix M file output format, such as |
| `text`, `mm`, or `csv`; see read/write |
| functions in SystemML Language Reference for details. |
| |
| |
| ### Examples |
| |
| Note that in the examples below the value for the `disp` input |
| argument is set arbitrarily. The correct dispersion value should be |
| computed from the training data during model estimation, or omitted if |
| unknown (which sets it to `1.0`). |
| |
| **Linear regression example**: |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=1 |
| vpow=0.0 |
| link=1 |
| lpow=1.0 |
| disp=5.67 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=1 |
| vpow=0.0 |
| link=1 |
| lpow=1.0 |
| disp=5.67 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| </div> |
| |
| **Linear regression example, prediction only (no Y given)**: |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=1 |
| vpow=0.0 |
| link=1 |
| lpow=1.0 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=1 |
| vpow=0.0 |
| link=1 |
| lpow=1.0 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| </div> |
| </div> |
| |
| **Binomial logistic regression example**: |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=2 |
| link=2 |
| disp=3.0004464 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=2 |
| link=2 |
| disp=3.0004464 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| </div> |
| |
| **Binomial probit regression example**: |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=2 |
| link=3 |
| disp=3.0004464 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=2 |
| link=3 |
| disp=3.0004464 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| </div> |
| |
| **Multinomial logistic regression example**: |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=3 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=3 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| </div> |
| |
| **Poisson regression with the log link example**: |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=1 |
| vpow=1.0 |
| link=1 |
| lpow=0.0 |
| disp=3.45 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=1 |
| vpow=1.0 |
| link=1 |
| lpow=0.0 |
| disp=3.45 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| </div> |
| |
| **Gamma regression with the inverse (reciprocal) link example**: |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop" markdown="1"> |
| hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=1 |
| vpow=2.0 |
| link=1 |
| lpow=-1.0 |
| disp=1.99118 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| <div data-lang="Spark" markdown="1"> |
| $SPARK_HOME/bin/spark-submit --master yarn-cluster |
| --conf spark.driver.maxResultSize=0 |
| --conf spark.akka.frameSize=128 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config=SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=1 |
| vpow=2.0 |
| link=1 |
| lpow=-1.0 |
| disp=1.99118 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </div> |
| </div> |
| |
| * * * |
| |
| <a name="table13" /> |
| **Table 13**: The goodness-of-fit statistics are provided in CSV format, one per each line, with four |
| columns: (Name, CID, Disp?, Value). The columns are: |
| "Name" is the string identifier for the statistic; |
| "CID" is an optional integer value that specifies the $Y$-column index for per-column statistics |
| (note that a bi/multinomial one-column Y-input is converted into multi-column); |
| "Disp?" is an optional Boolean value ($TRUE$ or $FALSE$) that tells us |
| whether or not scaling by the input dispersion parameter `disp` has been applied to this |
| statistic; |
| "Value" is the value of the statistic. |
| |
| | Name | CID | Disp? | Meaning | |
| | -------------------- | :-: | :---: | ------- | |
| | LOGLHOOD\_Z | | + | Log-likelihood $Z$-score (in st. dev.'s from the mean) | |
| | LOGLHOOD\_Z\_PVAL | | + | Log-likelihood $Z$-score p-value, two-sided | |
| | PEARSON\_X2 | | + | Pearson residual $X^2$-statistic | |
| | PEARSON\_X2\_BY\_DF | | + | Pearson $X^2$ divided by degrees of freedom | |
| | PEARSON\_X2\_PVAL | | + | Pearson $X^2$ p-value | |
| | DEVIANCE\_G2 | | + | Deviance from the saturated model $G^2$-statistic | |
| | DEVIANCE\_G2\_BY\_DF | | + | Deviance $G^2$ divided by degrees of freedom | |
| | DEVIANCE\_G2\_PVAL | | + | Deviance $G^2$ p-value | |
| | AVG\_TOT\_Y | + | | $Y$-column average for an individual response value | |
| | STDEV\_TOT\_Y | + | | $Y$-column st. dev. for an individual response value | |
| | AVG\_RES\_Y | + | | $Y$-column residual average of $Y - pred. mean(Y\\|X)$ | |
| | STDEV\_RES\_Y | + | | $Y$-column residual st. dev. of $Y - pred. mean(Y\\|X)$ | |
| | PRED\_STDEV\_RES | + | + | Model-predicted $Y$-column residual st. deviation| |
| | PLAIN\_R2 | + | | Plain $R^2$ of $Y$-column residual with bias included | |
| | ADJUSTED\_R2 | + | | Adjusted $R^2$ of $Y$-column residual w. bias included | |
| | PLAIN\_R2\_NOBIAS | + | | Plain $R^2$ of $Y$-column residual, bias subtracted | |
| | ADJUSTED\_R2\_NOBIAS | + | | Adjusted $R^2$ of $Y$-column residual, bias subtracted | |
| |
| * * * |
| |
| |
| ### Details |
| |
| The output matrix $M$ of predicted means (or probabilities) is computed |
| by matrix-multiplying $X$ with the first column of $B$ or with the |
| whole $B$ in the multinomial case, adding the intercept if available |
| (conceptually, appending an extra column of ones to $X$); then applying |
| the inverse of the model’s link function. The difference between "means" |
| and "probabilities" in the categorical case becomes significant when |
| there are ${\geq}\,2$ observations per record (with the multi-label |
| records) or when the labels such as $-1$ and $1$ are viewed and averaged |
| as numerical response values (with the single-label records). To avoid |
| any or information loss, we separately return the predicted probability |
| of each category label for each record. |
| |
| When the "actual" response values $Y$ are available, the summary |
| statistics are computed and written out as described in |
| [**Table 13**](algorithms-regression.html#table13). Below we discuss each of these statistics |
| in detail. Note that in the categorical case (binomial and multinomial) |
| $Y$ is internally represented as the matrix of observation counts for |
| each label in each record, rather than just the label ID for each |
| record. The input $Y$ may already be a matrix of counts, in which case |
| it is used as-is. But if $Y$ is given as a vector of response labels, |
| each response label is converted into an indicator vector |
| $(0,\ldots,0,1_l,0,\ldots,0)$ where $l$ is the label ID for this record. |
| All negative (e.g. $-1$) or zero label IDs are converted to the |
| $1 +$ maximum label ID. The largest label ID is viewed as the |
| "baseline" as explained in the section on Multinomial Logistic |
| Regression. We assume that there are $k\geq 1$ non-baseline categories |
| and one (last) baseline category. |
| |
| We also estimate residual variances for each response value, although we |
| do not output them, but use them only inside the summary statistics, |
| scaled and unscaled by the input dispersion parameter `disp`, |
| as described below. |
| |
| `LOGLHOOD_Z` and `LOGLHOOD_Z_PVAL` statistics measure how far the |
| log-likelihood of $Y$ deviates from its expected value according to the |
| model. The script implements them only for the binomial and the |
| multinomial distributions, returning `NaN` for all other distributions. |
| Pearson’s $X^2$ and deviance $G^2$ often perform poorly with bi- and |
| multinomial distributions due to low cell counts, hence we need this |
| extra goodness-of-fit measure. To compute these statistics, we use: |
| |
| * the $n\times (k\,{+}\,1)$-matrix $Y$ of multi-label response counts, in |
| which $y_{i,j}$ is the number of times label $j$ was observed in |
| record $i$ |
| * the model-estimated probability matrix $P$ of the same dimensions that |
| satisfies $$\sum_{j=1}^{k+1} p_{i,j} = 1$$ for all $i=1,\ldots,n$ and |
| where $p_{i,j}$ is the model probability of observing label $j$ in |
| record $i$ |
| * the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of |
| observations in record $i$ (all $N_i = 1$ if each record has only one |
| response label) |
| |
| We start by computing the multinomial log-likelihood of $Y$ given $P$ |
| and $N$, as well as the expected log-likelihood given a random $Y$ and |
| the variance of this log-likelihood if $Y$ indeed follows the proposed |
| distribution: |
| |
| $$ |
| \begin{aligned} |
| \ell (Y) \,\,&=\,\, \log Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1} \,y_{i,j}\log p_{i,j} \\ |
| E_Y \ell (Y) \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j} |
| \,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\ |
| Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2 |
| - \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right) |
| \end{aligned} |
| $$ |
| |
| |
| Then we compute the $Z$-score as the difference between the actual and |
| the expected log-likelihood $\ell(Y)$ divided by its expected standard |
| deviation, and its two-sided p-value in the Normal distribution |
| assumption ($\ell(Y)$ should approach normality due to the Central Limit |
| Theorem): |
| |
| $$ |
| Z \,=\, \frac {\ell(Y) - E_Y \ell(Y)}{\sqrt{Var_Y \ell(Y)}};\quad |
| \mathop{\textrm{p-value}}(Z) \,=\, Prob\Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big] |
| $$ |
| |
| A low p-value would indicate "underfitting" if $Z\ll 0$ or "overfitting" |
| if $Z\gg 0$. Here "overfitting" means that higher-probability labels |
| occur more often than their probabilities suggest. |
| |
| We also apply the dispersion input (`disp`) to compute the |
| "scaled" version of the $Z$-score and its p-value. Since $\ell(Y)$ is a |
| linear function of $Y$, multiplying the GLM-predicted variance of $Y$ by |
| `disp` results in multiplying |
| $Var_Y \ell(Y)$ by the same |
| `disp`. This, in turn, translates into dividing the $Z$-score |
| by the square root of the dispersion: |
| |
| $$Z_{\texttt{disp}} \,=\, \big(\ell(Y) \,-\, E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot Var_Y \ell(Y)} |
| \,=\, Z / \sqrt{\texttt{disp}}$$ |
| |
| Finally, we recalculate the p-value |
| with this new $Z$-score. |
| |
| `PEARSON_X2`, `PEARSON_X2_BY_DF`, and `PEARSON_X2_PVAL`: |
| Pearson’s residual $X^2$-statistic is a commonly used goodness-of-fit |
| measure for linear models [[McCullagh1989]](algorithms-bibliography.html). |
| The idea is to measure how |
| well the model-predicted means and variances match the actual behavior |
| of response values. For each record $i$, we estimate the mean $\mu_i$ |
| and the variance $v_i$ (or `disp` $\cdot v_i$) and use them to |
| normalize the residual: $r_i = (y_i - \mu_i) / \sqrt{v_i}$. These |
| normalized residuals are then squared, aggregated by summation, and |
| tested against an appropriate $\chi^2$ distribution. The computation |
| of $X^2$ is slightly different for categorical data (bi- and |
| multinomial) than it is for numerical data, since $y_i$ has multiple |
| correlated dimensions [[McCullagh1989]](algorithms-bibliography.html): |
| |
| $$X^2\,\textrm{(numer.)} \,=\, \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad |
| X^2\,\textrm{(categ.)} \,=\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i |
| \hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}}$$ |
| |
| The number of |
| degrees of freedom \#d.f. for the $\chi^2$ distribution is $n - m$ for |
| numerical data and $(n - m)k$ for categorical data, where |
| $k = \mathop{\texttt{ncol}}(Y) - 1$. Given the dispersion parameter |
| `disp` the $X^2$ statistic is scaled by division: $$X^2_{\texttt{disp}} = X^2 / \texttt{disp}$$. If the |
| dispersion is accurate, $X^2 / \texttt{disp}$ should be close to \#d.f. |
| In fact, $X^2 / \textrm{\#d.f.}$ over the *training* data is the |
| dispersion estimator used in our `GLM.dml` script, |
| see (7). Here we provide $X^2 / \textrm{\#d.f.}$ and |
| $X^2_{\texttt{disp}} / \textrm{\#d.f.}$ as |
| `PEARSON_X2_BY_DF` to enable dispersion comparison between |
| the training data and the test data. |
| |
| NOTE: For categorical data, both Pearson’s $X^2$ and the deviance $G^2$ |
| are unreliable (i.e. do not approach the $\chi^2$ distribution) unless |
| the predicted means of multi-label counts |
| $$\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$$ are fairly large: all |
| ${\geq}\,1$ and 80% are at least $5$ [[Cochran1954]](algorithms-bibliography.html). They should not |
| be used for "one label per record" categoricals. |
| |
| `DEVIANCE_G2`, `DEVIANCE_G2_BY_DF`, and |
| `DEVIANCE_G2_PVAL`: Deviance $G^2$ is the log of the |
| likelihood ratio between the "saturated" model and the linear model |
| being tested for the given dataset, multiplied by two: |
| |
| $$ |
| \begin{equation} |
| G^2 \,=\, 2 \,\log \frac{Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}{Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]} |
| \end{equation} |
| $$ |
| |
| The "saturated" model sets the mean |
| $\mu_i^{\mathrm{sat}}$ to equal $y_i$ for every record (for categorical |
| data, $$p_{i,j}^{sat} = y_{i,j} / N_i$$), which represents the |
| "perfect fit." For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at |
| a boundary, by continuity we set $0 \log 0 = 0$. The GLM likelihood |
| functions defined in (6) become simplified in |
| ratio (8) due to canceling out the term $c(y, a)$ |
| since it is the same in both models. |
| |
| The log of a likelihood ratio between two nested models, times two, is |
| known to approach a $\chi^2$ distribution as $n\to\infty$ if both models |
| have fixed parameter spaces. But this is not the case for the |
| "saturated" model: it adds more parameters with each record. In |
| practice, however, $\chi^2$ distributions are used to compute the |
| p-value of $G^2$ [[McCullagh1989]](algorithms-bibliography.html). The number of degrees of |
| freedom \#d.f. and the treatment of dispersion are the same as for |
| Pearson’s $X^2$, see above. |
| |
| |
| ### Column-Wise Statistics |
| |
| The rest of the statistics are computed separately for each column |
| of $Y$. As explained above, $Y$ has two or more columns in bi- and |
| multinomial case, either at input or after conversion. Moreover, each |
| $$y_{i,j}$$ in record $i$ with $N_i \geq 2$ is counted as $N_i$ separate |
| observations $$y_{i,j,l}$$ of 0 or 1 (where $l=1,\ldots,N_i$) with |
| $$y_{i,j}$$ ones and $$N_i-y_{i,j}$$ zeros. For power distributions, |
| including linear regression, $Y$ has only one column and all $N_i = 1$, |
| so the statistics are computed for all $Y$ with each record counted |
| once. Below we denote $$N = \sum_{i=1}^n N_i \,\geq n$$. Here is the total |
| average and the residual average (residual bias) of $y_{i,j,l}$ for each |
| $Y$-column: |
| |
| $$\texttt{AVG_TOT_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n y_{i,j}; \quad |
| \texttt{AVG_RES_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j})$$ |
| |
| Dividing by $N$ (rather than $n$) gives the averages for $$y_{i,j,l}$$ |
| (rather than $$y_{i,j}$$). The total variance, and the standard deviation, |
| for individual observations $$y_{i,j,l}$$ is estimated from the total |
| variance for response values $$y_{i,j}$$ using independence assumption: |
| $$Var \,y_{i,j} = Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} Var y_{i,j,l}$$. |
| This allows us to estimate the sum of squares for $y_{i,j,l}$ via the |
| sum of squares for $$y_{i,j}$$: |
| |
| $$\texttt{STDEV_TOT_Y}_j \,=\, |
| \Bigg[\frac{1}{N-1} \sum_{i=1}^n \Big( y_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n y_{i'\!,j}\Big)^2\Bigg]^{1/2}$$ |
| |
| Analogously, we estimate the standard deviation of the residual |
| $$y_{i,j,l} - \mu_{i,j,l}$$: |
| |
| $$\texttt{STDEV_RES_Y}_j \,=\, |
| \Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n \Big( y_{i,j} - \mu_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2}$$ |
| |
| Here $m'=m$ if $m$ includes the intercept as a feature and $m'=m+1$ if |
| it does not. The estimated standard deviations can be compared to the |
| model-predicted residual standard deviation computed from the predicted |
| means by the GLM variance formula and scaled by the dispersion: |
| |
| $$\texttt{PRED_STDEV_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2}$$ |
| |
| We also compute the $R^2$ statistics for each column of $Y$, see |
| [**Table 14**](algorithms-regression.html#table14) and [**Table 15**](algorithms-regression.html#table15) for details. We compute two versions |
| of $R^2$: in one version the residual sum-of-squares (RSS) includes any |
| bias in the residual that might be present (due to the lack of, or |
| inaccuracy in, the intercept); in the other version of RSS the bias is |
| subtracted by "centering" the residual. In both cases we subtract the |
| bias in the total sum-of-squares (in the denominator), and $m'$ equals |
| $m$ with the intercept or $m+1$ without the intercept. |
| |
| |
| * * * |
| |
| <a name="table14" /> |
| **Table 14**: $R^2$ where the residual sum-of-squares includes the bias contribution. |
| |
| | Statistic | Formula | |
| | --------------------- | ------------- | |
| | $\texttt{PLAIN_R2}_j$ | $$ \displaystyle 1 - \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}} $$ |
| | $\texttt{ADJUSTED_R2}_j$ | $$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}} \, \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}} $$ |
| |
| |
| * * * |
| |
| <a name="table15" /> |
| **Table 15**: $R^2$ where the residual sum-of-squares is centered so that the bias is subtracted. |
| |
| | Statistic | Formula | |
| | --------------------- | ------------- | |
| | $\texttt{PLAIN_R2_NOBIAS}_j$ | $$ \displaystyle 1 - \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}} $$ |
| | $\texttt{ADJUSTED_R2_NOBIAS}_j$ | $$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}} $$ |
| |
| |
| * * * |
| |
| |
| ### Returns |
| |
| The matrix of predicted means (if the response is numerical) or |
| probabilities (if the response is categorical), see Description |
| subsection above for more information. Given `Y`, we return |
| some statistics in CSV format as described in |
| [**Table 13**](algorithms-regression.html#table13) and in the above text. |
| |
| |
| * * * |
| |
| |
| [^1]: Smaller likelihood difference between two models suggests less |
| statistical evidence to pick one model over the other. |
| |
| [^2]: $Prob[y\mid \mu_i]$ |
| is given by a density function if $y$ is continuous. |
| |
| [^3]: We use $k+1$ because there are $k$ non-baseline categories and one |
| baseline category, with regression parameters $B$ having |
| $k$ columns. |