| \begin{comment} |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| \end{comment} |
| |
| \subsection{Multinomial Logistic Regression} |
| |
| \noindent{\bf Description} |
| \smallskip |
| |
| Our logistic regression script performs both binomial and multinomial logistic regression. |
| The script is given a dataset $(X, Y)$ where matrix $X$ has $m$~columns and matrix $Y$ has |
| one column; both $X$ and~$Y$ have $n$~rows. The rows of $X$ and~$Y$ are viewed as a collection |
| of records: $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a numerical vector of explanatory |
| (feature) variables and $y_i$ is a categorical response variable. |
| Each row~$x_i$ in~$X$ has size~\mbox{$\dim x_i = m$}, while its corresponding $y_i$ is an |
| integer that represents the observed response value for record~$i$. |
| |
| The goal of logistic regression is to learn a linear model over the feature vector |
| $x_i$ that can be used to predict how likely each categorical label is expected to |
| be observed as the actual~$y_i$. |
| Note that logistic regression predicts more than a label: it predicts the probability |
| for every possible label. The binomial case allows only two possible labels, the |
| multinomial case has no such restriction. |
| |
| Just as linear regression estimates the mean value $\mu_i$ of a numerical response |
| variable, logistic regression does the same for category label probabilities. |
| In linear regression, the mean of $y_i$ is estimated as a linear combination of the features: |
| $\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$. |
| In logistic regression, the |
| label probability has to lie between 0 and~1, so a link function is applied to connect |
| it to $\beta_0 + x_i\beta_{1:m}$. If there are just two possible category labels, for example |
| 0~and~1, the logistic link looks as follows: |
| \begin{equation*} |
| \Prob[y_i\,{=}\,1\mid x_i; \beta] \,=\, |
| \frac{e^{\,\beta_0 + x_i\beta_{1:m}}}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}; |
| \quad |
| \Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\, |
| \frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}} |
| \end{equation*} |
| Here category label~0 serves as the \emph{baseline}, and function |
| $\exp(\beta_0 + x_i\beta_{1:m})$ |
| shows how likely we expect to see ``$y_i = 1$'' in comparison to the baseline. |
| Like in a loaded coin, the predicted odds of seeing 1~versus~0 are |
| $\exp(\beta_0 + x_i\beta_{1:m})$ to~1, |
| with each feature $x_{i,j}$ multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds. |
| Given a large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic regression seeks |
| to find the $\beta_j$'s that maximize the product of probabilities |
| \hbox{$\Prob[y_i\mid x_i; \beta]$} |
| for actually observed $y_i$-labels (assuming no regularization). |
| |
| Multinomial logistic regression~\cite{Agresti2002:CDA} extends this link to $k \geq 3$ possible |
| categories. Again we identify one category as the baseline, for example the $k$-th category. |
| Instead of a coin, here we have a loaded multisided die, one side per category. Each non-baseline |
| category $l = 1\ldots k\,{-}\,1$ has its own vector $(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$ |
| of regression parameters with the intercept, making up a matrix $B$ of size |
| $(m\,{+}\,1)\times(k\,{-}\,1)$. The predicted odds of seeing non-baseline category~$l$ versus |
| the baseline~$k$ are $\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$ |
| to~1, and the predicted probabilities are: |
| \begin{align} |
| l < k:\quad\Prob[y_i\,{=}\,\makebox[0.5em][c]{$l$}\mid x_i; B] \,\,\,{=}\,\,\,& |
| \frac{\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)}% |
| {1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}; |
| \label{eqn:mlogreg:nonbaseprob}\\ |
| \Prob[y_i\,{=}\,\makebox[0.5em][c]{$k$}\mid x_i; B] \,\,\,{=}\,\,\,& \frac{1}% |
| {1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}. |
| \label{eqn:mlogreg:baseprob} |
| \end{align} |
| The goal of the regression is to estimate the parameter matrix~$B$ from the provided dataset |
| $(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of \hbox{$\Prob[y_i\mid x_i; B]$} |
| over the observed labels~$y_i$. Taking its logarithm, negating, and adding a regularization term |
| gives us a minimization objective: |
| \begin{equation} |
| f(B; X, Y) \,\,=\,\, |
| -\sum_{i=1}^n \,\log \Prob[y_i\mid x_i; B] \,+\, |
| \frac{\lambda}{2} \sum_{j=1}^m \sum_{l=1}^{k-1} |\beta_{j,l}|^2 |
| \,\,\to\,\,\min |
| \label{eqn:mlogreg:loss} |
| \end{equation} |
| The optional regularization term is added to mitigate overfitting and degeneracy in the data; |
| to reduce bias, the intercepts $\beta_{0,l}$ are not regularized. Once the~$\beta_{j,l}$'s |
| are accurately estimated, we can make predictions about the category label~$y$ for a new |
| feature vector~$x$ using Eqs.~(\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}). |
| |
| \smallskip |
| \noindent{\bf Usage} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\it% |
| {\tt{}-f }path/\/{\tt{}MultiLogReg.dml} |
| {\tt{} -nvargs} |
| {\tt{} X=}path/file |
| {\tt{} Y=}path/file |
| {\tt{} B=}path/file |
| {\tt{} Log=}path/file |
| {\tt{} icpt=}int |
| {\tt{} reg=}double |
| {\tt{} tol=}double |
| {\tt{} moi=}int |
| {\tt{} mii=}int |
| {\tt{} fmt=}format |
| |
| } |
| |
| |
| \smallskip |
| \noindent{\bf Arguments} |
| \begin{Description} |
| \item[{\tt X}:] |
| Location (on HDFS) to read the input matrix of feature vectors; each row constitutes |
| one feature vector. |
| \item[{\tt Y}:] |
| Location to read the input one-column matrix of category labels that correspond to |
| feature vectors in~{\tt X}. Note the following:\\ |
| -- Each non-baseline category label must be a positive integer.\\ |
| -- If all labels are positive, the largest represents the baseline category.\\ |
| -- If non-positive labels such as $-1$ or~$0$ are present, then they represent the (same) |
| baseline category and are converted to label $\max(\texttt{Y})\,{+}\,1$. |
| \item[{\tt B}:] |
| Location to store the matrix of estimated regression parameters (the $\beta_{j, l}$'s), |
| with the intercept parameters~$\beta_{0, l}$ at position {\tt B[}$m\,{+}\,1$, $l${\tt ]} |
| if available. The size of {\tt B} is $(m\,{+}\,1)\times (k\,{-}\,1)$ with the intercepts |
| or $m \times (k\,{-}\,1)$ without the intercepts, one column per non-baseline category |
| and one row per feature. |
| \item[{\tt Log}:] (default:\mbox{ }{\tt " "}) |
| Location to store iteration-specific variables for monitoring and debugging purposes, |
| see Table~\ref{table:mlogreg:log} for details. |
| \item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) |
| Intercept and shifting/rescaling of the features in~$X$:\\ |
| {\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\ |
| {\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ |
| {\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 |
| \item[{\tt reg}:] (default:\mbox{ }{\tt 0.0}) |
| L2-regularization parameter (lambda) |
| \item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}) |
| Tolerance (epsilon) used in the convergence criterion |
| \item[{\tt moi}:] (default:\mbox{ }{\tt 100}) |
| Maximum number of outer (Fisher scoring) iterations |
| \item[{\tt mii}:] (default:\mbox{ }{\tt 0}) |
| Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum |
| limit provided |
| \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) |
| Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; |
| see read/write functions in SystemML Language Reference for details. |
| \end{Description} |
| |
| |
| \begin{table}[t]\small\centerline{% |
| \begin{tabular}{|ll|} |
| \hline |
| Name & Meaning \\ |
| \hline |
| {\tt LINEAR\_TERM\_MIN} & The minimum value of $X \pxp B$, used to check for overflows \\ |
| {\tt LINEAR\_TERM\_MAX} & The maximum value of $X \pxp B$, used to check for overflows \\ |
| {\tt NUM\_CG\_ITERS} & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\ |
| {\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\ |
| {\tt POINT\_STEP\_NORM} & L2-norm of iteration step from old point (matrix $B$) to new point \\ |
| {\tt OBJECTIVE} & The loss function we minimize (negative regularized log-likelihood) \\ |
| {\tt OBJ\_DROP\_REAL} & Reduction in the objective during this iteration, actual value \\ |
| {\tt OBJ\_DROP\_PRED} & Reduction in the objective predicted by a quadratic approximation \\ |
| {\tt OBJ\_DROP\_RATIO} & Actual-to-predicted reduction ratio, used to update the trust region \\ |
| {\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\ |
| {\tt GRADIENT\_NORM} & L2-norm of the loss function gradient (omitted if point is rejected) \\ |
| {\tt TRUST\_DELTA} & Updated trust region size, the ``delta'' \\ |
| \hline |
| \end{tabular}} |
| \caption{ |
| The {\tt Log} file for multinomial logistic regression contains the above \mbox{per-}iteration |
| variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\# |
| being~0 for initial values.} |
| \label{table:mlogreg:log} |
| \end{table} |
| |
| |
| \noindent{\bf Details} |
| \smallskip |
| |
| We estimate the logistic regression parameters via L2-regularized negative |
| log-likelihood minimization~(\ref{eqn:mlogreg:loss}). |
| The optimization method used in the script closely follows the trust region |
| Newton method for logistic regression described in~\cite{Lin2008:logistic}. |
| For convenience, let us make some changes in notation: |
| \begin{Itemize} |
| \item Convert the input vector of observed category labels into an indicator matrix $Y$ |
| of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th category label is~$l$ and |
| $Y_{i, l} = 0$ otherwise; |
| \item Append an extra column of all ones, i.e.\ $(1, 1, \ldots, 1)^T$, as the |
| $m\,{+}\,1$-st column to the feature matrix $X$ to represent the intercept; |
| \item Append an all-zero column as the $k$-th column to $B$, the matrix of regression |
| parameters, to represent the baseline category; |
| \item Convert the regularization constant $\lambda$ into matrix $\Lambda$ of the same |
| size as $B$, placing 0's into the $m\,{+}\,1$-st row to disable intercept regularization, |
| and placing $\lambda$'s everywhere else. |
| \end{Itemize} |
| Now the ($n\,{\times}\,k$)-matrix of predicted probabilities given |
| by (\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}) |
| and the objective function $f$ in~(\ref{eqn:mlogreg:loss}) have the matrix form |
| \begin{align*} |
| P \,\,&=\,\, \exp(XB) \,\,/\,\, \big(\exp(XB)\,1_{k\times k}\big)\\ |
| f \,\,&=\,\, - \,\,{\textstyle\sum} \,\,Y \cdot (X B)\, + \, |
| {\textstyle\sum}\,\log\big(\exp(XB)\,1_{k\times 1}\big) \,+ \, |
| (1/2)\,\, {\textstyle\sum} \,\,\Lambda \cdot B \cdot B |
| \end{align*} |
| where operations $\cdot\,$, $/$, $\exp$, and $\log$ are applied cellwise, |
| and $\textstyle\sum$ denotes the sum of all cells in a matrix. |
| The gradient of~$f$ with respect to~$B$ can be represented as a matrix too: |
| \begin{equation*} |
| \nabla f \,\,=\,\, X^T (P - Y) \,+\, \Lambda \cdot B |
| \end{equation*} |
| The Hessian $\mathcal{H}$ of~$f$ is a tensor, but, fortunately, the conjugate |
| gradient inner loop of the trust region algorithm in~\cite{Lin2008:logistic} |
| does not need to instantiate it. We only need to multiply $\mathcal{H}$ by |
| ordinary matrices of the same size as $B$ and $\nabla f$, and this can be done |
| in matrix form: |
| \begin{equation*} |
| \mathcal{H}V \,\,=\,\, X^T \big( Q \,-\, P \cdot (Q\,1_{k\times k}) \big) \,+\, |
| \Lambda \cdot V, \,\,\,\,\textrm{where}\,\,\,\,Q \,=\, P \cdot (XV) |
| \end{equation*} |
| At each Newton iteration (the \emph{outer} iteration) the minimization algorithm |
| approximates the difference $\varDelta f(S; B) = f(B + S; X, Y) \,-\, f(B; X, Y)$ |
| attained in the objective function after a step $B \mapsto B\,{+}\,S$ by a |
| second-degree formula |
| \begin{equation*} |
| \varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \mathcal{H}S |
| \,+\, {\textstyle\sum}\,\,S\cdot \nabla f |
| \end{equation*} |
| This approximation is then minimized by trust-region conjugate gradient iterations |
| (the \emph{inner} iterations) subject to the constraint $\|S\|_2 \leq \delta$. |
| The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$ |
| and updated as described in~\cite{Lin2008:logistic}. |
| Users can specify the maximum number of the outer and the inner iterations with |
| input parameters {\tt moi} and {\tt mii}, respectively. The iterative minimizer |
| terminates successfully if $\|\nabla f\|_2 < \eps\,\|\nabla f_{B=0}\|_2$, |
| where $\eps > 0$ is a tolerance supplied by the user via input parameter~{\tt tol}. |
| |
| \smallskip |
| \noindent{\bf Returns} |
| \smallskip |
| |
| The estimated regression parameters (the $\hat{\beta}_{j, l}$) are populated into |
| a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}'' |
| input argument. Only the non-baseline categories ($1\leq l \leq k\,{-}\,1$) have |
| their $\hat{\beta}_{j, l}$ in the output; to add the baseline category, just append |
| a column of zeros. If {\tt icpt=0} in the input command line, no intercepts are used |
| and {\tt B} has size $m\times (k\,{-}\,1)$; otherwise {\tt B} has size |
| $(m\,{+}\,1)\times (k\,{-}\,1)$ |
| and the intercepts are in the $m\,{+}\,1$-st row. If {\tt icpt=2}, then initially |
| the feature columns in~$X$ are shifted to mean${} = 0$ and rescaled to variance${} = 1$. |
| After the iterations converge, the $\hat{\beta}_{j, l}$'s are rescaled and shifted |
| to work with the original features. |
| |
| |
| \smallskip |
| \noindent{\bf Examples} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\tt |
| \hml -f MultiLogReg.dml -nvargs X=/user/biadmin/X.mtx |
| Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx fmt=csv |
| icpt=2 reg=1.0 tol=0.0001 moi=100 mii=10 Log=/user/biadmin/log.csv |
| |
| } |
| |
| |
| \smallskip |
| \noindent{\bf References} |
| \begin{itemize} |
| \item A.~Agresti. |
| \newblock {\em Categorical Data Analysis}. |
| \newblock Wiley Series in Probability and Statistics. Wiley-Interscience, second edition, 2002. |
| \end{itemize} |