\begin{comment} | |

Licensed to the Apache Software Foundation (ASF) under one | |

or more contributor license agreements. See the NOTICE file | |

distributed with this work for additional information | |

regarding copyright ownership. The ASF licenses this file | |

to you under the Apache License, Version 2.0 (the | |

"License"); you may not use this file except in compliance | |

with the License. You may obtain a copy of the License at | |

http://www.apache.org/licenses/LICENSE-2.0 | |

Unless required by applicable law or agreed to in writing, | |

software distributed under the License is distributed on an | |

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |

KIND, either express or implied. See the License for the | |

specific language governing permissions and limitations | |

under the License. | |

\end{comment} | |

\subsection{Multinomial Logistic Regression} | |

\noindent{\bf Description} | |

\smallskip | |

Our logistic regression script performs both binomial and multinomial logistic regression. | |

The script is given a dataset $(X, Y)$ where matrix $X$ has $m$~columns and matrix $Y$ has | |

one column; both $X$ and~$Y$ have $n$~rows. The rows of $X$ and~$Y$ are viewed as a collection | |

of records: $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a numerical vector of explanatory | |

(feature) variables and $y_i$ is a categorical response variable. | |

Each row~$x_i$ in~$X$ has size~\mbox{$\dim x_i = m$}, while its corresponding $y_i$ is an | |

integer that represents the observed response value for record~$i$. | |

The goal of logistic regression is to learn a linear model over the feature vector | |

$x_i$ that can be used to predict how likely each categorical label is expected to | |

be observed as the actual~$y_i$. | |

Note that logistic regression predicts more than a label: it predicts the probability | |

for every possible label. The binomial case allows only two possible labels, the | |

multinomial case has no such restriction. | |

Just as linear regression estimates the mean value $\mu_i$ of a numerical response | |

variable, logistic regression does the same for category label probabilities. | |

In linear regression, the mean of $y_i$ is estimated as a linear combination of the features: | |

$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$. | |

In logistic regression, the | |

label probability has to lie between 0 and~1, so a link function is applied to connect | |

it to $\beta_0 + x_i\beta_{1:m}$. If there are just two possible category labels, for example | |

0~and~1, the logistic link looks as follows: | |

\begin{equation*} | |

\Prob[y_i\,{=}\,1\mid x_i; \beta] \,=\, | |

\frac{e^{\,\beta_0 + x_i\beta_{1:m}}}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}; | |

\quad | |

\Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\, | |

\frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}} | |

\end{equation*} | |

Here category label~0 serves as the \emph{baseline}, and function | |

$\exp(\beta_0 + x_i\beta_{1:m})$ | |

shows how likely we expect to see ``$y_i = 1$'' in comparison to the baseline. | |

Like in a loaded coin, the predicted odds of seeing 1~versus~0 are | |

$\exp(\beta_0 + x_i\beta_{1:m})$ to~1, | |

with each feature $x_{i,j}$ multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds. | |

Given a large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic regression seeks | |

to find the $\beta_j$'s that maximize the product of probabilities | |

\hbox{$\Prob[y_i\mid x_i; \beta]$} | |

for actually observed $y_i$-labels (assuming no regularization). | |

Multinomial logistic regression~\cite{Agresti2002:CDA} extends this link to $k \geq 3$ possible | |

categories. Again we identify one category as the baseline, for example the $k$-th category. | |

Instead of a coin, here we have a loaded multisided die, one side per category. Each non-baseline | |

category $l = 1\ldots k\,{-}\,1$ has its own vector $(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$ | |

of regression parameters with the intercept, making up a matrix $B$ of size | |

$(m\,{+}\,1)\times(k\,{-}\,1)$. The predicted odds of seeing non-baseline category~$l$ versus | |

the baseline~$k$ are $\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$ | |

to~1, and the predicted probabilities are: | |

\begin{align} | |

l < k:\quad\Prob[y_i\,{=}\,\makebox[0.5em][c]{$l$}\mid x_i; B] \,\,\,{=}\,\,\,& | |

\frac{\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)}% | |

{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}; | |

\label{eqn:mlogreg:nonbaseprob}\\ | |

\Prob[y_i\,{=}\,\makebox[0.5em][c]{$k$}\mid x_i; B] \,\,\,{=}\,\,\,& \frac{1}% | |

{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}. | |

\label{eqn:mlogreg:baseprob} | |

\end{align} | |

The goal of the regression is to estimate the parameter matrix~$B$ from the provided dataset | |

$(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of \hbox{$\Prob[y_i\mid x_i; B]$} | |

over the observed labels~$y_i$. Taking its logarithm, negating, and adding a regularization term | |

gives us a minimization objective: | |

\begin{equation} | |

f(B; X, Y) \,\,=\,\, | |

-\sum_{i=1}^n \,\log \Prob[y_i\mid x_i; B] \,+\, | |

\frac{\lambda}{2} \sum_{j=1}^m \sum_{l=1}^{k-1} |\beta_{j,l}|^2 | |

\,\,\to\,\,\min | |

\label{eqn:mlogreg:loss} | |

\end{equation} | |

The optional regularization term is added to mitigate overfitting and degeneracy in the data; | |

to reduce bias, the intercepts $\beta_{0,l}$ are not regularized. Once the~$\beta_{j,l}$'s | |

are accurately estimated, we can make predictions about the category label~$y$ for a new | |

feature vector~$x$ using Eqs.~(\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}). | |

\smallskip | |

\noindent{\bf Usage} | |

\smallskip | |

{\hangindent=\parindent\noindent\it% | |

{\tt{}-f }path/\/{\tt{}MultiLogReg.dml} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} Y=}path/file | |

{\tt{} B=}path/file | |

{\tt{} Log=}path/file | |

{\tt{} icpt=}int | |

{\tt{} reg=}double | |

{\tt{} tol=}double | |

{\tt{} moi=}int | |

{\tt{} mii=}int | |

{\tt{} fmt=}format | |

} | |

\smallskip | |

\noindent{\bf Arguments} | |

\begin{Description} | |

\item[{\tt X}:] | |

Location (on HDFS) to read the input matrix of feature vectors; each row constitutes | |

one feature vector. | |

\item[{\tt Y}:] | |

Location to read the input one-column matrix of category labels that correspond to | |

feature vectors in~{\tt X}. Note the following:\\ | |

-- Each non-baseline category label must be a positive integer.\\ | |

-- If all labels are positive, the largest represents the baseline category.\\ | |

-- If non-positive labels such as $-1$ or~$0$ are present, then they represent the (same) | |

baseline category and are converted to label $\max(\texttt{Y})\,{+}\,1$. | |

\item[{\tt B}:] | |

Location to store the matrix of estimated regression parameters (the $\beta_{j, l}$'s), | |

with the intercept parameters~$\beta_{0, l}$ at position {\tt B[}$m\,{+}\,1$, $l${\tt ]} | |

if available. The size of {\tt B} is $(m\,{+}\,1)\times (k\,{-}\,1)$ with the intercepts | |

or $m \times (k\,{-}\,1)$ without the intercepts, one column per non-baseline category | |

and one row per feature. | |

\item[{\tt Log}:] (default:\mbox{ }{\tt " "}) | |

Location to store iteration-specific variables for monitoring and debugging purposes, | |

see Table~\ref{table:mlogreg:log} for details. | |

\item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) | |

Intercept and shifting/rescaling of the features in~$X$:\\ | |

{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\ | |

{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ | |

{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 | |

\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0}) | |

L2-regularization parameter (lambda) | |

\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}) | |

Tolerance (epsilon) used in the convergence criterion | |

\item[{\tt moi}:] (default:\mbox{ }{\tt 100}) | |

Maximum number of outer (Fisher scoring) iterations | |

\item[{\tt mii}:] (default:\mbox{ }{\tt 0}) | |

Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum | |

limit provided | |

\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) | |

Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; | |

see read/write functions in SystemML Language Reference for details. | |

\end{Description} | |

\begin{table}[t]\small\centerline{% | |

\begin{tabular}{|ll|} | |

\hline | |

Name & Meaning \\ | |

\hline | |

{\tt LINEAR\_TERM\_MIN} & The minimum value of $X \pxp B$, used to check for overflows \\ | |

{\tt LINEAR\_TERM\_MAX} & The maximum value of $X \pxp B$, used to check for overflows \\ | |

{\tt NUM\_CG\_ITERS} & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\ | |

{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\ | |

{\tt POINT\_STEP\_NORM} & L2-norm of iteration step from old point (matrix $B$) to new point \\ | |

{\tt OBJECTIVE} & The loss function we minimize (negative regularized log-likelihood) \\ | |

{\tt OBJ\_DROP\_REAL} & Reduction in the objective during this iteration, actual value \\ | |

{\tt OBJ\_DROP\_PRED} & Reduction in the objective predicted by a quadratic approximation \\ | |

{\tt OBJ\_DROP\_RATIO} & Actual-to-predicted reduction ratio, used to update the trust region \\ | |

{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\ | |

{\tt GRADIENT\_NORM} & L2-norm of the loss function gradient (omitted if point is rejected) \\ | |

{\tt TRUST\_DELTA} & Updated trust region size, the ``delta'' \\ | |

\hline | |

\end{tabular}} | |

\caption{ | |

The {\tt Log} file for multinomial logistic regression contains the above \mbox{per-}iteration | |

variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\# | |

being~0 for initial values.} | |

\label{table:mlogreg:log} | |

\end{table} | |

\noindent{\bf Details} | |

\smallskip | |

We estimate the logistic regression parameters via L2-regularized negative | |

log-likelihood minimization~(\ref{eqn:mlogreg:loss}). | |

The optimization method used in the script closely follows the trust region | |

Newton method for logistic regression described in~\cite{Lin2008:logistic}. | |

For convenience, let us make some changes in notation: | |

\begin{Itemize} | |

\item Convert the input vector of observed category labels into an indicator matrix $Y$ | |

of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th category label is~$l$ and | |

$Y_{i, l} = 0$ otherwise; | |

\item Append an extra column of all ones, i.e.\ $(1, 1, \ldots, 1)^T$, as the | |

$m\,{+}\,1$-st column to the feature matrix $X$ to represent the intercept; | |

\item Append an all-zero column as the $k$-th column to $B$, the matrix of regression | |

parameters, to represent the baseline category; | |

\item Convert the regularization constant $\lambda$ into matrix $\Lambda$ of the same | |

size as $B$, placing 0's into the $m\,{+}\,1$-st row to disable intercept regularization, | |

and placing $\lambda$'s everywhere else. | |

\end{Itemize} | |

Now the ($n\,{\times}\,k$)-matrix of predicted probabilities given | |

by (\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}) | |

and the objective function $f$ in~(\ref{eqn:mlogreg:loss}) have the matrix form | |

\begin{align*} | |

P \,\,&=\,\, \exp(XB) \,\,/\,\, \big(\exp(XB)\,1_{k\times k}\big)\\ | |

f \,\,&=\,\, - \,\,{\textstyle\sum} \,\,Y \cdot (X B)\, + \, | |

{\textstyle\sum}\,\log\big(\exp(XB)\,1_{k\times 1}\big) \,+ \, | |

(1/2)\,\, {\textstyle\sum} \,\,\Lambda \cdot B \cdot B | |

\end{align*} | |

where operations $\cdot\,$, $/$, $\exp$, and $\log$ are applied cellwise, | |

and $\textstyle\sum$ denotes the sum of all cells in a matrix. | |

The gradient of~$f$ with respect to~$B$ can be represented as a matrix too: | |

\begin{equation*} | |

\nabla f \,\,=\,\, X^T (P - Y) \,+\, \Lambda \cdot B | |

\end{equation*} | |

The Hessian $\mathcal{H}$ of~$f$ is a tensor, but, fortunately, the conjugate | |

gradient inner loop of the trust region algorithm in~\cite{Lin2008:logistic} | |

does not need to instantiate it. We only need to multiply $\mathcal{H}$ by | |

ordinary matrices of the same size as $B$ and $\nabla f$, and this can be done | |

in matrix form: | |

\begin{equation*} | |

\mathcal{H}V \,\,=\,\, X^T \big( Q \,-\, P \cdot (Q\,1_{k\times k}) \big) \,+\, | |

\Lambda \cdot V, \,\,\,\,\textrm{where}\,\,\,\,Q \,=\, P \cdot (XV) | |

\end{equation*} | |

At each Newton iteration (the \emph{outer} iteration) the minimization algorithm | |

approximates the difference $\varDelta f(S; B) = f(B + S; X, Y) \,-\, f(B; X, Y)$ | |

attained in the objective function after a step $B \mapsto B\,{+}\,S$ by a | |

second-degree formula | |

\begin{equation*} | |

\varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \mathcal{H}S | |

\,+\, {\textstyle\sum}\,\,S\cdot \nabla f | |

\end{equation*} | |

This approximation is then minimized by trust-region conjugate gradient iterations | |

(the \emph{inner} iterations) subject to the constraint $\|S\|_2 \leq \delta$. | |

The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$ | |

and updated as described in~\cite{Lin2008:logistic}. | |

Users can specify the maximum number of the outer and the inner iterations with | |

input parameters {\tt moi} and {\tt mii}, respectively. The iterative minimizer | |

terminates successfully if $\|\nabla f\|_2 < \eps\,\|\nabla f_{B=0}\|_2$, | |

where $\eps > 0$ is a tolerance supplied by the user via input parameter~{\tt tol}. | |

\smallskip | |

\noindent{\bf Returns} | |

\smallskip | |

The estimated regression parameters (the $\hat{\beta}_{j, l}$) are populated into | |

a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}'' | |

input argument. Only the non-baseline categories ($1\leq l \leq k\,{-}\,1$) have | |

their $\hat{\beta}_{j, l}$ in the output; to add the baseline category, just append | |

a column of zeros. If {\tt icpt=0} in the input command line, no intercepts are used | |

and {\tt B} has size $m\times (k\,{-}\,1)$; otherwise {\tt B} has size | |

$(m\,{+}\,1)\times (k\,{-}\,1)$ | |

and the intercepts are in the $m\,{+}\,1$-st row. If {\tt icpt=2}, then initially | |

the feature columns in~$X$ are shifted to mean${} = 0$ and rescaled to variance${} = 1$. | |

After the iterations converge, the $\hat{\beta}_{j, l}$'s are rescaled and shifted | |

to work with the original features. | |

\smallskip | |

\noindent{\bf Examples} | |

\smallskip | |

{\hangindent=\parindent\noindent\tt | |

\hml -f MultiLogReg.dml -nvargs X=/user/biadmin/X.mtx | |

Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx fmt=csv | |

icpt=2 reg=1.0 tol=0.0001 moi=100 mii=10 Log=/user/biadmin/log.csv | |

} | |

\smallskip | |

\noindent{\bf References} | |

\begin{itemize} | |

\item A.~Agresti. | |

\newblock {\em Categorical Data Analysis}. | |

\newblock Wiley Series in Probability and Statistics. Wiley-Interscience, second edition, 2002. | |

\end{itemize} |