blob: 43d4e15d9f963343f754d96bbe82e417ac063dfc [file] [log] [blame]
\begin{comment}
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\end{comment}
\subsection{Multinomial Logistic Regression}
\noindent{\bf Description}
\smallskip
Our logistic regression script performs both binomial and multinomial logistic regression.
The script is given a dataset $(X, Y)$ where matrix $X$ has $m$~columns and matrix $Y$ has
one column; both $X$ and~$Y$ have $n$~rows. The rows of $X$ and~$Y$ are viewed as a collection
of records: $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a numerical vector of explanatory
(feature) variables and $y_i$ is a categorical response variable.
Each row~$x_i$ in~$X$ has size~\mbox{$\dim x_i = m$}, while its corresponding $y_i$ is an
integer that represents the observed response value for record~$i$.
The goal of logistic regression is to learn a linear model over the feature vector
$x_i$ that can be used to predict how likely each categorical label is expected to
be observed as the actual~$y_i$.
Note that logistic regression predicts more than a label: it predicts the probability
for every possible label. The binomial case allows only two possible labels, the
multinomial case has no such restriction.
Just as linear regression estimates the mean value $\mu_i$ of a numerical response
variable, logistic regression does the same for category label probabilities.
In linear regression, the mean of $y_i$ is estimated as a linear combination of the features:
$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$.
In logistic regression, the
label probability has to lie between 0 and~1, so a link function is applied to connect
it to $\beta_0 + x_i\beta_{1:m}$. If there are just two possible category labels, for example
0~and~1, the logistic link looks as follows:
\begin{equation*}
\Prob[y_i\,{=}\,1\mid x_i; \beta] \,=\,
\frac{e^{\,\beta_0 + x_i\beta_{1:m}}}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}};
\quad
\Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\,
\frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}
\end{equation*}
Here category label~0 serves as the \emph{baseline}, and function
$\exp(\beta_0 + x_i\beta_{1:m})$
shows how likely we expect to see ``$y_i = 1$'' in comparison to the baseline.
Like in a loaded coin, the predicted odds of seeing 1~versus~0 are
$\exp(\beta_0 + x_i\beta_{1:m})$ to~1,
with each feature $x_{i,j}$ multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds.
Given a large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic regression seeks
to find the $\beta_j$'s that maximize the product of probabilities
\hbox{$\Prob[y_i\mid x_i; \beta]$}
for actually observed $y_i$-labels (assuming no regularization).
Multinomial logistic regression~\cite{Agresti2002:CDA} extends this link to $k \geq 3$ possible
categories. Again we identify one category as the baseline, for example the $k$-th category.
Instead of a coin, here we have a loaded multisided die, one side per category. Each non-baseline
category $l = 1\ldots k\,{-}\,1$ has its own vector $(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$
of regression parameters with the intercept, making up a matrix $B$ of size
$(m\,{+}\,1)\times(k\,{-}\,1)$. The predicted odds of seeing non-baseline category~$l$ versus
the baseline~$k$ are $\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$
to~1, and the predicted probabilities are:
\begin{align}
l < k:\quad\Prob[y_i\,{=}\,\makebox[0.5em][c]{$l$}\mid x_i; B] \,\,\,{=}\,\,\,&
\frac{\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)}%
{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)};
\label{eqn:mlogreg:nonbaseprob}\\
\Prob[y_i\,{=}\,\makebox[0.5em][c]{$k$}\mid x_i; B] \,\,\,{=}\,\,\,& \frac{1}%
{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}.
\label{eqn:mlogreg:baseprob}
\end{align}
The goal of the regression is to estimate the parameter matrix~$B$ from the provided dataset
$(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of \hbox{$\Prob[y_i\mid x_i; B]$}
over the observed labels~$y_i$. Taking its logarithm, negating, and adding a regularization term
gives us a minimization objective:
\begin{equation}
f(B; X, Y) \,\,=\,\,
-\sum_{i=1}^n \,\log \Prob[y_i\mid x_i; B] \,+\,
\frac{\lambda}{2} \sum_{j=1}^m \sum_{l=1}^{k-1} |\beta_{j,l}|^2
\,\,\to\,\,\min
\label{eqn:mlogreg:loss}
\end{equation}
The optional regularization term is added to mitigate overfitting and degeneracy in the data;
to reduce bias, the intercepts $\beta_{0,l}$ are not regularized. Once the~$\beta_{j,l}$'s
are accurately estimated, we can make predictions about the category label~$y$ for a new
feature vector~$x$ using Eqs.~(\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}).
\smallskip
\noindent{\bf Usage}
\smallskip
{\hangindent=\parindent\noindent\it%
{\tt{}-f }path/\/{\tt{}MultiLogReg.dml}
{\tt{} -nvargs}
{\tt{} X=}path/file
{\tt{} Y=}path/file
{\tt{} B=}path/file
{\tt{} Log=}path/file
{\tt{} icpt=}int
{\tt{} reg=}double
{\tt{} tol=}double
{\tt{} moi=}int
{\tt{} mii=}int
{\tt{} fmt=}format
}
\smallskip
\noindent{\bf Arguments}
\begin{Description}
\item[{\tt X}:]
Location (on HDFS) to read the input matrix of feature vectors; each row constitutes
one feature vector.
\item[{\tt Y}:]
Location to read the input one-column matrix of category labels that correspond to
feature vectors in~{\tt X}. Note the following:\\
-- Each non-baseline category label must be a positive integer.\\
-- If all labels are positive, the largest represents the baseline category.\\
-- If non-positive labels such as $-1$ or~$0$ are present, then they represent the (same)
baseline category and are converted to label $\max(\texttt{Y})\,{+}\,1$.
\item[{\tt B}:]
Location to store the matrix of estimated regression parameters (the $\beta_{j, l}$'s),
with the intercept parameters~$\beta_{0, l}$ at position {\tt B[}$m\,{+}\,1$, $l${\tt ]}
if available. The size of {\tt B} is $(m\,{+}\,1)\times (k\,{-}\,1)$ with the intercepts
or $m \times (k\,{-}\,1)$ without the intercepts, one column per non-baseline category
and one row per feature.
\item[{\tt Log}:] (default:\mbox{ }{\tt " "})
Location to store iteration-specific variables for monitoring and debugging purposes,
see Table~\ref{table:mlogreg:log} for details.
\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
Intercept and shifting/rescaling of the features in~$X$:\\
{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0})
L2-regularization parameter (lambda)
\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
Tolerance (epsilon) used in the convergence criterion
\item[{\tt moi}:] (default:\mbox{ }{\tt 100})
Maximum number of outer (Fisher scoring) iterations
\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
limit provided
\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
see read/write functions in SystemML Language Reference for details.
\end{Description}
\begin{table}[t]\small\centerline{%
\begin{tabular}{|ll|}
\hline
Name & Meaning \\
\hline
{\tt LINEAR\_TERM\_MIN} & The minimum value of $X \pxp B$, used to check for overflows \\
{\tt LINEAR\_TERM\_MAX} & The maximum value of $X \pxp B$, used to check for overflows \\
{\tt NUM\_CG\_ITERS} & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\
{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\
{\tt POINT\_STEP\_NORM} & L2-norm of iteration step from old point (matrix $B$) to new point \\
{\tt OBJECTIVE} & The loss function we minimize (negative regularized log-likelihood) \\
{\tt OBJ\_DROP\_REAL} & Reduction in the objective during this iteration, actual value \\
{\tt OBJ\_DROP\_PRED} & Reduction in the objective predicted by a quadratic approximation \\
{\tt OBJ\_DROP\_RATIO} & Actual-to-predicted reduction ratio, used to update the trust region \\
{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\
{\tt GRADIENT\_NORM} & L2-norm of the loss function gradient (omitted if point is rejected) \\
{\tt TRUST\_DELTA} & Updated trust region size, the ``delta'' \\
\hline
\end{tabular}}
\caption{
The {\tt Log} file for multinomial logistic regression contains the above \mbox{per-}iteration
variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\#
being~0 for initial values.}
\label{table:mlogreg:log}
\end{table}
\noindent{\bf Details}
\smallskip
We estimate the logistic regression parameters via L2-regularized negative
log-likelihood minimization~(\ref{eqn:mlogreg:loss}).
The optimization method used in the script closely follows the trust region
Newton method for logistic regression described in~\cite{Lin2008:logistic}.
For convenience, let us make some changes in notation:
\begin{Itemize}
\item Convert the input vector of observed category labels into an indicator matrix $Y$
of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th category label is~$l$ and
$Y_{i, l} = 0$ otherwise;
\item Append an extra column of all ones, i.e.\ $(1, 1, \ldots, 1)^T$, as the
$m\,{+}\,1$-st column to the feature matrix $X$ to represent the intercept;
\item Append an all-zero column as the $k$-th column to $B$, the matrix of regression
parameters, to represent the baseline category;
\item Convert the regularization constant $\lambda$ into matrix $\Lambda$ of the same
size as $B$, placing 0's into the $m\,{+}\,1$-st row to disable intercept regularization,
and placing $\lambda$'s everywhere else.
\end{Itemize}
Now the ($n\,{\times}\,k$)-matrix of predicted probabilities given
by (\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob})
and the objective function $f$ in~(\ref{eqn:mlogreg:loss}) have the matrix form
\begin{align*}
P \,\,&=\,\, \exp(XB) \,\,/\,\, \big(\exp(XB)\,1_{k\times k}\big)\\
f \,\,&=\,\, - \,\,{\textstyle\sum} \,\,Y \cdot (X B)\, + \,
{\textstyle\sum}\,\log\big(\exp(XB)\,1_{k\times 1}\big) \,+ \,
(1/2)\,\, {\textstyle\sum} \,\,\Lambda \cdot B \cdot B
\end{align*}
where operations $\cdot\,$, $/$, $\exp$, and $\log$ are applied cellwise,
and $\textstyle\sum$ denotes the sum of all cells in a matrix.
The gradient of~$f$ with respect to~$B$ can be represented as a matrix too:
\begin{equation*}
\nabla f \,\,=\,\, X^T (P - Y) \,+\, \Lambda \cdot B
\end{equation*}
The Hessian $\mathcal{H}$ of~$f$ is a tensor, but, fortunately, the conjugate
gradient inner loop of the trust region algorithm in~\cite{Lin2008:logistic}
does not need to instantiate it. We only need to multiply $\mathcal{H}$ by
ordinary matrices of the same size as $B$ and $\nabla f$, and this can be done
in matrix form:
\begin{equation*}
\mathcal{H}V \,\,=\,\, X^T \big( Q \,-\, P \cdot (Q\,1_{k\times k}) \big) \,+\,
\Lambda \cdot V, \,\,\,\,\textrm{where}\,\,\,\,Q \,=\, P \cdot (XV)
\end{equation*}
At each Newton iteration (the \emph{outer} iteration) the minimization algorithm
approximates the difference $\varDelta f(S; B) = f(B + S; X, Y) \,-\, f(B; X, Y)$
attained in the objective function after a step $B \mapsto B\,{+}\,S$ by a
second-degree formula
\begin{equation*}
\varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \mathcal{H}S
\,+\, {\textstyle\sum}\,\,S\cdot \nabla f
\end{equation*}
This approximation is then minimized by trust-region conjugate gradient iterations
(the \emph{inner} iterations) subject to the constraint $\|S\|_2 \leq \delta$.
The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$
and updated as described in~\cite{Lin2008:logistic}.
Users can specify the maximum number of the outer and the inner iterations with
input parameters {\tt moi} and {\tt mii}, respectively. The iterative minimizer
terminates successfully if $\|\nabla f\|_2 < \eps\,\|\nabla f_{B=0}\|_2$,
where $\eps > 0$ is a tolerance supplied by the user via input parameter~{\tt tol}.
\smallskip
\noindent{\bf Returns}
\smallskip
The estimated regression parameters (the $\hat{\beta}_{j, l}$) are populated into
a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
input argument. Only the non-baseline categories ($1\leq l \leq k\,{-}\,1$) have
their $\hat{\beta}_{j, l}$ in the output; to add the baseline category, just append
a column of zeros. If {\tt icpt=0} in the input command line, no intercepts are used
and {\tt B} has size $m\times (k\,{-}\,1)$; otherwise {\tt B} has size
$(m\,{+}\,1)\times (k\,{-}\,1)$
and the intercepts are in the $m\,{+}\,1$-st row. If {\tt icpt=2}, then initially
the feature columns in~$X$ are shifted to mean${} = 0$ and rescaled to variance${} = 1$.
After the iterations converge, the $\hat{\beta}_{j, l}$'s are rescaled and shifted
to work with the original features.
\smallskip
\noindent{\bf Examples}
\smallskip
{\hangindent=\parindent\noindent\tt
\hml -f MultiLogReg.dml -nvargs X=/user/biadmin/X.mtx
Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx fmt=csv
icpt=2 reg=1.0 tol=0.0001 moi=100 mii=10 Log=/user/biadmin/log.csv
}
\smallskip
\noindent{\bf References}
\begin{itemize}
\item A.~Agresti.
\newblock {\em Categorical Data Analysis}.
\newblock Wiley Series in Probability and Statistics. Wiley-Interscience, second edition, 2002.
\end{itemize}