blob: 34278d022492b7fd76582ec132048baec761a534 [file] [log] [blame]
% When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines.
!TEX root = ../design.tex
\chapter[Regression]{Regression}
\begin{moduleinfo}
\item[History]
\begin{modulehistory}
\item[v0.1] Initial version, including background of regression
\end{modulehistory}
\end{moduleinfo}
% Abstract. What is the problem we want to solve?
Regression analysis is a statistical tool for the investigation of
relationships between variables. Usually, the investigator seeks to ascertain
the causal effect of one variable upon another—the effect of a price increase
upon demand, for example, or the effect of changes in the money supply upon the
inflation rate. More specifically, regression analysis helps one understand how
the typical value of the dependent variable changes when any one of the
independent variables is varied, while the other independent variables are held
fixed.
Regression models involve the following variables:
\begin{enumerate}
\item The unknown parameters, denoted as $\beta$, which may represent a scalar or a vector.
\item The independent variables, $x$
\item The dependent variables, $y$
\end{enumerate}
\section{Linear Methods for Regression} % (fold)
\label{sub:linear_methods_for_regression}
% subsection linear_methods_for_regression (end)
\section{Regularization} % (fold)
\label{sub:regularization}
Usually, $y$ is the result of measurements contaminated by small errors
(noise). Frequently, ill-conditioned or singular systems also arise in the iterative solution of nonlinear systems or optimization problems. In all such situations, the vector $x = {A}^{-1}y$ (or in the full rank overdetermined
case $A^+ y$, with the pseudo inverse $A^+ = (A^T A)^{−1}A^T X)$, if it exists at all, is usually a meaningless bad approximation to x.
Regularization techniques are needed to obtain meaningful solution estimates
for such ill-posed problems, and in particular when the number of parameters
is larger than the number of available measurements, so that standard least
squares techniques break down.
\subsection{Linear Ridge Regression}
Ridge regression is the most commonly used method of regularization of
ill-posed problems. Mathematically, it seeks to minimize
\begin{equation}
Q\left(\vec{w},w_0;\lambda\right)\equiv \min_{\vec{w},w_0}\left[ \frac{1}{2N} \sum_{i=1}^{N} \left( y_i - w_0 -
\vec{w} \cdot \vec{x}_i \right)^2
+\frac{\lambda}{2}\|\vec{w}\|_2^2 \right]\ ,
\end{equation}
for a given value of $\lambda$, where $\vec{w}$ and $w_0$ are the fitting coefficients, and $\lambda$
is a non-negative regularization parameter. $\vec{w}$ is a vector in
$d$ dimensional space, and
\begin{equation}
\|\vec{w}\|_2^2 = \sum_{j=1}^{d}w_j^2 = \vec{w}^T\vec{w}\ .
\end{equation}
When $\lambda = 0$, $Q$ is
the mean squared error of the fitting.
The intercept term $w_0$ is not regularized, because this term is
fully decided by the mean values of $y_i$ and $\vec{x}_i$ and the
values of $\vec{w}$, and does not affect the model's complexity.
$Q\left(\vec{w},w_0;\lambda)$ is a quadratic function of $\vec{w}$ and
$w_0$, and thus can be solved analytically
\begin{equation}
\vec{w}_{ridge}=\left(\lambda\vec{I}_d +
\vec{X}^T\vec{X}\right)^{-1}\vec{X}^T\vec{y}\ .
\end{equation}
By using the available Newton method (Sec. 6.2.4), the above quantity can be easily
calculated from one single step of the Newton method.
Many packages for Ridge regularization actually regularize the fitting
coefficients not for the fitting model for the original data but for
the data that has be scaled. MADlib also provides this option. When
the normalization parameter is set to be True, which is by default
False, the data will be first converted to the following before
applying the Ridge regularization.
\begin{equation}
x_i' \leftarrow \frac{x_i - \langle x_i \rangle}{\langle (x_i -
\langle x_i \rangle)^2\rangle} \ ,
\end{equation}
\begin{equation}
y_i \leftarrow y_i - \langle y_i \rangle \ ,
\end{equation}
where $\langle \cdot \rangle = \sum_{i=1}^{N} \cdot / N$.
Note that Ridge regressions for scaled data and un-scaled data are not equivalent.
\subsection{Elastic Net Regularization} % (fold)
\label{ssub:elastic_net_regularization}
As a continuous shrinkage method, ridge regression achieves its better prediction performance through a bias–variance trade-off. However, ridge regression cannot produce a parsimonious model, for it always keeps all the predictors in the model~\cite{zou2005}. Best subset selection in contrast produces a sparse model, but it is extremely variable because of its inherent discreteness.
A promising technique called the lasso was proposed by Tibshirani (1996). The
lasso is a penalized least squares method imposing an L1-penalty on the
regression coefficients. Owing to the nature of the L1-penalty, the lasso does
both continuous shrinkage and automatic variable selection simultaneously.
Although the lasso has shown success in many situations, it has some
limitations. Consider the following three scenarios:
\begin{enumerate}
\item In the Number of features ($p$) >> Number of observations ($n$) case, the lasso selects at most $n$ variables before it saturates, because of the nature of the convex optimization problem. This seems to be a limiting feature for a variable selection method. Moreover, the lasso is not well defined unless the bound on the $L_1$-norm of the coefficients is smaller than a certain value.
\item If there is a group of variables among which the pairwise correlations are very high, then the lasso tends to select only one variable from the group and does not care which one is selected.
\item For usual n>p situations, if there are high correlations between predictors, it has been empirically observed that the prediction performance of the lasso is dominated by ridge regression.
\end{enumerate}
These scenarios make lasso an inappropriate variable selection method in some situations.
Hui Zou and Trevor Hastie [42] introduce a new regularization
technique called the 'elastic net'. Similar to the lasso, the elastic net
simultaneously does automatic variable selection and continuous
shrinkage, and it can select groups of correlated variables. It is like a
stretchable fishing net that retains `all the big fish'.
The elastic net regularization minimizes the following target function
\begin{equation} \label{eq:target}
\min_{\vec{w} \in R^N}L(\vec{w}) + \lambda \left[\frac{1-\alpha}{2}\|\vec{w}\|_2^2 +
\lambda\alpha \|\vec{w}\|_1\right]\ ,
\end{equation}
where $\|\vec{w}\|_1 = \sum_{i=1}^N|w_i|$ and $N$ is the number of features.
For the elastic net regularization on linear models,
\begin{equation}
L(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot
\vec{x}_m\right)^2\ ,
\end{equation}
where the sum is over all observations and $M$ is the total number of
observation.
For the elastic net regularization on logistic models,
\begin{equation}
L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 +
\vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 +
\vec{w}\cdot\vec{x}_m}\right)\right]\ ,
\end{equation}
where $y_m \in \{0,1\}$.
\subsubsection{Optimizer Algorithms}
Right now, we support two algorithms for optimizer. The default one is
FISTA, and the other is IGD.
\paragraph{FISTA}
Fast Iterative Shrinkage Thresholding Algorithm (FISTA) with {\bf
backtracking step size} [4]:
\vspace{0.2in}
\hline
\vspace{0.2in}
{\bf Step $0$}: Choose $\delta>0$ and $\eta > 1$, and
$\vec{w}^{(0)} \in R^N$. Set $\vec{v}^{(1)}=\vec{w}^{(0)}$ and
$t_1=1$.
{\bf Step $k$}: ($k \ge 1$) Find the smallest nonnegative integers
$i_k$ such that with $\bar{\delta} = \delta_{k-1}/\eta^{i_k-1}$
\begin{equation}
F(p_{\bar{\delta}}(\vec{v}^{(k)})) \le
Q_{\bar{\delta}}(p_{\bar{\delta}}(\vec{v}^{(k)}), \vec{v}^{k})\ .
\end{equation}
Set $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ and compute
\begin{equation}
\vec{w}^{(k)} = p_{\delta_k}\left(\vec{v}^{(k)}\right)\ ,
\end{equation}
\begin{equation}
t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}\ ,
\end{equation}
\begin{equation}
\vec{v}^{(k+1)} = \vec{w}^{(k)} +
\frac{t_k-1}{t_{k+1}}\left(\vec{w}^{(k)} - \vec{w}^{(k-1)}\right)\ .
\end{equation}
\vspace{0.2in}
\hline
\vspace{0.2in}
Here,
\begin{equation}
F(\vec{w}) = f(\vec{w}) + g(\vec{w})\ ,
\end{equation}
where $f(\vec{w})$ is the differentiable part of
Eq. (\ref{eq:target}) and $g(\vec{w})$ is the non-differentiable part. For linear models,
\begin{equation}
f(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot
\vec{x}_m\right)^2 + \frac{\lambda(1-\alpha)}{2}\|\vec{w}\|_2^2\ ,
\end{equation}
and for logistic models,
\begin{equation}
f(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 +
\vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 +
\vec{w}\cdot\vec{x}_m}\right)\right] + \frac{\lambda(1-\alpha)}{2}\|\vec{w}\|_2^2\ .
\end{equation}
And for both types of models,
\begin{equation}
g(\vec{w}) = \lambda\alpha\sum_{i=1}^N|w_i|\ .
\end{equation}
And
\begin{equation}
Q_{\delta}(\vec{a}, \vec{b}) := f(\vec{b}) + \langle \vec{a} -
\vec{b}, \nabla f(\vec{b})\rangle +
\frac{1}{2\delta}\|\vec{a} - \vec{b}\|^2 + g(\vec{a})\ ,
\end{equation}
where $\langle\cdot\rangle$ is just the usual vector dot product.
And the proxy function is defined as
\begin{equation}
p_\delta (\vec{v}) := \underset{\vec{w}}{\operatorname{arg\,min}} \left[g(\vec{w}) +
\frac{1}{2\delta}\left\|\vec{w} - \left(\vec{v} - \delta\,\nabla f(\vec{v})\right)\right\|^2 \right]
\end{equation}
For our case, where $g(\vec{w}) = \lambda\alpha\sum_{i=1}^N|w_i|$, the
proxy function is simply equal to the soft-thresholding function
\begin{equation}
p_\delta (v_i) = \left\{ \begin{array}{ll}
v_i - \lambda\alpha\delta u_i\ , \quad & \mbox{if } v_i > \lambda\alpha\delta
u_i \\
0\ , \quad & \mbox{otherwise} \\
v_i + \lambda\alpha\delta u_i\ , \quad & \mbox{if } v_i < - \lambda\alpha\delta u_i
\end{array}
\right.
\end{equation}
where
\begin{equation}
\vec{u} = \vec{v} - \delta\,\nabla f(\vec{v})\ .
\end{equation}
{\bf Active set method} is used in our implementation for FISTA to
speed up the computation. Considerable speedup is obtained by
organizing the iterations around the active set of features— those
with nonzero coefficients. After a complete cycle through all the
variables, we iterate on only the active set till convergence. If
another complete cycle does not change the active set, we are done,
otherwise the process is repeated.
{\bf Warm-up method} is also used to speed up the computation. When
the option parameter $warmup$ is set to be $True$, a series of lambda
values, which is strictly descent and ends at the lambda value that
the user wants to calculate, will be used. The larger lambda gives
very sparse solution, and the sparse solution again is used as the
initial guess for the next lambda's solution, which will speed up the
computation for the next lambda. For larger data sets, this can
sometimes accelerate the whole computation and might be faster than
computation on only one lambda value.
{\bf Note:} Our implementation is a little bit different from the
original FISTA. In the original FISTA, during the backtracking
procedure, the algorithm is searching for a non-negative integer $i_k$
and the new step size is $\delta_k = \delta_{k-1}/\eta^{i_k}$. Thus
the step size is non-increasing. Here, we allow the step size to
increase by using $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ so that
larger step sizes can be tried by the algorithm. Tests show that this
is actually faster.
\paragraph{IGD}
The Incremental Gradient Descent (IGD) algorithm is a stochastic
algorithm by its nature. So it is difficult to get sparse
solutions. What we implemented is Stochastic Mirror Descent Algorithm
made Sparse (SMIDAS). The inverse p-form link function is used
\begin{equation} \label{eq:f}
h_j^{-1}(\vec{\theta}) = \frac{\mbox{sign}(\theta_j)\vert \theta_j
\vert^{p-1}}{\|\theta\|_p^{p-2}}\ ,
\end{equation}
where
\begin{equation}
\|\theta\|_p = \left(\sum_j \vert \theta \vert ^p\right)^{1/p}\ ,
\end{equation}
and $\mbox{sign}(0) = 0$.
\vspace{0.2in}
\hline
\vspace{0.2in}
Choose step size $\delta > 0$.
Let $p = 2\log d$ and let $h^{-1}$ be as in Eq. (\ref{eq:f})
Let $\vec{\theta} = \vec{0}$
{\bf for} $k = 1,2,\dots$
\quad $\vec{w} = h^{-1}(\vec{\theta})$
\quad $\vec{v} = \nabla f(\vec{w})$
\quad $\tilde{\vec{\theta}} = \theta - \delta \, \vec{v}$
\quad $\forall j, \theta_j = \mbox{sign}(\tilde{\theta}_j)\max (0,
\vert \tilde{\theta}_j \vert- \lambda\alpha\delta)$
\vspace{0.2in}
\hline
\vspace{0.2in}
The resulting fitting coefficients of this algorithm is not really
sparse, but the values are very small (usually $< 10^{15}$), which can
be safely set to be zero after filtering with a threshold.
This is done as the following: (1) multiply each coefficient with the
standard deviation of the corresponding feature (2) compute the
average of absolute values of re-scaled coefficients (3) divide each
rescaled coefficients with the average, and if the resulting absolute
value is smaller than <em>threshold</em>, set the original coefficient
to be zero.
IGD is in nature a sequential algorithm, and when running in a
distributed way, each segment of the data runs its own SGD model, and
the models are averaged to get a model for each iteration. This
average might slow down the convergence speed, although we acquire the
ability to process large data set on multiple machines. So this
algorithm provides an option <em>parallel</em> to let the user choose
whether to do parallel computation.
IGD also implements the {\bf warm-up method}.
{\bf Stopping Criteria} Both {\bf FISTA} and {\bf IGD} compute the average difference
between the coefficients of two consecutive iterations, and if the
difference is smaller than <em>tolerance</em> or the iteration number
is larger than <em>max_iter</em>, the computation stops.
% subsubsection elastic_net_regularization (end)
% subsection regularization (end)