doc/design/modules/regression.tex - madlib - Git at Google

 % When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines.
 !TEX root = ../design.tex

 \chapter[Regression]{Regression}
 \begin{moduleinfo}
 \item[History]
 	\begin{modulehistory}
 		\item[v0.1] Initial version, including background of regression
 	\end{modulehistory}
 \end{moduleinfo}

 % Abstract. What is the problem we want to solve?

 Regression analysis is a statistical tool for the investigation of
 relationships between variables. Usually, the investigator seeks to ascertain
 the causal effect of one variable upon another—the effect of a price increase
 upon demand, for example, or the effect of changes in the money supply upon the
 inflation rate. More specifically, regression analysis helps one understand how
 the typical value of the dependent variable changes when any one of the
 independent variables is varied, while the other independent variables are held
 fixed.

 Regression models involve the following variables:
 \begin{enumerate}
     \item The unknown parameters, denoted as $\beta$, which may represent a scalar or a vector.
     \item The independent variables, $x$
     \item The dependent variables, $y$
 \end{enumerate}

 \section{Linear Methods for Regression} % (fold)
 \label{sub:linear_methods_for_regression}

 % subsection linear_methods_for_regression (end)

 \section{Regularization} % (fold)
 \label{sub:regularization}

 Usually, $y$ is the result of measurements contaminated by small errors
 (noise). Frequently, ill-conditioned or singular systems also arise in the iterative solution of nonlinear systems or optimization problems. In all such situations, the vector $x = {A}^{-1}y$ (or in the full rank overdetermined
 case $A^+ y$, with the pseudo inverse $A^+ = (A^T A)^{−1}A^T X)$, if it exists at all, is usually a meaningless bad approximation to x.

 Regularization techniques are needed to obtain meaningful solution estimates
 for such ill-posed problems, and in particular when the number of parameters
 is larger than the number of available measurements, so that standard least
 squares techniques break down.

 \subsection{Linear Ridge Regression}
 Ridge regression is the most commonly used method of regularization of
 ill-posed problems. Mathematically, it seeks to minimize

 \begin{equation}
 Q\left(\vec{w},w_0;\lambda\right)\equiv \min_{\vec{w},w_0}\left[ \frac{1}{2N} \sum_{i=1}^{N} \left( y_i - w_0 -
     \vec{w} \cdot \vec{x}_i \right)^2
   +\frac{\lambda}{2}\|\vec{w}\|_2^2 \right]\ ,
 \end{equation}
 for a given value of $\lambda$, where $\vec{w}$ and $w_0$ are the fitting coefficients, and $\lambda$
 is a non-negative regularization parameter. $\vec{w}$ is a vector in
 $d$ dimensional space, and
 \begin{equation}
 \|\vec{w}\|_2^2 = \sum_{j=1}^{d}w_j^2 = \vec{w}^T\vec{w}\ .
 \end{equation}
 When $\lambda = 0$, $Q$ is
 the mean squared error of the fitting.

 The intercept term $w_0$ is not regularized, because this term is
 fully decided by the mean values of $y_i$ and $\vec{x}_i$ and the
 values of $\vec{w}$, and does not affect the model's complexity.

 $Q\left(\vec{w},w_0;\lambda)$ is a quadratic function of $\vec{w}$ and
   $w_0$, and thus can be solved analytically
 \begin{equation}
 \vec{w}_{ridge}=\left(\lambda\vec{I}_d +
   \vec{X}^T\vec{X}\right)^{-1}\vec{X}^T\vec{y}\ .
 \end{equation}
 By using the available Newton method (Sec. 6.2.4), the above quantity can be easily
 calculated from one single step of the Newton method.

 Many packages for Ridge regularization actually regularize the fitting
 coefficients not for the fitting model for the original data but for
 the data that has be scaled. MADlib also provides this option. When
 the normalization parameter is set to be True, which is by default
 False, the data will be first converted to the following before
 applying the Ridge regularization.
 \begin{equation}
   x_i' \leftarrow \frac{x_i - \langle x_i \rangle}{\langle (x_i -
     \langle x_i \rangle)^2\rangle} \ ,
 \end{equation}
 \begin{equation}
 y_i \leftarrow y_i - \langle y_i \rangle \ ,
 \end{equation}
 where $\langle \cdot \rangle = \sum_{i=1}^{N} \cdot / N$.

 Note that Ridge regressions for scaled data and un-scaled data are not equivalent.

 \subsection{Elastic Net Regularization} % (fold)
 \label{ssub:elastic_net_regularization}
 As a continuous shrinkage method, ridge regression achieves its better prediction performance through a bias–variance trade-off. However, ridge regression cannot produce a parsimonious model, for it always keeps all the predictors in the model~\cite{zou2005}. Best subset selection in contrast produces a sparse model, but it is extremely variable because of its inherent discreteness.

 A promising technique called the lasso was proposed by Tibshirani (1996). The
 lasso is a penalized least squares method imposing an L1-penalty on the
 regression coefficients. Owing to the nature of the L1-penalty, the lasso does
 both continuous shrinkage and automatic variable selection simultaneously.

 Although the lasso has shown success in many situations, it has some
 limitations. Consider the following three scenarios:
 \begin{enumerate}
     \item In the Number of features ($p$) >> Number of observations ($n$) case, the lasso selects at most $n$ variables before it saturates, because of the nature of the convex optimization problem. This seems to be a limiting feature for a variable selection method. Moreover, the lasso is not well defined unless the bound on the $L_1$-norm of the coefficients is smaller than a certain value.
     \item If there is a group of variables among which the pairwise correlations are very high, then the lasso tends to select only one variable from the group and does not care which one is selected.
     \item For usual n>p situations, if there are high correlations between predictors, it has been empirically observed that the prediction performance of the lasso is dominated by ridge regression.
 \end{enumerate}

 These scenarios make lasso an inappropriate variable selection method in some situations.

 Hui Zou and Trevor Hastie [42] introduce a new regularization
 technique called the 'elastic net'. Similar to the lasso, the elastic net
 simultaneously does automatic variable selection and continuous
 shrinkage, and it can select groups of correlated variables. It is like a
 stretchable fishing net that retains `all the big fish'.

 The elastic net regularization minimizes the following target function
 \begin{equation} \label{eq:target}
 \min_{\vec{w} \in R^N}L(\vec{w}) + \lambda \left[\frac{1-\alpha}{2}\|\vec{w}\|_2^2 +
   \lambda\alpha \|\vec{w}\|_1\right]\ ,
 \end{equation}
 where $\|\vec{w}\|_1 = \sum_{i=1}^N|w_i|$ and $N$ is the number of features.

 For the elastic net regularization on linear models,
 \begin{equation}
 L(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot
   \vec{x}_m\right)^2\ ,
 \end{equation}
 where the sum is over all observations and $M$ is the total number of
 observation.

 For the elastic net regularization on logistic models,
 \begin{equation}
 L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 +
       \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 +
       \vec{w}\cdot\vec{x}_m}\right)\right]\ ,
 \end{equation}
 where $y_m \in \{0,1\}$.

 \subsubsection{Optimizer Algorithms}
 Right now, we support two algorithms for optimizer. The default one is
 FISTA, and the other is IGD.

 \paragraph{FISTA}

 Fast Iterative Shrinkage Thresholding Algorithm (FISTA) with {\bf
   backtracking step size} [4]:
 \vspace{0.2in}
 \hline
 \vspace{0.2in}
 {\bf Step $0$}: Choose $\delta>0$ and $\eta > 1$, and
 $\vec{w}^{(0)} \in R^N$. Set $\vec{v}^{(1)}=\vec{w}^{(0)}$ and
 $t_1=1$.

 {\bf Step $k$}: ($k \ge 1$) Find the smallest nonnegative integers
 $i_k$ such that with $\bar{\delta} = \delta_{k-1}/\eta^{i_k-1}$
 \begin{equation}
 F(p_{\bar{\delta}}(\vec{v}^{(k)})) \le
 Q_{\bar{\delta}}(p_{\bar{\delta}}(\vec{v}^{(k)}), \vec{v}^{k})\ .
 \end{equation}
 Set $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ and compute
 \begin{equation}
 \vec{w}^{(k)}  =  p_{\delta_k}\left(\vec{v}^{(k)}\right)\ ,
 \end{equation}
 \begin{equation}
 t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}\ ,
 \end{equation}
 \begin{equation}
 \vec{v}^{(k+1)} = \vec{w}^{(k)} +
 \frac{t_k-1}{t_{k+1}}\left(\vec{w}^{(k)} - \vec{w}^{(k-1)}\right)\ .
 \end{equation}
 \vspace{0.2in}
 \hline
 \vspace{0.2in}
 Here,
 \begin{equation}
 F(\vec{w}) = f(\vec{w}) + g(\vec{w})\ ,
 \end{equation}
 where $f(\vec{w})$ is the differentiable part of
 Eq. (\ref{eq:target}) and $g(\vec{w})$ is the non-differentiable part. For linear models,
 \begin{equation}
 f(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot
   \vec{x}_m\right)^2 + \frac{\lambda(1-\alpha)}{2}\|\vec{w}\|_2^2\ ,
 \end{equation}
 and for logistic models,
 \begin{equation}
 f(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 +
       \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 +
       \vec{w}\cdot\vec{x}_m}\right)\right] + \frac{\lambda(1-\alpha)}{2}\|\vec{w}\|_2^2\ .
 \end{equation}
 And for both types of models,
 \begin{equation}
 g(\vec{w}) = \lambda\alpha\sum_{i=1}^N|w_i|\ .
 \end{equation}

 And
 \begin{equation}
 Q_{\delta}(\vec{a}, \vec{b}) := f(\vec{b}) + \langle \vec{a} -
 \vec{b}, \nabla f(\vec{b})\rangle +
 \frac{1}{2\delta}\|\vec{a} - \vec{b}\|^2 + g(\vec{a})\ ,
 \end{equation}
 where $\langle\cdot\rangle$ is just the usual vector dot product.

 And the proxy function is defined as
 \begin{equation}
 p_\delta (\vec{v}) := \underset{\vec{w}}{\operatorname{arg\,min}} \left[g(\vec{w}) +
   \frac{1}{2\delta}\left\|\vec{w} - \left(\vec{v} - \delta\,\nabla f(\vec{v})\right)\right\|^2  \right]
 \end{equation}
 For our case, where $g(\vec{w}) = \lambda\alpha\sum_{i=1}^N|w_i|$, the
 proxy function is simply equal to the soft-thresholding function
 \begin{equation}
 p_\delta (v_i) = \left\{ \begin{array}{ll}
 v_i - \lambda\alpha\delta u_i\ , \quad  & \mbox{if } v_i > \lambda\alpha\delta
 u_i \\
 0\ , \quad & \mbox{otherwise} \\
 v_i + \lambda\alpha\delta u_i\ , \quad & \mbox{if } v_i < - \lambda\alpha\delta u_i
 \end{array}
 \right.
 \end{equation}
 where
 \begin{equation}
 \vec{u} = \vec{v} - \delta\,\nabla f(\vec{v})\ .
 \end{equation}

 {\bf Active set method} is used in our implementation for FISTA to
 speed up the computation. Considerable speedup is obtained by
 organizing the iterations around the active set of features— those
 with nonzero coefficients. After a complete cycle through all the
 variables, we iterate on only the active set till convergence. If
 another complete cycle does not change the active set, we are done,
 otherwise the process is repeated.

 {\bf Warm-up method} is also used to speed up the computation. When
 the option parameter $warmup$ is set to be $True$, a series of lambda
 values, which is strictly descent and ends at the lambda value that
 the user wants to calculate, will be used. The larger lambda gives
 very sparse solution, and the sparse solution again is used as the
 initial guess for the next lambda's solution, which will speed up the
 computation for the next lambda. For larger data sets, this can
 sometimes accelerate the whole computation and might be faster than
 computation on only one lambda value.

 {\bf Note:} Our implementation is a little bit different from the
 original FISTA. In the original FISTA, during the backtracking
 procedure, the algorithm is searching for a non-negative integer $i_k$
 and the new step size is $\delta_k = \delta_{k-1}/\eta^{i_k}$. Thus
 the step size is non-increasing. Here, we allow the step size to
 increase by using $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ so that
 larger step sizes can be tried by the algorithm. Tests show that this
 is actually faster.

 \paragraph{IGD}

 The Incremental Gradient Descent (IGD) algorithm is a stochastic
 algorithm by its nature. So it is difficult to get sparse
 solutions. What we implemented is Stochastic Mirror Descent Algorithm
 made Sparse (SMIDAS). The inverse p-form link function is used
 \begin{equation} \label{eq:f}
 h_j^{-1}(\vec{\theta}) = \frac{\mbox{sign}(\theta_j)\vert \theta_j
   \vert^{p-1}}{\|\theta\|_p^{p-2}}\ ,
 \end{equation}
 where
 \begin{equation}
 \|\theta\|_p = \left(\sum_j \vert \theta \vert ^p\right)^{1/p}\ ,
 \end{equation}
 and $\mbox{sign}(0) = 0$.
 \vspace{0.2in}
 \hline
 \vspace{0.2in}
 Choose step size $\delta > 0$.

 Let $p = 2\log d$ and let $h^{-1}$ be as in Eq. (\ref{eq:f})

 Let $\vec{\theta} = \vec{0}$

 {\bf for} $k = 1,2,\dots$

 \quad $\vec{w} = h^{-1}(\vec{\theta})$

 \quad $\vec{v} = \nabla f(\vec{w})$

 \quad $\tilde{\vec{\theta}} = \theta - \delta \, \vec{v}$

 \quad $\forall j, \theta_j = \mbox{sign}(\tilde{\theta}_j)\max (0,
 \vert \tilde{\theta}_j \vert- \lambda\alpha\delta)$
 \vspace{0.2in}
 \hline
 \vspace{0.2in}

 The resulting fitting coefficients of this algorithm is not really
 sparse, but the values are very small (usually $< 10^{15}$), which can
 be safely set to be zero after filtering with a threshold.

 This is done as the following: (1) multiply each coefficient with the
 standard deviation of the corresponding feature (2) compute the
 average of absolute values of re-scaled coefficients (3) divide each
 rescaled coefficients with the average, and if the resulting absolute
 value is smaller than <em>threshold</em>, set the original coefficient
 to be zero.

 IGD is in nature a sequential algorithm, and when running in a
 distributed way, each segment of the data runs its own SGD model, and
 the models are averaged to get a model for each iteration. This
 average might slow down the convergence speed, although we acquire the
 ability to process large data set on multiple machines. So this
 algorithm provides an option <em>parallel</em> to let the user choose
 whether to do parallel computation.

 IGD also implements the {\bf warm-up method}.

 {\bf Stopping Criteria} Both {\bf FISTA} and {\bf IGD} compute the average difference
 between the coefficients of two consecutive iterations, and if the
 difference is smaller than <em>tolerance</em> or the iteration number
 is larger than <em>max_iter</em>, the computation stops.


 % subsubsection elastic_net_regularization (end)
 % subsection regularization (end)
	% When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines.
	!TEX root = ../design.tex

	\chapter[Regression]{Regression}
	\begin{moduleinfo}
	\item[History]
	\begin{modulehistory}
	\item[v0.1] Initial version, including background of regression
	\end{modulehistory}
	\end{moduleinfo}

	% Abstract. What is the problem we want to solve?

	Regression analysis is a statistical tool for the investigation of
	relationships between variables. Usually, the investigator seeks to ascertain
	the causal effect of one variable upon another—the effect of a price increase
	upon demand, for example, or the effect of changes in the money supply upon the
	inflation rate. More specifically, regression analysis helps one understand how
	the typical value of the dependent variable changes when any one of the
	independent variables is varied, while the other independent variables are held
	fixed.

	Regression models involve the following variables:
	\begin{enumerate}
	\item The unknown parameters, denoted as $\beta$, which may represent a scalar or a vector.
	\item The independent variables, $x$
	\item The dependent variables, $y$
	\end{enumerate}

	\section{Linear Methods for Regression} % (fold)
	\label{sub:linear_methods_for_regression}

	% subsection linear_methods_for_regression (end)

	\section{Regularization} % (fold)
	\label{sub:regularization}

	Usually, $y$ is the result of measurements contaminated by small errors
	(noise). Frequently, ill-conditioned or singular systems also arise in the iterative solution of nonlinear systems or optimization problems. In all such situations, the vector $x = {A}^{-1}y$ (or in the full rank overdetermined
	case $A^+ y$, with the pseudo inverse $A^+ = (A^T A)^{−1}A^T X)$, if it exists at all, is usually a meaningless bad approximation to x.

	Regularization techniques are needed to obtain meaningful solution estimates
	for such ill-posed problems, and in particular when the number of parameters
	is larger than the number of available measurements, so that standard least
	squares techniques break down.

	\subsection{Linear Ridge Regression}
	Ridge regression is the most commonly used method of regularization of
	ill-posed problems. Mathematically, it seeks to minimize

	\begin{equation}
	Q\left(\vec{w},w_0;\lambda\right)\equiv \min_{\vec{w},w_0}\left[ \frac{1}{2N} \sum_{i=1}^{N} \left( y_i - w_0 -
	\vec{w} \cdot \vec{x}_i \right)^2
	+\frac{\lambda}{2}\\|\vec{w}\\|_2^2 \right]\ ,
	\end{equation}
	for a given value of $\lambda$, where $\vec{w}$ and $w_0$ are the fitting coefficients, and $\lambda$
	is a non-negative regularization parameter. $\vec{w}$ is a vector in
	$d$ dimensional space, and
	\begin{equation}
	\\|\vec{w}\\|_2^2 = \sum_{j=1}^{d}w_j^2 = \vec{w}^T\vec{w}\ .
	\end{equation}
	When $\lambda = 0$, $Q$ is
	the mean squared error of the fitting.

	The intercept term $w_0$ is not regularized, because this term is
	fully decided by the mean values of $y_i$ and $\vec{x}_i$ and the
	values of $\vec{w}$, and does not affect the model's complexity.

	$Q\left(\vec{w},w_0;\lambda)$ is a quadratic function of $\vec{w}$ and
	$w_0$, and thus can be solved analytically
	\begin{equation}
	\vec{w}_{ridge}=\left(\lambda\vec{I}_d +
	\vec{X}^T\vec{X}\right)^{-1}\vec{X}^T\vec{y}\ .
	\end{equation}
	By using the available Newton method (Sec. 6.2.4), the above quantity can be easily
	calculated from one single step of the Newton method.

	Many packages for Ridge regularization actually regularize the fitting
	coefficients not for the fitting model for the original data but for
	the data that has be scaled. MADlib also provides this option. When
	the normalization parameter is set to be True, which is by default
	False, the data will be first converted to the following before
	applying the Ridge regularization.
	\begin{equation}
	x_i' \leftarrow \frac{x_i - \langle x_i \rangle}{\langle (x_i -
	\langle x_i \rangle)^2\rangle} \ ,
	\end{equation}
	\begin{equation}
	y_i \leftarrow y_i - \langle y_i \rangle \ ,
	\end{equation}
	where $\langle \cdot \rangle = \sum_{i=1}^{N} \cdot / N$.

	Note that Ridge regressions for scaled data and un-scaled data are not equivalent.

	\subsection{Elastic Net Regularization} % (fold)
	\label{ssub:elastic_net_regularization}
	As a continuous shrinkage method, ridge regression achieves its better prediction performance through a bias–variance trade-off. However, ridge regression cannot produce a parsimonious model, for it always keeps all the predictors in the model~\cite{zou2005}. Best subset selection in contrast produces a sparse model, but it is extremely variable because of its inherent discreteness.

	A promising technique called the lasso was proposed by Tibshirani (1996). The
	lasso is a penalized least squares method imposing an L1-penalty on the
	regression coefficients. Owing to the nature of the L1-penalty, the lasso does
	both continuous shrinkage and automatic variable selection simultaneously.

	Although the lasso has shown success in many situations, it has some
	limitations. Consider the following three scenarios:
	\begin{enumerate}
	\item In the Number of features ($p$) >> Number of observations ($n$) case, the lasso selects at most $n$ variables before it saturates, because of the nature of the convex optimization problem. This seems to be a limiting feature for a variable selection method. Moreover, the lasso is not well defined unless the bound on the $L_1$-norm of the coefficients is smaller than a certain value.
	\item If there is a group of variables among which the pairwise correlations are very high, then the lasso tends to select only one variable from the group and does not care which one is selected.
	\item For usual n>p situations, if there are high correlations between predictors, it has been empirically observed that the prediction performance of the lasso is dominated by ridge regression.
	\end{enumerate}

	These scenarios make lasso an inappropriate variable selection method in some situations.

	Hui Zou and Trevor Hastie [42] introduce a new regularization
	technique called the 'elastic net'. Similar to the lasso, the elastic net
	simultaneously does automatic variable selection and continuous
	shrinkage, and it can select groups of correlated variables. It is like a
	stretchable fishing net that retains `all the big fish'.

	The elastic net regularization minimizes the following target function
	\begin{equation} \label{eq:target}
	\min_{\vec{w} \in R^N}L(\vec{w}) + \lambda \left[\frac{1-\alpha}{2}\\|\vec{w}\\|_2^2 +
	\lambda\alpha \\|\vec{w}\\|_1\right]\ ,
	\end{equation}
	where $\\|\vec{w}\\|_1 = \sum_{i=1}^N\|w_i\|$ and $N$ is the number of features.

	For the elastic net regularization on linear models,
	\begin{equation}
	L(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot
	\vec{x}_m\right)^2\ ,
	\end{equation}
	where the sum is over all observations and $M$ is the total number of
	observation.

	For the elastic net regularization on logistic models,
	\begin{equation}
	L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 +
	\vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 +
	\vec{w}\cdot\vec{x}_m}\right)\right]\ ,
	\end{equation}
	where $y_m \in \{0,1\}$.

	\subsubsection{Optimizer Algorithms}
	Right now, we support two algorithms for optimizer. The default one is
	FISTA, and the other is IGD.

	\paragraph{FISTA}

	Fast Iterative Shrinkage Thresholding Algorithm (FISTA) with {\bf
	backtracking step size} [4]:
	\vspace{0.2in}
	\hline
	\vspace{0.2in}
	{\bf Step $0$}: Choose $\delta>0$ and $\eta > 1$, and
	$\vec{w}^{(0)} \in R^N$. Set $\vec{v}^{(1)}=\vec{w}^{(0)}$ and
	$t_1=1$.

	{\bf Step $k$}: ($k \ge 1$) Find the smallest nonnegative integers
	$i_k$ such that with $\bar{\delta} = \delta_{k-1}/\eta^{i_k-1}$
	\begin{equation}
	F(p_{\bar{\delta}}(\vec{v}^{(k)})) \le
	Q_{\bar{\delta}}(p_{\bar{\delta}}(\vec{v}^{(k)}), \vec{v}^{k})\ .
	\end{equation}
	Set $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ and compute
	\begin{equation}
	\vec{w}^{(k)} = p_{\delta_k}\left(\vec{v}^{(k)}\right)\ ,
	\end{equation}
	\begin{equation}
	t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}\ ,
	\end{equation}
	\begin{equation}
	\vec{v}^{(k+1)} = \vec{w}^{(k)} +
	\frac{t_k-1}{t_{k+1}}\left(\vec{w}^{(k)} - \vec{w}^{(k-1)}\right)\ .
	\end{equation}
	\vspace{0.2in}
	\hline
	\vspace{0.2in}
	Here,
	\begin{equation}
	F(\vec{w}) = f(\vec{w}) + g(\vec{w})\ ,
	\end{equation}
	where $f(\vec{w})$ is the differentiable part of
	Eq. (\ref{eq:target}) and $g(\vec{w})$ is the non-differentiable part. For linear models,
	\begin{equation}
	f(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot
	\vec{x}_m\right)^2 + \frac{\lambda(1-\alpha)}{2}\\|\vec{w}\\|_2^2\ ,
	\end{equation}
	and for logistic models,
	\begin{equation}
	f(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 +
	\vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 +
	\vec{w}\cdot\vec{x}_m}\right)\right] + \frac{\lambda(1-\alpha)}{2}\\|\vec{w}\\|_2^2\ .
	\end{equation}
	And for both types of models,
	\begin{equation}
	g(\vec{w}) = \lambda\alpha\sum_{i=1}^N\|w_i\|\ .
	\end{equation}

	And
	\begin{equation}
	Q_{\delta}(\vec{a}, \vec{b}) := f(\vec{b}) + \langle \vec{a} -
	\vec{b}, \nabla f(\vec{b})\rangle +
	\frac{1}{2\delta}\\|\vec{a} - \vec{b}\\|^2 + g(\vec{a})\ ,
	\end{equation}
	where $\langle\cdot\rangle$ is just the usual vector dot product.

	And the proxy function is defined as
	\begin{equation}
	p_\delta (\vec{v}) := \underset{\vec{w}}{\operatorname{arg\,min}} \left[g(\vec{w}) +
	\frac{1}{2\delta}\left\\|\vec{w} - \left(\vec{v} - \delta\,\nabla f(\vec{v})\right)\right\\|^2 \right]
	\end{equation}
	For our case, where $g(\vec{w}) = \lambda\alpha\sum_{i=1}^N\|w_i\|$, the
	proxy function is simply equal to the soft-thresholding function
	\begin{equation}
	p_\delta (v_i) = \left\{ \begin{array}{ll}
	v_i - \lambda\alpha\delta u_i\ , \quad & \mbox{if } v_i > \lambda\alpha\delta
	u_i \\
	0\ , \quad & \mbox{otherwise} \\
	v_i + \lambda\alpha\delta u_i\ , \quad & \mbox{if } v_i < - \lambda\alpha\delta u_i
	\end{array}
	\right.
	\end{equation}
	where
	\begin{equation}
	\vec{u} = \vec{v} - \delta\,\nabla f(\vec{v})\ .
	\end{equation}

	{\bf Active set method} is used in our implementation for FISTA to
	speed up the computation. Considerable speedup is obtained by
	organizing the iterations around the active set of features— those
	with nonzero coefficients. After a complete cycle through all the
	variables, we iterate on only the active set till convergence. If
	another complete cycle does not change the active set, we are done,
	otherwise the process is repeated.

	{\bf Warm-up method} is also used to speed up the computation. When
	the option parameter $warmup$ is set to be $True$, a series of lambda
	values, which is strictly descent and ends at the lambda value that
	the user wants to calculate, will be used. The larger lambda gives
	very sparse solution, and the sparse solution again is used as the
	initial guess for the next lambda's solution, which will speed up the
	computation for the next lambda. For larger data sets, this can
	sometimes accelerate the whole computation and might be faster than
	computation on only one lambda value.

	{\bf Note:} Our implementation is a little bit different from the
	original FISTA. In the original FISTA, during the backtracking
	procedure, the algorithm is searching for a non-negative integer $i_k$
	and the new step size is $\delta_k = \delta_{k-1}/\eta^{i_k}$. Thus
	the step size is non-increasing. Here, we allow the step size to
	increase by using $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ so that
	larger step sizes can be tried by the algorithm. Tests show that this
	is actually faster.

	\paragraph{IGD}

	The Incremental Gradient Descent (IGD) algorithm is a stochastic
	algorithm by its nature. So it is difficult to get sparse
	solutions. What we implemented is Stochastic Mirror Descent Algorithm
	made Sparse (SMIDAS). The inverse p-form link function is used
	\begin{equation} \label{eq:f}
	h_j^{-1}(\vec{\theta}) = \frac{\mbox{sign}(\theta_j)\vert \theta_j
	\vert^{p-1}}{\\|\theta\\|_p^{p-2}}\ ,
	\end{equation}
	where
	\begin{equation}
	\\|\theta\\|_p = \left(\sum_j \vert \theta \vert ^p\right)^{1/p}\ ,
	\end{equation}
	and $\mbox{sign}(0) = 0$.
	\vspace{0.2in}
	\hline
	\vspace{0.2in}
	Choose step size $\delta > 0$.

	Let $p = 2\log d$ and let $h^{-1}$ be as in Eq. (\ref{eq:f})

	Let $\vec{\theta} = \vec{0}$

	{\bf for} $k = 1,2,\dots$

	\quad $\vec{w} = h^{-1}(\vec{\theta})$

	\quad $\vec{v} = \nabla f(\vec{w})$

	\quad $\tilde{\vec{\theta}} = \theta - \delta \, \vec{v}$

	\quad $\forall j, \theta_j = \mbox{sign}(\tilde{\theta}_j)\max (0,
	\vert \tilde{\theta}_j \vert- \lambda\alpha\delta)$
	\vspace{0.2in}
	\hline
	\vspace{0.2in}

	The resulting fitting coefficients of this algorithm is not really
	sparse, but the values are very small (usually $< 10^{15}$), which can
	be safely set to be zero after filtering with a threshold.

	This is done as the following: (1) multiply each coefficient with the
	standard deviation of the corresponding feature (2) compute the
	average of absolute values of re-scaled coefficients (3) divide each
	rescaled coefficients with the average, and if the resulting absolute
	value is smaller than <em>threshold</em>, set the original coefficient
	to be zero.

	IGD is in nature a sequential algorithm, and when running in a
	distributed way, each segment of the data runs its own SGD model, and
	the models are averaged to get a model for each iteration. This
	average might slow down the convergence speed, although we acquire the
	ability to process large data set on multiple machines. So this
	algorithm provides an option <em>parallel</em> to let the user choose
	whether to do parallel computation.

	IGD also implements the {\bf warm-up method}.

	{\bf Stopping Criteria} Both {\bf FISTA} and {\bf IGD} compute the average difference
	between the coefficients of two consecutive iterations, and if the
	difference is smaller than <em>tolerance</em> or the iteration number
	is larger than <em>max_iter</em>, the computation stops.


	% subsubsection elastic_net_regularization (end)
	% subsection regularization (end)