| % When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines. |
| !TEX root = ../design.tex |
| |
| \chapter[Regression]{Regression} |
| \begin{moduleinfo} |
| \item[History] |
| \begin{modulehistory} |
| \item[v0.1] Initial version, including background of regression |
| \end{modulehistory} |
| \end{moduleinfo} |
| |
| % Abstract. What is the problem we want to solve? |
| |
| Regression analysis is a statistical tool for the investigation of |
| relationships between variables. Usually, the investigator seeks to ascertain |
| the causal effect of one variable upon another—the effect of a price increase |
| upon demand, for example, or the effect of changes in the money supply upon the |
| inflation rate. More specifically, regression analysis helps one understand how |
| the typical value of the dependent variable changes when any one of the |
| independent variables is varied, while the other independent variables are held |
| fixed. |
| |
| Regression models involve the following variables: |
| \begin{enumerate} |
| \item The unknown parameters, denoted as $\beta$, which may represent a scalar or a vector. |
| \item The independent variables, $x$ |
| \item The dependent variables, $y$ |
| \end{enumerate} |
| |
| \section{Linear Methods for Regression} % (fold) |
| \label{sub:linear_methods_for_regression} |
| |
| % subsection linear_methods_for_regression (end) |
| |
| \section{Regularization} % (fold) |
| \label{sub:regularization} |
| |
| Usually, $y$ is the result of measurements contaminated by small errors |
| (noise). Frequently, ill-conditioned or singular systems also arise in the iterative solution of nonlinear systems or optimization problems. In all such situations, the vector $x = {A}^{-1}y$ (or in the full rank overdetermined |
| case $A^+ y$, with the pseudo inverse $A^+ = (A^T A)^{−1}A^T X)$, if it exists at all, is usually a meaningless bad approximation to x. |
| |
| Regularization techniques are needed to obtain meaningful solution estimates |
| for such ill-posed problems, and in particular when the number of parameters |
| is larger than the number of available measurements, so that standard least |
| squares techniques break down. |
| |
| \subsection{Linear Ridge Regression} |
| Ridge regression is the most commonly used method of regularization of |
| ill-posed problems. Mathematically, it seeks to minimize |
| |
| \begin{equation} |
| Q\left(\vec{w},w_0;\lambda\right)\equiv \min_{\vec{w},w_0}\left[ \frac{1}{2N} \sum_{i=1}^{N} \left( y_i - w_0 - |
| \vec{w} \cdot \vec{x}_i \right)^2 |
| +\frac{\lambda}{2}\|\vec{w}\|_2^2 \right]\ , |
| \end{equation} |
| for a given value of $\lambda$, where $\vec{w}$ and $w_0$ are the fitting coefficients, and $\lambda$ |
| is a non-negative regularization parameter. $\vec{w}$ is a vector in |
| $d$ dimensional space, and |
| \begin{equation} |
| \|\vec{w}\|_2^2 = \sum_{j=1}^{d}w_j^2 = \vec{w}^T\vec{w}\ . |
| \end{equation} |
| When $\lambda = 0$, $Q$ is |
| the mean squared error of the fitting. |
| |
| The intercept term $w_0$ is not regularized, because this term is |
| fully decided by the mean values of $y_i$ and $\vec{x}_i$ and the |
| values of $\vec{w}$, and does not affect the model's complexity. |
| |
| $Q\left(\vec{w},w_0;\lambda)$ is a quadratic function of $\vec{w}$ and |
| $w_0$, and thus can be solved analytically |
| \begin{equation} |
| \vec{w}_{ridge}=\left(\lambda\vec{I}_d + |
| \vec{X}^T\vec{X}\right)^{-1}\vec{X}^T\vec{y}\ . |
| \end{equation} |
| By using the available Newton method (Sec. 6.2.4), the above quantity can be easily |
| calculated from one single step of the Newton method. |
| |
| Many packages for Ridge regularization actually regularize the fitting |
| coefficients not for the fitting model for the original data but for |
| the data that has be scaled. MADlib also provides this option. When |
| the normalization parameter is set to be True, which is by default |
| False, the data will be first converted to the following before |
| applying the Ridge regularization. |
| \begin{equation} |
| x_i' \leftarrow \frac{x_i - \langle x_i \rangle}{\langle (x_i - |
| \langle x_i \rangle)^2\rangle} \ , |
| \end{equation} |
| \begin{equation} |
| y_i \leftarrow y_i - \langle y_i \rangle \ , |
| \end{equation} |
| where $\langle \cdot \rangle = \sum_{i=1}^{N} \cdot / N$. |
| |
| Note that Ridge regressions for scaled data and un-scaled data are not equivalent. |
| |
| \subsection{Elastic Net Regularization} % (fold) |
| \label{ssub:elastic_net_regularization} |
| As a continuous shrinkage method, ridge regression achieves its better prediction performance through a bias–variance trade-off. However, ridge regression cannot produce a parsimonious model, for it always keeps all the predictors in the model~\cite{zou2005}. Best subset selection in contrast produces a sparse model, but it is extremely variable because of its inherent discreteness. |
| |
| A promising technique called the lasso was proposed by Tibshirani (1996). The |
| lasso is a penalized least squares method imposing an L1-penalty on the |
| regression coefficients. Owing to the nature of the L1-penalty, the lasso does |
| both continuous shrinkage and automatic variable selection simultaneously. |
| |
| Although the lasso has shown success in many situations, it has some |
| limitations. Consider the following three scenarios: |
| \begin{enumerate} |
| \item In the Number of features ($p$) >> Number of observations ($n$) case, the lasso selects at most $n$ variables before it saturates, because of the nature of the convex optimization problem. This seems to be a limiting feature for a variable selection method. Moreover, the lasso is not well defined unless the bound on the $L_1$-norm of the coefficients is smaller than a certain value. |
| \item If there is a group of variables among which the pairwise correlations are very high, then the lasso tends to select only one variable from the group and does not care which one is selected. |
| \item For usual n>p situations, if there are high correlations between predictors, it has been empirically observed that the prediction performance of the lasso is dominated by ridge regression. |
| \end{enumerate} |
| |
| These scenarios make lasso an inappropriate variable selection method in some situations. |
| |
| Hui Zou and Trevor Hastie [42] introduce a new regularization |
| technique called the 'elastic net'. Similar to the lasso, the elastic net |
| simultaneously does automatic variable selection and continuous |
| shrinkage, and it can select groups of correlated variables. It is like a |
| stretchable fishing net that retains `all the big fish'. |
| |
| The elastic net regularization minimizes the following target function |
| \begin{equation} \label{eq:target} |
| \min_{\vec{w} \in R^N}L(\vec{w}) + \lambda \left[\frac{1-\alpha}{2}\|\vec{w}\|_2^2 + |
| \lambda\alpha \|\vec{w}\|_1\right]\ , |
| \end{equation} |
| where $\|\vec{w}\|_1 = \sum_{i=1}^N|w_i|$ and $N$ is the number of features. |
| |
| For the elastic net regularization on linear models, |
| \begin{equation} |
| L(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot |
| \vec{x}_m\right)^2\ , |
| \end{equation} |
| where the sum is over all observations and $M$ is the total number of |
| observation. |
| |
| For the elastic net regularization on logistic models, |
| \begin{equation} |
| L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 + |
| \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 + |
| \vec{w}\cdot\vec{x}_m}\right)\right]\ , |
| \end{equation} |
| where $y_m \in \{0,1\}$. |
| |
| \subsubsection{Optimizer Algorithms} |
| Right now, we support two algorithms for optimizer. The default one is |
| FISTA, and the other is IGD. |
| |
| \paragraph{FISTA} |
| |
| Fast Iterative Shrinkage Thresholding Algorithm (FISTA) with {\bf |
| backtracking step size} [4]: |
| \vspace{0.2in} |
| \hline |
| \vspace{0.2in} |
| {\bf Step $0$}: Choose $\delta>0$ and $\eta > 1$, and |
| $\vec{w}^{(0)} \in R^N$. Set $\vec{v}^{(1)}=\vec{w}^{(0)}$ and |
| $t_1=1$. |
| |
| {\bf Step $k$}: ($k \ge 1$) Find the smallest nonnegative integers |
| $i_k$ such that with $\bar{\delta} = \delta_{k-1}/\eta^{i_k-1}$ |
| \begin{equation} |
| F(p_{\bar{\delta}}(\vec{v}^{(k)})) \le |
| Q_{\bar{\delta}}(p_{\bar{\delta}}(\vec{v}^{(k)}), \vec{v}^{k})\ . |
| \end{equation} |
| Set $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ and compute |
| \begin{equation} |
| \vec{w}^{(k)} = p_{\delta_k}\left(\vec{v}^{(k)}\right)\ , |
| \end{equation} |
| \begin{equation} |
| t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}\ , |
| \end{equation} |
| \begin{equation} |
| \vec{v}^{(k+1)} = \vec{w}^{(k)} + |
| \frac{t_k-1}{t_{k+1}}\left(\vec{w}^{(k)} - \vec{w}^{(k-1)}\right)\ . |
| \end{equation} |
| \vspace{0.2in} |
| \hline |
| \vspace{0.2in} |
| Here, |
| \begin{equation} |
| F(\vec{w}) = f(\vec{w}) + g(\vec{w})\ , |
| \end{equation} |
| where $f(\vec{w})$ is the differentiable part of |
| Eq. (\ref{eq:target}) and $g(\vec{w})$ is the non-differentiable part. For linear models, |
| \begin{equation} |
| f(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot |
| \vec{x}_m\right)^2 + \frac{\lambda(1-\alpha)}{2}\|\vec{w}\|_2^2\ , |
| \end{equation} |
| and for logistic models, |
| \begin{equation} |
| f(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 + |
| \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 + |
| \vec{w}\cdot\vec{x}_m}\right)\right] + \frac{\lambda(1-\alpha)}{2}\|\vec{w}\|_2^2\ . |
| \end{equation} |
| And for both types of models, |
| \begin{equation} |
| g(\vec{w}) = \lambda\alpha\sum_{i=1}^N|w_i|\ . |
| \end{equation} |
| |
| And |
| \begin{equation} |
| Q_{\delta}(\vec{a}, \vec{b}) := f(\vec{b}) + \langle \vec{a} - |
| \vec{b}, \nabla f(\vec{b})\rangle + |
| \frac{1}{2\delta}\|\vec{a} - \vec{b}\|^2 + g(\vec{a})\ , |
| \end{equation} |
| where $\langle\cdot\rangle$ is just the usual vector dot product. |
| |
| And the proxy function is defined as |
| \begin{equation} |
| p_\delta (\vec{v}) := \underset{\vec{w}}{\operatorname{arg\,min}} \left[g(\vec{w}) + |
| \frac{1}{2\delta}\left\|\vec{w} - \left(\vec{v} - \delta\,\nabla f(\vec{v})\right)\right\|^2 \right] |
| \end{equation} |
| For our case, where $g(\vec{w}) = \lambda\alpha\sum_{i=1}^N|w_i|$, the |
| proxy function is simply equal to the soft-thresholding function |
| \begin{equation} |
| p_\delta (v_i) = \left\{ \begin{array}{ll} |
| v_i - \lambda\alpha\delta u_i\ , \quad & \mbox{if } v_i > \lambda\alpha\delta |
| u_i \\ |
| 0\ , \quad & \mbox{otherwise} \\ |
| v_i + \lambda\alpha\delta u_i\ , \quad & \mbox{if } v_i < - \lambda\alpha\delta u_i |
| \end{array} |
| \right. |
| \end{equation} |
| where |
| \begin{equation} |
| \vec{u} = \vec{v} - \delta\,\nabla f(\vec{v})\ . |
| \end{equation} |
| |
| {\bf Active set method} is used in our implementation for FISTA to |
| speed up the computation. Considerable speedup is obtained by |
| organizing the iterations around the active set of features— those |
| with nonzero coefficients. After a complete cycle through all the |
| variables, we iterate on only the active set till convergence. If |
| another complete cycle does not change the active set, we are done, |
| otherwise the process is repeated. |
| |
| {\bf Warm-up method} is also used to speed up the computation. When |
| the option parameter $warmup$ is set to be $True$, a series of lambda |
| values, which is strictly descent and ends at the lambda value that |
| the user wants to calculate, will be used. The larger lambda gives |
| very sparse solution, and the sparse solution again is used as the |
| initial guess for the next lambda's solution, which will speed up the |
| computation for the next lambda. For larger data sets, this can |
| sometimes accelerate the whole computation and might be faster than |
| computation on only one lambda value. |
| |
| {\bf Note:} Our implementation is a little bit different from the |
| original FISTA. In the original FISTA, during the backtracking |
| procedure, the algorithm is searching for a non-negative integer $i_k$ |
| and the new step size is $\delta_k = \delta_{k-1}/\eta^{i_k}$. Thus |
| the step size is non-increasing. Here, we allow the step size to |
| increase by using $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ so that |
| larger step sizes can be tried by the algorithm. Tests show that this |
| is actually faster. |
| |
| \paragraph{IGD} |
| |
| The Incremental Gradient Descent (IGD) algorithm is a stochastic |
| algorithm by its nature. So it is difficult to get sparse |
| solutions. What we implemented is Stochastic Mirror Descent Algorithm |
| made Sparse (SMIDAS). The inverse p-form link function is used |
| \begin{equation} \label{eq:f} |
| h_j^{-1}(\vec{\theta}) = \frac{\mbox{sign}(\theta_j)\vert \theta_j |
| \vert^{p-1}}{\|\theta\|_p^{p-2}}\ , |
| \end{equation} |
| where |
| \begin{equation} |
| \|\theta\|_p = \left(\sum_j \vert \theta \vert ^p\right)^{1/p}\ , |
| \end{equation} |
| and $\mbox{sign}(0) = 0$. |
| \vspace{0.2in} |
| \hline |
| \vspace{0.2in} |
| Choose step size $\delta > 0$. |
| |
| Let $p = 2\log d$ and let $h^{-1}$ be as in Eq. (\ref{eq:f}) |
| |
| Let $\vec{\theta} = \vec{0}$ |
| |
| {\bf for} $k = 1,2,\dots$ |
| |
| \quad $\vec{w} = h^{-1}(\vec{\theta})$ |
| |
| \quad $\vec{v} = \nabla f(\vec{w})$ |
| |
| \quad $\tilde{\vec{\theta}} = \theta - \delta \, \vec{v}$ |
| |
| \quad $\forall j, \theta_j = \mbox{sign}(\tilde{\theta}_j)\max (0, |
| \vert \tilde{\theta}_j \vert- \lambda\alpha\delta)$ |
| \vspace{0.2in} |
| \hline |
| \vspace{0.2in} |
| |
| The resulting fitting coefficients of this algorithm is not really |
| sparse, but the values are very small (usually $< 10^{15}$), which can |
| be safely set to be zero after filtering with a threshold. |
| |
| This is done as the following: (1) multiply each coefficient with the |
| standard deviation of the corresponding feature (2) compute the |
| average of absolute values of re-scaled coefficients (3) divide each |
| rescaled coefficients with the average, and if the resulting absolute |
| value is smaller than <em>threshold</em>, set the original coefficient |
| to be zero. |
| |
| IGD is in nature a sequential algorithm, and when running in a |
| distributed way, each segment of the data runs its own SGD model, and |
| the models are averaged to get a model for each iteration. This |
| average might slow down the convergence speed, although we acquire the |
| ability to process large data set on multiple machines. So this |
| algorithm provides an option <em>parallel</em> to let the user choose |
| whether to do parallel computation. |
| |
| IGD also implements the {\bf warm-up method}. |
| |
| {\bf Stopping Criteria} Both {\bf FISTA} and {\bf IGD} compute the average difference |
| between the coefficients of two consecutive iterations, and if the |
| difference is smaller than <em>tolerance</em> or the iteration number |
| is larger than <em>max_iter</em>, the computation stops. |
| |
| |
| % subsubsection elastic_net_regularization (end) |
| % subsection regularization (end) |