blob: d1de548a6766f69a2251ee7f0303c6c441717e16 [file] [log] [blame]
% When using TeXShop on the Mac, let it know the root document.
% The following must be one of the first 20 lines.
% !TEX root = ../design.tex
\chapter[Convex Programming Framework]{Convex Programming Framework}
\begin{moduleinfo}
\item[Author] \href{mailto:xfeng@cs.wisc.edu}{Xixuan (Aaron) Feng}
\item[History]
\begin{modulehistory}
\item[v0.5] Initial revision
\end{modulehistory}
\end{moduleinfo}
% Motivation. Why do we want to have this abstract layer?
The nature of MADlib drives itself to support many different kinds of data modeling modules, such as logistic regression, support vector machine, matrix factorization, etc.
However, keeping up with the state of the art and experimenting with individual data modeling modules require significant development and quality-assurance effort.
Therefore, to lower the bar of adding and maintaining new modules, it is crucial to identify the invariants among many important modules, in turn, abstract and encapsulate them as reusable components.
Bismarck \cite{DBLP:conf/sigmod/FengKRR12} is such a unified framework that links many useful statistical modeling modules and the relational DBMS, by introducing a well-studied formulation, convex programming, in between.
Incremental Gradient Descent (IGD) has also been shown as a very effective algorithm to solve convex programs in the relational DBMS environment.
But it is natural that IGD does not always fit the need of MADlib users who are applying convex statistical modeling to various domains.
Driven by this, convex programming framework in MADlib also implements other algorithms that solves convex programs, such as Newton's method and conjugate gradient methods.
\section{Introduction}
% Problem definition. What are the problems that we can solve, formally and example applications?
% linearly separable, unconstrained, continuous, deterministic, convex, minimization problems.
This section is to first explain, formally, the type of problems that we consider in the MADlib convex programming framework, and then give a few example modules.
\subsection{Formulation}
We support numerical optimization problems with an objective function that is a sum of many component functions \cite{springerlink:10.1007/s10107-011-0472-0}, such as
\[\min_{w \in \mathbb{R}^N} \sum_{m=1}^M f_{z_m}(w),\]
where $z_m \in \mathcal{O}, m = 1,...,M$, are observations, and $f_{z_m} : \mathbb{R}^N \to \mathbb{R}$ are convex functions.
For simplicity, let $z_{1:M}$ denote $\{z_m \in \mathcal{O} | m = 1,...,M\}$.
Note: given $z_{1:M}$, let $F(w) = \sum_{m=1}^M f_{z_m}(w)$, and $F : \mathbb{R}^N \to \mathbb{R}$ is also convex.
\subsection{Examples}
Many popular models can be formulated in the above form, with $f_{z_m}$ being properly specified.
\paragraph{Logistic Regression.} The component function is given by
\[f_{(x_m, y_m)}(w) = \log(1 + e^{- y_m w^{T} x_m}),\]
where $x_m \in \mathbb{R}^N$ are values of independent variables, and $y_m \in \{-1, 1\}$ are values of the dependent variable, $m = 1,...,M$.
\paragraph{Linear SVM with hinge loss.} The component function is given by
\[f_{(x_m, y_m)}(w) = \max(0, 1 - y_m w^{T} x_m),\]
where $x_m \in \mathbb{R}^N$ are values of features, and $y_m \in \{-1, 1\}$ are values of the label, $m = 1,...,M$.
Bertsekas \cite{springerlink:10.1007/s10107-011-0472-0} gives many other examples across application domains.
\section{Algorithms}
\paragraph{Gradient Descent.}
A most-frequently-mentioned algorithm that solves convex programs is gradient descent.
This is an iterative algorithm and the iteration is given by
\[w_{k+1} = w_k - \alpha_k \nabla F(w_k),\]
where, given $z_{1:M}$, $F(w) = \sum_{m=1}^M f_{z_m}(w)$, and $\alpha_k$ is a positive scalar, called stepsize (or step length).
Gradient descent algorithm is simple but usually recognized as a slow algorithm with linear convergence rate, while other algorithms like conjugate gradient methods and Newton's method has super-linear convergence rates \cite{nocedal2006numerical}.
\paragraph{Line Search: A Class of Algorithms.}
% line search & trust region
Convex programming has been well studied in the past few decades, and two main classes of algorithms are widely considered: line search and trust region (\cite{nocedal2006numerical}, section 2.2).
Because line search is more commonly deployed and discussed, we focus on line search in MADlib, although some of the algorithms we discuss in this section can also easily be formulated as trust region strategy.
% general form of line search: w += \alpha p_k
All algorithms of line search strategies have the iteration given by
\[w_{k+1} = w_k + \alpha_k p_k,\]
where $p_k \in \mathbb{R}^N$ is search direction, and stepsize $\alpha_k$ \cite{nocedal2006numerical}.
Specifiedly, for gradient descent, $p_k$ is the steepest descent direction $- \nabla \sum_{m=1}^M f_{z_m}(w_k)$.
\subsection{Formal Description of Line Search}
\begin{algorithm}[line-search$(z_{1:M})$] \label{alg:line-search}
\alginput{Observation set $z_{1:M}$,\\
convergence criterion $\mathit{Convergence}()$,\\
start strategy $\mathit{Start}()$,\\
initialization strategy $\mathit{Initialization}()$,\\
transition strategy $\mathit{Transition}()$,\\
finalization strategy $\mathit{Finalization}()$}
\algoutput{Coefficients $w \in \mathbb{R}^N$}
\algprecond{$iteration = 0, k = 0$}
\begin{algorithmic}[1]
\State $w_\text{new} \set \mathit{Start}(z_{1:M})$
\Repeat
\State $w_\text{old} \set w_\text{new}$
\State $\mathit{state} \set \mathit{Initialization}(w_\text{new})$
\For{$m \in 1..M$} \Comment{Single entry in the observation set}
\State $\mathit{state} \set \mathit{Transition}(\mathit{state}, z_m)$
\Comment{Usually computing derivative}
\EndFor
\State $w_\text{new} \set Finalization(\mathit{state})$
\Until{$Convergence(w_\text{old}, w_\text{new}, \mathit{iteration})$}
\State \Return $w_\text{new}$
\end{algorithmic}
\end{algorithm}
\paragraph{Programming Model.}
We above give the algorithm of generic line search strategy, in the fashion of the selected programming model supported by MADlib (mainly user-defined aggregate).
\paragraph{Parallelism.}
The outer loop is inherently sequential.
We require the inner loop is data-parallel.
Simple component-wise addition or model averaging \cite{DBLP:conf/nips/DuchiAW10} are used to merge two states.
A merge function is not explicitly added to the pseudocode for simplicity.
A separate discussion will be made when necessary.
\paragraph{Convergence criterion.}
Usually, following conditions are combined by AND, OR, or NOT.
\begin{enumerate}
\item The change in the objective drops below a given threshold (E.g., negative log-likelihood, root-mean-square error).
\item The value of the objective matches some pre-computed value.
\item The maximum number of iterations has been reached.
\item There could be more.
\end{enumerate}
In MADlib implementation, the computation of objective is paired up with line-search to share data I/O.
\paragraph{Start strategy.}
In most cases, zeros are used unless otherwise specified.
\paragraph{Transition and finalization strategies.}
The coefficients update code ($w_{k+1} \set w_k + \alpha_k p_k$) is put into either $\mathit{Transition}()$ or $\mathit{Finalization}()$.
These two functions contain most of the computation logic, for computing the search direction $p_k$.
We discuss details of individual algorithms in the following sections.
For simplicity, global iterator $k$ is read and updated in place by these functions without specifed as an additional argument.
\subsection{Incremental Gradient Descent (IGD)}
% motivation and introduction of IGD
A main challenge arises when we are handling large amount of data, $M \gg 1$, where the computation of $\nabla (\sum_{m=1}^M f_{z_m})$ requires a whole pass of the observation data which is usually expensive.
What distinguishes IGD from other algorithms is that it approximates $\nabla (\sum_{m=1}^M f_{z_m}) = \sum_{m=1}^M (\nabla f_{z_m})$ by the gradient of a single component function $\nabla f_{z_m}$
\footnote{$z_m$ is usually selected in a stochastic fashion.
Therefore, IGD is also referred to as stochastic gradient descent.
The convergence and convergence rate of IGD are well developed \cite{springerlink:10.1007/s10107-011-0472-0}, and IGD is often considered to be very effective with $M$ being very large \cite{DBLP:conf/nips/BottouB07}.}.
The reflection of this to the pseudocode makes the coefficients update code ($w_{k+1} \set w_k + \alpha_k p_k$) in $\mathit{Transition}()$ instead of in $\mathit{Finalization}()$.
\subsubsection{Initialization Strategy}
\begin{algorithm}[initialization-igd$(w)$] \label{alg:initialization-igd}
\alginput{Coefficients $w \in \mathbb{R}^N$}
\algoutput{Transition state $\mathit{state}$}
\begin{algorithmic}[1]
\State $\mathit{state}.w_k \set w$
\State \Return $\mathit{state}$
\end{algorithmic}
\end{algorithm}
\subsubsection{Transition Strategy}
\begin{algorithm}[transition-igd$(\mathit{state}, z_m)$] \label{alg:transition-igd}
\alginput{Transition state $\mathit{state}$,\\
observation entry $z_m$,\\
stepsize $\alpha \in \mathbb{R}^+$,\\
gradient function $\mathit{Gradient}()$}
\algoutput{Transition state $\mathit{state}$}
\begin{algorithmic}[1]
\State $p_k \set - \mathit{Gradient}(\mathit{state}.w_k, z_m)$
\Comment{Previously mentioned as $p_k = - \nabla f_{z_m}$}
\State $\mathit{state}.w_{k+1} \set \mathit{state}.w_k + \alpha p_k$
\State $k \set k + 1$
\Comment{In-place update of the global iterator}
\State \Return $\mathit{state}$
\end{algorithmic}
\end{algorithm}
\paragraph{Stepsize.}
In MADlib, we support only constant stepsize for simplicity.
Although IGD with constant stepsizes does not even have convergence guarantee \cite{springerlink:10.1007/s10107-011-0472-0}, but it works reasonably well in practice so far \cite{DBLP:conf/sigmod/FengKRR12} with some proper tuning.
Commonly-used algorithms to tune stepsize (\cite{bertsekas1999nonlinear}, appendix C) are mostly heuristics and do not have strong guarantees on convergence rate.
More importantly, these algorithms require many evaluations of the objective function, which is usually very costly in use cases of MADlib.
\paragraph{Gradient function.}
A function where IGD accepts computational logic of specified modules.
In MADlib convex programming framework, currently, there is no support of objective functions that does not have gradient or subgradient.
Those objective functions that is not linearly separable is not currently supported by the convex programming framework, such as Cox proportional hazards models \cite{Cox1972}.
\subsubsection{Finalization Strategy}
\begin{algorithm}[finalization-igd$(\mathit{state})$] \label{alg:finalization-igd}
\alginput{Transition state $\mathit{state}$}
\algoutput{Coefficients $w \in \mathbb{R}^N$}
\begin{algorithmic}[1]
\State \Return $\mathit{state}.w_k$
\Comment{Trivially return $w_k$}
\end{algorithmic}
\end{algorithm}
\subsection{Conjugate Gradient Methods}
% Simple description of conjugate gradient
Conjugate gradient methods that solve convex programs are usually refered to as nonlinear conjugate gradient mthods.
The key difference between conjugate gradient methods and gradient descent is that conjuagte gradient methods perform adjustment of the search direction $p_k$ by considering gradient directions of previous iterations in some intriguing way.
We skip the formal desciption of conjugate gradient methods that can be found in the references (such as Nocedal \& Wright \cite{nocedal2006numerical}, section 5.2).
\subsubsection{Initialization Strategy}
\begin{algorithm}[initialization-cg$(w)$] \label{alg:initialization-cg}
\alginput{Coefficients $w \in \mathbb{R}^N$,\\
gradient value $g \in \mathbb{R}^N$ (i.e., $\sum_{m=1}^M \nabla f_{z_m}(w_{k-1})$),\\
previous search direction $p \in \mathbb{R}^N$}
\algoutput{Transition state $\mathit{state}$}
\begin{algorithmic}[1]
\State $\mathit{state}.p_{k-1} \set p$
\State $\mathit{state}.g_{k-1} \set g$
\State $\mathit{state}.w_k \set w$
\State $\mathit{state}.g_k \set 0$
\State \Return $\mathit{state}$
\end{algorithmic}
\end{algorithm}
\subsubsection{Transition Strategy}
\begin{algorithm}[transition-cg$(\mathit{state}, z_m)$] \label{alg:transition-cg}
\alginput{Transition state $\mathit{state}$,\\
observation entry $z_m$,\\
gradient function $\mathit{Gradient}()$}
\algoutput{Transition state $\mathit{state}$}
\begin{algorithmic}[1]
\State $\mathit{state}.g_k \set \mathit{state}.g_k + \mathit{Gradient}(\mathit{state}.w_k, z_m)$
\State \Return $\mathit{state}$
\end{algorithmic}
\end{algorithm}
\subsubsection{Finalization Strategy}
\begin{algorithm}[finalization-cg$(\mathit{state})$] \label{alg:finalization-cg}
\alginput{Transition state $\mathit{state}$,\\
stepsize $\alpha \in \mathbb{R}^+$,\\
update parameter strategy $\mathit{Beta}()$}
\algoutput{Coefficients $w \in \mathbb{R}^N$,\\
gradient value $g \in \mathbb{R}^N$ (i.e., $\sum_{m=1}^M \nabla f_{z_m}(w_{k-1})$),\\
previous search direction $p \in \mathbb{R}^N$}
\begin{algorithmic}[1]
\If{k = 0}
\State $\mathit{state}.p_k \set - \mathit{state}.g_k$
\Else
\State $\beta_k \set \mathit{Beta}(state)$
\State $p_k \set - \mathit{state}.g_k + \beta_k p_{k-1}$
\EndIf
\State $\mathit{state}.w_{k+1} \set \mathit{state}.w_k + \alpha p_k$
\State $k \set k + 1$
\State $p \set p_{k-1}$
\Comment{Implicitly returning}
\State $g \set \mathit{state}.g_{k-1}$
\Comment{Implicitly returning again}
\State \Return $\mathit{state}.w_k$
\end{algorithmic}
\end{algorithm}
\paragraph{Update parameter strategy.}
For cases that $F$ is strongly convex quadratic (e.g., ordinary least squares), $\beta_k$ can be computed in closed form, having $p_k$ be in conjugate direction of $p_0,...,p_{k-1}$.
For more general objective functions, many different choices of update parameter are proposed \cite{hager2006survey, nocedal2006numerical}, such as
\[\beta_k^{FR} = \frac{\|g_k\|^2}{\|g_{k-1}\|^2},\]
\[\beta_k^{HS} = \frac{g_k^T (g_k - g_{k-1})}{p_{k-1}^T (g_k - g_{k-1})},\]
\[\beta_k^{PR} = \frac{g_k^T (g_k - g_{k-1})}{\|g_{k-1}\|^2},\]
\[\beta_k^{DY} = \frac{\|g_k\|^2}{p_{k-1}^T (g_k - g_{k-1})},\]
where $g_k = \sum_{m=1}^M \nabla f_{z_m}(w_k)$, and $p_k = - g_k + \beta_k p_{k-1}$.
We choose the strategy proposed by Dai and Yuan due to lack of mechanism for stepsize tuning in MADlib, which is required for other alternatives to guarantee convergence rate. (See Theorem 4.1 in Hager and Zhang \cite{hager2006survey}).
In fact, lack of sufficient stepsize tuning for each iteration individually could make conjugate gradient methods have similar or even worse convergence rate than gradient descent.
This should be fixed in the future.
\subsection{Newton's Method}
Newton's method uses a search direction other than the steepest descent direction -- \emph{Newton direction}.
The Newton direction is very effective in the cases that the objective function is not too different from a quadratic approximation, and it gives quadratic convergence rate by considering Taylor's theorem.
Formally, the Newton direction is given by
\[p_k = -(\nabla^2 F(w_k))^{-1} \nabla F(w_k),\]
where, given $z_{1:M}$, $F(w) = \sum_{m=1}^M f_{z_m}(w)$, and $H_k = \nabla^2 F(w_k)$ is called the Hessian matrix.
\subsubsection{Initialization Strategy}
\begin{algorithm}[initialization-newton$(w)$] \label{alg:initialization-newton}
\alginput{Coefficients $w \in \mathbb{R}^N$}
\algoutput{Transition state $\mathit{state}$}
\begin{algorithmic}[1]
\State $\mathit{state}.w_k \set w$
\State $\mathit{state}.g_k \set 0$
\State $\mathit{state}.H_k \set 0$
\State \Return $\mathit{state}$
\end{algorithmic}
\end{algorithm}
\subsubsection{Transition Strategy}
\begin{algorithm}[transition-newton$(\mathit{state}, z_m)$] \label{alg:transition-newton}
\alginput{Transition state $\mathit{state}$,\\
observation entry $z_m$,\\
gradient function $\mathit{Gradient}()$,\\
Hessian matrix function $\mathit{Hessian}()$}
\algoutput{Transition state $\mathit{state}$}
\begin{algorithmic}[1]
\State $\mathit{state}.g_k \set \mathit{state}.g_k + \mathit{Gradient}(\mathit{state}.w_k, z_m)$
\State $\mathit{state}.H_k \set \mathit{state}.H_k + \mathit{Hessian}(\mathit{state}.w_k, z_m)$
\State \Return $\mathit{state}$
\end{algorithmic}
\end{algorithm}
\subsubsection{Finalization Strategy}
\begin{algorithm}[finalization-newton$(\mathit{state})$] \label{alg:finalization-newton}
\alginput{Transition state $\mathit{state}$}
\algoutput{Coefficients $w \in \mathbb{R}^N$}
\begin{algorithmic}[1]
\State $p_k \set - (\mathit{state}.H_k)^{-1} \mathit{state}.g_k$
\State $\mathit{state}.w_{k+1} \set \mathit{state}.w_k + p_k$
\State $k \set k + 1$
\State \Return $\mathit{state}.w_k$
\end{algorithmic}
\end{algorithm}
\paragraph{Hessian Matrix Function.}
A function where Newton's method accepts another computational logic of specified modules. See also gradient function.
\paragraph{Inverse of the Hessian Matrix.}
The inverse of Hessian matrix may not always exist if the Hessian is not guaranteed to be positive definite ($\nabla^2 F = 0$ when $F$ is linear).
We currently only support Newton's method for objetcive functions that is strongly convex.
This may sometimes mean an objective function that is not globally strongly convex but Newton's method works well with a good starting point as long as the objective function is strongly convex in a convex set that contains the given starting point and the minimum.
A few techniques that modify the Newton's method to adapt objective functions that are not strongly convex can be found in the references \cite{bertsekas1999nonlinear, nocedal2006numerical}.
\paragraph{Feed a Good Start Point.}
Since Newton's method is sensitive to the start point $w_0$, we provide a start strategy $\mathit{Start()}$ to accept a start point that may not be zeros.
It may come from results of other algorithms, e.g., IGD.
% \subsection{Quasi-Newton Method}
\section{Implemented Machine Learning Algorithms}
We have implemented several machine learning algorithms under the
framework of convex optimization.
\subsection{Linear Ridge Regression}
Ridge regression is the most commonly used method of regularization of
ill-posed problems. Mathematically, it seeks to minimize
\begin{equation}
Q\left(\vec{w},w_0;\lambda\right)\equiv \min_{\vec{w},w_0}\left[ \frac{1}{2N} \sum_{i=1}^{N} \left( y_i - w_0 -
\vec{w} \cdot \vec{x}_i \right)^2
+\frac{\lambda}{2}\|\vec{w}\|_2^2 \right]\ ,
\end{equation}
for a given value of $\lambda$, where $\vec{w}$ and $w_0$ are the fitting coefficients, and $\lambda$
is a non-negative regularization parameter. $\vec{w}$ is a vector in
$d$ dimensional space, and
\begin{equation}
\|\vec{w}\|_2^2 = \sum_{j=1}^{d}w_j^2 = \vec{w}^T\vec{w}\ .
\end{equation}
When $\lambda = 0$, $Q$ is
the mean squared error of the fitting.
The intercept term $w_0$ is not regularized, because this term is
fully decided by the mean values of $y_i$ and $\vec{x}_i$ and the
values of $\vec{w}$, and does not affect the model's complexity.
$Q\left(\vec{w},w_0;\lambda\right)$ is a quadratic function of $\vec{w}$ and
$w_0$, and thus can be solved analytically
\begin{equation}
\vec{w}_{ridge}=\left(\lambda\vec{I}_d +
\vec{X}^T\vec{X}\right)^{-1}\vec{X}^T\vec{y}\ .
\end{equation}
By using the available Newton method (Sec. 6.2.4), the above quantity can be easily
calculated from one single step of the Newton method.
Many packages for Ridge regularization actually regularize the fitting
coefficients not for the fitting model for the original data but for
the data that has be scaled. MADlib also provides this option. When
the normalization parameter is set to be True, which is by default
False, the data will be first converted to the following before
applying the Ridge regularization.
\begin{equation}
x_i' \leftarrow \frac{x_i - \langle x_i \rangle}{\langle (x_i -
\langle x_i \rangle)^2\rangle} \ ,
\end{equation}
\begin{equation}
y_i \leftarrow y_i - \langle y_i \rangle \ ,
\end{equation}
where $\langle \cdot \rangle = \sum_{i=1}^{N} \cdot / N$.
Note that Ridge regressions for scaled data and un-scaled data are not equivalent.