| % !TEX root = ../design.tex |
| \chapter[SVM]{SVM} |
| |
| \newcommand{\Pcal}{{\cal P}} |
| \newcommand{\Ncal}{{\cal N}} |
| |
| \begin{moduleinfo} |
| \item[Authors] {} |
| \item[History] |
| \begin{modulehistory} |
| |
| \item[v0.1] Initial version |
| \end{modulehistory} |
| \end{moduleinfo} |
| |
| |
| |
| |
| |
| % Abstract. What is the problem we want to solve? |
| |
| Support Vector Machines (SVMs) are a commonly used technique for approaching |
| regression and classification problems. SVM models have two particularly |
| desirable features: robustness in the presence of noisy data, and applicability |
| to a variety of data schemes. At its core, a \textit{linear} SVM model is a |
| hyperplane separating the two distinct classes of data (in the case of |
| classification problems), in such a way that the \textit{margin} is maximized. |
| Using kernels, one can approximate a large variety of decision boundaries. |
| |
| |
| |
| %\section{Linear Methods for Regression} % (fold) |
| %\label{sub:linear_methods_for_regression} |
| |
| % subsection linear_methods_for_regression (end) |
| |
| |
| |
| |
| |
| \section{Linear SVM}\label{sec:linear} |
| |
| %Cite Smola's book for this section |
| |
| |
| Suppose we have a classification task with training data |
| $(x_1,y_1),\ldots,(x_n,y_n) \in \R^d\times \{0,1\}$. A hyperplane in $\R^d$ is |
| determined by $w\in R^d$ and $b\in R$: $x\in\R^d$ is on the plane if $\langle w, |
| x\rangle + b = 0$. To solve the classification problem, we seek a hyperplane |
| which separates the data, and therefore coefficients $(w,b) \in \R^d\times\R$ |
| such that for all $i\leq n$, $y_i = \text{sgn}(\langle w, x_i\rangle + b)$.\\ |
| |
| The \textit{margin} determined by a hyperplane and the training data is the |
| smallest distance from a data point to the hyperplane. For convenience, the |
| length of $w$ can be modified so that the margin is equal to $\frac{1}{||w||}$. |
| In addition to merely separating the data, it is desirable to find a hyperplane |
| which maximizes the margin; inuitively, a larger margin will lead to better |
| predictions on future data.\\ |
| |
| The \textit{support vectors} are the members of the training data whose distance |
| to hyperplane is exactly the margin; that is, the training points which are |
| closest to the hyperplane. Learning with SVM lends itself to large datasets in |
| part because the model in depends only on these support vectors; all other |
| training data do not influence the final model. When assigning values to new |
| data, one needs only to check it against the support vectors.\\ |
| |
| Thus, we are left with the following convex programming problem: |
| |
| \begin{align*} |
| \underset{w,b}{\text{Minimize }} \frac{1}{2}||w||^2, w\in \R^d, \\ |
| \text{subject to } y_i(\langle w, x_i \rangle + b) \geq 1 \text{ for } i=1\ldots n. |
| \end{align*} |
| |
| If the data are not linearly separable, \textit{slack variables} $\xi_i$ are |
| introduced to allow for some points to be classified incorrectly. We do not want |
| to sacrifice too much to capture outliers, so we introduce an extra term in the |
| objective function to control the error. The optimization problem now becomes |
| |
| \begin{align*} |
| \underset{w,\vec{\xi},b}{\text{Minimize }} \frac12 ||w||^2 + \frac{C}{n}\sum_{i=1}^n \xi_i \\ |
| \text{subject to } y_i(\langle w, x_i \rangle + b) \geq 1 - \xi_i,,\\ |
| \xi_i\geq 0. |
| \end{align*} |
| |
| |
| The parameter $C$ is free, and should be chosen to optimize the model. One can use cross-validation, for example, to optimize the choice of $C$. \\ |
| |
| |
| \subsection{Unconstrained optimization} |
| |
| One can formulate the same problem without constraints: |
| |
| \begin{equation} |
| \underset{w}{\text{Minimize }} \lambda ||w||^2 + \frac{1}{n}\sum_{i=1}^n \ell(y_i,f(x_i)) |
| \end{equation} |
| |
| |
| where $\ell(y,f(x)) = \max(0,1-yf(x))$ is the \textit{hinge loss}. Also, in the |
| case of linear SVM, $f(x) = \langle w, x\rangle + b$ (Note: there are a couple |
| ways to handle the variable $b$; one can include it in the gradient |
| calculations, or add extra feature to the data space.) Also note that this loss |
| function is not smooth. However, one can solve this unconstrained convex problem |
| using the techniques outlined in the ``Convex Programming Framework'' section of |
| this document. In particular, the subgradient of this loss function which is |
| used in our gradient-descent method is: |
| $$ |
| v(y,f(x)) = \left\{ |
| \begin{array}{ll} |
| 0 & \quad 1-yf(x) \leq 0 \\ |
| -y\cdot x & \quad 1-yf(x) > 0 |
| \end{array} |
| \right. |
| $$ |
| |
| See \cite{ShSS07} (extended version) for details. |
| |
| |
| |
| % \subsection{$\nu$-SVM} |
| %Another formulation of the SVM optimization problem is \textit{$\nu$-SVM}. |
| % |
| %\begin{align*} |
| %\underset{w,\vec{\xi},\rho,b}{\text{Minimize }} \frac{1}{2} ||w||^2 -\nu\rho + \frac{1}{n}\sum_{i=1}^n \xi_i \\ |
| %\text{subject to } y_i(\langle w, x_i \rangle + b) \geq \rho - \xi_i,,\\ |
| %\xi_i,\rho\geq 0. |
| %\end{align*} |
| % |
| %The parameters $\nu$ take the place of the parameter $C$ in this formulation. The advantage of this is that, unlike $C$, $\nu$ has a nice interpretation: $\nu$ is a lower bound on the fraction of vectors which are support vectors in the resulting decision boundary. An additional variable $\rho$ is also introduced, which one thinks of as controlling the margin of the data.\\ |
| % |
| %It is unclear at this point how to convert this to an unconstrained optimization problem. |
| |
| |
| \subsection{$\epsilon$-Regression} |
| |
| SVM can also be used to predict the values of an affine function $f(x) = \langle |
| w, x\rangle+b$, given sample input-output pairs $(x_1,y_1),\ldots,(x_n,y_n)$. If |
| we allow ourselves an error bound of $\epsilon>0$, and some error controlled by |
| the slack variables $\xi^*$, it is a matter of simply modifying the above convex |
| problem. By demanding that our function is relatively ``flat," and that it |
| approximates the true $f$ reasonably, the relevant optimization problem is: |
| |
| \begin{align*} |
| \underset{w,\vec{\xi},\vec{\xi^*_i},b}{\text{Minimize }} & \frac{1}{2}||w||^2 + \frac{C}{n}\sum_{i=1}^n \xi_i + \xi^*_i \\ |
| \text{subject to } & y_i - \langle w, x_i\rangle - b \leq \epsilon + \xi_i,\\ |
| & \langle w, x_i\rangle + b - y_i \leq \epsilon + \xi^*_i, \\ |
| & \xi_i,\xi^*_i \geq 0 |
| \end{align*} |
| |
| One can also formulate $\epsilon$-regression as an unconstrained optimization |
| problem just as we did with classification. In fact, the objective function |
| remains the same, except rather than hinge loss we would use the loss function |
| |
| \[ \ell(y,f(x)) = \max(0,|y-f(x)|-\epsilon) \] |
| |
| whose subgradient is |
| |
| $$ |
| v(y,f(x)) = \left\{ |
| \begin{array}{ll} |
| x & \quad f(x) - y > \epsilon \\ |
| -x & \quad y - f(x) > \epsilon \\ |
| 0 & \quad \text{otherwise} |
| \end{array} |
| \right. |
| $$ |
| |
| \section{Nonlinear SVM}\label{sec:nonlinear} |
| |
| One can generalize the above optimization problems using \textit{kernels}. A |
| kernel is nothing more than a continuous, symmetric, positive-definite function |
| $k:X\times X \to \R$. This replacement, known as the \textit{kernel trick}, |
| allows one to compute decision boundaries which are more general than |
| hyperplanes. \\ |
| |
| |
| |
| Two commonly used kernels are polynomial kernels, of the form $k(x,y) = (\langle |
| x,y\rangle +q)^r$, and Gaussian kernels, $k(x,y) = e^{-\gamma||x-y||^2}$, |
| $\gamma$ a free parameter.\\ |
| |
| The approach we take to learning with these kernels, following \cite{RR07}, is |
| to approximate them with linear kernels and then apply the above gradient |
| descent based algorithm for learning linear kernels. The approximation takes the |
| following form: we embed the data into a finite dimensional feature space, |
| $z:\R^d\to \R^D$, so that the inner product between transformed points |
| approximates the kernel: |
| |
| \[ k(x,y) \approx \langle z(x), z(y)\rangle.\] |
| |
| This map $z$ will be a randomly generated (in a manner depending on the kernel) |
| by methods outlined in the papers \cite{KK12,RR07}. We then use $z$ to embed |
| both the training data and the test data into $\R^D$ and run linear SVM on the |
| embedded data.\\ |
| |
| \subsection{Gaussian kernels} |
| |
| A kernel $k$ is said to be shift-invariant if, for all $x,y,z$, $k(x,y) = |
| k(x-z,y-z)$. In the paper \cite{RR07}, the authors show that for any shift- |
| invariant kernel $k$, there exists a probability measure $p(w)$ on its domain |
| such that |
| |
| \[ k(x,y) = k(x-y,0) = \int_X p(w)e^{i\langle w, x-y\rangle} dw = E_w[e^{i\langle w,x\rangle}e^{-i\langle w,y\rangle}] |
| \] |
| |
| where the expectation above is taken with respect to the measure $p$. In the |
| particular case of $k(x,y) = \exp(-\gamma||x-y||^2)$, |
| |
| \[ p(w) = \left(\dfrac{1}{\sqrt{4\pi\gamma}}\right)^d \exp\left(\frac{-||w||^2}{4\gamma}\right)\] |
| |
| where $d$ is the dimension of the sample data. In other words, $p(w)$ is a |
| Gaussian distribution with mean zero and standard deviation |
| $\sigma = \sqrt{2\gamma}$.\\ |
| |
| Since both the kernel and the probability distribution are real-valued in the |
| above integral, the complex exponential can be replaced with |
| $\cos(\langle w, x-y\rangle)$. Define a random function |
| $z: x\mapsto \sqrt{2}\cos(\langle w,x\rangle + b)$, |
| where $w$ is drawn from the distribution $p$ and $b$ uniformly |
| from $[0,2\pi]$. Then $\langle z(x),z(y)\rangle$ is an unbiased estimator of |
| $k(x,y)=k(x-y,0)$. To see this, use the sum of angles formula: |
| |
| \begin{align*} |
| z(x)z(y) &= 2\cos(\langle w,x\rangle+b)\cos(\langle w,y\rangle+b)\\ |
| &= 2\cos(\langle w,x\rangle)\cos(\langle w,y\rangle)\cos^2(b) \\ |
| & +\sin(\langle w, x\rangle)\sin(\langle w, y\rangle)\sin^2(b) - \sin(\langle w, x\rangle+\langle w, y\rangle)\cos(b)\sin(b) |
| \end{align*} |
| |
| Since $w,b$ are chosen independently, |
| |
| \begin{align*} |
| E[2\cos(\langle w,x\rangle)\cos(\langle w,y\rangle)\cos^2(b)] &= 2E[2\cos(\langle w,x\rangle)\cos(\langle w,y\rangle)]E[\cos^2(b)]\\ |
| &= 2E[\cos(\langle w,x\rangle)\cos(\langle w,y\rangle)]\cdot \frac12 \\ |
| &= E[\cos(\langle w,x\rangle)\cos(\langle w,y\rangle)] |
| \end{align*} |
| |
| and similarly with the other terms. Finally, |
| |
| \begin{align*} |
| E_{w,b}[z(x)z(y)] &= E[\cos(\langle w,x\rangle)\cos(\langle w,y\rangle) + \sin(\langle w, x\rangle)\sin(\langle w, y\rangle)]\\ |
| &= E[\cos(\langle w, x-y\rangle)]. |
| \end{align*} |
| |
| |
| |
| To lower the variance of our estimate, we sample several $(w,b)$ pairs and |
| average the resulting values of $z$.\\ |
| |
| \subsubsection{Formal Description} |
| |
| \begin{algorithm}[Random Fourier Features] |
| \alginput{Training data $X$, an $n\times d$ matrix representing $n$ data points in dimension $d$,\\ |
| $\gamma$ parameter of Gaussian distribution $e^{-\gamma ||x||^2}$\\ |
| dimension of range space, $D$,\\ |
| random normal generator seed} |
| \algoutput{$X'$, an $n\times D$ dimension matrix of data in feature space to be sent to linear solver, |
| as well as the $\Omega$ and $\vec{b}$ used to compute $X'$, to be used by the predictor.} |
| |
| \begin{algorithmic}[1] |
| \State $\Omega \set d\times D $ matrix of samples drawn from the standard normal distribution $ stdnormal$, |
| then scaled by a factor of $\sqrt{2\gamma}$ to simulate a Gaussian with $\sigma=\sqrt{2\gamma}$ |
| (see fit function from RBFsampler class of scikit-learn) |
| \State $\vec{b} \set$ vector of length $D$, each entry a uniform random sample from $[0,2\pi]$ |
| \State $X' \set X\cdot \Omega$ |
| \State $X' \set X' + \vec{b} $ ($\vec{b}$ is added to each row) |
| \State $X' \set \cos(X')$ (take the cosine of each entry) |
| \State $X' \set \sqrt{\frac{2}{D}}\cdot X'$ |
| \State return $X', \Omega, \vec{b}$ |
| \end{algorithmic} |
| |
| \end{algorithm} |
| |
| |
| \subsubsection{Parallelization} |
| Since the random cosine features are generated independently, each coordinate |
| could be computed independently in parallel, as long as each node has access to |
| the distribution $p$. |
| |
| |
| \subsection{Dot product kernels} |
| |
| As in the case of Gaussian kernels, we can use random feature maps to |
| approximate polynomial kernels, $k(x,y)=(\langle x,y\rangle+q)^r$. Again, the |
| embedding takes the following simple form: |
| |
| \begin{align*} |
| z: \R^d\to \R^D\\ |
| x\mapsto \frac{1}{\sqrt{D}} (z_1(x), \ldots, z_D(x)). |
| \end{align*} |
| |
| The idea here is to choose each random features $z_i$ such that it satisfies |
| $E[z_i(x)z_i(y)] = k(x, y)$. Then the concentration of measure phenomenon, e.g., |
| as $D$ goes to infinity, ensures $|k(x, y) - z(x)^Tz(y)|$ to be small with high |
| probability. We describe the process of generating $z_i$ below. Note that it |
| applies to any positive definite kernel |
| $k: (x, y) \mapsto f(\langle x, y \rangle)$, |
| where $f$ admits a Maclaurin expansion, i.e., |
| $f(x) = \sum_{N=0}^{\infty} a_N x^N$, |
| where $a_N = f^{(N)}(0) / N!$. |
| For example, in the case of polynomial kernels |
| $k(x,y)=(\langle x,y\rangle+q)^r$, $f(x)$ will be |
| $(x + q)^r$ and $a_N$ will be the N-th derivative of $(x + q)^r$ divided by $N!$. |
| |
| \begin{enumerate} |
| \item Randomly pick a number $N \in \N \cup \{0\}$ with $\Pcal[N] = \frac{1}{p^{N+1}}$. |
| Here $p$ is a positive number larger than 1 and in practice $p=2$ is often a |
| good choice since it establishes a normalized measure over $\N \cup \{0\}$ \cite{KK12}. |
| \item Pick $N$ independent Rademacher vector $w_1, ..., w_N$, i.e., each |
| component of Rademacher vector is chosen independently using a fair coin toss from the set $\{-1, 1\}$. |
| \item Compute $z_i(x)$ as follows, |
| \begin{align} |
| \label{equ:svm-zi} |
| z_i(x) = |
| \begin{cases} |
| \sqrt{a_0p}, &\quad \mbox{ if } N = 0,\\ |
| \sqrt{a_Np^{N+1}}\prod_{j=1}^N w_j^Tx, &\quad \mbox{ otherwise. } |
| \end{cases} |
| \end{align} |
| \end{enumerate} |
| |
| The reason why we pick $N$ from $\N \cup \{0\}$ with probability $\Pcal[N] = |
| \frac{1}{p^{N+1}}$ becomes obvious in the following simple proof. It establishes |
| that the random feature product $z_i(x)z_i(y)$ approximates the kernel product |
| $k(x, y)$ with high probability for any $x, y \in \R^d$, i.e., $E[z_i(x)z_i(y)] |
| = k(x, y)$: |
| |
| \begin{align*} |
| E[z_i(x)z_i(y)] |
| &= E_N [E_{w_1, ..., w_N} [z_i(x)z_i(y)] | N] \\ |
| &= E_N[a_N p^{N+1} E_{w_1, ..., w_N} [\prod_{j=1}^N w_j^T x \prod_{j=1}^N w_j^T y]] \\ |
| &= E_N[a_N p^{N+1} (E_{w} [w^Tx \cdot w^Ty])^N] \\ |
| &= E_N[a_N p^{N+1} \langle x, y \rangle^N] \\ |
| &= \sum_{n=0}^{\infty} \frac{1}{p^{n+1}} \cdot a_n p^{n+1} \langle x, y \rangle^n \\ |
| &= f(\langle x, y \rangle) = k(x, y) |
| \end{align*} |
| |
| See the paper of Kar and Karnick \cite{KK12} for more details. |
| |
| \subsubsection{Formal Description} |
| |
| \begin{algorithm}[Random Features for Polynomial Kernels] |
| \alginput{Training data $X$, an $n\times d$ matrix representing $n$ data points in dimension $d$, \\ |
| Parameters of a polynomial kernel, i.e. degree $r$, dimension of linear feature space $D$\\} |
| \algoutput{$X'$, an $n\times D$ dimension matrix of data in feature space to be sent to linear solver \\ |
| $\Omega, \Ncal$ generated by the algorithm, to be used by the predictor.\\ |
| } |
| \begin{algorithmic}[1] |
| \State $\Ncal \set $ array of D entries, each generated from the exponential probability distribution $\frac{1}{2^{N+1}}$ |
| \State $\Omega \set $ matrix of dimensions $d\times sum(\Ncal) \cdot D$ with $\{-1,1\}$ entries, chosen with a fair coin flip |
| \State $X'\set X\cdot \Omega,$ now a matrix of dimensions $n \times sum(\Ncal)$ |
| \label{alg:x-omega} |
| \For {each $row$ in $X'$}\label{alg:x-for} |
| \State $row' \set $ dividing $row$ into D segments, with $i$th segment consisting of $\Ncal_i$ number of entries. |
| \State $row'' \set $ array of D entries aggregating from each segment of $row'$ according to \eqref{equ:svm-zi} |
| \EndFor |
| \State $X'' \set $ the transformed $X'$ divided by $\frac{1}{\sqrt{D}}$ |
| \State return $X'',\Omega, \Ncal$ |
| \end{algorithmic} |
| \end{algorithm} |
| |
| \subsubsection{Parallelization} |
| In the above algorithm, Step \ref{alg:x-omega} is done by broadcasting the |
| matrix $\Omega$ to each row of $X$, and computing the matrix-vector product |
| locally in parallel for each row; Similarly, Step \ref{alg:x-for} can also be |
| distributed since the computations for each row are independent of the others. |
| |
| \section{Novelty Detection} |
| |
| Suppose we have training data $x_1, x_2, \ldots x_n \in \R^d$, the goal of |
| novelty detection is to learn a hyperplane in $\R^d$ that separates the training |
| data from the origin with maximum margin. We model this as a one-class |
| classification problem by transforming the training data to |
| $(x_1,y_1),\ldots,(x_n,y_n) \in \R^d\times \{1\}$, indicating that the dependent |
| variable of each training instance is assumed to be the same. The origin is |
| treated as a negative class data point and all input data points are treated as |
| part of the positive class. Given such a mapping, we use the SVM classification |
| mechanisms detailed in Sections~\ref{sec:linear} and~\ref{sec:nonlinear} to |
| learn a one-class classification model. |