docs/alg-ref/MultiSVM.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsubsection{Multi-class Support Vector Machines}
 \label{msvm}

 \noindent{\bf Description}

 Support Vector Machines are used to model the relationship between a categorical
 dependent variable y and one or more explanatory variables denoted X. This
 implementation supports dependent variables that have domain size greater or
 equal to 2 and hence is not restricted to binary class labels.
 \\

 \noindent{\bf Usage}

 \begin{tabbing}
 \texttt{-f} \textit{path}/\texttt{m-svm.dml -nvargs}
 \=\texttt{X=}\textit{path}/\textit{file}
   \texttt{Y=}\textit{path}/\textit{file}
   \texttt{icpt=}\textit{int}\\
 \>\texttt{tol=}\textit{double}
   \texttt{reg=}\textit{double}
   \texttt{maxiter=}\textit{int}
   \texttt{model=}\textit{path}/\textit{file}\\
 \>\texttt{Log=}\textit{path}/\textit{file}
   \texttt{fmt=}\textit{csv}$\vert$\textit{text}
 \end{tabbing}

 \begin{tabbing}
 \texttt{-f} \textit{path}/\texttt{m-svm-predict.dml -nvargs}
 \=\texttt{X=}\textit{path}/\textit{file}
   \texttt{Y=}\textit{path}/\textit{file}
   \texttt{icpt=}\textit{int}
   \texttt{model=}\textit{path}/\textit{file}\\
 \>\texttt{scores=}\textit{path}/\textit{file}
   \texttt{accuracy=}\textit{path}/\textit{file}\\
 \>\texttt{confusion=}\textit{path}/\textit{file}
   \texttt{fmt=}\textit{csv}$\vert$\textit{text}
 \end{tabbing}

 \noindent{\bf Arguments}

 \begin{itemize}
 \item X: Location (on HDFS) containing the explanatory variables
 in a matrix. Each row constitutes an example.
 \item Y: Location (on HDFS) containing a 1-column matrix specifying
 the categorical dependent variable (label). Labels are assumed to be
 contiguously numbered from 1 $\ldots$ \#classes.  Note that, this
 argument is optional for prediction.
 \item icpt (default: {\tt 0}): If set to 1 then a constant bias column
 is added to X.
 \item tol (default: {\tt 0.001}): Procedure terminates early if the reduction
 in objective function value is less than tolerance times the initial objective
 function value.
 \item reg (default: {\tt 1}): Regularization constant. See details to find
 out where lambda appears in the objective function. If one were interested
 in drawing an analogy with C-SVM, then C = 2/lambda. Usually, cross validation
 is employed to determine the optimum value of lambda.
 \item maxiter (default: {\tt 100}): The maximum number of iterations.
 \item model: Location (on HDFS) that contains the learnt weights.
 \item Log: Location (on HDFS) to collect various metrics (e.g., objective
 function value etc.) that depict progress across iterations while training.
 \item fmt (default: {\tt text}): Specifies the output format. Choice of
 comma-separated values (csv) or as a sparse-matrix (text).
 \item scores: Location (on HDFS) to store scores for a held-out test set.
 Note that, this is an optional argument.
 \item accuracy: Location (on HDFS) to store the accuracy computed on a
 held-out test set. Note that, this is an optional argument.
 \item confusion: Location (on HDFS) to store the confusion matrix
 computed using a held-out test set. Note that, this is an optional
 argument.
 \end{itemize}

 \noindent{\bf Details}

 Support vector machines learn a classification function by solving the
 following optimization problem ($L_2$-SVM):
 \begin{eqnarray*}
 &\textrm{argmin}_w& \frac{\lambda}{2} ||w||_2^2 + \sum_i \xi_i^2\\
 &\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i
 \end{eqnarray*}
 where $x_i$ is an example from the training set with its label given by $y_i$,
 $w$ is the vector of parameters and $\lambda$ is the regularization constant
 specified by the user.

 To extend the above formulation (binary class SVM) to the multiclass setting,
 one standard approache is to learn one binary class SVM per class that
 separates data belonging to that class from the rest of the training data
 (one-against-the-rest SVM, see C. Scholkopf, 1995).

 To account for the missing bias term, one may augment the data with a column
 of constants which is achieved by setting intercept argument to 1 (C-J Hsieh
 et al, 2008).

 This implementation optimizes the primal directly (Chapelle, 2007). It uses
 nonlinear conjugate gradient descent to minimize the objective function
 coupled with choosing step-sizes by performing one-dimensional Newton
 minimization in the direction of the gradient.
 \\

 \noindent{\bf Returns}

 The learnt weights produced by m-svm.dml are populated into a matrix that
 has as many columns as there are classes in the training data, and written
 to file provided on HDFS (see model in section Arguments). The number of rows
 in this matrix is ncol(X) if intercept was set to 0 during invocation and ncol(X) + 1
 otherwise. The bias terms, if used, are placed in the last row. Depending on what
 arguments are provided during invocation, m-svm-predict.dml may compute one or more
 of scores, accuracy and confusion matrix in the output format specified.
 \\

 %%\noindent{\bf See Also}
 %%
 %%In case of binary classification problems, please consider using a binary class classifier
 %%learning algorithm, e.g., binary class $L_2$-SVM (see Section \ref{l2svm}) or logistic regression
 %%(see Section \ref{logreg}). To model the relationship between a scalar dependent variable
 %%y and one or more explanatory variables X, consider Linear Regression instead (see Section
 %%\ref{linreg-solver} or Section \ref{linreg-iterative}).
 %%\\
 %%
 \noindent{\bf Examples}
 \begin{verbatim}
 hadoop jar SystemML.jar -f m-svm.dml -nvargs X=/user/biadmin/X.mtx
                                              Y=/user/biadmin/y.mtx
                                              icpt=0 tol=0.001
                                              reg=1.0 maxiter=100 fmt=csv
                                              model=/user/biadmin/weights.csv
                                              Log=/user/biadmin/Log.csv
 \end{verbatim}

 \begin{verbatim}
 hadoop jar SystemML.jar -f m-svm-predict.dml -nvargs X=/user/biadmin/X.mtx
                                                      Y=/user/biadmin/y.mtx
                                                      icpt=0 fmt=csv
                                                      model=/user/biadmin/weights.csv
                                                      scores=/user/biadmin/scores.csv
                                                      accuracy=/user/biadmin/accuracy.csv
                                                      confusion=/user/biadmin/confusion.csv
 \end{verbatim}

 \noindent{\bf References}

 \begin{itemize}
 \item W. T. Vetterling and B. P. Flannery. \newblock{\em Conjugate Gradient Methods in Multidimensions in
 Numerical Recipes in C - The Art in Scientific Computing.} \newblock W. H. Press and S. A. Teukolsky
 (eds.), Cambridge University Press, 1992.
 \item J. Nocedal and  S. J. Wright. \newblock{\em Numerical Optimization.} \newblock Springer-Verlag, 1999.
 \item C-J Hsieh, K-W Chang, C-J Lin, S. S. Keerthi and S. Sundararajan. \newblock {\em A Dual Coordinate
 Descent Method for Large-scale Linear SVM.} \newblock International Conference of Machine Learning
 (ICML), 2008.
 \item Olivier Chapelle. \newblock{\em Training a Support Vector Machine in the Primal.} \newblock Neural
 Computation, 2007.
 \item B. Scholkopf, C. Burges and V. Vapnik. \newblock{\em Extracting Support Data for a Given Task.} \newblock International Conference on Knowledge Discovery and Data Mining (ICDM), 1995.
 \end{itemize}
	\begin{comment}

	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	\end{comment}

	\subsubsection{Multi-class Support Vector Machines}
	\label{msvm}

	\noindent{\bf Description}

	Support Vector Machines are used to model the relationship between a categorical
	dependent variable y and one or more explanatory variables denoted X. This
	implementation supports dependent variables that have domain size greater or
	equal to 2 and hence is not restricted to binary class labels.
	\\

	\noindent{\bf Usage}

	\begin{tabbing}
	\texttt{-f} \textit{path}/\texttt{m-svm.dml -nvargs}
	\=\texttt{X=}\textit{path}/\textit{file}
	\texttt{Y=}\textit{path}/\textit{file}
	\texttt{icpt=}\textit{int}\\
	\>\texttt{tol=}\textit{double}
	\texttt{reg=}\textit{double}
	\texttt{maxiter=}\textit{int}
	\texttt{model=}\textit{path}/\textit{file}\\
	\>\texttt{Log=}\textit{path}/\textit{file}
	\texttt{fmt=}\textit{csv}$\vert$\textit{text}
	\end{tabbing}

	\begin{tabbing}
	\texttt{-f} \textit{path}/\texttt{m-svm-predict.dml -nvargs}
	\=\texttt{X=}\textit{path}/\textit{file}
	\texttt{Y=}\textit{path}/\textit{file}
	\texttt{icpt=}\textit{int}
	\texttt{model=}\textit{path}/\textit{file}\\
	\>\texttt{scores=}\textit{path}/\textit{file}
	\texttt{accuracy=}\textit{path}/\textit{file}\\
	\>\texttt{confusion=}\textit{path}/\textit{file}
	\texttt{fmt=}\textit{csv}$\vert$\textit{text}
	\end{tabbing}

	\noindent{\bf Arguments}

	\begin{itemize}
	\item X: Location (on HDFS) containing the explanatory variables
	in a matrix. Each row constitutes an example.
	\item Y: Location (on HDFS) containing a 1-column matrix specifying
	the categorical dependent variable (label). Labels are assumed to be
	contiguously numbered from 1 $\ldots$ \#classes. Note that, this
	argument is optional for prediction.
	\item icpt (default: {\tt 0}): If set to 1 then a constant bias column
	is added to X.
	\item tol (default: {\tt 0.001}): Procedure terminates early if the reduction
	in objective function value is less than tolerance times the initial objective
	function value.
	\item reg (default: {\tt 1}): Regularization constant. See details to find
	out where lambda appears in the objective function. If one were interested
	in drawing an analogy with C-SVM, then C = 2/lambda. Usually, cross validation
	is employed to determine the optimum value of lambda.
	\item maxiter (default: {\tt 100}): The maximum number of iterations.
	\item model: Location (on HDFS) that contains the learnt weights.
	\item Log: Location (on HDFS) to collect various metrics (e.g., objective
	function value etc.) that depict progress across iterations while training.
	\item fmt (default: {\tt text}): Specifies the output format. Choice of
	comma-separated values (csv) or as a sparse-matrix (text).
	\item scores: Location (on HDFS) to store scores for a held-out test set.
	Note that, this is an optional argument.
	\item accuracy: Location (on HDFS) to store the accuracy computed on a
	held-out test set. Note that, this is an optional argument.
	\item confusion: Location (on HDFS) to store the confusion matrix
	computed using a held-out test set. Note that, this is an optional
	argument.
	\end{itemize}

	\noindent{\bf Details}

	Support vector machines learn a classification function by solving the
	following optimization problem ($L_2$-SVM):
	\begin{eqnarray*}
	&\textrm{argmin}_w& \frac{\lambda}{2} \|\|w\|\|_2^2 + \sum_i \xi_i^2\\
	&\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i
	\end{eqnarray*}
	where $x_i$ is an example from the training set with its label given by $y_i$,
	$w$ is the vector of parameters and $\lambda$ is the regularization constant
	specified by the user.

	To extend the above formulation (binary class SVM) to the multiclass setting,
	one standard approache is to learn one binary class SVM per class that
	separates data belonging to that class from the rest of the training data
	(one-against-the-rest SVM, see C. Scholkopf, 1995).

	To account for the missing bias term, one may augment the data with a column
	of constants which is achieved by setting intercept argument to 1 (C-J Hsieh
	et al, 2008).

	This implementation optimizes the primal directly (Chapelle, 2007). It uses
	nonlinear conjugate gradient descent to minimize the objective function
	coupled with choosing step-sizes by performing one-dimensional Newton
	minimization in the direction of the gradient.
	\\

	\noindent{\bf Returns}

	The learnt weights produced by m-svm.dml are populated into a matrix that
	has as many columns as there are classes in the training data, and written
	to file provided on HDFS (see model in section Arguments). The number of rows
	in this matrix is ncol(X) if intercept was set to 0 during invocation and ncol(X) + 1
	otherwise. The bias terms, if used, are placed in the last row. Depending on what
	arguments are provided during invocation, m-svm-predict.dml may compute one or more
	of scores, accuracy and confusion matrix in the output format specified.
	\\

	%%\noindent{\bf See Also}
	%%
	%%In case of binary classification problems, please consider using a binary class classifier
	%%learning algorithm, e.g., binary class $L_2$-SVM (see Section \ref{l2svm}) or logistic regression
	%%(see Section \ref{logreg}). To model the relationship between a scalar dependent variable
	%%y and one or more explanatory variables X, consider Linear Regression instead (see Section
	%%\ref{linreg-solver} or Section \ref{linreg-iterative}).
	%%\\
	%%
	\noindent{\bf Examples}
	\begin{verbatim}
	hadoop jar SystemML.jar -f m-svm.dml -nvargs X=/user/biadmin/X.mtx
	Y=/user/biadmin/y.mtx
	icpt=0 tol=0.001
	reg=1.0 maxiter=100 fmt=csv
	model=/user/biadmin/weights.csv
	Log=/user/biadmin/Log.csv
	\end{verbatim}

	\begin{verbatim}
	hadoop jar SystemML.jar -f m-svm-predict.dml -nvargs X=/user/biadmin/X.mtx
	Y=/user/biadmin/y.mtx
	icpt=0 fmt=csv
	model=/user/biadmin/weights.csv
	scores=/user/biadmin/scores.csv
	accuracy=/user/biadmin/accuracy.csv
	confusion=/user/biadmin/confusion.csv
	\end{verbatim}

	\noindent{\bf References}

	\begin{itemize}
	\item W. T. Vetterling and B. P. Flannery. \newblock{\em Conjugate Gradient Methods in Multidimensions in
	Numerical Recipes in C - The Art in Scientific Computing.} \newblock W. H. Press and S. A. Teukolsky
	(eds.), Cambridge University Press, 1992.
	\item J. Nocedal and S. J. Wright. \newblock{\em Numerical Optimization.} \newblock Springer-Verlag, 1999.
	\item C-J Hsieh, K-W Chang, C-J Lin, S. S. Keerthi and S. Sundararajan. \newblock {\em A Dual Coordinate
	Descent Method for Large-scale Linear SVM.} \newblock International Conference of Machine Learning
	(ICML), 2008.
	\item Olivier Chapelle. \newblock{\em Training a Support Vector Machine in the Primal.} \newblock Neural
	Computation, 2007.
	\item B. Scholkopf, C. Burges and V. Vapnik. \newblock{\em Extracting Support Data for a Given Task.} \newblock International Conference on Knowledge Discovery and Data Mining (ICDM), 1995.
	\end{itemize}