docs/Algorithms Reference/NaiveBayes.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsection{Naive Bayes}
 \label{naive_bayes}

 \noindent{\bf Description}

 Naive Bayes is very simple generative model used for classifying data.
 This implementation learns a multinomial naive Bayes classifier which
 is applicable when all features are counts of categorical values.
 \\

 \noindent{\bf Usage}

 \begin{tabbing}
 \texttt{-f} \textit{path}/\texttt{naive-bayes.dml -nvargs}
 \=\texttt{X=}\textit{path}/\textit{file}
   \texttt{Y=}\textit{path}/\textit{file}
   \texttt{laplace=}\textit{double}\\
 \>\texttt{prior=}\textit{path}/\textit{file}
   \texttt{conditionals=}\textit{path}/\textit{file}\\
 \>\texttt{accuracy=}\textit{path}/\textit{file}
   \texttt{fmt=}\textit{csv}$\vert$\textit{text}
 \end{tabbing}

 \begin{tabbing}
 \texttt{-f} \textit{path}/\texttt{naive-bayes-predict.dml -nvargs}
 \=\texttt{X=}\textit{path}/\textit{file}
   \texttt{Y=}\textit{path}/\textit{file}
   \texttt{prior=}\textit{path}/\textit{file}\\
 \>\texttt{conditionals=}\textit{path}/\textit{file}
   \texttt{fmt=}\textit{csv}$\vert$\textit{text}\\
 \>\texttt{accuracy=}\textit{path}/\textit{file}
   \texttt{confusion=}\textit{path}/\textit{file}\\
 \>\texttt{probabilities=}\textit{path}/\textit{file}
 \end{tabbing}

 \noindent{\bf Arguments}

 \begin{itemize}
 \item X: Location (on HDFS) to read the matrix of feature vectors;
 each row constitutes one feature vector.
 \item Y: Location (on HDFS) to read the one-column matrix of (categorical)
 labels that correspond to feature vectors in X. Classes are assumed to be
 contiguously labeled beginning from 1. Note that, this argument is optional
 for prediction.
 \item laplace (default: {\tt 1}): Laplace smoothing specified by the
 user to avoid creation of 0 probabilities.
 \item prior: Location (on HDFS) that contains the class prior probabilites.
 \item conditionals: Location (on HDFS) that contains the class conditional
 feature distributions.
 \item fmt (default: {\tt text}): Specifies the output format. Choice of
 comma-separated values (csv) or as a sparse-matrix (text).
 \item probabilities: Location (on HDFS) to store class membership probabilities
 for a held-out test set. Note that, this is an optional argument.
 \item accuracy: Location (on HDFS) to store the training accuracy during
 learning and testing accuracy from a held-out test set during prediction.
 Note that, this is an optional argument for prediction.
 \item confusion: Location (on HDFS) to store the confusion matrix
 computed using a held-out test set. Note that, this is an optional
 argument.
 \end{itemize}

 \noindent{\bf Details}

 Naive Bayes is a very simple generative classification model. It posits that
 given the class label, features can be generated independently of each other.
 More precisely, the (multinomial) naive Bayes model uses the following
 equation to estimate the joint probability of a feature vector $x$ belonging
 to class $y$:
 \begin{equation*}
 \text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)}
 \end{equation*}
 where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a feature
 present in $x$ with $n(i,x)$ denoting its count and $\theta_{iy}$ denotes the
 class conditional probability of feature $i$ in class $y$. The usual
 constraints hold on $\pi$ and $\theta$:
 \begin{eqnarray*}
 && \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\
 \forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1
 \end{eqnarray*}
 where $\mathcal{C}$ is the set of classes.

 Given a fully labeled training dataset, it is possible to learn a naive Bayes
 model using simple counting (group-by aggregates). To compute the class conditional
 probabilities, it is usually advisable to avoid setting $\theta_{iy}$ to 0. One way to
 achieve this is using additive smoothing or Laplace smoothing. Some authors have argued
 that this should in fact be add-one smoothing. This implementation uses add-one smoothing
 by default but lets the user specify her/his own constant, if required.

 This implementation is sometimes referred to as \emph{multinomial} naive Bayes. Other
 flavours of naive Bayes are also popular.
 \\

 \noindent{\bf Returns}

 The learnt model produced by naive-bayes.dml is stored in two separate files.
 The first file stores the class prior (a single-column matrix). The second file
 stores the class conditional probabilities organized into a matrix with as many
 rows as there are class labels and as many columns as there are features.
 Depending on what arguments are provided during invocation, naive-bayes-predict.dml
 may compute one or more of probabilities, accuracy and confusion matrix in the
 output format specified.
 \\

 \noindent{\bf Examples}

 \begin{verbatim}
 hadoop jar SystemML.jar -f naive-bayes.dml -nvargs
                            X=/user/biadmin/X.mtx
                            Y=/user/biadmin/y.mtx
                            laplace=1 fmt=csv
                            prior=/user/biadmin/prior.csv
                            conditionals=/user/biadmin/conditionals.csv
                            accuracy=/user/biadmin/accuracy.csv
 \end{verbatim}

 \begin{verbatim}
 hadoop jar SystemML.jar -f naive-bayes-predict.dml -nvargs
                            X=/user/biadmin/X.mtx
                            Y=/user/biadmin/y.mtx
                            prior=/user/biadmin/prior.csv
                            conditionals=/user/biadmin/conditionals.csv
                            fmt=csv
                            accuracy=/user/biadmin/accuracy.csv
                            probabilities=/user/biadmin/probabilities.csv
                            confusion=/user/biadmin/confusion.csv
 \end{verbatim}

 \noindent{\bf References}

 \begin{itemize}
 \item S. Russell and P. Norvig. \newblock{\em Artificial Intelligence: A Modern Approach.} Prentice Hall, 2009.
 \item A. McCallum and K. Nigam. \newblock{\em A comparison of event models for naive bayes text classification.}
 \newblock AAAI-98 workshop on learning for text categorization, 1998.
 \end{itemize}
	\begin{comment}

	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	\end{comment}

	\subsection{Naive Bayes}
	\label{naive_bayes}

	\noindent{\bf Description}

	Naive Bayes is very simple generative model used for classifying data.
	This implementation learns a multinomial naive Bayes classifier which
	is applicable when all features are counts of categorical values.
	\\

	\noindent{\bf Usage}

	\begin{tabbing}
	\texttt{-f} \textit{path}/\texttt{naive-bayes.dml -nvargs}
	\=\texttt{X=}\textit{path}/\textit{file}
	\texttt{Y=}\textit{path}/\textit{file}
	\texttt{laplace=}\textit{double}\\
	\>\texttt{prior=}\textit{path}/\textit{file}
	\texttt{conditionals=}\textit{path}/\textit{file}\\
	\>\texttt{accuracy=}\textit{path}/\textit{file}
	\texttt{fmt=}\textit{csv}$\vert$\textit{text}
	\end{tabbing}

	\begin{tabbing}
	\texttt{-f} \textit{path}/\texttt{naive-bayes-predict.dml -nvargs}
	\=\texttt{X=}\textit{path}/\textit{file}
	\texttt{Y=}\textit{path}/\textit{file}
	\texttt{prior=}\textit{path}/\textit{file}\\
	\>\texttt{conditionals=}\textit{path}/\textit{file}
	\texttt{fmt=}\textit{csv}$\vert$\textit{text}\\
	\>\texttt{accuracy=}\textit{path}/\textit{file}
	\texttt{confusion=}\textit{path}/\textit{file}\\
	\>\texttt{probabilities=}\textit{path}/\textit{file}
	\end{tabbing}

	\noindent{\bf Arguments}

	\begin{itemize}
	\item X: Location (on HDFS) to read the matrix of feature vectors;
	each row constitutes one feature vector.
	\item Y: Location (on HDFS) to read the one-column matrix of (categorical)
	labels that correspond to feature vectors in X. Classes are assumed to be
	contiguously labeled beginning from 1. Note that, this argument is optional
	for prediction.
	\item laplace (default: {\tt 1}): Laplace smoothing specified by the
	user to avoid creation of 0 probabilities.
	\item prior: Location (on HDFS) that contains the class prior probabilites.
	\item conditionals: Location (on HDFS) that contains the class conditional
	feature distributions.
	\item fmt (default: {\tt text}): Specifies the output format. Choice of
	comma-separated values (csv) or as a sparse-matrix (text).
	\item probabilities: Location (on HDFS) to store class membership probabilities
	for a held-out test set. Note that, this is an optional argument.
	\item accuracy: Location (on HDFS) to store the training accuracy during
	learning and testing accuracy from a held-out test set during prediction.
	Note that, this is an optional argument for prediction.
	\item confusion: Location (on HDFS) to store the confusion matrix
	computed using a held-out test set. Note that, this is an optional
	argument.
	\end{itemize}

	\noindent{\bf Details}

	Naive Bayes is a very simple generative classification model. It posits that
	given the class label, features can be generated independently of each other.
	More precisely, the (multinomial) naive Bayes model uses the following
	equation to estimate the joint probability of a feature vector $x$ belonging
	to class $y$:
	\begin{equation*}
	\text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)}
	\end{equation*}
	where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a feature
	present in $x$ with $n(i,x)$ denoting its count and $\theta_{iy}$ denotes the
	class conditional probability of feature $i$ in class $y$. The usual
	constraints hold on $\pi$ and $\theta$:
	\begin{eqnarray*}
	&& \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\
	\forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1
	\end{eqnarray*}
	where $\mathcal{C}$ is the set of classes.

	Given a fully labeled training dataset, it is possible to learn a naive Bayes
	model using simple counting (group-by aggregates). To compute the class conditional
	probabilities, it is usually advisable to avoid setting $\theta_{iy}$ to 0. One way to
	achieve this is using additive smoothing or Laplace smoothing. Some authors have argued
	that this should in fact be add-one smoothing. This implementation uses add-one smoothing
	by default but lets the user specify her/his own constant, if required.

	This implementation is sometimes referred to as \emph{multinomial} naive Bayes. Other
	flavours of naive Bayes are also popular.
	\\

	\noindent{\bf Returns}

	The learnt model produced by naive-bayes.dml is stored in two separate files.
	The first file stores the class prior (a single-column matrix). The second file
	stores the class conditional probabilities organized into a matrix with as many
	rows as there are class labels and as many columns as there are features.
	Depending on what arguments are provided during invocation, naive-bayes-predict.dml
	may compute one or more of probabilities, accuracy and confusion matrix in the
	output format specified.
	\\

	\noindent{\bf Examples}

	\begin{verbatim}
	hadoop jar SystemML.jar -f naive-bayes.dml -nvargs
	X=/user/biadmin/X.mtx
	Y=/user/biadmin/y.mtx
	laplace=1 fmt=csv
	prior=/user/biadmin/prior.csv
	conditionals=/user/biadmin/conditionals.csv
	accuracy=/user/biadmin/accuracy.csv
	\end{verbatim}

	\begin{verbatim}
	hadoop jar SystemML.jar -f naive-bayes-predict.dml -nvargs
	X=/user/biadmin/X.mtx
	Y=/user/biadmin/y.mtx
	prior=/user/biadmin/prior.csv
	conditionals=/user/biadmin/conditionals.csv
	fmt=csv
	accuracy=/user/biadmin/accuracy.csv
	probabilities=/user/biadmin/probabilities.csv
	confusion=/user/biadmin/confusion.csv
	\end{verbatim}

	\noindent{\bf References}

	\begin{itemize}
	\item S. Russell and P. Norvig. \newblock{\em Artificial Intelligence: A Modern Approach.} Prentice Hall, 2009.
	\item A. McCallum and K. Nigam. \newblock{\em A comparison of event models for naive bayes text classification.}
	\newblock AAAI-98 workshop on learning for text categorization, 1998.
	\end{itemize}