blob: b5f721d008e6cfaa36085b58d1cf16c4aff2c010 [file] [log] [blame]
\begin{comment}
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\end{comment}
\subsection{Naive Bayes}
\label{naive_bayes}
\noindent{\bf Description}
Naive Bayes is very simple generative model used for classifying data.
This implementation learns a multinomial naive Bayes classifier which
is applicable when all features are counts of categorical values.
\\
\noindent{\bf Usage}
\begin{tabbing}
\texttt{-f} \textit{path}/\texttt{naive-bayes.dml -nvargs}
\=\texttt{X=}\textit{path}/\textit{file}
\texttt{Y=}\textit{path}/\textit{file}
\texttt{laplace=}\textit{double}\\
\>\texttt{prior=}\textit{path}/\textit{file}
\texttt{conditionals=}\textit{path}/\textit{file}\\
\>\texttt{accuracy=}\textit{path}/\textit{file}
\texttt{fmt=}\textit{csv}$\vert$\textit{text}
\end{tabbing}
\begin{tabbing}
\texttt{-f} \textit{path}/\texttt{naive-bayes-predict.dml -nvargs}
\=\texttt{X=}\textit{path}/\textit{file}
\texttt{Y=}\textit{path}/\textit{file}
\texttt{prior=}\textit{path}/\textit{file}\\
\>\texttt{conditionals=}\textit{path}/\textit{file}
\texttt{fmt=}\textit{csv}$\vert$\textit{text}\\
\>\texttt{accuracy=}\textit{path}/\textit{file}
\texttt{confusion=}\textit{path}/\textit{file}\\
\>\texttt{probabilities=}\textit{path}/\textit{file}
\end{tabbing}
\noindent{\bf Arguments}
\begin{itemize}
\item X: Location (on HDFS) to read the matrix of feature vectors;
each row constitutes one feature vector.
\item Y: Location (on HDFS) to read the one-column matrix of (categorical)
labels that correspond to feature vectors in X. Classes are assumed to be
contiguously labeled beginning from 1. Note that, this argument is optional
for prediction.
\item laplace (default: {\tt 1}): Laplace smoothing specified by the
user to avoid creation of 0 probabilities.
\item prior: Location (on HDFS) that contains the class prior probabilites.
\item conditionals: Location (on HDFS) that contains the class conditional
feature distributions.
\item fmt (default: {\tt text}): Specifies the output format. Choice of
comma-separated values (csv) or as a sparse-matrix (text).
\item probabilities: Location (on HDFS) to store class membership probabilities
for a held-out test set. Note that, this is an optional argument.
\item accuracy: Location (on HDFS) to store the training accuracy during
learning and testing accuracy from a held-out test set during prediction.
Note that, this is an optional argument for prediction.
\item confusion: Location (on HDFS) to store the confusion matrix
computed using a held-out test set. Note that, this is an optional
argument.
\end{itemize}
\noindent{\bf Details}
Naive Bayes is a very simple generative classification model. It posits that
given the class label, features can be generated independently of each other.
More precisely, the (multinomial) naive Bayes model uses the following
equation to estimate the joint probability of a feature vector $x$ belonging
to class $y$:
\begin{equation*}
\text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)}
\end{equation*}
where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a feature
present in $x$ with $n(i,x)$ denoting its count and $\theta_{iy}$ denotes the
class conditional probability of feature $i$ in class $y$. The usual
constraints hold on $\pi$ and $\theta$:
\begin{eqnarray*}
&& \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\
\forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1
\end{eqnarray*}
where $\mathcal{C}$ is the set of classes.
Given a fully labeled training dataset, it is possible to learn a naive Bayes
model using simple counting (group-by aggregates). To compute the class conditional
probabilities, it is usually advisable to avoid setting $\theta_{iy}$ to 0. One way to
achieve this is using additive smoothing or Laplace smoothing. Some authors have argued
that this should in fact be add-one smoothing. This implementation uses add-one smoothing
by default but lets the user specify her/his own constant, if required.
This implementation is sometimes referred to as \emph{multinomial} naive Bayes. Other
flavours of naive Bayes are also popular.
\\
\noindent{\bf Returns}
The learnt model produced by naive-bayes.dml is stored in two separate files.
The first file stores the class prior (a single-column matrix). The second file
stores the class conditional probabilities organized into a matrix with as many
rows as there are class labels and as many columns as there are features.
Depending on what arguments are provided during invocation, naive-bayes-predict.dml
may compute one or more of probabilities, accuracy and confusion matrix in the
output format specified.
\\
\noindent{\bf Examples}
\begin{verbatim}
hadoop jar SystemML.jar -f naive-bayes.dml -nvargs
X=/user/biadmin/X.mtx
Y=/user/biadmin/y.mtx
laplace=1 fmt=csv
prior=/user/biadmin/prior.csv
conditionals=/user/biadmin/conditionals.csv
accuracy=/user/biadmin/accuracy.csv
\end{verbatim}
\begin{verbatim}
hadoop jar SystemML.jar -f naive-bayes-predict.dml -nvargs
X=/user/biadmin/X.mtx
Y=/user/biadmin/y.mtx
prior=/user/biadmin/prior.csv
conditionals=/user/biadmin/conditionals.csv
fmt=csv
accuracy=/user/biadmin/accuracy.csv
probabilities=/user/biadmin/probabilities.csv
confusion=/user/biadmin/confusion.csv
\end{verbatim}
\noindent{\bf References}
\begin{itemize}
\item S. Russell and P. Norvig. \newblock{\em Artificial Intelligence: A Modern Approach.} Prentice Hall, 2009.
\item A. McCallum and K. Nigam. \newblock{\em A comparison of event models for naive bayes text classification.}
\newblock AAAI-98 workshop on learning for text categorization, 1998.
\end{itemize}