\begin{comment} | |

Licensed to the Apache Software Foundation (ASF) under one | |

or more contributor license agreements. See the NOTICE file | |

distributed with this work for additional information | |

regarding copyright ownership. The ASF licenses this file | |

to you under the Apache License, Version 2.0 (the | |

"License"); you may not use this file except in compliance | |

with the License. You may obtain a copy of the License at | |

http://www.apache.org/licenses/LICENSE-2.0 | |

Unless required by applicable law or agreed to in writing, | |

software distributed under the License is distributed on an | |

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |

KIND, either express or implied. See the License for the | |

specific language governing permissions and limitations | |

under the License. | |

\end{comment} | |

\subsection{Naive Bayes} | |

\label{naive_bayes} | |

\noindent{\bf Description} | |

Naive Bayes is very simple generative model used for classifying data. | |

This implementation learns a multinomial naive Bayes classifier which | |

is applicable when all features are counts of categorical values. | |

\\ | |

\noindent{\bf Usage} | |

\begin{tabbing} | |

\texttt{-f} \textit{path}/\texttt{naive-bayes.dml -nvargs} | |

\=\texttt{X=}\textit{path}/\textit{file} | |

\texttt{Y=}\textit{path}/\textit{file} | |

\texttt{laplace=}\textit{double}\\ | |

\>\texttt{prior=}\textit{path}/\textit{file} | |

\texttt{conditionals=}\textit{path}/\textit{file}\\ | |

\>\texttt{accuracy=}\textit{path}/\textit{file} | |

\texttt{fmt=}\textit{csv}$\vert$\textit{text} | |

\end{tabbing} | |

\begin{tabbing} | |

\texttt{-f} \textit{path}/\texttt{naive-bayes-predict.dml -nvargs} | |

\=\texttt{X=}\textit{path}/\textit{file} | |

\texttt{Y=}\textit{path}/\textit{file} | |

\texttt{prior=}\textit{path}/\textit{file}\\ | |

\>\texttt{conditionals=}\textit{path}/\textit{file} | |

\texttt{fmt=}\textit{csv}$\vert$\textit{text}\\ | |

\>\texttt{accuracy=}\textit{path}/\textit{file} | |

\texttt{confusion=}\textit{path}/\textit{file}\\ | |

\>\texttt{probabilities=}\textit{path}/\textit{file} | |

\end{tabbing} | |

\noindent{\bf Arguments} | |

\begin{itemize} | |

\item X: Location (on HDFS) to read the matrix of feature vectors; | |

each row constitutes one feature vector. | |

\item Y: Location (on HDFS) to read the one-column matrix of (categorical) | |

labels that correspond to feature vectors in X. Classes are assumed to be | |

contiguously labeled beginning from 1. Note that, this argument is optional | |

for prediction. | |

\item laplace (default: {\tt 1}): Laplace smoothing specified by the | |

user to avoid creation of 0 probabilities. | |

\item prior: Location (on HDFS) that contains the class prior probabilites. | |

\item conditionals: Location (on HDFS) that contains the class conditional | |

feature distributions. | |

\item fmt (default: {\tt text}): Specifies the output format. Choice of | |

comma-separated values (csv) or as a sparse-matrix (text). | |

\item probabilities: Location (on HDFS) to store class membership probabilities | |

for a held-out test set. Note that, this is an optional argument. | |

\item accuracy: Location (on HDFS) to store the training accuracy during | |

learning and testing accuracy from a held-out test set during prediction. | |

Note that, this is an optional argument for prediction. | |

\item confusion: Location (on HDFS) to store the confusion matrix | |

computed using a held-out test set. Note that, this is an optional | |

argument. | |

\end{itemize} | |

\noindent{\bf Details} | |

Naive Bayes is a very simple generative classification model. It posits that | |

given the class label, features can be generated independently of each other. | |

More precisely, the (multinomial) naive Bayes model uses the following | |

equation to estimate the joint probability of a feature vector $x$ belonging | |

to class $y$: | |

\begin{equation*} | |

\text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)} | |

\end{equation*} | |

where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a feature | |

present in $x$ with $n(i,x)$ denoting its count and $\theta_{iy}$ denotes the | |

class conditional probability of feature $i$ in class $y$. The usual | |

constraints hold on $\pi$ and $\theta$: | |

\begin{eqnarray*} | |

&& \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\ | |

\forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1 | |

\end{eqnarray*} | |

where $\mathcal{C}$ is the set of classes. | |

Given a fully labeled training dataset, it is possible to learn a naive Bayes | |

model using simple counting (group-by aggregates). To compute the class conditional | |

probabilities, it is usually advisable to avoid setting $\theta_{iy}$ to 0. One way to | |

achieve this is using additive smoothing or Laplace smoothing. Some authors have argued | |

that this should in fact be add-one smoothing. This implementation uses add-one smoothing | |

by default but lets the user specify her/his own constant, if required. | |

This implementation is sometimes referred to as \emph{multinomial} naive Bayes. Other | |

flavours of naive Bayes are also popular. | |

\\ | |

\noindent{\bf Returns} | |

The learnt model produced by naive-bayes.dml is stored in two separate files. | |

The first file stores the class prior (a single-column matrix). The second file | |

stores the class conditional probabilities organized into a matrix with as many | |

rows as there are class labels and as many columns as there are features. | |

Depending on what arguments are provided during invocation, naive-bayes-predict.dml | |

may compute one or more of probabilities, accuracy and confusion matrix in the | |

output format specified. | |

\\ | |

\noindent{\bf Examples} | |

\begin{verbatim} | |

hadoop jar SystemML.jar -f naive-bayes.dml -nvargs | |

X=/user/biadmin/X.mtx | |

Y=/user/biadmin/y.mtx | |

laplace=1 fmt=csv | |

prior=/user/biadmin/prior.csv | |

conditionals=/user/biadmin/conditionals.csv | |

accuracy=/user/biadmin/accuracy.csv | |

\end{verbatim} | |

\begin{verbatim} | |

hadoop jar SystemML.jar -f naive-bayes-predict.dml -nvargs | |

X=/user/biadmin/X.mtx | |

Y=/user/biadmin/y.mtx | |

prior=/user/biadmin/prior.csv | |

conditionals=/user/biadmin/conditionals.csv | |

fmt=csv | |

accuracy=/user/biadmin/accuracy.csv | |

probabilities=/user/biadmin/probabilities.csv | |

confusion=/user/biadmin/confusion.csv | |

\end{verbatim} | |

\noindent{\bf References} | |

\begin{itemize} | |

\item S. Russell and P. Norvig. \newblock{\em Artificial Intelligence: A Modern Approach.} Prentice Hall, 2009. | |

\item A. McCallum and K. Nigam. \newblock{\em A comparison of event models for naive bayes text classification.} | |

\newblock AAAI-98 workshop on learning for text categorization, 1998. | |

\end{itemize} |