| \begin{comment} |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| \end{comment} |
| |
| \subsection{Naive Bayes} |
| \label{naive_bayes} |
| |
| \noindent{\bf Description} |
| |
| Naive Bayes is very simple generative model used for classifying data. |
| This implementation learns a multinomial naive Bayes classifier which |
| is applicable when all features are counts of categorical values. |
| \\ |
| |
| \noindent{\bf Usage} |
| |
| \begin{tabbing} |
| \texttt{-f} \textit{path}/\texttt{naive-bayes.dml -nvargs} |
| \=\texttt{X=}\textit{path}/\textit{file} |
| \texttt{Y=}\textit{path}/\textit{file} |
| \texttt{laplace=}\textit{double}\\ |
| \>\texttt{prior=}\textit{path}/\textit{file} |
| \texttt{conditionals=}\textit{path}/\textit{file}\\ |
| \>\texttt{accuracy=}\textit{path}/\textit{file} |
| \texttt{fmt=}\textit{csv}$\vert$\textit{text} |
| \end{tabbing} |
| |
| \begin{tabbing} |
| \texttt{-f} \textit{path}/\texttt{naive-bayes-predict.dml -nvargs} |
| \=\texttt{X=}\textit{path}/\textit{file} |
| \texttt{Y=}\textit{path}/\textit{file} |
| \texttt{prior=}\textit{path}/\textit{file}\\ |
| \>\texttt{conditionals=}\textit{path}/\textit{file} |
| \texttt{fmt=}\textit{csv}$\vert$\textit{text}\\ |
| \>\texttt{accuracy=}\textit{path}/\textit{file} |
| \texttt{confusion=}\textit{path}/\textit{file}\\ |
| \>\texttt{probabilities=}\textit{path}/\textit{file} |
| \end{tabbing} |
| |
| \noindent{\bf Arguments} |
| |
| \begin{itemize} |
| \item X: Location (on HDFS) to read the matrix of feature vectors; |
| each row constitutes one feature vector. |
| \item Y: Location (on HDFS) to read the one-column matrix of (categorical) |
| labels that correspond to feature vectors in X. Classes are assumed to be |
| contiguously labeled beginning from 1. Note that, this argument is optional |
| for prediction. |
| \item laplace (default: {\tt 1}): Laplace smoothing specified by the |
| user to avoid creation of 0 probabilities. |
| \item prior: Location (on HDFS) that contains the class prior probabilites. |
| \item conditionals: Location (on HDFS) that contains the class conditional |
| feature distributions. |
| \item fmt (default: {\tt text}): Specifies the output format. Choice of |
| comma-separated values (csv) or as a sparse-matrix (text). |
| \item probabilities: Location (on HDFS) to store class membership probabilities |
| for a held-out test set. Note that, this is an optional argument. |
| \item accuracy: Location (on HDFS) to store the training accuracy during |
| learning and testing accuracy from a held-out test set during prediction. |
| Note that, this is an optional argument for prediction. |
| \item confusion: Location (on HDFS) to store the confusion matrix |
| computed using a held-out test set. Note that, this is an optional |
| argument. |
| \end{itemize} |
| |
| \noindent{\bf Details} |
| |
| Naive Bayes is a very simple generative classification model. It posits that |
| given the class label, features can be generated independently of each other. |
| More precisely, the (multinomial) naive Bayes model uses the following |
| equation to estimate the joint probability of a feature vector $x$ belonging |
| to class $y$: |
| \begin{equation*} |
| \text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)} |
| \end{equation*} |
| where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a feature |
| present in $x$ with $n(i,x)$ denoting its count and $\theta_{iy}$ denotes the |
| class conditional probability of feature $i$ in class $y$. The usual |
| constraints hold on $\pi$ and $\theta$: |
| \begin{eqnarray*} |
| && \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\ |
| \forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1 |
| \end{eqnarray*} |
| where $\mathcal{C}$ is the set of classes. |
| |
| Given a fully labeled training dataset, it is possible to learn a naive Bayes |
| model using simple counting (group-by aggregates). To compute the class conditional |
| probabilities, it is usually advisable to avoid setting $\theta_{iy}$ to 0. One way to |
| achieve this is using additive smoothing or Laplace smoothing. Some authors have argued |
| that this should in fact be add-one smoothing. This implementation uses add-one smoothing |
| by default but lets the user specify her/his own constant, if required. |
| |
| This implementation is sometimes referred to as \emph{multinomial} naive Bayes. Other |
| flavours of naive Bayes are also popular. |
| \\ |
| |
| \noindent{\bf Returns} |
| |
| The learnt model produced by naive-bayes.dml is stored in two separate files. |
| The first file stores the class prior (a single-column matrix). The second file |
| stores the class conditional probabilities organized into a matrix with as many |
| rows as there are class labels and as many columns as there are features. |
| Depending on what arguments are provided during invocation, naive-bayes-predict.dml |
| may compute one or more of probabilities, accuracy and confusion matrix in the |
| output format specified. |
| \\ |
| |
| \noindent{\bf Examples} |
| |
| \begin{verbatim} |
| hadoop jar SystemML.jar -f naive-bayes.dml -nvargs |
| X=/user/biadmin/X.mtx |
| Y=/user/biadmin/y.mtx |
| laplace=1 fmt=csv |
| prior=/user/biadmin/prior.csv |
| conditionals=/user/biadmin/conditionals.csv |
| accuracy=/user/biadmin/accuracy.csv |
| \end{verbatim} |
| |
| \begin{verbatim} |
| hadoop jar SystemML.jar -f naive-bayes-predict.dml -nvargs |
| X=/user/biadmin/X.mtx |
| Y=/user/biadmin/y.mtx |
| prior=/user/biadmin/prior.csv |
| conditionals=/user/biadmin/conditionals.csv |
| fmt=csv |
| accuracy=/user/biadmin/accuracy.csv |
| probabilities=/user/biadmin/probabilities.csv |
| confusion=/user/biadmin/confusion.csv |
| \end{verbatim} |
| |
| \noindent{\bf References} |
| |
| \begin{itemize} |
| \item S. Russell and P. Norvig. \newblock{\em Artificial Intelligence: A Modern Approach.} Prentice Hall, 2009. |
| \item A. McCallum and K. Nigam. \newblock{\em A comparison of event models for naive bayes text classification.} |
| \newblock AAAI-98 workshop on learning for text categorization, 1998. |
| \end{itemize} |