docs/mllib-naive-bayes.md - spark - Git at Google

 ---
 layout: global
 title: Naive Bayes - MLlib
 displayTitle: <a href="mllib-guide.html">MLlib</a> - Naive Bayes
 ---

 [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple
 multiclass classification algorithm with the assumption of independence between
 every pair of features. Naive Bayes can be trained very efficiently. Within a
 single pass to the training data, it computes the conditional probability
 distribution of each feature given label, and then it applies Bayes' theorem to
 compute the conditional probability distribution of label given an observation
 and use it for prediction.

 MLlib supports [multinomial naive
 Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
 and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
 These models are typically used for [document classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
 Within that context, each observation is a document and each
 feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
 a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
 Feature values must be nonnegative. The model type is selected with an optional parameter
 "multinomial" or "bernoulli" with "multinomial" as the default.
 [Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
 setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
 vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
 sparsity. Since the training data is only used once, it is not necessary to cache it.

 ## Examples

 <div class="codetabs">
 <div data-lang="scala" markdown="1">

 [NaiveBayes](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
 multinomial naive Bayes. It takes an RDD of
 [LabeledPoint](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an optional
 smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
 [NaiveBayesModel](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
 can be used for evaluation and prediction.

 {% highlight scala %}
 import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
 import org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.regression.LabeledPoint

 val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt")
 val parsedData = data.map { line =>
   val parts = line.split(',')
   LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
 }
 // Split data into training (60%) and test (40%).
 val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
 val training = splits(0)
 val test = splits(1)

 val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

 val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
 val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()

 // Save and load model
 model.save(sc, "myModelPath")
 val sameModel = NaiveBayesModel.load(sc, "myModelPath")
 {% endhighlight %}
 </div>

 <div data-lang="java" markdown="1">

 [NaiveBayes](api/java/org/apache/spark/mllib/classification/NaiveBayes.html) implements
 multinomial naive Bayes. It takes a Scala RDD of
 [LabeledPoint](api/java/org/apache/spark/mllib/regression/LabeledPoint.html) and an
 optionally smoothing parameter `lambda` as input, and output a
 [NaiveBayesModel](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
 can be used for evaluation and prediction.

 {% highlight java %}
 import scala.Tuple2;

 import org.apache.spark.api.java.JavaPairRDD;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.api.java.function.PairFunction;
 import org.apache.spark.mllib.classification.NaiveBayes;
 import org.apache.spark.mllib.classification.NaiveBayesModel;
 import org.apache.spark.mllib.regression.LabeledPoint;

 JavaRDD<LabeledPoint> training = ... // training set
 JavaRDD<LabeledPoint> test = ... // test set

 final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);

 JavaPairRDD<Double, Double> predictionAndLabel =
   test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
     @Override public Tuple2<Double, Double> call(LabeledPoint p) {
       return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
     }
   });
 double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
     @Override public Boolean call(Tuple2<Double, Double> pl) {
       return pl._1().equals(pl._2());
     }
   }).count() / (double) test.count();

 // Save and load model
 model.save(sc.sc(), "myModelPath");
 NaiveBayesModel sameModel = NaiveBayesModel.load(sc.sc(), "myModelPath");
 {% endhighlight %}
 </div>

 <div data-lang="python" markdown="1">

 [NaiveBayes](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes) implements multinomial
 naive Bayes. It takes an RDD of
 [LabeledPoint](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint) and an optionally
 smoothing parameter `lambda` as input, and output a
 [NaiveBayesModel](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel), which can be
 used for evaluation and prediction.

 Note that the Python API does not yet support model save/load but will in the future.

 {% highlight python %}
 from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
 from pyspark.mllib.linalg import Vectors
 from pyspark.mllib.regression import LabeledPoint

 def parseLine(line):
     parts = line.split(',')
     label = float(parts[0])
     features = Vectors.dense([float(x) for x in parts[1].split(' ')])
     return LabeledPoint(label, features)

 data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)

 # Split data aproximately into training (60%) and test (40%)
 training, test = data.randomSplit([0.6, 0.4], seed = 0)

 # Train a naive Bayes model.
 model = NaiveBayes.train(training, 1.0)

 # Make prediction and test accuracy.
 predictionAndLabel = test.map(lambda p : (model.predict(p.features), p.label))
 accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()

 # Save and load model
 model.save(sc, "myModelPath")
 sameModel = NaiveBayesModel.load(sc, "myModelPath")
 {% endhighlight %}

 </div>
 </div>
	---
	layout: global
	title: Naive Bayes - MLlib
	displayTitle: <a href="mllib-guide.html">MLlib</a> - Naive Bayes
	---

	[Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple
	multiclass classification algorithm with the assumption of independence between
	every pair of features. Naive Bayes can be trained very efficiently. Within a
	single pass to the training data, it computes the conditional probability
	distribution of each feature given label, and then it applies Bayes' theorem to
	compute the conditional probability distribution of label given an observation
	and use it for prediction.

	MLlib supports [multinomial naive
	Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
	and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
	These models are typically used for [document classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
	Within that context, each observation is a document and each
	feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
	a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
	Feature values must be nonnegative. The model type is selected with an optional parameter
	"multinomial" or "bernoulli" with "multinomial" as the default.
	[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
	setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
	vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
	sparsity. Since the training data is only used once, it is not necessary to cache it.

	## Examples

	<div class="codetabs">
	<div data-lang="scala" markdown="1">

	[NaiveBayes](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
	multinomial naive Bayes. It takes an RDD of
	[LabeledPoint](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an optional
	smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
	[NaiveBayesModel](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
	can be used for evaluation and prediction.

	{% highlight scala %}
	import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
	import org.apache.spark.mllib.linalg.Vectors
	import org.apache.spark.mllib.regression.LabeledPoint

	val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt")
	val parsedData = data.map { line =>
	val parts = line.split(',')
	LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
	}
	// Split data into training (60%) and test (40%).
	val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
	val training = splits(0)
	val test = splits(1)

	val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

	val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
	val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()

	// Save and load model
	model.save(sc, "myModelPath")
	val sameModel = NaiveBayesModel.load(sc, "myModelPath")
	{% endhighlight %}
	</div>

	<div data-lang="java" markdown="1">

	[NaiveBayes](api/java/org/apache/spark/mllib/classification/NaiveBayes.html) implements
	multinomial naive Bayes. It takes a Scala RDD of
	[LabeledPoint](api/java/org/apache/spark/mllib/regression/LabeledPoint.html) and an
	optionally smoothing parameter `lambda` as input, and output a
	[NaiveBayesModel](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
	can be used for evaluation and prediction.

	{% highlight java %}
	import scala.Tuple2;

	import org.apache.spark.api.java.JavaPairRDD;
	import org.apache.spark.api.java.JavaRDD;
	import org.apache.spark.api.java.function.Function;
	import org.apache.spark.api.java.function.PairFunction;
	import org.apache.spark.mllib.classification.NaiveBayes;
	import org.apache.spark.mllib.classification.NaiveBayesModel;
	import org.apache.spark.mllib.regression.LabeledPoint;

	JavaRDD<LabeledPoint> training = ... // training set
	JavaRDD<LabeledPoint> test = ... // test set

	final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);

	JavaPairRDD<Double, Double> predictionAndLabel =
	test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
	@Override public Tuple2<Double, Double> call(LabeledPoint p) {
	return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
	}
	});
	double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
	@Override public Boolean call(Tuple2<Double, Double> pl) {
	return pl._1().equals(pl._2());
	}
	}).count() / (double) test.count();

	// Save and load model
	model.save(sc.sc(), "myModelPath");
	NaiveBayesModel sameModel = NaiveBayesModel.load(sc.sc(), "myModelPath");
	{% endhighlight %}
	</div>

	<div data-lang="python" markdown="1">

	[NaiveBayes](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes) implements multinomial
	naive Bayes. It takes an RDD of
	[LabeledPoint](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint) and an optionally
	smoothing parameter `lambda` as input, and output a
	[NaiveBayesModel](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel), which can be
	used for evaluation and prediction.

	Note that the Python API does not yet support model save/load but will in the future.

	{% highlight python %}
	from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
	from pyspark.mllib.linalg import Vectors
	from pyspark.mllib.regression import LabeledPoint

	def parseLine(line):
	parts = line.split(',')
	label = float(parts[0])
	features = Vectors.dense([float(x) for x in parts[1].split(' ')])
	return LabeledPoint(label, features)

	data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)

	# Split data aproximately into training (60%) and test (40%)
	training, test = data.randomSplit([0.6, 0.4], seed = 0)

	# Train a naive Bayes model.
	model = NaiveBayes.train(training, 1.0)

	# Make prediction and test accuracy.
	predictionAndLabel = test.map(lambda p : (model.predict(p.features), p.label))
	accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()

	# Save and load model
	model.save(sc, "myModelPath")
	sameModel = NaiveBayesModel.load(sc, "myModelPath")
	{% endhighlight %}

	</div>
	</div>