docs/mllib-naive-bayes.md - spark - Git at Google

 ---
 layout: global
 title: Naive Bayes - RDD-based API
 displayTitle: Naive Bayes - RDD-based API
 license: |
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 ---

 [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple
 multiclass classification algorithm with the assumption of independence between
 every pair of features. Naive Bayes can be trained very efficiently. Within a
 single pass to the training data, it computes the conditional probability
 distribution of each feature given label, and then it applies Bayes' theorem to
 compute the conditional probability distribution of label given an observation
 and use it for prediction.

 `spark.mllib` supports [multinomial naive
 Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
 and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
 These models are typically used for [document classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
 Within that context, each observation is a document and each
 feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
 a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
 Feature values must be nonnegative. The model type is selected with an optional parameter
 "multinomial" or "bernoulli" with "multinomial" as the default.
 [Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
 setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
 vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
 sparsity. Since the training data is only used once, it is not necessary to cache it.

 ## Examples

 <div class="codetabs">

 <div data-lang="python" markdown="1">

 [NaiveBayes](api/python/reference/api/pyspark.mllib.classification.NaiveBayes.html) implements multinomial
 naive Bayes. It takes an RDD of
 [LabeledPoint](api/python/reference/api/pyspark.mllib.regression.LabeledPoint.html) and an optionally
 smoothing parameter `lambda` as input, and output a
 [NaiveBayesModel](api/python/reference/api/pyspark.mllib.classification.NaiveBayesModel.html), which can be
 used for evaluation and prediction.

 Note that the Python API does not yet support model save/load but will in the future.

 Refer to the [`NaiveBayes` Python docs](api/python/reference/api/pyspark.mllib.classification.NaiveBayes.html) and [`NaiveBayesModel` Python docs](api/python/reference/api/pyspark.mllib.classification.NaiveBayesModel.html) for more details on the API.

 {% include_example python/mllib/naive_bayes_example.py %}
 </div>

 <div data-lang="scala" markdown="1">

 [NaiveBayes](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) implements
 multinomial naive Bayes. It takes an RDD of
 [LabeledPoint](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html) and an optional
 smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
 [NaiveBayesModel](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
 can be used for evaluation and prediction.

 Refer to the [`NaiveBayes` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) and [`NaiveBayesModel` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html) for details on the API.

 {% include_example scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala %}
 </div>
 <div data-lang="java" markdown="1">

 [NaiveBayes](api/java/org/apache/spark/mllib/classification/NaiveBayes.html) implements
 multinomial naive Bayes. It takes a Scala RDD of
 [LabeledPoint](api/java/org/apache/spark/mllib/regression/LabeledPoint.html) and an
 optionally smoothing parameter `lambda` as input, and output a
 [NaiveBayesModel](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
 can be used for evaluation and prediction.

 Refer to the [`NaiveBayes` Java docs](api/java/org/apache/spark/mllib/classification/NaiveBayes.html) and [`NaiveBayesModel` Java docs](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html) for details on the API.

 {% include_example java/org/apache/spark/examples/mllib/JavaNaiveBayesExample.java %}
 </div>

 </div>
	---
	layout: global
	title: Naive Bayes - RDD-based API
	displayTitle: Naive Bayes - RDD-based API
	license: \|
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	---

	[Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple
	multiclass classification algorithm with the assumption of independence between
	every pair of features. Naive Bayes can be trained very efficiently. Within a
	single pass to the training data, it computes the conditional probability
	distribution of each feature given label, and then it applies Bayes' theorem to
	compute the conditional probability distribution of label given an observation
	and use it for prediction.

	`spark.mllib` supports [multinomial naive
	Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
	and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
	These models are typically used for [document classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
	Within that context, each observation is a document and each
	feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
	a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
	Feature values must be nonnegative. The model type is selected with an optional parameter
	"multinomial" or "bernoulli" with "multinomial" as the default.
	[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
	setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
	vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
	sparsity. Since the training data is only used once, it is not necessary to cache it.

	## Examples

	<div class="codetabs">

	<div data-lang="python" markdown="1">

	[NaiveBayes](api/python/reference/api/pyspark.mllib.classification.NaiveBayes.html) implements multinomial
	naive Bayes. It takes an RDD of
	[LabeledPoint](api/python/reference/api/pyspark.mllib.regression.LabeledPoint.html) and an optionally
	smoothing parameter `lambda` as input, and output a
	[NaiveBayesModel](api/python/reference/api/pyspark.mllib.classification.NaiveBayesModel.html), which can be
	used for evaluation and prediction.

	Note that the Python API does not yet support model save/load but will in the future.

	Refer to the [`NaiveBayes` Python docs](api/python/reference/api/pyspark.mllib.classification.NaiveBayes.html) and [`NaiveBayesModel` Python docs](api/python/reference/api/pyspark.mllib.classification.NaiveBayesModel.html) for more details on the API.

	{% include_example python/mllib/naive_bayes_example.py %}
	</div>

	<div data-lang="scala" markdown="1">

	[NaiveBayes](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) implements
	multinomial naive Bayes. It takes an RDD of
	[LabeledPoint](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html) and an optional
	smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
	[NaiveBayesModel](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
	can be used for evaluation and prediction.

	Refer to the [`NaiveBayes` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) and [`NaiveBayesModel` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html) for details on the API.

	{% include_example scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala %}
	</div>
	<div data-lang="java" markdown="1">

	[NaiveBayes](api/java/org/apache/spark/mllib/classification/NaiveBayes.html) implements
	multinomial naive Bayes. It takes a Scala RDD of
	[LabeledPoint](api/java/org/apache/spark/mllib/regression/LabeledPoint.html) and an
	optionally smoothing parameter `lambda` as input, and output a
	[NaiveBayesModel](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
	can be used for evaluation and prediction.

	Refer to the [`NaiveBayes` Java docs](api/java/org/apache/spark/mllib/classification/NaiveBayes.html) and [`NaiveBayesModel` Java docs](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html) for details on the API.

	{% include_example java/org/apache/spark/examples/mllib/JavaNaiveBayesExample.java %}
	</div>

	</div>