docs/ml-clustering.md - spark - Git at Google

 ---
 layout: global
 title: Clustering - spark.ml
 displayTitle: Clustering - spark.ml
 ---

 In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html).

 **Table of Contents**

 * This will become a table of contents (this text will be scraped).
 {:toc}

 ## K-means

 [k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
 most commonly used clustering algorithms that clusters the data points into a
 predefined number of clusters. The MLlib implementation includes a parallelized
 variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
 called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).

 `KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.

 ### Input Columns

 <table class="table">
   <thead>
     <tr>
       <th align="left">Param name</th>
       <th align="left">Type(s)</th>
       <th align="left">Default</th>
       <th align="left">Description</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td>featuresCol</td>
       <td>Vector</td>
       <td>"features"</td>
       <td>Feature vector</td>
     </tr>
   </tbody>
 </table>

 ### Output Columns

 <table class="table">
   <thead>
     <tr>
       <th align="left">Param name</th>
       <th align="left">Type(s)</th>
       <th align="left">Default</th>
       <th align="left">Description</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td>predictionCol</td>
       <td>Int</td>
       <td>"prediction"</td>
       <td>Predicted cluster center</td>
     </tr>
   </tbody>
 </table>

 ### Example

 <div class="codetabs">

 <div data-lang="scala" markdown="1">
 Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.

 {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
 </div>

 <div data-lang="java" markdown="1">
 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.

 {% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
 </div>

 </div>


 ## Latent Dirichlet allocation (LDA)

 `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
 and generates a `LDAModel` as the base models. Expert users may cast a `LDAModel` generated by
 `EMLDAOptimizer` to a `DistributedLDAModel` if needed.

 <div class="codetabs">

 <div data-lang="scala" markdown="1">

 Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details.

 {% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
 </div>

 <div data-lang="java" markdown="1">

 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.

 {% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
 </div>

 </div>
	---
	layout: global
	title: Clustering - spark.ml
	displayTitle: Clustering - spark.ml
	---

	In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html).

	Table of Contents

	* This will become a table of contents (this text will be scraped).
	{:toc}

	## K-means

	[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
	most commonly used clustering algorithms that clusters the data points into a
	predefined number of clusters. The MLlib implementation includes a parallelized
	variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
	called [kmeans\|\|](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).

	`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.

	### Input Columns

	<table class="table">
	<thead>
	<tr>
	<th align="left">Param name</th>
	<th align="left">Type(s)</th>
	<th align="left">Default</th>
	<th align="left">Description</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>featuresCol</td>
	<td>Vector</td>
	<td>"features"</td>
	<td>Feature vector</td>
	</tr>
	</tbody>
	</table>

	### Output Columns

	<table class="table">
	<thead>
	<tr>
	<th align="left">Param name</th>
	<th align="left">Type(s)</th>
	<th align="left">Default</th>
	<th align="left">Description</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>predictionCol</td>
	<td>Int</td>
	<td>"prediction"</td>
	<td>Predicted cluster center</td>
	</tr>
	</tbody>
	</table>

	### Example

	<div class="codetabs">

	<div data-lang="scala" markdown="1">
	Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.

	{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
	</div>

	<div data-lang="java" markdown="1">
	Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.

	{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
	</div>

	</div>


	## Latent Dirichlet allocation (LDA)

	`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
	and generates a `LDAModel` as the base models. Expert users may cast a `LDAModel` generated by
	`EMLDAOptimizer` to a `DistributedLDAModel` if needed.

	<div class="codetabs">

	<div data-lang="scala" markdown="1">

	Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details.

	{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
	</div>

	<div data-lang="java" markdown="1">

	Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.

	{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
	</div>

	</div>