| --- |
| layout: global |
| title: Clustering - spark.ml |
| displayTitle: Clustering - spark.ml |
| --- |
| |
| In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html). |
| |
| **Table of Contents** |
| |
| * This will become a table of contents (this text will be scraped). |
| {:toc} |
| |
| ## K-means |
| |
| [k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the |
| most commonly used clustering algorithms that clusters the data points into a |
| predefined number of clusters. The MLlib implementation includes a parallelized |
| variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method |
| called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). |
| |
| `KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model. |
| |
| ### Input Columns |
| |
| <table class="table"> |
| <thead> |
| <tr> |
| <th align="left">Param name</th> |
| <th align="left">Type(s)</th> |
| <th align="left">Default</th> |
| <th align="left">Description</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>featuresCol</td> |
| <td>Vector</td> |
| <td>"features"</td> |
| <td>Feature vector</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| ### Output Columns |
| |
| <table class="table"> |
| <thead> |
| <tr> |
| <th align="left">Param name</th> |
| <th align="left">Type(s)</th> |
| <th align="left">Default</th> |
| <th align="left">Description</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>predictionCol</td> |
| <td>Int</td> |
| <td>"prediction"</td> |
| <td>Predicted cluster center</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| ### Example |
| |
| <div class="codetabs"> |
| |
| <div data-lang="scala" markdown="1"> |
| Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details. |
| |
| {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details. |
| |
| {% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %} |
| </div> |
| |
| </div> |
| |
| |
| ## Latent Dirichlet allocation (LDA) |
| |
| `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`, |
| and generates a `LDAModel` as the base models. Expert users may cast a `LDAModel` generated by |
| `EMLDAOptimizer` to a `DistributedLDAModel` if needed. |
| |
| <div class="codetabs"> |
| |
| <div data-lang="scala" markdown="1"> |
| |
| Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details. |
| |
| {% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| |
| Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details. |
| |
| {% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %} |
| </div> |
| |
| </div> |