| --- |
| layout: global |
| title: Clustering |
| displayTitle: Clustering |
| license: | |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --- |
| |
| This page describes clustering algorithms in MLlib. |
| The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information |
| about these algorithms. |
| |
| **Table of Contents** |
| |
| * This will become a table of contents (this text will be scraped). |
| {:toc} |
| |
| ## K-means |
| |
| [k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the |
| most commonly used clustering algorithms that clusters the data points into a |
| predefined number of clusters. The MLlib implementation includes a parallelized |
| variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method |
| called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). |
| |
| `KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model. |
| |
| ### Input Columns |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr> |
| <th align="left">Param name</th> |
| <th align="left">Type(s)</th> |
| <th align="left">Default</th> |
| <th align="left">Description</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>featuresCol</td> |
| <td>Vector</td> |
| <td>"features"</td> |
| <td>Feature vector</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| ### Output Columns |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr> |
| <th align="left">Param name</th> |
| <th align="left">Type(s)</th> |
| <th align="left">Default</th> |
| <th align="left">Description</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>predictionCol</td> |
| <td>Int</td> |
| <td>"prediction"</td> |
| <td>Predicted cluster center</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| **Examples** |
| |
| <div class="codetabs"> |
| |
| <div data-lang="python" markdown="1"> |
| Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.KMeans.html) for more details. |
| |
| {% include_example python/ml/kmeans_example.py %} |
| </div> |
| |
| <div data-lang="scala" markdown="1"> |
| Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/KMeans.html) for more details. |
| |
| {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details. |
| |
| {% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %} |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| |
| Refer to the [R API docs](api/R/reference/spark.kmeans.html) for more details. |
| |
| {% include_example r/ml/kmeans.R %} |
| </div> |
| |
| </div> |
| |
| ## Latent Dirichlet allocation (LDA) |
| |
| `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`, |
| and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by |
| `EMLDAOptimizer` to a `DistributedLDAModel` if needed. |
| |
| **Examples** |
| |
| <div class="codetabs"> |
| |
| <div data-lang="python" markdown="1"> |
| |
| Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.LDA.html) for more details. |
| |
| {% include_example python/ml/lda_example.py %} |
| </div> |
| |
| <div data-lang="scala" markdown="1"> |
| |
| Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/LDA.html) for more details. |
| |
| {% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| |
| Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details. |
| |
| {% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %} |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| |
| Refer to the [R API docs](api/R/reference/spark.lda.html) for more details. |
| |
| {% include_example r/ml/lda.R %} |
| </div> |
| |
| </div> |
| |
| ## Bisecting k-means |
| |
| Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a |
| divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one |
| moves down the hierarchy. |
| |
| Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering. |
| |
| `BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model. |
| |
| **Examples** |
| |
| <div class="codetabs"> |
| |
| <div data-lang="python" markdown="1"> |
| Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.BisectingKMeans.html) for more details. |
| |
| {% include_example python/ml/bisecting_k_means_example.py %} |
| </div> |
| |
| <div data-lang="scala" markdown="1"> |
| Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details. |
| |
| {% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details. |
| |
| {% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %} |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| |
| Refer to the [R API docs](api/R/reference/spark.bisectingKmeans.html) for more details. |
| |
| {% include_example r/ml/bisectingKmeans.R %} |
| </div> |
| </div> |
| |
| ## Gaussian Mixture Model (GMM) |
| |
| A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) |
| represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions, |
| each with its own probability. The `spark.ml` implementation uses the |
| [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) |
| algorithm to induce the maximum-likelihood model given a set of samples. |
| |
| `GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base |
| model. |
| |
| ### Input Columns |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr> |
| <th align="left">Param name</th> |
| <th align="left">Type(s)</th> |
| <th align="left">Default</th> |
| <th align="left">Description</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>featuresCol</td> |
| <td>Vector</td> |
| <td>"features"</td> |
| <td>Feature vector</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| ### Output Columns |
| |
| <table class="table table-striped"> |
| <thead> |
| <tr> |
| <th align="left">Param name</th> |
| <th align="left">Type(s)</th> |
| <th align="left">Default</th> |
| <th align="left">Description</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>predictionCol</td> |
| <td>Int</td> |
| <td>"prediction"</td> |
| <td>Predicted cluster center</td> |
| </tr> |
| <tr> |
| <td>probabilityCol</td> |
| <td>Vector</td> |
| <td>"probability"</td> |
| <td>Probability of each cluster</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| **Examples** |
| |
| <div class="codetabs"> |
| |
| <div data-lang="python" markdown="1"> |
| Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html) for more details. |
| |
| {% include_example python/ml/gaussian_mixture_example.py %} |
| </div> |
| |
| <div data-lang="scala" markdown="1"> |
| Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/GaussianMixture.html) for more details. |
| |
| {% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details. |
| |
| {% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %} |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| |
| Refer to the [R API docs](api/R/reference/spark.gaussianMixture.html) for more details. |
| |
| {% include_example r/ml/gaussianMixture.R %} |
| </div> |
| |
| </div> |
| |
| ## Power Iteration Clustering (PIC) |
| |
| Power Iteration Clustering (PIC) is a scalable graph clustering algorithm |
| developed by [Lin and Cohen](http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf). |
| From the abstract: PIC finds a very low-dimensional embedding of a dataset |
| using truncated power iteration on a normalized pair-wise similarity matrix of the data. |
| |
| `spark.ml`'s PowerIterationClustering implementation takes the following parameters: |
| |
| * `k`: the number of clusters to create |
| * `initMode`: param for the initialization algorithm |
| * `maxIter`: param for maximum number of iterations |
| * `srcCol`: param for the name of the input column for source vertex IDs |
| * `dstCol`: name of the input column for destination vertex IDs |
| * `weightCol`: Param for weight column name |
| |
| **Examples** |
| |
| <div class="codetabs"> |
| |
| <div data-lang="python" markdown="1"> |
| Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.PowerIterationClustering.html) for more details. |
| |
| {% include_example python/ml/power_iteration_clustering_example.py %} |
| </div> |
| |
| <div data-lang="scala" markdown="1"> |
| Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details. |
| |
| {% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details. |
| |
| {% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %} |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| |
| Refer to the [R API docs](api/R/reference/spark.powerIterationClustering.html) for more details. |
| |
| {% include_example r/ml/powerIterationClustering.R %} |
| </div> |
| |
| </div> |