docs/ml-clustering.md

layout: global title: Clustering displayTitle: Clustering

This page describes clustering algorithms in MLlib. The guide for clustering in the RDD-based API also has relevant information about these algorithms.

Table of Contents

This will become a table of contents (this text will be scraped). {:toc}

K-means

k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.

KMeans is implemented as an Estimator and generates a KMeansModel as the base model.

Input Columns

Output Columns

Examples

{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}

{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}

{% include_example python/ml/kmeans_example.py %}

Refer to the R API docs for more details.

{% include_example r/ml/kmeans.R %}

Latent Dirichlet allocation (LDA)

LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.

Examples

Refer to the Scala API docs for more details.

{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}

Refer to the Java API docs for more details.

{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}

Refer to the Python API docs for more details.

{% include_example python/ml/lda_example.py %}

Refer to the R API docs for more details.

{% include_example r/ml/lda.R %}

Bisecting k-means

Bisecting k-means is a kind of hierarchical clustering using a divisive (or “top-down”) approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.

BisectingKMeans is implemented as an Estimator and generates a BisectingKMeansModel as the base model.

Examples

{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}

{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}

{% include_example python/ml/bisecting_k_means_example.py %}

Refer to the R API docs for more details.

{% include_example r/ml/bisectingKmeans.R %}

Gaussian Mixture Model (GMM)

A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. The spark.ml implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.

GaussianMixture is implemented as an Estimator and generates a GaussianMixtureModel as the base model.

Input Columns

Output Columns

Examples

{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}

{% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}

{% include_example python/ml/gaussian_mixture_example.py %}

Refer to the R API docs for more details.

{% include_example r/ml/gaussianMixture.R %}