docs/ml-clustering.md

layout: global title: Clustering - spark.ml displayTitle: Clustering - spark.ml

In this section, we introduce the pipeline API for clustering in mllib.

Table of Contents

This will become a table of contents (this text will be scraped). {:toc}

K-means

k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.

KMeans is implemented as an Estimator and generates a KMeansModel as the base model.

Input Columns

Output Columns

Example

{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}

{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}

Latent Dirichlet allocation (LDA)

LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base models. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.

Refer to the Scala API docs for more details.

{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}

Refer to the Java API docs for more details.

{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}