docs/ml-guide.md - spark - Git at Google

 ---
 layout: global
 title: "MLlib: Main Guide"
 displayTitle: "Machine Learning Library (MLlib) Guide"
 ---

 MLlib is Spark's machine learning (ML) library.
 Its goal is to make practical machine learning scalable and easy.
 At a high level, it provides tools such as:

 * ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
 * Featurization: feature extraction, transformation, dimensionality reduction, and selection
 * Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
 * Persistence: saving and load algorithms, models, and Pipelines
 * Utilities: linear algebra, statistics, data handling, etc.

 # Announcement: DataFrame-based API is primary API

 **The MLlib RDD-based API is now in maintenance mode.**

 As of Spark 2.0, the [RDD](programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
 The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.

 *What are the implications?*

 * MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
 * MLlib will not add new features to the RDD-based API.
 * In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
 * After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
 * The RDD-based API is expected to be removed in Spark 3.0.

 *Why is MLlib switching to the DataFrame-based API?*

 * DataFrames provide a more user-friendly API than RDDs.  The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
 * The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
 * DataFrames facilitate practical ML Pipelines, particularly feature transformations.  See the [Pipelines guide](ml-pipeline.html) for details.

 *What is "Spark ML"?*

 * "Spark ML" is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
   This is majorly due to the `org.apache.spark.ml` Scala package name used by the DataFrame-based API,
   and the "Spark ML Pipelines" term we used initially to emphasize the pipeline concept.

 *Is MLlib deprecated?*

 * No. MLlib includes both the RDD-based API and the DataFrame-based API.
   The RDD-based API is now in maintenance mode.
   But neither API is deprecated, nor MLlib as a whole.

 # Dependencies

 MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
 [netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
 If native libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
 implementation will be used instead.

 Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
 proxies by default.
 To configure `netlib-java` / Breeze to use system optimised binaries, include
 `com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
 project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
 platform's additional installation instructions.

 To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.

 [^1]: To learn more about the benefits and background of system optimised natives, you may wish to
     watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).

 # Migration guide

 MLlib is under active development.
 The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
 and the migration guide below will explain all changes between releases.

 ## From 2.0 to 2.1

 ### Breaking changes

 **Deprecated methods removed**

 * `setLabelCol` in `feature.ChiSqSelectorModel`
 * `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`)
 * `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`)
 * `model` in `regression.LinearRegressionSummary`
 * `validateParams` in `PipelineStage`
 * `validateParams` in `Evaluator`

 ### Deprecations and changes of behavior

 **Deprecations**

 * [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
   Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel`

 **Changes of behavior**

 * [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
  Fix a bug of `ChiSqSelector` which will likely change its result. Now `ChiSquareSelector` use pValue rather than raw statistic to select a fixed number of top features.
 * [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261):
  `KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
 * [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
  `KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode.

 ## Previous Spark versions

 Earlier migration guides are archived [on this page](ml-migration-guides.html).

 ---
	---
	layout: global
	title: "MLlib: Main Guide"
	displayTitle: "Machine Learning Library (MLlib) Guide"
	---

	MLlib is Spark's machine learning (ML) library.
	Its goal is to make practical machine learning scalable and easy.
	At a high level, it provides tools such as:

	* ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
	* Featurization: feature extraction, transformation, dimensionality reduction, and selection
	* Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
	* Persistence: saving and load algorithms, models, and Pipelines
	* Utilities: linear algebra, statistics, data handling, etc.

	# Announcement: DataFrame-based API is primary API

	The MLlib RDD-based API is now in maintenance mode.

	As of Spark 2.0, the [RDD](programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
	The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.

	What are the implications?

	* MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
	* MLlib will not add new features to the RDD-based API.
	* In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
	* After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
	* The RDD-based API is expected to be removed in Spark 3.0.

	Why is MLlib switching to the DataFrame-based API?

	* DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
	* The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
	* DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the [Pipelines guide](ml-pipeline.html) for details.

	What is "Spark ML"?

	* "Spark ML" is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
	This is majorly due to the `org.apache.spark.ml` Scala package name used by the DataFrame-based API,
	and the "Spark ML Pipelines" term we used initially to emphasize the pipeline concept.

	Is MLlib deprecated?

	* No. MLlib includes both the RDD-based API and the DataFrame-based API.
	The RDD-based API is now in maintenance mode.
	But neither API is deprecated, nor MLlib as a whole.

	# Dependencies

	MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
	[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
	If native libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
	implementation will be used instead.

	Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
	proxies by default.
	To configure `netlib-java` / Breeze to use system optimised binaries, include
	`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
	project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
	platform's additional installation instructions.

	To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.

	[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
	watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).

	# Migration guide

	MLlib is under active development.
	The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
	and the migration guide below will explain all changes between releases.

	## From 2.0 to 2.1

	### Breaking changes

	Deprecated methods removed

	* `setLabelCol` in `feature.ChiSqSelectorModel`
	* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`)
	* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`)
	* `model` in `regression.LinearRegressionSummary`
	* `validateParams` in `PipelineStage`
	* `validateParams` in `Evaluator`

	### Deprecations and changes of behavior

	Deprecations

	* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
	Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel`

	Changes of behavior

	* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
	Fix a bug of `ChiSqSelector` which will likely change its result. Now `ChiSquareSelector` use pValue rather than raw statistic to select a fixed number of top features.
	* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261):
	`KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
	* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
	`KMeans` reduces the default number of steps from 5 to 2 for the k-means\|\| initialization mode.

	## Previous Spark versions

	Earlier migration guides are archived [on this page](ml-migration-guides.html).

	---