docs/_docs/machine-learning/clustering/k-means-clustering.adoc - ignite - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one or more
 // contributor license agreements.  See the NOTICE file distributed with
 // this work for additional information regarding copyright ownership.
 // The ASF licenses this file to You under the Apache License, Version 2.0
 // (the "License"); you may not use this file except in compliance with
 // the License.  You may obtain a copy of the License at
 //
 // http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 = K-Means Clustering

 K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters.

 == Model

 K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

 The model holds a vector of k centers and one of the distance metrics provided by the ML framework such as Euclidean, Hamming, Manhattan and etc.

 It creates the label as follows:


 [source, java]
 ----
 KMeansModel mdl = trainer.fit(
     ignite,
     dataCache,
     vectorizer
 );


 double clusterLabel = mdl.predict(inputVector);
 ----

 == Trainer


 KMeans is an unsupervised learning algorithm. It solves a clustering task which is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

 KMeans is a parametrized iterative algorithm which calculates the new means to be the centroids of the observations in the clusters on each iteration.

 Presently, Ignite supports a few parameters for the KMeans classification algorithm:

 * `k` - a number of possible clusters
 * `maxIterations` - one stop criteria (the other one is epsilon)
 * `epsilon` - delta of convergence (delta between old and new centroid's values)
 * `distance` - one of the distance metrics provided by the ML framework such as Euclidean, Hamming or Manhattan
 * `seed` - one of initialization parameters which helps to reproduce models (trainer has a random initialization step to get the first centroids)


 [source, java]
 ----
 // Set up the trainer
 KMeansTrainer trainer = new KMeansTrainer()
    .withDistance(new EuclideanDistance())
    .withK(AMOUNT_OF_CLUSTERS)
    .withMaxIterations(MAX_ITERATIONS)
    .withEpsilon(PRECISION);

 // Build the model
 KMeansModel mdl = trainer.fit(
     ignite,
     dataCache,
     vectorizer
 );
 ----


 == Example


 To see how K-Means clustering can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/KMeansClusterizationExample.java[example^] that is available on GitHub and delivered with every Apache Ignite distribution.

 The training dataset is the subset of the Iris dataset (classes with labels 1 and 2, which are presented linear separable two-classes dataset) which can be loaded from the https://archive.ics.uci.edu/ml/datasets/iris[UCI Machine Learning Repository].
	// Licensed to the Apache Software Foundation (ASF) under one or more
	// contributor license agreements. See the NOTICE file distributed with
	// this work for additional information regarding copyright ownership.
	// The ASF licenses this file to You under the Apache License, Version 2.0
	// (the "License"); you may not use this file except in compliance with
	// the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing, software
	// distributed under the License is distributed on an "AS IS" BASIS,
	// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	// See the License for the specific language governing permissions and
	// limitations under the License.
	= K-Means Clustering

	K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters.

	== Model

	K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

	The model holds a vector of k centers and one of the distance metrics provided by the ML framework such as Euclidean, Hamming, Manhattan and etc.

	It creates the label as follows:



	[source, java]
	----
	KMeansModel mdl = trainer.fit(
	ignite,
	dataCache,
	vectorizer
	);


	double clusterLabel = mdl.predict(inputVector);
	----

	== Trainer


	KMeans is an unsupervised learning algorithm. It solves a clustering task which is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

	KMeans is a parametrized iterative algorithm which calculates the new means to be the centroids of the observations in the clusters on each iteration.

	Presently, Ignite supports a few parameters for the KMeans classification algorithm:

	* `k` - a number of possible clusters
	* `maxIterations` - one stop criteria (the other one is epsilon)
	* `epsilon` - delta of convergence (delta between old and new centroid's values)
	* `distance` - one of the distance metrics provided by the ML framework such as Euclidean, Hamming or Manhattan
	* `seed` - one of initialization parameters which helps to reproduce models (trainer has a random initialization step to get the first centroids)


	[source, java]
	----
	// Set up the trainer
	KMeansTrainer trainer = new KMeansTrainer()
	.withDistance(new EuclideanDistance())
	.withK(AMOUNT_OF_CLUSTERS)
	.withMaxIterations(MAX_ITERATIONS)
	.withEpsilon(PRECISION);

	// Build the model
	KMeansModel mdl = trainer.fit(
	ignite,
	dataCache,
	vectorizer
	);
	----


	== Example


	To see how K-Means clustering can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/KMeansClusterizationExample.java[example^] that is available on GitHub and delivered with every Apache Ignite distribution.

	The training dataset is the subset of the Iris dataset (classes with labels 1 and 2, which are presented linear separable two-classes dataset) which can be loaded from the https://archive.ics.uci.edu/ml/datasets/iris[UCI Machine Learning Repository].