blob: eba9ec754ef0ee8c68df85f742fdb2fb9e60a323 [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
= K-Means Clustering
K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters.
== Model
K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
The model holds a vector of k centers and one of the distance metrics provided by the ML framework such as Euclidean, Hamming, Manhattan and etc.
It creates the label as follows:
[source, java]
----
KMeansModel mdl = trainer.fit(
ignite,
dataCache,
vectorizer
);
double clusterLabel = mdl.predict(inputVector);
----
== Trainer
KMeans is an unsupervised learning algorithm. It solves a clustering task which is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
KMeans is a parametrized iterative algorithm which calculates the new means to be the centroids of the observations in the clusters on each iteration.
Presently, Ignite supports a few parameters for the KMeans classification algorithm:
* `k` - a number of possible clusters
* `maxIterations` - one stop criteria (the other one is epsilon)
* `epsilon` - delta of convergence (delta between old and new centroid's values)
* `distance` - one of the distance metrics provided by the ML framework such as Euclidean, Hamming or Manhattan
* `seed` - one of initialization parameters which helps to reproduce models (trainer has a random initialization step to get the first centroids)
[source, java]
----
// Set up the trainer
KMeansTrainer trainer = new KMeansTrainer()
.withDistance(new EuclideanDistance())
.withK(AMOUNT_OF_CLUSTERS)
.withMaxIterations(MAX_ITERATIONS)
.withEpsilon(PRECISION);
// Build the model
KMeansModel mdl = trainer.fit(
ignite,
dataCache,
vectorizer
);
----
== Example
To see how K-Means clustering can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/KMeansClusterizationExample.java[example^] that is available on GitHub and delivered with every Apache Ignite distribution.
The training dataset is the subset of the Iris dataset (classes with labels 1 and 2, which are presented linear separable two-classes dataset) which can be loaded from the https://archive.ics.uci.edu/ml/datasets/iris[UCI Machine Learning Repository].