docs/_docs/machine-learning/clustering/gaussian-mixture.adoc - ignite - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one or more
 // contributor license agreements.  See the NOTICE file distributed with
 // this work for additional information regarding copyright ownership.
 // The ASF licenses this file to You under the Apache License, Version 2.0
 // (the "License"); you may not use this file except in compliance with
 // the License.  You may obtain a copy of the License at
 //
 // http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 = Gaussian mixture (GMM)

 A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

 NOTE: You could think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

 == Model

 This algorithm represents a soft clustering model where each cluster is a Gaussian distribution with its own mean value and covariation matrix. Such a model can predict a cluster using the maximum likelihood principle.

 It defines the labels by the following way:


 [source, java]
 ----
 KMeansModel mdl = trainer.fit(
     ignite,
     dataCache,
     vectorizer
 );

 double clusterLabel = mdl.predict(inputVector);
 ----


 == Trainer


 GMM is a unsupervised learning algorithm. The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can compute the Bayesian Information Criterion to assess the number of clusters in the data.

 Presently, Ignite ML supports a few parameters for the GMM classification algorithm:

 * `maxCountOfClusters ` - the number of possible clusters
 * `maxCountOfIterations ` - one stop criteria (the other one is epsilon)
 * `epsilon` - delta of convergence(delta between old and new centroid's values)
 * `countOfComponents` - the number of components
 * `maxLikelihoodDivergence` - maximum divergence between maximum of likelihood of vector in dataset and other for anomalies identification
 * `minElementsForNewCluster` - minimum required anomalies in terms of maxLikelihoodDivergence for creating new cluster
 * `minClusterProbability` - minimum cluster probability


 [source, java]
 ----
 // Set up the trainer
 GmmTrainer trainer = new GmmTrainer(COUNT_OF_COMPONENTS);

 // Build the model
 GmmModel mdl = trainer
     .withMaxCountIterations(MAX_COUNT_ITERATIONS)
     .withMaxCountOfClusters(MAX_AMOUNT_OF_CLUSTERS)
     .fit(ignite, dataCache, vectorizer);
 ----

 == Example

 To see how GMM clustering can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/GmmClusterizationExample.java[example] that is available on GitHub and delivered with every Apache Ignite distribution.
	// Licensed to the Apache Software Foundation (ASF) under one or more
	// contributor license agreements. See the NOTICE file distributed with
	// this work for additional information regarding copyright ownership.
	// The ASF licenses this file to You under the Apache License, Version 2.0
	// (the "License"); you may not use this file except in compliance with
	// the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing, software
	// distributed under the License is distributed on an "AS IS" BASIS,
	// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	// See the License for the specific language governing permissions and
	// limitations under the License.
	= Gaussian mixture (GMM)

	A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

	NOTE: You could think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

	== Model

	This algorithm represents a soft clustering model where each cluster is a Gaussian distribution with its own mean value and covariation matrix. Such a model can predict a cluster using the maximum likelihood principle.

	It defines the labels by the following way:


	[source, java]
	----
	KMeansModel mdl = trainer.fit(
	ignite,
	dataCache,
	vectorizer
	);

	double clusterLabel = mdl.predict(inputVector);
	----


	== Trainer


	GMM is a unsupervised learning algorithm. The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can compute the Bayesian Information Criterion to assess the number of clusters in the data.

	Presently, Ignite ML supports a few parameters for the GMM classification algorithm:

	* `maxCountOfClusters ` - the number of possible clusters
	* `maxCountOfIterations ` - one stop criteria (the other one is epsilon)
	* `epsilon` - delta of convergence(delta between old and new centroid's values)
	* `countOfComponents` - the number of components
	* `maxLikelihoodDivergence` - maximum divergence between maximum of likelihood of vector in dataset and other for anomalies identification
	* `minElementsForNewCluster` - minimum required anomalies in terms of maxLikelihoodDivergence for creating new cluster
	* `minClusterProbability` - minimum cluster probability


	[source, java]
	----
	// Set up the trainer
	GmmTrainer trainer = new GmmTrainer(COUNT_OF_COMPONENTS);

	// Build the model
	GmmModel mdl = trainer
	.withMaxCountIterations(MAX_COUNT_ITERATIONS)
	.withMaxCountOfClusters(MAX_AMOUNT_OF_CLUSTERS)
	.fit(ignite, dataCache, vectorizer);
	----

	== Example

	To see how GMM clustering can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/GmmClusterizationExample.java[example] that is available on GitHub and delivered with every Apache Ignite distribution.