| // Licensed to the Apache Software Foundation (ASF) under one or more |
| // contributor license agreements. See the NOTICE file distributed with |
| // this work for additional information regarding copyright ownership. |
| // The ASF licenses this file to You under the Apache License, Version 2.0 |
| // (the "License"); you may not use this file except in compliance with |
| // the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, software |
| // distributed under the License is distributed on an "AS IS" BASIS, |
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| // See the License for the specific language governing permissions and |
| // limitations under the License. |
| = Gaussian mixture (GMM) |
| |
| A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. |
| |
| NOTE: You could think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians. |
| |
| == Model |
| |
| This algorithm represents a soft clustering model where each cluster is a Gaussian distribution with its own mean value and covariation matrix. Such a model can predict a cluster using the maximum likelihood principle. |
| |
| It defines the labels by the following way: |
| |
| |
| [source, java] |
| ---- |
| KMeansModel mdl = trainer.fit( |
| ignite, |
| dataCache, |
| vectorizer |
| ); |
| |
| double clusterLabel = mdl.predict(inputVector); |
| ---- |
| |
| |
| == Trainer |
| |
| |
| GMM is a unsupervised learning algorithm. The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can compute the Bayesian Information Criterion to assess the number of clusters in the data. |
| |
| Presently, Ignite ML supports a few parameters for the GMM classification algorithm: |
| |
| * `maxCountOfClusters ` - the number of possible clusters |
| * `maxCountOfIterations ` - one stop criteria (the other one is epsilon) |
| * `epsilon` - delta of convergence(delta between old and new centroid's values) |
| * `countOfComponents` - the number of components |
| * `maxLikelihoodDivergence` - maximum divergence between maximum of likelihood of vector in dataset and other for anomalies identification |
| * `minElementsForNewCluster` - minimum required anomalies in terms of maxLikelihoodDivergence for creating new cluster |
| * `minClusterProbability` - minimum cluster probability |
| |
| |
| [source, java] |
| ---- |
| // Set up the trainer |
| GmmTrainer trainer = new GmmTrainer(COUNT_OF_COMPONENTS); |
| |
| // Build the model |
| GmmModel mdl = trainer |
| .withMaxCountIterations(MAX_COUNT_ITERATIONS) |
| .withMaxCountOfClusters(MAX_AMOUNT_OF_CLUSTERS) |
| .fit(ignite, dataCache, vectorizer); |
| ---- |
| |
| == Example |
| |
| To see how GMM clustering can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/GmmClusterizationExample.java[example] that is available on GitHub and delivered with every Apache Ignite distribution. |
| |