blob: 68cb586255e554e20e2bbeed2eaf226e1e7b8fc6 [file] [log] [blame]
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]--><head>
<meta charset='utf-8'/><meta http-equiv='X-UA-Compatible' content='IE=edge'/><meta name='viewport' content='width=device-width, initial-scale=1'/><meta name='keywords' content='data science, groovy, ignite, kmeans, clustering'/><meta name='description' content='This post looks at using Apache Ignite with Apache Groovy and the K-Means algorithm to cluster scotch whiskeys.'/><title>The Apache Groovy programming language - Blogs - Whiskey Clustering with Groovy and Apache Ignite</title><link href='../img/favicon.ico' type='image/x-ico' rel='icon'/><link rel='stylesheet' type='text/css' href='../css/bootstrap.css'/><link rel='stylesheet' type='text/css' href='../css/font-awesome.min.css'/><link rel='stylesheet' type='text/css' href='../css/style.css'/><link rel='stylesheet' type='text/css' href='https://cdnjs.cloudflare.com/ajax/libs/prettify/r298/prettify.min.css'/>
</head><body>
<div id='fork-me'>
<a href='https://github.com/apache/groovy'>
<img style='position: fixed; top: 20px; right: -58px; border: 0; z-index: 100; transform: rotate(45deg);' src='/img/horizontal-github-ribbon.png'/>
</a>
</div><div id='st-container' class='st-container st-effect-9'>
<nav class='st-menu st-effect-9' id='menu-12'>
<h2 class='icon icon-lab'>Socialize</h2><ul>
<li>
<a href='https://groovy-lang.org/mailing-lists.html' class='icon'><span class='fa fa-envelope'></span> Discuss on the mailing-list</a>
</li><li>
<a href='https://twitter.com/ApacheGroovy' class='icon'><span class='fa fa-twitter'></span> Groovy on Twitter</a>
</li><li>
<a href='https://groovy-lang.org/events.html' class='icon'><span class='fa fa-calendar'></span> Events and conferences</a>
</li><li>
<a href='https://github.com/apache/groovy' class='icon'><span class='fa fa-github'></span> Source code on GitHub</a>
</li><li>
<a href='https://groovy-lang.org/reporting-issues.html' class='icon'><span class='fa fa-bug'></span> Report issues in Jira</a>
</li><li>
<a href='http://stackoverflow.com/questions/tagged/groovy' class='icon'><span class='fa fa-stack-overflow'></span> Stack Overflow questions</a>
</li><li>
<a href='http://groovycommunity.com/' class='icon'><span class='fa fa-slack'></span> Slack Community</a>
</li>
</ul>
</nav><div class='st-pusher'>
<div class='st-content'>
<div class='st-content-inner'>
<!--[if lt IE 7]>
<p class="browsehappy">You are using an <strong>outdated</strong> browser. Please <a href="http://browsehappy.com/">upgrade your browser</a> to improve your experience.</p>
<![endif]--><div><div class='navbar navbar-default navbar-static-top' role='navigation'>
<div class='container'>
<div class='navbar-header'>
<button type='button' class='navbar-toggle' data-toggle='collapse' data-target='.navbar-collapse'>
<span class='sr-only'></span><span class='icon-bar'></span><span class='icon-bar'></span><span class='icon-bar'></span>
</button><a class='navbar-brand' href='../index.html'>
<i class='fa fa-star'></i> Apache Groovy
</a>
</div><div class='navbar-collapse collapse'>
<ul class='nav navbar-nav navbar-right'>
<li class=''><a href='https://groovy-lang.org/learn.html'>Learn</a></li><li class=''><a href='https://groovy-lang.org/documentation.html'>Documentation</a></li><li class=''><a href='/download.html'>Download</a></li><li class=''><a href='https://groovy-lang.org/support.html'>Support</a></li><li class=''><a href='/'>Contribute</a></li><li class=''><a href='https://groovy-lang.org/ecosystem.html'>Ecosystem</a></li><li class=''><a href='/blog'>Blog posts</a></li><li class=''><a href='https://groovy.apache.org/events.html'></a></li><li>
<a data-effect='st-effect-9' class='st-trigger' href='#'>Socialize</a>
</li><li class=''>
<a href='../search.html'>
<i class='fa fa-search'></i>
</a>
</li>
</ul>
</div>
</div>
</div><div id='content' class='page-1'><div class='row'><div class='row-fluid'><div class='col-lg-3'><ul class='nav-sidebar'><li><a href='./'>Blog index</a></li><li class='active'><a href='#doc'>Whiskey Clustering with Groovy and Apache Ignite</a></li><li><a href='#_whiskey_clustering' class='anchor-link'>Whiskey Clustering</a></li><li><a href='#_apache_ignite' class='anchor-link'>Apache Ignite</a></li><li><a href='#_implementation_details' class='anchor-link'>Implementation Details</a></li><li><a href='#_results' class='anchor-link'>Results</a></li><li><a href='#_more_information' class='anchor-link'>More Information</a></li></ul><br/><ul class='nav-sidebar'><li style='padding: 0.35em 0.625em; background-color: #eee'><span>Related posts</span></li><li><a href='./fruity-eclipse-collections'>Fruity Eclipse Collections</a></li><li><a href='./using-groovy-with-apache-wayang'>Using Groovy with Apache Wayang and Apache Spark</a></li><li><a href='./matrix-calculations-with-groovy-apache'>Matrix calculations with Groovy, Apache Commons Math, ojAlgo, Nd4j and EJML</a></li><li><a href='./reading-and-writing-csv-files'>Reading and Writing CSV files with Groovy</a></li><li><a href='./detecting-objects-with-groovy-the'>Detecting objects with Groovy, the Deep Java Library (DJL), and Apache MXNet</a></li><li><a href='./classifying-iris-flowers-with-deep'>Classifying Iris Flowers with Deep Learning, Groovy and GraalVM</a></li></ul></div><div class='col-lg-8 col-lg-pull-0'><a name='doc'></a><h1>Whiskey Clustering with Groovy and Apache Ignite</h1><p><span>Author: <i>Paul King</i></span><br/><span>Published: 2022-10-27 11:13AM</span></p><hr/><div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p>In a previous <a href="https://groovy.apache.org/blog/using-groovy-with-apache-wayang">blog post</a>,
<span class="image right"><img src="https://www.apache.org/logos/res/ignite/default.png" alt="Ignite logo" width="150"></span> we looked at
using <a href="https://wayang.apache.org/">Apache Wayang</a> (incubating) and
<a href="https://spark.apache.org/">Apache Spark</a> to scale up the
<a href="https://en.wikipedia.org/wiki/K-means_clustering">k-means</a> clustering algorithm.
Let&#8217;s look at another useful technology for scaling up this problem,
<a href="https://ignite.apache.org/">Apache Ignite</a>.
The Ignite team recently released a <a href="https://ignite.apache.org/releases/2.14.0/release_notes.html">new version</a>,
but earlier versions are also fine for our example.</p>
</div>
<div class="paragraph">
<p>Before we start, a quick reminder of the problem.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_whiskey_clustering">Whiskey Clustering</h2>
<div class="sectionbody">
<div class="paragraph">
<p><span class="image right"><img src="img/groovy_logo.png" alt="Groovy logo" width="200"></span>
This problem looks at the quest of finding the perfect single-malt Scotch whiskey. The whiskies produced from <a href="https://www.niss.org/sites/default/files/ScotchWhisky01.txt">86 distilleries</a> have been ranked by expert tasters according to 12 criteria (Body, Sweetness, Malty, Smoky, Fruity, etc.). We&#8217;ll use a K-means algorithm to calculate the centroids.</p>
</div>
<div class="paragraph">
<p><span class="image right"><img src="img/whiskey_bottles.jpg" alt="whiskey_bottles" width="300"></span>
K-means is a standard data-science clustering technique. In our case, it groups whiskies with similar characteristics (according to the 12 criteria) into clusters. If we have a favourite whiskey, chances are we can find something similar by looking at other instances in the same cluster. If we are feeling like a change, we can look for a whiskey in some other cluster. The centroid is the notional "point" in the middle of the cluster. For us, it reflects the typical measure of each criteria for a whiskey in that cluster.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_apache_ignite">Apache Ignite</h2>
<div class="sectionbody">
<div class="paragraph">
<p>Apache Ignite is a distributed database for high-performance computing with in-memory speed. It makes a cluster (or <em>grid</em>) of nodes appear like an in-memory cache.
This explanation drastically simplifies Ignite&#8217;s feature set. Ignite can be used as:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>an in-memory cache with special features like SQL querying and transactional properties</p>
</li>
<li>
<p>an in-memory data-grid with advanced read-through &amp; write-through capabilities on top of one or more distributed databases</p>
</li>
<li>
<p>an ultra-fast and horizontally scalable in-memory database</p>
</li>
<li>
<p>a high-performance computing engine for custom or built-in tasks including machine learning</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>It is mostly this last capability that we will use. Ignite&#8217;s <em>Machine Learning API</em> has purpose built, cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation among others. We&#8217;ll use the distributed <a href="https://ignite.apache.org/docs/latest/machine-learning/clustering/k-means-clustering">K-means Clustering</a> algorithm from their library.
<span class="image"><img src="img/apache_ignite_architecture.png" alt="Machine Learning _ Ignite Documentation"></span></p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_implementation_details">Implementation Details</h2>
<div class="sectionbody">
<div class="paragraph">
<p>Apache Ignite has special capabilities for reading data into the cache. We could use <code>IgniteDataStreamer</code> or <code>IgniteCache.loadCache()</code> and load data from files, stream sources, various database sources and so forth. This is particularly relevant when using a cluster.</p>
</div>
<div class="paragraph">
<p>For our little example, our data is in a relatively small CSV file and we will be using a single node, so we&#8217;ll just read our data using <a href="https://commons.apache.org/csv/">Apache Commons CSV</a>:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="prettyprint highlight"><code data-lang="groovy">var file = getClass().classLoader.getResource('whiskey.csv').file as File
var rows = file.withReader {r -&gt; RFC4180.parse(r).records*.toList() }
var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][]</code></pre>
</div>
</div>
<div class="paragraph">
<p>We&#8217;ll configure our single node Ignite data cache using code (but we could place the details in a configuration file in more complex scenarios):</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="prettyprint highlight"><code data-lang="groovy">var cfg = new IgniteConfiguration(
peerClassLoadingEnabled: true,
discoverySpi: new TcpDiscoverySpi(
ipFinder: new TcpDiscoveryMulticastIpFinder(
addresses: ['127.0.0.1:47500..47509']
)
)
)</code></pre>
</div>
</div>
<div class="paragraph">
<p>We&#8217;ll create a few helper variables:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="prettyprint highlight"><code data-lang="groovy">var features = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
var pretty = this.&amp;sprintf.curry('%.4f')
var dist = new EuclideanDistance()
var vectorizer = new DoubleArrayVectorizer().labeled(FIRST)</code></pre>
</div>
</div>
<div class="paragraph">
<p>Now we start the node, populate the cache, run our k-means algorithm, and print the result.</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="prettyprint highlight"><code data-lang="groovy">Ignition.start(cfg).withCloseable { ignite -&gt;
println "&gt;&gt;&gt; Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols"
var dataCache = ignite.createCache(new CacheConfiguration&lt;Integer, double[]&gt;(
name: "TEST_${UUID.randomUUID()}",
affinity: new RendezvousAffinityFunction(false, 10)))
data.indices.each { int i -&gt; dataCache.put(i, data[i]) }
var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5)
var mdl = trainer.fit(ignite, dataCache, vectorizer)
println "&gt;&gt;&gt; KMeans centroids:\n${features.join(', ')}"
var centroids = mdl.centers*.all()
centroids.each { c -&gt; println c*.get().collect(pretty).join(', ') }
dataCache.destroy()
}</code></pre>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_results">Results</h2>
<div class="sectionbody">
<div class="paragraph">
<p>Here is the output:</p>
</div>
<div class="listingblock">
<div class="content">
<pre>[18:13:11] __________ ________________
[18:13:11] / _/ ___/ |/ / _/_ __/ __/
[18:13:11] _/ // (7 7 // / / / / _/
[18:13:11] /___/\___/_/|_/___/ /_/ /x___/
[18:13:11]
[18:13:11] ver. 2.14.0#20220929-sha1:951e8deb
[18:13:11] 2022 Copyright(C) Apache Software Foundation
...
[18:13:11] Configured plugins:
[18:13:11] ^-- ml-inference-plugin 1.0.0
[18:13:14] Ignite node started OK (id=f731e4ab)
...
&gt;&gt;&gt; Ignite grid started for data: 86 rows X 13 cols
&gt;&gt;&gt; KMeans centroids
Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral
2.7037, 2.4444, 1.4074, 0.0370, 0.0000, 1.8519, 1.6667, 1.8519, 1.8889, 2.0370, 2.1481, 1.6667
1.8500, 1.9000, 2.0000, 0.9500, 0.1500, 1.1000, 1.5000, 0.6000, 1.5500, 1.7000, 1.3000, 1.5000
1.2667, 2.1333, 0.9333, 0.1333, 0.0000, 1.0667, 0.8000, 0.5333, 1.8000, 1.7333, 2.2667, 2.2667
3.6667, 1.5000, 3.6667, 3.3333, 0.6667, 0.1667, 1.6667, 0.5000, 1.1667, 1.3333, 1.1667, 0.1667
1.5000, 2.8889, 1.0000, 0.2778, 0.1667, 1.0000, 1.2222, 0.6111, 0.5556, 1.7778, 1.6667, 2.0000
[18:13:15] Ignite node stopped OK [uptime=00:00:00.663]</pre>
</div>
</div>
<div class="paragraph">
<p>We can plot the centroid characteristics in a spider plot.
<span class="image"><img src="img/whiskey_spider_plot.png" alt="Whiskey clusters with Apache Ignite"></span></p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_more_information">More Information</h2>
<div class="sectionbody">
<div class="ulist">
<ul>
<li>
<p>Repo containing the source code:
<a href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyIgnite">WhiskeyIgnite</a></p>
</li>
<li>
<p>Repo containing similar examples using a variety of libraries including Apache Commons CSV,
Weka, Smile, Tribuo and others:
<a href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/Whiskey">Whiskey</a></p>
</li>
<li>
<p>A similar example using Apache Spark directly but with a built-in parallelized k-means from the spark-mllib library rather than a hand-crafted algorithm:
<a href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeySpark">WhiskeySpark</a></p>
</li>
</ul>
</div>
</div>
</div></div></div></div></div><footer id='footer'>
<div class='row'>
<div class='colset-3-footer'>
<div class='col-1'>
<h1>Groovy</h1><ul>
<li><a href='https://groovy-lang.org/learn.html'>Learn</a></li><li><a href='https://groovy-lang.org/documentation.html'>Documentation</a></li><li><a href='/download.html'>Download</a></li><li><a href='https://groovy-lang.org/support.html'>Support</a></li><li><a href='/'>Contribute</a></li><li><a href='https://groovy-lang.org/ecosystem.html'>Ecosystem</a></li><li><a href='/blog'>Blog posts</a></li><li><a href='https://groovy.apache.org/events.html'></a></li>
</ul>
</div><div class='col-2'>
<h1>About</h1><ul>
<li><a href='https://github.com/apache/groovy'>Source code</a></li><li><a href='https://groovy-lang.org/security.html'>Security</a></li><li><a href='https://groovy-lang.org/learn.html#books'>Books</a></li><li><a href='https://groovy-lang.org/thanks.html'>Thanks</a></li><li><a href='http://www.apache.org/foundation/sponsorship.html'>Sponsorship</a></li><li><a href='https://groovy-lang.org/faq.html'>FAQ</a></li><li><a href='https://groovy-lang.org/search.html'>Search</a></li>
</ul>
</div><div class='col-3'>
<h1>Socialize</h1><ul>
<li><a href='https://groovy-lang.org/mailing-lists.html'>Discuss on the mailing-list</a></li><li><a href='https://twitter.com/ApacheGroovy'>Groovy on Twitter</a></li><li><a href='https://groovy-lang.org/events.html'>Events and conferences</a></li><li><a href='https://github.com/apache/groovy'>Source code on GitHub</a></li><li><a href='https://groovy-lang.org/reporting-issues.html'>Report issues in Jira</a></li><li><a href='http://stackoverflow.com/questions/tagged/groovy'>Stack Overflow questions</a></li><li><a href='http://groovycommunity.com/'>Slack Community</a></li>
</ul>
</div><div class='col-right'>
<p>
The Groovy programming language is supported by the <a href='http://www.apache.org'>Apache Software Foundation</a> and the Groovy community.
</p><div text-align='right'>
<img src='../img/asf_logo.png' title='The Apache Software Foundation' alt='The Apache Software Foundation' style='width:60%'/>
</div><p>Apache&reg; and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation.</p>
</div>
</div><div class='clearfix'>&copy; 2003-2023 the Apache Groovy project &mdash; Groovy is Open Source: <a href='http://www.apache.org/licenses/LICENSE-2.0.html' alt='Apache 2 License'>license</a>, <a href='https://privacy.apache.org/policies/privacy-policy-public.html'>privacy policy</a>.</div>
</div>
</footer></div>
</div>
</div>
</div>
</div><script src='../js/vendor/jquery-1.10.2.min.js' defer></script><script src='../js/vendor/classie.js' defer></script><script src='../js/vendor/bootstrap.js' defer></script><script src='../js/vendor/sidebarEffects.js' defer></script><script src='../js/vendor/modernizr-2.6.2.min.js' defer></script><script src='../js/plugins.js' defer></script><script src='https://cdnjs.cloudflare.com/ajax/libs/prettify/r298/prettify.min.js'></script><script>document.addEventListener('DOMContentLoaded',prettyPrint)</script><script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-257558-10', 'auto');
ga('send', 'pageview');
</script>
</body></html>