| <!DOCTYPE html> |
| <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--> |
| <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]--> |
| <!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]--> |
| <!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]--><head> |
| <meta charset='utf-8'/><meta http-equiv='X-UA-Compatible' content='IE=edge'/><meta name='viewport' content='width=device-width, initial-scale=1'/><meta name='keywords' content='data science, groovy, ignite, kmeans, clustering'/><meta name='description' content='This post looks at using Apache Ignite with Apache Groovy and the K-Means algorithm to cluster scotch whiskeys.'/><title>The Apache Groovy programming language - Blogs - Whiskey Clustering with Groovy and Apache Ignite</title><link href='../img/favicon.ico' type='image/x-ico' rel='icon'/><link rel='stylesheet' type='text/css' href='../css/bootstrap.css'/><link rel='stylesheet' type='text/css' href='../css/font-awesome.min.css'/><link rel='stylesheet' type='text/css' href='../css/style.css'/><link rel='stylesheet' type='text/css' href='https://cdnjs.cloudflare.com/ajax/libs/prettify/r298/prettify.min.css'/> |
| </head><body> |
| <div id='fork-me'> |
| <a href='https://github.com/apache/groovy'> |
| <img style='position: fixed; top: 20px; right: -58px; border: 0; z-index: 100; transform: rotate(45deg);' src='/img/horizontal-github-ribbon.png'/> |
| </a> |
| </div><div id='st-container' class='st-container st-effect-9'> |
| <nav class='st-menu st-effect-9' id='menu-12'> |
| <h2 class='icon icon-lab'>Socialize</h2><ul> |
| <li> |
| <a href='https://groovy-lang.org/mailing-lists.html' class='icon'><span class='fa fa-envelope'></span> Discuss on the mailing-list</a> |
| </li><li> |
| <a href='https://twitter.com/ApacheGroovy' class='icon'><span class='fa fa-twitter'></span> Groovy on Twitter</a> |
| </li><li> |
| <a href='https://groovy-lang.org/events.html' class='icon'><span class='fa fa-calendar'></span> Events and conferences</a> |
| </li><li> |
| <a href='https://github.com/apache/groovy' class='icon'><span class='fa fa-github'></span> Source code on GitHub</a> |
| </li><li> |
| <a href='https://groovy-lang.org/reporting-issues.html' class='icon'><span class='fa fa-bug'></span> Report issues in Jira</a> |
| </li><li> |
| <a href='http://stackoverflow.com/questions/tagged/groovy' class='icon'><span class='fa fa-stack-overflow'></span> Stack Overflow questions</a> |
| </li><li> |
| <a href='http://groovycommunity.com/' class='icon'><span class='fa fa-slack'></span> Slack Community</a> |
| </li> |
| </ul> |
| </nav><div class='st-pusher'> |
| <div class='st-content'> |
| <div class='st-content-inner'> |
| <!--[if lt IE 7]> |
| <p class="browsehappy">You are using an <strong>outdated</strong> browser. Please <a href="http://browsehappy.com/">upgrade your browser</a> to improve your experience.</p> |
| <![endif]--><div><div class='navbar navbar-default navbar-static-top' role='navigation'> |
| <div class='container'> |
| <div class='navbar-header'> |
| <button type='button' class='navbar-toggle' data-toggle='collapse' data-target='.navbar-collapse'> |
| <span class='sr-only'></span><span class='icon-bar'></span><span class='icon-bar'></span><span class='icon-bar'></span> |
| </button><a class='navbar-brand' href='../index.html'> |
| <i class='fa fa-star'></i> Apache Groovy |
| </a> |
| </div><div class='navbar-collapse collapse'> |
| <ul class='nav navbar-nav navbar-right'> |
| <li class=''><a href='https://groovy-lang.org/learn.html'>Learn</a></li><li class=''><a href='https://groovy-lang.org/documentation.html'>Documentation</a></li><li class=''><a href='/download.html'>Download</a></li><li class=''><a href='https://groovy-lang.org/support.html'>Support</a></li><li class=''><a href='/'>Contribute</a></li><li class=''><a href='https://groovy-lang.org/ecosystem.html'>Ecosystem</a></li><li class=''><a href='/blog'>Blog posts</a></li><li class=''><a href='https://groovy.apache.org/events.html'></a></li><li> |
| <a data-effect='st-effect-9' class='st-trigger' href='#'>Socialize</a> |
| </li><li class=''> |
| <a href='../search.html'> |
| <i class='fa fa-search'></i> |
| </a> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </div><div id='content' class='page-1'><div class='row'><div class='row-fluid'><div class='col-lg-3'><ul class='nav-sidebar'><li><a href='./'>Blog index</a></li><li class='active'><a href='#doc'>Whiskey Clustering with Groovy and Apache Ignite</a></li><li><a href='#_whiskey_clustering' class='anchor-link'>Whiskey Clustering</a></li><li><a href='#_apache_ignite' class='anchor-link'>Apache Ignite</a></li><li><a href='#_implementation_details' class='anchor-link'>Implementation Details</a></li><li><a href='#_results' class='anchor-link'>Results</a></li><li><a href='#_more_information' class='anchor-link'>More Information</a></li></ul><br/><ul class='nav-sidebar'><li style='padding: 0.35em 0.625em; background-color: #eee'><span>Related posts</span></li><li><a href='./fruity-eclipse-collections'>Fruity Eclipse Collections</a></li><li><a href='./using-groovy-with-apache-wayang'>Using Groovy with Apache Wayang and Apache Spark</a></li><li><a href='./matrix-calculations-with-groovy-apache'>Matrix calculations with Groovy, Apache Commons Math, ojAlgo, Nd4j and EJML</a></li><li><a href='./reading-and-writing-csv-files'>Reading and Writing CSV files with Groovy</a></li><li><a href='./detecting-objects-with-groovy-the'>Detecting objects with Groovy, the Deep Java Library (DJL), and Apache MXNet</a></li><li><a href='./classifying-iris-flowers-with-deep'>Classifying Iris Flowers with Deep Learning, Groovy and GraalVM</a></li></ul></div><div class='col-lg-8 col-lg-pull-0'><a name='doc'></a><h1>Whiskey Clustering with Groovy and Apache Ignite</h1><p><span>Author: <i>Paul King</i></span><br/><span>Published: 2022-10-27 11:13AM</span></p><hr/><div id="preamble"> |
| <div class="sectionbody"> |
| <div class="paragraph"> |
| <p>In a previous <a href="https://groovy.apache.org/blog/using-groovy-with-apache-wayang">blog post</a>, |
| <span class="image right"><img src="https://www.apache.org/logos/res/ignite/default.png" alt="Ignite logo" width="150"></span> we looked at |
| using <a href="https://wayang.apache.org/">Apache Wayang</a> (incubating) and |
| <a href="https://spark.apache.org/">Apache Spark</a> to scale up the |
| <a href="https://en.wikipedia.org/wiki/K-means_clustering">k-means</a> clustering algorithm. |
| Let’s look at another useful technology for scaling up this problem, |
| <a href="https://ignite.apache.org/">Apache Ignite</a>. |
| The Ignite team recently released a <a href="https://ignite.apache.org/releases/2.14.0/release_notes.html">new version</a>, |
| but earlier versions are also fine for our example.</p> |
| </div> |
| <div class="paragraph"> |
| <p>Before we start, a quick reminder of the problem.</p> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_whiskey_clustering">Whiskey Clustering</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"> |
| <p><span class="image right"><img src="img/groovy_logo.png" alt="Groovy logo" width="200"></span> |
| This problem looks at the quest of finding the perfect single-malt Scotch whiskey. The whiskies produced from <a href="https://www.niss.org/sites/default/files/ScotchWhisky01.txt">86 distilleries</a> have been ranked by expert tasters according to 12 criteria (Body, Sweetness, Malty, Smoky, Fruity, etc.). We’ll use a K-means algorithm to calculate the centroids.</p> |
| </div> |
| <div class="paragraph"> |
| <p><span class="image right"><img src="img/whiskey_bottles.jpg" alt="whiskey_bottles" width="300"></span> |
| K-means is a standard data-science clustering technique. In our case, it groups whiskies with similar characteristics (according to the 12 criteria) into clusters. If we have a favourite whiskey, chances are we can find something similar by looking at other instances in the same cluster. If we are feeling like a change, we can look for a whiskey in some other cluster. The centroid is the notional "point" in the middle of the cluster. For us, it reflects the typical measure of each criteria for a whiskey in that cluster.</p> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_apache_ignite">Apache Ignite</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"> |
| <p>Apache Ignite is a distributed database for high-performance computing with in-memory speed. It makes a cluster (or <em>grid</em>) of nodes appear like an in-memory cache. |
| This explanation drastically simplifies Ignite’s feature set. Ignite can be used as:</p> |
| </div> |
| <div class="ulist"> |
| <ul> |
| <li> |
| <p>an in-memory cache with special features like SQL querying and transactional properties</p> |
| </li> |
| <li> |
| <p>an in-memory data-grid with advanced read-through & write-through capabilities on top of one or more distributed databases</p> |
| </li> |
| <li> |
| <p>an ultra-fast and horizontally scalable in-memory database</p> |
| </li> |
| <li> |
| <p>a high-performance computing engine for custom or built-in tasks including machine learning</p> |
| </li> |
| </ul> |
| </div> |
| <div class="paragraph"> |
| <p>It is mostly this last capability that we will use. Ignite’s <em>Machine Learning API</em> has purpose built, cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation among others. We’ll use the distributed <a href="https://ignite.apache.org/docs/latest/machine-learning/clustering/k-means-clustering">K-means Clustering</a> algorithm from their library. |
| <span class="image"><img src="img/apache_ignite_architecture.png" alt="Machine Learning _ Ignite Documentation"></span></p> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_implementation_details">Implementation Details</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"> |
| <p>Apache Ignite has special capabilities for reading data into the cache. We could use <code>IgniteDataStreamer</code> or <code>IgniteCache.loadCache()</code> and load data from files, stream sources, various database sources and so forth. This is particularly relevant when using a cluster.</p> |
| </div> |
| <div class="paragraph"> |
| <p>For our little example, our data is in a relatively small CSV file and we will be using a single node, so we’ll just read our data using <a href="https://commons.apache.org/csv/">Apache Commons CSV</a>:</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="prettyprint highlight"><code data-lang="groovy">var file = getClass().classLoader.getResource('whiskey.csv').file as File |
| var rows = file.withReader {r -> RFC4180.parse(r).records*.toList() } |
| var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][]</code></pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>We’ll configure our single node Ignite data cache using code (but we could place the details in a configuration file in more complex scenarios):</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="prettyprint highlight"><code data-lang="groovy">var cfg = new IgniteConfiguration( |
| peerClassLoadingEnabled: true, |
| discoverySpi: new TcpDiscoverySpi( |
| ipFinder: new TcpDiscoveryMulticastIpFinder( |
| addresses: ['127.0.0.1:47500..47509'] |
| ) |
| ) |
| )</code></pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>We’ll create a few helper variables:</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="prettyprint highlight"><code data-lang="groovy">var features = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', |
| 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral'] |
| var pretty = this.&sprintf.curry('%.4f') |
| var dist = new EuclideanDistance() |
| var vectorizer = new DoubleArrayVectorizer().labeled(FIRST)</code></pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>Now we start the node, populate the cache, run our k-means algorithm, and print the result.</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="prettyprint highlight"><code data-lang="groovy">Ignition.start(cfg).withCloseable { ignite -> |
| println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" |
| var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( |
| name: "TEST_${UUID.randomUUID()}", |
| affinity: new RendezvousAffinityFunction(false, 10))) |
| data.indices.each { int i -> dataCache.put(i, data[i]) } |
| var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) |
| var mdl = trainer.fit(ignite, dataCache, vectorizer) |
| println ">>> KMeans centroids:\n${features.join(', ')}" |
| var centroids = mdl.centers*.all() |
| centroids.each { c -> println c*.get().collect(pretty).join(', ') } |
| dataCache.destroy() |
| }</code></pre> |
| </div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_results">Results</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"> |
| <p>Here is the output:</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre>[18:13:11] __________ ________________ |
| [18:13:11] / _/ ___/ |/ / _/_ __/ __/ |
| [18:13:11] _/ // (7 7 // / / / / _/ |
| [18:13:11] /___/\___/_/|_/___/ /_/ /x___/ |
| [18:13:11] |
| [18:13:11] ver. 2.14.0#20220929-sha1:951e8deb |
| [18:13:11] 2022 Copyright(C) Apache Software Foundation |
| ... |
| [18:13:11] Configured plugins: |
| [18:13:11] ^-- ml-inference-plugin 1.0.0 |
| [18:13:14] Ignite node started OK (id=f731e4ab) |
| ... |
| >>> Ignite grid started for data: 86 rows X 13 cols |
| >>> KMeans centroids |
| Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral |
| 2.7037, 2.4444, 1.4074, 0.0370, 0.0000, 1.8519, 1.6667, 1.8519, 1.8889, 2.0370, 2.1481, 1.6667 |
| 1.8500, 1.9000, 2.0000, 0.9500, 0.1500, 1.1000, 1.5000, 0.6000, 1.5500, 1.7000, 1.3000, 1.5000 |
| 1.2667, 2.1333, 0.9333, 0.1333, 0.0000, 1.0667, 0.8000, 0.5333, 1.8000, 1.7333, 2.2667, 2.2667 |
| 3.6667, 1.5000, 3.6667, 3.3333, 0.6667, 0.1667, 1.6667, 0.5000, 1.1667, 1.3333, 1.1667, 0.1667 |
| 1.5000, 2.8889, 1.0000, 0.2778, 0.1667, 1.0000, 1.2222, 0.6111, 0.5556, 1.7778, 1.6667, 2.0000 |
| [18:13:15] Ignite node stopped OK [uptime=00:00:00.663]</pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>We can plot the centroid characteristics in a spider plot. |
| <span class="image"><img src="img/whiskey_spider_plot.png" alt="Whiskey clusters with Apache Ignite"></span></p> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_more_information">More Information</h2> |
| <div class="sectionbody"> |
| <div class="ulist"> |
| <ul> |
| <li> |
| <p>Repo containing the source code: |
| <a href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyIgnite">WhiskeyIgnite</a></p> |
| </li> |
| <li> |
| <p>Repo containing similar examples using a variety of libraries including Apache Commons CSV, |
| Weka, Smile, Tribuo and others: |
| <a href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/Whiskey">Whiskey</a></p> |
| </li> |
| <li> |
| <p>A similar example using Apache Spark directly but with a built-in parallelized k-means from the spark-mllib library rather than a hand-crafted algorithm: |
| <a href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeySpark">WhiskeySpark</a></p> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </div></div></div></div></div><footer id='footer'> |
| <div class='row'> |
| <div class='colset-3-footer'> |
| <div class='col-1'> |
| <h1>Groovy</h1><ul> |
| <li><a href='https://groovy-lang.org/learn.html'>Learn</a></li><li><a href='https://groovy-lang.org/documentation.html'>Documentation</a></li><li><a href='/download.html'>Download</a></li><li><a href='https://groovy-lang.org/support.html'>Support</a></li><li><a href='/'>Contribute</a></li><li><a href='https://groovy-lang.org/ecosystem.html'>Ecosystem</a></li><li><a href='/blog'>Blog posts</a></li><li><a href='https://groovy.apache.org/events.html'></a></li> |
| </ul> |
| </div><div class='col-2'> |
| <h1>About</h1><ul> |
| <li><a href='https://github.com/apache/groovy'>Source code</a></li><li><a href='https://groovy-lang.org/security.html'>Security</a></li><li><a href='https://groovy-lang.org/learn.html#books'>Books</a></li><li><a href='https://groovy-lang.org/thanks.html'>Thanks</a></li><li><a href='http://www.apache.org/foundation/sponsorship.html'>Sponsorship</a></li><li><a href='https://groovy-lang.org/faq.html'>FAQ</a></li><li><a href='https://groovy-lang.org/search.html'>Search</a></li> |
| </ul> |
| </div><div class='col-3'> |
| <h1>Socialize</h1><ul> |
| <li><a href='https://groovy-lang.org/mailing-lists.html'>Discuss on the mailing-list</a></li><li><a href='https://twitter.com/ApacheGroovy'>Groovy on Twitter</a></li><li><a href='https://groovy-lang.org/events.html'>Events and conferences</a></li><li><a href='https://github.com/apache/groovy'>Source code on GitHub</a></li><li><a href='https://groovy-lang.org/reporting-issues.html'>Report issues in Jira</a></li><li><a href='http://stackoverflow.com/questions/tagged/groovy'>Stack Overflow questions</a></li><li><a href='http://groovycommunity.com/'>Slack Community</a></li> |
| </ul> |
| </div><div class='col-right'> |
| <p> |
| The Groovy programming language is supported by the <a href='http://www.apache.org'>Apache Software Foundation</a> and the Groovy community. |
| </p><div text-align='right'> |
| <img src='../img/asf_logo.png' title='The Apache Software Foundation' alt='The Apache Software Foundation' style='width:60%'/> |
| </div><p>Apache® and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation.</p> |
| </div> |
| </div><div class='clearfix'>© 2003-2023 the Apache Groovy project — Groovy is Open Source: <a href='http://www.apache.org/licenses/LICENSE-2.0.html' alt='Apache 2 License'>license</a>, <a href='https://privacy.apache.org/policies/privacy-policy-public.html'>privacy policy</a>.</div> |
| </div> |
| </footer></div> |
| </div> |
| </div> |
| </div> |
| </div><script src='../js/vendor/jquery-1.10.2.min.js' defer></script><script src='../js/vendor/classie.js' defer></script><script src='../js/vendor/bootstrap.js' defer></script><script src='../js/vendor/sidebarEffects.js' defer></script><script src='../js/vendor/modernizr-2.6.2.min.js' defer></script><script src='../js/plugins.js' defer></script><script src='https://cdnjs.cloudflare.com/ajax/libs/prettify/r298/prettify.min.js'></script><script>document.addEventListener('DOMContentLoaded',prettyPrint)</script><script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| |
| ga('create', 'UA-257558-10', 'auto'); |
| ga('send', 'pageview'); |
| </script> |
| </body></html> |