blob: 59239fa14e285d4964176e396fdaf9f74e8314ed [file] [log] [blame]
<!doctype html>
<html lang="en" dir="ltr" class="blog-wrapper blog-list-page plugin-blog plugin-id-default" data-has-hydrated="false">
<head>
<meta charset="UTF-8">
<meta name="generator" content="Docusaurus v3.1.1">
<title data-rh="true">Blog | Apache Wayang (incubating)</title><meta data-rh="true" name="viewport" content="width=device-width,initial-scale=1"><meta data-rh="true" name="twitter:card" content="summary_large_image"><meta data-rh="true" property="og:url" content="https://wayang.apache.org/blog"><meta data-rh="true" property="og:locale" content="en"><meta data-rh="true" name="docusaurus_locale" content="en"><meta data-rh="true" name="docsearch:language" content="en"><meta data-rh="true" property="og:title" content="Blog | Apache Wayang (incubating)"><meta data-rh="true" name="description" content="Blog"><meta data-rh="true" property="og:description" content="Blog"><meta data-rh="true" name="docusaurus_tag" content="blog_posts_list"><meta data-rh="true" name="docsearch:docusaurus_tag" content="blog_posts_list"><link data-rh="true" rel="icon" href="/img/wayang-logo.jpg"><link data-rh="true" rel="canonical" href="https://wayang.apache.org/blog"><link data-rh="true" rel="alternate" href="https://wayang.apache.org/blog" hreflang="en"><link data-rh="true" rel="alternate" href="https://wayang.apache.org/blog" hreflang="x-default"><link rel="alternate" type="application/rss+xml" href="/blog/rss.xml" title="Apache Wayang (incubating) RSS Feed">
<link rel="alternate" type="application/atom+xml" href="/blog/atom.xml" title="Apache Wayang (incubating) Atom Feed"><link rel="stylesheet" href="/assets/css/styles.ecf70413.css">
<script src="/assets/js/runtime~main.db1fac0d.js" defer="defer"></script>
<script src="/assets/js/main.f50bad53.js" defer="defer"></script>
</head>
<body class="navigation-with-keyboard">
<script>!function(){function t(t){document.documentElement.setAttribute("data-theme",t)}var e=function(){try{return new URLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{return localStorage.getItem("theme")}catch(t){}}();t(null!==e?e:"light")}(),function(){try{const a=new URLSearchParams(window.location.search).entries();for(var[t,e]of a)if(t.startsWith("docusaurus-data-")){var n=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(n,e)}}catch(t){}}(),document.documentElement.setAttribute("data-announcement-bar-initially-dismissed",function(){try{return"true"===localStorage.getItem("docusaurus.announcement.dismiss")}catch(t){}return!1}())</script><div id="__docusaurus"><div role="region" aria-label="Skip to main content"><a class="skipToContent_fXgn" href="#__docusaurus_skipToContent_fallback">Skip to main content</a></div><div class="announcementBar_mb4j" style="background-color:#fafbfc;color:#091E42" role="banner"><div class="announcementBarPlaceholder_vyr4"></div><div class="content_knG7 announcementBarContent_xLdY">⭐️ If you like Apache Wayang, give it a star on <a target="_blank" href="https://github.com/apache/incubator-wayang">GitHub</a>! ⭐ </div><button type="button" aria-label="Close" class="clean-btn close closeButton_CVFx announcementBarClose_gvF7"><svg viewBox="0 0 15 15" width="14" height="14"><g stroke="currentColor" stroke-width="3.1"><path d="M.75.75l13.5 13.5M14.25.75L.75 14.25"></path></g></svg></button></div><nav aria-label="Main" class="navbar navbar--fixed-top"><div class="navbar__inner"><div class="navbar__items"><button aria-label="Toggle navigation bar" aria-expanded="false" class="navbar__toggle clean-btn" type="button"><svg width="30" height="30" viewBox="0 0 30 30" aria-hidden="true"><path stroke="currentColor" stroke-linecap="round" stroke-miterlimit="10" stroke-width="2" d="M4 7h22M4 15h22M4 23h22"></path></svg></button><a class="navbar__brand" href="/"><div class="navbar__logo"><img src="/img/wayang.png" alt="Wayang Logo" class="themedComponent_mlkZ themedComponent--light_NVdE"><img src="/img/wayang.png" alt="Wayang Logo" class="themedComponent_mlkZ themedComponent--dark_xIcU"></div><b class="navbar__title text--truncate"></b></a></div><div class="navbar__items navbar__items--right"><a class="navbar__item navbar__link" href="/docs/start/download">Download</a><a class="navbar__item navbar__link" href="/docs/introduction/about">About</a><a class="navbar__item navbar__link" href="/docs/guide/installation">Developers</a><div class="navbar__item dropdown dropdown--hoverable dropdown--right"><a href="#" aria-haspopup="true" aria-expanded="false" role="button" class="navbar__link">Community</a><ul class="dropdown__menu"><li><a aria-current="page" class="dropdown__link dropdown__link--active" href="/blog/">Blog</a></li><li><a class="dropdown__link" href="/docs/community/mailinglist">Project</a></li></ul></div><div class="navbar__item dropdown dropdown--hoverable dropdown--right"><a href="#" aria-haspopup="true" aria-expanded="false" role="button" class="navbar__link">ASF</a><ul class="dropdown__menu"><li><a href="https://www.apache.org/" target="_blank" rel="noopener noreferrer" class="dropdown__link">Foundation</a></li><li><a href="https://www.apache.org/licenses/" target="_blank" rel="noopener noreferrer" class="dropdown__link">License</a></li><li><a href="https://www.apache.org/events/current-event.html" target="_blank" rel="noopener noreferrer" class="dropdown__link">Events</a></li><li><a href="https://privacy.apache.org/policies/privacy-policy-public.html" target="_blank" rel="noopener noreferrer" class="dropdown__link">Privacy</a></li><li><a href="https://www.apache.org/security/" target="_blank" rel="noopener noreferrer" class="dropdown__link">Security</a></li><li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank" rel="noopener noreferrer" class="dropdown__link">Sponsorship</a></li><li><a href="https://www.apache.org/foundation/thanks.html" target="_blank" rel="noopener noreferrer" class="dropdown__link">Thanks</a></li><li><a href="https://www.apache.org/foundation/policies/conduct.html" target="_blank" rel="noopener noreferrer" class="dropdown__link">Code of Conduct</a></li></ul></div><a href="https://github.com/apache/incubator-wayang" target="_blank" rel="noopener noreferrer" class="navbar__item navbar__link header-github-link" aria-label="GitHub repository"></a><div class="toggle_vylO colorModeToggle_DEke"><button class="clean-btn toggleButton_gllP toggleButtonDisabled_aARS" type="button" disabled="" title="Switch between dark and light mode (currently light mode)" aria-label="Switch between dark and light mode (currently light mode)" aria-live="polite"><svg viewBox="0 0 24 24" width="24" height="24" class="lightToggleIcon_pyhR"><path fill="currentColor" d="M12,9c1.65,0,3,1.35,3,3s-1.35,3-3,3s-3-1.35-3-3S10.35,9,12,9 M12,7c-2.76,0-5,2.24-5,5s2.24,5,5,5s5-2.24,5-5 S14.76,7,12,7L12,7z M2,13l2,0c0.55,0,1-0.45,1-1s-0.45-1-1-1l-2,0c-0.55,0-1,0.45-1,1S1.45,13,2,13z M20,13l2,0c0.55,0,1-0.45,1-1 s-0.45-1-1-1l-2,0c-0.55,0-1,0.45-1,1S19.45,13,20,13z M11,2v2c0,0.55,0.45,1,1,1s1-0.45,1-1V2c0-0.55-0.45-1-1-1S11,1.45,11,2z M11,20v2c0,0.55,0.45,1,1,1s1-0.45,1-1v-2c0-0.55-0.45-1-1-1C11.45,19,11,19.45,11,20z M5.99,4.58c-0.39-0.39-1.03-0.39-1.41,0 c-0.39,0.39-0.39,1.03,0,1.41l1.06,1.06c0.39,0.39,1.03,0.39,1.41,0s0.39-1.03,0-1.41L5.99,4.58z M18.36,16.95 c-0.39-0.39-1.03-0.39-1.41,0c-0.39,0.39-0.39,1.03,0,1.41l1.06,1.06c0.39,0.39,1.03,0.39,1.41,0c0.39-0.39,0.39-1.03,0-1.41 L18.36,16.95z M19.42,5.99c0.39-0.39,0.39-1.03,0-1.41c-0.39-0.39-1.03-0.39-1.41,0l-1.06,1.06c-0.39,0.39-0.39,1.03,0,1.41 s1.03,0.39,1.41,0L19.42,5.99z M7.05,18.36c0.39-0.39,0.39-1.03,0-1.41c-0.39-0.39-1.03-0.39-1.41,0l-1.06,1.06 c-0.39,0.39-0.39,1.03,0,1.41s1.03,0.39,1.41,0L7.05,18.36z"></path></svg><svg viewBox="0 0 24 24" width="24" height="24" class="darkToggleIcon_wfgR"><path fill="currentColor" d="M9.37,5.51C9.19,6.15,9.1,6.82,9.1,7.5c0,4.08,3.32,7.4,7.4,7.4c0.68,0,1.35-0.09,1.99-0.27C17.45,17.19,14.93,19,12,19 c-3.86,0-7-3.14-7-7C5,9.07,6.81,6.55,9.37,5.51z M12,3c-4.97,0-9,4.03-9,9s4.03,9,9,9s9-4.03,9-9c0-0.46-0.04-0.92-0.1-1.36 c-0.98,1.37-2.58,2.26-4.4,2.26c-2.98,0-5.4-2.42-5.4-5.4c0-1.81,0.89-3.42,2.26-4.4C12.92,3.04,12.46,3,12,3L12,3z"></path></svg></button></div><div class="navbarSearchContainer_Bca1"><div class="navbar__search"><span aria-label="expand searchbar" role="button" class="search-icon" tabindex="0"></span><input id="search_input_react" type="search" placeholder="Loading..." aria-label="Search" class="navbar__search-input search-bar" disabled=""></div></div></div></div><div role="presentation" class="navbar-sidebar__backdrop"></div></nav><div id="__docusaurus_skipToContent_fallback" class="main-wrapper mainWrapper_z2l0"><div class="container margin-vert--lg"><div class="row"><aside class="col col--3"><nav class="sidebar_re4s thin-scrollbar" aria-label="Blog recent posts navigation"><div class="sidebarItemTitle_pO2u margin-bottom--md">All our posts</div><ul class="sidebarItemList_Yudw clean-list"><li class="sidebarItem__DBe"><a class="sidebarItemLink_mo7H" href="/blog/wayang-tensorflow">Integrating ML platforms in Wayang</a></li><li class="sidebarItem__DBe"><a class="sidebarItemLink_mo7H" href="/blog/wayang-federated-ai">Wayang and the Federated AI</a></li><li class="sidebarItem__DBe"><a class="sidebarItemLink_mo7H" href="/blog/wayang-python-api">Pywayang - Apache Wayang&#x27;s Python API</a></li><li class="sidebarItem__DBe"><a class="sidebarItemLink_mo7H" href="/blog/kafka-meets-wayang-3">Apache Kafka meets Apache Wayang - Part 3</a></li><li class="sidebarItem__DBe"><a class="sidebarItemLink_mo7H" href="/blog/wayang-vs-trino">Apache Wayang vs. Presto/Trino</a></li><li class="sidebarItem__DBe"><a class="sidebarItemLink_mo7H" href="/blog/kafka-meets-wayang-2">Apache Kafka meets Apache Wayang - Part 2</a></li><li class="sidebarItem__DBe"><a class="sidebarItemLink_mo7H" href="/blog/kafka-meets-wayang-1">Apache Kafka meets Apache Wayang - Part 1</a></li><li class="sidebarItem__DBe"><a class="sidebarItemLink_mo7H" href="/blog/website_update">Website updated</a></li></ul></nav></aside><main class="col col--7" itemscope="" itemtype="https://schema.org/Blog"><article class="margin-bottom--xl" itemprop="blogPost" itemscope="" itemtype="https://schema.org/BlogPosting"><meta itemprop="description" content="We are happy to announce that we have extended Wayang to be able to utilize any ML platform and any ML operators."><header><h2 class="title_f1Hy" itemprop="headline"><a itemprop="url" href="/blog/wayang-tensorflow">Integrating ML platforms in Wayang</a></h2><div class="container_mt6G margin-vert--md"><time datetime="2024-05-07T00:00:00.000Z" itemprop="datePublished">May 7, 2024</time> · <!-- -->3 min read</div><div class="margin-top--md margin-bottom--sm row"><div class="col col--6 authorCol_Hf19"><div class="avatar margin-bottom--sm"><a href="https://github.com/zkaoudi" target="_blank" rel="noopener noreferrer" class="avatar__photo-link"><img class="avatar__photo" src="https://avatars.githubusercontent.com/zkaoudi" alt="Zoi Kaoudi" itemprop="image"></a><div class="avatar__intro" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><div class="avatar__name"><a href="https://github.com/zkaoudi" target="_blank" rel="noopener noreferrer" itemprop="url"><span itemprop="name">Zoi Kaoudi</span></a></div><small class="avatar__subtitle" itemprop="description">(P)PMC Apache Wayang</small></div></div></div></div></header><div class="markdown" itemprop="articleBody"><p>We are happy to announce that we have extended Wayang to be able to utilize any ML platform and any ML operators.
Thanks to the extensible nature of Wayang, the only core changes we had to do were introducing the concept of a <code>Model</code> and implement a new driver for the newly added platform.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="step-1-introducing-a-model">Step 1: Introducing a Model<a href="#step-1-introducing-a-model" class="hash-link" aria-label="Direct link to Step 1: Introducing a Model" title="Direct link to Step 1: Introducing a Model"></a></h2>
<p>With respect to the model, we followed Wayang’s abstraction philosophy: We created a <code>Model</code> interface to be used as input or output by Wayang operators and then extended it for the platform-specific operators. Different model interfaces can be found here:</p>
<p><a href="https://github.com/apache/incubator-wayang/tree/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/model" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-wayang/tree/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/model</a></p>
<p>A platform-specific model needs to be instantiated to be used as the output of a training operator and as input for an inference operator. You can see an example of the <code>SparkMLModel</code> here:</p>
<p><a href="https://github.com/apache/incubator-wayang/tree/main/wayang-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/model/SparkMLModel.java" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-wayang/tree/main/wayang-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/model/SparkMLModel.java</a></p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="step-2-introducing-training-operators">Step 2: Introducing Training Operators<a href="#step-2-introducing-training-operators" class="hash-link" aria-label="Direct link to Step 2: Introducing Training Operators" title="Direct link to Step 2: Introducing Training Operators"></a></h2>
<p>We added the desired Wayang (platform-agnostic) training operators which are binary to unary operators, taking as input the X and y values and outputting a Model. You can find an example of a LinearRegressionOperator here:</p>
<p><a href="https://github.com/apache/incubator-wayang/blob/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/operators/LinearRegressionOperator.java" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-wayang/blob/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/operators/LinearRegressionOperator.java</a></p>
<p>Platform-specific execution operators, such as SparkLinearRegressionOperator, can be easily added as any other execution operator: extending the corresponding Wayang operator and providing the mappings from the Wayang to the execution operator. See, for example, the <code>SparkLinearRegressionOperator</code>:</p>
<p><a href="https://github.com/apache/incubator-wayang/tree/main/wayang-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/ml/SparkLinearRegressionOperator.java" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-wayang/tree/main/wayang-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/ml/SparkLinearRegressionOperator.java</a></p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="step-3-introducing-prediction-operators">Step 3: Introducing Prediction Operators<a href="#step-3-introducing-prediction-operators" class="hash-link" aria-label="Direct link to Step 3: Introducing Prediction Operators" title="Direct link to Step 3: Introducing Prediction Operators"></a></h2>
<p>Additionally, we created a <code>PredictOperator</code>, a BinaryToUnary Wayang (platform-agnostic) operator which takes as input the data quanta and a model and outputs the data quanta with the predictions output by the model.</p>
<p><a href="https://github.com/apache/incubator-wayang/tree/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/operators/PredictOperator.java" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-wayang/tree/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/operators/PredictOperator.java</a></p>
<p>Then, a concrete platform-specific operator extends from the abstract one. See the <code>SparkPredictOperator</code> for an example:</p>
<p><a href="https://github.com/apache/incubator-wayang/tree/main/wayang-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/ml/SparkPredictOperator.java" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-wayang/tree/main/wayang-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/ml/SparkPredictOperator.java</a></p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="deep-learning-models">Deep Learning Models<a href="#deep-learning-models" class="hash-link" aria-label="Direct link to Deep Learning Models" title="Direct link to Deep Learning Models"></a></h2>
<p>Unlike traditional machine learning models, the definition of deep learning models is more flexible. Users can combine different blocks (e.g., fully connected blocks, convolutional blocks) to build their desired models. The whole model can be represented as a graph on which the vertices represent blocks and the edges represent connections between blocks. In this case, we built a <code>DLModel</code> class that implements the <code>Model</code> interface, which contains a user-defined, platform-agnostic graph of the model:</p>
<p><a href="https://github.com/apache/incubator-wayang/tree/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/model/DLModel.java" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-wayang/tree/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/model/DLModel.java</a></p>
<p>For training, we implemented the platform-agnostic <code>DLModelTrainingOperator</code> Wayang operator:</p>
<p><a href="https://github.com/apache/incubator-wayang/tree/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/operators/DLTrainingOperator.java" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-wayang/tree/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/operators/DLTrainingOperator.java</a></p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-ml-platform----tensorflow-integration">New ML platform -- Tensorflow Integration<a href="#new-ml-platform----tensorflow-integration" class="hash-link" aria-label="Direct link to New ML platform -- Tensorflow Integration" title="Direct link to New ML platform -- Tensorflow Integration"></a></h2>
<p>We have added Tensorflow as a new platform by creating a new module (<code>wayang-tensorflow</code>) inside the <code>wayang-platforms</code> parent module and implementing a Tensorflow driver. The TensorflowExecutor driver is responsible for creating and destroying Tensorflow resources, such as a model graph and a model parameter context. When a training task scheduled on Tensorflow, it will be mapped to TensorflowDLModelTrainingOperator. In this process, the <code>DLModel</code> will be converted to <code>TensorflowModel</code>, which means that the user-defined model graph will be converted to a Tensorflow model graph. Likewise, for inference, the <code>PredictOperator</code> will be mapped to <code>TensorflowPredictOperator</code>. All the code for the tensorflow platform can be found here:</p>
<p><a href="https://github.com/apache/incubator-wayang/tree/main/wayang-platforms/wayang-tensorflow/src/main/java/org/apache/wayang/tensorflow" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-wayang/tree/main/wayang-platforms/wayang-tensorflow/src/main/java/org/apache/wayang/tensorflow</a></p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="acknowledgement">Acknowledgement<a href="#acknowledgement" class="hash-link" aria-label="Direct link to Acknowledgement" title="Direct link to Acknowledgement"></a></h3>
<p>The source code for the support of ML operators and the Tensorflow integration has been contributed by Mingxi Liu.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="follow-wayang">Follow Wayang<a href="#follow-wayang" class="hash-link" aria-label="Direct link to Follow Wayang" title="Direct link to Follow Wayang"></a></h3>
<p>Apache Wayang is in incubation phase and has a potential roadmap of implementations
coming soon (including the federated learning aspect as well as an SQL interface and a novel
data debugging functionality). If you want to hear or join the community, consult the link
<a href="https://wayang.apache.org/community/" target="_blank" rel="noopener noreferrer">https://wayang.apache.org/community/</a> , join the mailing lists, contribute with new ideas,
write documentation, or fix bugs.</p></div><footer class="row docusaurus-mt-lg"><div class="col"><b>Tags:</b><ul class="tags_jXut padding--none margin-left--sm"><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/wayang">wayang</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/ml">ML</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/tensorflow">tensorflow</a></li></ul></div></footer></article><article class="margin-bottom--xl" itemprop="blogPost" itemscope="" itemtype="https://schema.org/BlogPosting"><meta itemprop="description" content="AI systems and applications are widely used nowadays, from assisting grammar spellings to"><header><h2 class="title_f1Hy" itemprop="headline"><a itemprop="url" href="/blog/wayang-federated-ai">Wayang and the Federated AI</a></h2><div class="container_mt6G margin-vert--md"><time datetime="2024-04-17T00:00:00.000Z" itemprop="datePublished">April 17, 2024</time> · <!-- -->3 min read</div><div class="margin-top--md margin-bottom--sm row"><div class="col col--6 authorCol_Hf19"><div class="avatar margin-bottom--sm"><a href="https://github.com/glauesppen" target="_blank" rel="noopener noreferrer" class="avatar__photo-link"><img class="avatar__photo" src="https://avatars.githubusercontent.com/glauesppen" alt="Gláucia Esppenchutz" itemprop="image"></a><div class="avatar__intro" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><div class="avatar__name"><a href="https://github.com/glauesppen" target="_blank" rel="noopener noreferrer" itemprop="url"><span itemprop="name">Gláucia Esppenchutz</span></a></div><small class="avatar__subtitle" itemprop="description">(P)PMC Apache Wayang</small></div></div></div></div></header><div class="markdown" itemprop="articleBody"><p>AI systems and applications are widely used nowadays, from assisting grammar spellings to
detecting early signs of cancer cells. Building an AI requires a lot of data and training to achieve
the desired results, and federated learning is an approach to make AI training more viable.
Federated learning (or collaborative learning) is a technique that trains AI models on data
distributed across multiple serves or devices. It does so without centralizing data on a single
place or storage. It also prevents the possibility of data breaches and protects sensitive
personal data. One of the significant challenges in working with AI is the variety of tools found
in the market or the open-source community. Each tool provides results in a different form;
integrating them can be pretty challenging. Let&#x27;s talk about Apache Wayang (incubating) and
how it can help to solve this problem.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="apache-wayang-in-the-federated-ai-world">Apache Wayang in the Federated AI world<a href="#apache-wayang-in-the-federated-ai-world" class="hash-link" aria-label="Direct link to Apache Wayang in the Federated AI world" title="Direct link to Apache Wayang in the Federated AI world"></a></h2>
<p>Apache Wayang (Wayang, for short), a project in an incubation phase at Apache Software
Foundation (ASF), integrates big data platforms and tools by removing the complexity of
worrying about low-level details. Interestingly, even if it was not designed for, Wayang could
also serve as a scalable platform for federated learning: the Wayang community is starting to
work on integrating federated learning capabilities. In a federated learning approach, Wayang
would allow different local models to be built and exchange its model results across other data
centers to combine them into a single enhanced model.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-real-world-example">A real-world example<a href="#a-real-world-example" class="hash-link" aria-label="Direct link to A real-world example" title="Direct link to A real-world example"></a></h2>
<p>Let&#x27;s consider a real-world scenario. Hospitals and health organizations have increased their
investments in machine/deep learning initiatives to learn more and predict diagnostics.
However, due to legal frameworks, sharing patients&#x27; information or diagnostics is impossible,
and the solution would be to apply federated learning. To solve this problem, we could use
Wayang to help to train the models. See the diagram 1 below:</p>
<br>
<img width="75%" alt="wayang stack" src="/img/architecture/federated-ai-architecture-1.png">
<br>
<br>
<p>As a first step, the data scientists would send an ML task to Wayang, which will work as an
abstraction layer to connect to different data processing platforms, sparing the time to build
integration code for each. Then, the data platforms process and generate the results that will
be sent back to Wayang. Wayang aggregates the results into one &quot;global result&quot; and sends it
back to the requestor as a next step.</p>
<br>
<img width="75%" alt="wayang stack" src="/img/architecture/federated-ai-architecture-2.png">
<br>
<br>
<p>The process repeats until the desired results are achieved.
Although it is very much like a Federated learning pipeline, Wayang removes a considerable
layer of complexity from the developers by integrating with diverse types of data platforms. It
also brings fast development and reduces the need for a deep understanding of data
infrastructure or integrations. Developers can focus on the logic and how to execute tasks
instead of details about data processors.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="follow-wayang">Follow Wayang<a href="#follow-wayang" class="hash-link" aria-label="Direct link to Follow Wayang" title="Direct link to Follow Wayang"></a></h3>
<p>Apache Wayang is in an incubation phase and has a potential roadmap of implementations
coming soon (including the federated learning aspect as well as an SQL interface and a novel
data debugging functionality). If you want to hear or join the community, consult the link
<a href="https://wayang.apache.org/community/" target="_blank" rel="noopener noreferrer">https://wayang.apache.org/community/</a> , join the mailing lists, contribute with new ideas,
write documentation, or fix bugs.</p>
<br>
<h5 class="anchor anchorWithStickyNavbar_LWe7" id="thank-you">Thank you!<a href="#thank-you" class="hash-link" aria-label="Direct link to Thank you!" title="Direct link to Thank you!"></a></h5>
<p>I (Gláucia) want to thank professor Jorge Quiané for the guidance to write this blog post.
Thanks for incentivate me to join the project and for the knowledge shared. I will always remember you.</p></div><footer class="row docusaurus-mt-lg"><div class="col"><b>Tags:</b><ul class="tags_jXut padding--none margin-left--sm"><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/wayang">wayang</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/federated">federated</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/ai">ai</a></li></ul></div></footer></article><article class="margin-bottom--xl" itemprop="blogPost" itemscope="" itemtype="https://schema.org/BlogPosting"><meta itemprop="description" content="In the vast landscape of data processing, efficiency and flexibility are"><header><h2 class="title_f1Hy" itemprop="headline"><a itemprop="url" href="/blog/wayang-python-api">Pywayang - Apache Wayang&#x27;s Python API</a></h2><div class="container_mt6G margin-vert--md"><time datetime="2024-04-09T00:00:00.000Z" itemprop="datePublished">April 9, 2024</time> · <!-- -->4 min read</div><div class="margin-top--md margin-bottom--sm row"><div class="col col--6 authorCol_Hf19"><div class="avatar margin-bottom--sm"><a href="https://github.com/juripetersen" target="_blank" rel="noopener noreferrer" class="avatar__photo-link"><img class="avatar__photo" src="https://avatars.githubusercontent.com/u/43411515?v=4" alt="Juri Petersen" itemprop="image"></a><div class="avatar__intro" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><div class="avatar__name"><a href="https://github.com/juripetersen" target="_blank" rel="noopener noreferrer" itemprop="url"><span itemprop="name">Juri Petersen</span></a></div><small class="avatar__subtitle" itemprop="description">Apache Committer</small></div></div></div></div></header><div class="markdown" itemprop="articleBody"><p>In the vast landscape of data processing, efficiency and flexibility are
important. However, navigating through a multitude of tools and
languages often is a major inconvenience.
Apache Wayang&#x27;s upcoming Python API will allow you to seamlessly
orchestrate data processing tasks without ever leaving the comfort
of Python, irrespective of the underlying framework written in Java.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="expanding-apache-wayangs-apis">Expanding Apache Wayang&#x27;s APIs<a href="#expanding-apache-wayangs-apis" class="hash-link" aria-label="Direct link to Expanding Apache Wayang&#x27;s APIs" title="Direct link to Expanding Apache Wayang&#x27;s APIs"></a></h2>
<p>Apache Wayang&#x27;s architecture decouples the process of planning from the
resulting execution, allowing users to specify platform agnostic plans
through the provided APIs.</p>
<br>
<img width="75%" alt="wayang stack" src="/img/architecture/wayang-stack.png">
<br>
<br>
<p>Python&#x27;s popularity and convenience for data
processing workloads makes it an obvious candidate for a desired API.
Previous APIs, such as the Scala API <code>wayang-api-scala-java</code> benefited
from the interoperability of Java and Scala that allows to reuse objects
from other languages to provide new interfaces. Accessing JVM objects in
Python is possible through several libraries, but in doing so,
future APIs in other programming languages would need similar libraries and
implementations in order to exist. As a contrast to that, providing an
API within Apache Wayang that receives input plans from any source and
executes them within allows to create plans and submit them in any
programming language. The following figure shows the architecture of <code>pywayang</code>:</p>
<br>
<img width="75%" alt="pywayang stack" src="/img/architecture/pywayang.png">
<br>
<br>
<p>The Python API allows users to specify WayangPlans with UDFs in Python.
<code>pywayang</code> then serializes the UDFs and constructs the WayangPlan in
JSON format, preparing it to be sent to Apache Wayang&#x27;s JSON API.
When receiving a valid JSON plan, the JSON API uses the optimizer to
construct an execution plan. However, since UDFs are defined in Python
and thus need to be executed in Python as well, an operators function needs to be
wrapped into a <code>WrappedPythonFunction</code>:</p>
<div class="language-scala codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-scala codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">val</span><span class="token plain"> mapOperator </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> MapPartitionsOperator</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Input</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Output</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> MapPartitionsDescriptor</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Input</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Output</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> WrappedPythonFunction</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Input</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Output</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> ByteString</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">copyFromUtf8</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">udf</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> classOf</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Input</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> classOf</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Output</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>This wrapped functional descriptor allows to handle execution of
UDFs in Python through a socket connection with the <code>pywayang</code> worker.
Input data is sourced from the platform chosen by the optimizer and Apache
Wayang handles routing the output data to the next operator.</p>
<br>
<p>A new API in any programming languages would have
to specify two things:</p>
<ul>
<li>A way to create plans that conform to a JSON format specified in the
Wayang JSON API.</li>
<li>A <code>worker</code> that handles encoding and decoding of user defined
functions (UDFs), as they need to
be executed on iterables in their respective language.
After that, the API can be added as a module in Wayang, so that
operators will be wrapped and UDFs can be executed in the desired
programming language.</li>
</ul></div><footer class="row docusaurus-mt-lg"><div class="col col--9"><b>Tags:</b><ul class="tags_jXut padding--none margin-left--sm"><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/wayang">wayang</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/python">python</a></li></ul></div><div class="col text--right col--3"><a aria-label="Read more about Pywayang - Apache Wayang&#x27;s Python API" href="/blog/wayang-python-api"><b>Read More</b></a></div></footer></article><article class="margin-bottom--xl" itemprop="blogPost" itemscope="" itemtype="https://schema.org/BlogPosting"><meta itemprop="description" content="The third part of this article series is an activity log."><header><h2 class="title_f1Hy" itemprop="headline"><a itemprop="url" href="/blog/kafka-meets-wayang-3">Apache Kafka meets Apache Wayang - Part 3</a></h2><div class="container_mt6G margin-vert--md"><time datetime="2024-03-10T00:00:00.000Z" itemprop="datePublished">March 10, 2024</time> · <!-- -->5 min read</div><div class="margin-top--md margin-bottom--sm row"><div class="col col--6 authorCol_Hf19"><div class="avatar margin-bottom--sm"><a href="https://github.com/kamir" target="_blank" rel="noopener noreferrer" class="avatar__photo-link"><img class="avatar__photo" src="https://avatars.githubusercontent.com/u/1241122?v=4" alt="Mirko Kämpf" itemprop="image"></a><div class="avatar__intro" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><div class="avatar__name"><a href="https://github.com/kamir" target="_blank" rel="noopener noreferrer" itemprop="url"><span itemprop="name">Mirko Kämpf</span></a></div><small class="avatar__subtitle" itemprop="description">Apache Committer</small></div></div></div></div></header><div class="markdown" itemprop="articleBody"><p>The third part of this article series is an activity log.
Motivated by the learnings from last time, I stated implementing a Kafka Source component and a Kafka Sink component for the Apache Spark platform in Apache Wayang.
In our previous article we shared the results of the work on the frist Apache Kafka integration using the Java Platform.</p>
<p>Let&#x27;s see how it goes this time with Apache Spark.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-goal-of-this-implementation">The goal of this implementation<a href="#the-goal-of-this-implementation" class="hash-link" aria-label="Direct link to The goal of this implementation" title="Direct link to The goal of this implementation"></a></h2>
<p>We want to process data from Apache Kafka topics, which are hosted on Confluent cloud.
In our example scenario, the data is available in multiple different clusters, in different regions and owned by different organizations.</p>
<p>We assume, that the operator of our job has been granted appropriate permissions, and the topic owner already provided the configuration properties, including access coordinates and credentials.</p>
<p><img decoding="async" loading="lazy" alt="images/image-1.png" src="/assets/images/image-1-9cc35d5aea2b867d7e5759a96bd02334.png" width="904" height="550" class="img_ev3q"></p>
<p>This illustration has already been introduced in part one.
We focus on <strong>Job 4</strong> in the image and start to implement it.
This time we expect the processing load to be higher so that we want to utilize the scalability capabilities of Apache Spark.</p>
<p>Again, we start with a <strong>WayangContext</strong>, as shown by examples in the Wayang code repository.</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">WayangContext wayangContext = new WayangContext().with(Spark.basicPlugin());</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>
<p>We simply switched the backend system towards Apache Spark by using the <em>WayangContext</em> with <em>Spark.basicPlugin()</em>.
The <strong>JavaPlanBuilder</strong> and all other logic of our example job won&#x27;t be touched.</p>
<p>In order to make this working we will now implement the Mappings and the Operators for the Apache Spark platform module.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="implementation-of-input--and-output-operators">Implementation of Input- and Output Operators<a href="#implementation-of-input--and-output-operators" class="hash-link" aria-label="Direct link to Implementation of Input- and Output Operators" title="Direct link to Implementation of Input- and Output Operators"></a></h2>
<p>We reuse the Kafka Source and Kafka Sink components which have been created for the JavaKafkaSource and JavaKafkaSink.
Hence we work with Wayang&#x27;s Java API.</p>
<p><strong>Level 1 – Wayang execution plan with abstract operators</strong></p>
<p>Since the <em>JavaPlanBuilder</em> already exposes the function for selecting a Kafka topic as source
and the <em>DataQuantaBuilder</em> class exposes the <em>writeKafkaTopic</em> function we can move on quickly.</p>
<p>Remember, in this API layer we use the Scala programming language, but we utilize the Java classes, implemented in the layer below.</p>
<p><strong>Level 2 – Wiring between Platform Abstraction and Implementation</strong></p>
<p>As in the case with the Java Platform, in the second layer we build a bridge between the WayangContext and the PlanBuilders, which work together with DataQuanta and the DataQuantaBuilder.</p>
<p>We must provide the mapping between the abstract components and the specific implementations in this layer.</p>
<p>Therefore, the mappings package in project <strong>wayang-platforms/wayang-spark</strong> has a class <em>Mappings</em> in which
our <em>KafkaTopicSinkMapping</em> and <em>KafkaTopicSourceMapping</em> will be registered.</p>
<p>Again, these classes allow the Apache Wayang framework to use the Java implementation of the KafkaTopicSource component (and KafkaTopicSink respectively).</p>
<p>While the Wayang execution plan uses the higher abstractions, here on the “platform level” we have to link the specific implementation for the target platform.
In this case this leads to an Apache Spark job, running on a Spark cluster which is set up by the Apache Wayang framework using the logical components of the execution plan, and the Apache Spark configuration provided at runtime.</p>
<p>A mapping links an operator implementation to the abstraction used in an execution plan.
We define two new mappings for our purpose, namely KafkaTopicSourceMapping, and KafkaTopicSinkMapping, both could be reused from last round.</p>
<p>For the Spark platform we simply replace the occurences of <em>JavaPlatform</em> with <em>SparkPlatform</em>.</p>
<p>Furthermore, we create an implementation of the <em>SparkKafkaTopicSource</em> and <em>SparkKafkaTopicSink</em>.</p>
<p><strong>Layer 3 – Input/Output Connector Layer</strong></p>
<p>Let&#x27;s quickly recap, how does Apache Spark interacts with Apache Kafka?</p>
<p>There is already an integration which gives us a DataSet using the Spark SQL framework.
For Spark Streaming, there is also a Kafka integration using the <em>SparkSession</em>&#x27;s <em>readStream()</em> function.
Kafka client properties are provided as key value pairs <em>k</em> and <em>v</em> by using the <em>option( k, v )</em> function.
For writing into a topic, we can use the <em>writeStream()</em> function.
But from a first look, it seems to be not the best fit.</p>
<p>Another approach is possible.
We can use simple RDDs to process data previously consumed from Apache Kafka.
This is a more low-level approach compared to using Datasets with Spark Structured Streaming,
and it typically involves using the Kafka RDD API provided by Spark.</p>
<p>This approach is less common with newer versions of Spark, as Structured Streaming provides a higher-level abstraction that simplifies stream processing.
However, we might need that approach for the integration with Apache Wayang.</p>
<p>For now, we will focus on the lower level approach and plan to consume data from Kafka using a Kafka client, and then
we parallelize the records in an RDD.</p>
<p>This allows us to reuse <em>KafkaTopicSource</em> and <em>KafkaTopicSink</em> classes we built last time.
Those were made specifically for a simple non parallel Java program, using one Consumer and one Producer.</p>
<p>The selected approach does not yet fully take advantage from Spark&#x27;s parallelism at load time.
For higher loads and especially for streaming processing we would have to investigate another approache, using a <em>SparkStreamingContext</em>, but this is out of scope for now.</p>
<p>Since we can&#x27;t reuse the <em>JavaKafkaTopicSource</em> and <em>JavaKafkaTopicSink</em> we rather implement <em>SparkKafkaTopicSource</em> and <em>SparkKafkaTopicSink</em> based on given <em>SparkTextFileSource</em> and <em>SparkTextFileSink</em> which both cary all needed RDD specific logic.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="summary">Summary<a href="#summary" class="hash-link" aria-label="Direct link to Summary" title="Direct link to Summary"></a></h2>
<p>As expected, the integration of Apache Spark with Apache Wayang was no magic, thanks to a fluent API design and a well structured architecture of Apache Wayang.
We could easily follow the pattern we have worked out in the previous exercise.</p>
<p>But a bunch of much more interesting work will follow next.
More testing, more serialization schemes, and Kafka Schema Registry support should follow, and full parallelization as well.</p>
<p>The code has been submitted to the Apache Wayang repository.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="outlook">Outlook<a href="#outlook" class="hash-link" aria-label="Direct link to Outlook" title="Direct link to Outlook"></a></h2>
<p>The next part of the article series will cover the real world example as described in image 1.
We will show how analysts and developers can use the Apache Kafka integration for Apache Wayang to solve cross organizational collaboration issues.
Therefore, we will bring all puzzles together, and show the full implementation of the multi organizational data collaboration use case.</p></div><footer class="row docusaurus-mt-lg"><div class="col"><b>Tags:</b><ul class="tags_jXut padding--none margin-left--sm"><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/wayang">wayang</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/kafka">kafka</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/spark">spark</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/cross-organization-data-collaboration">cross organization data collaboration</a></li></ul></div></footer></article><article class="margin-bottom--xl" itemprop="blogPost" itemscope="" itemtype="https://schema.org/BlogPosting"><meta itemprop="description" content="We have been asked several times about the difference between Apache Wayang and Presto/Trino. In this blog post, we will clarify the main differences and how they impact various applications and use cases."><header><h2 class="title_f1Hy" itemprop="headline"><a itemprop="url" href="/blog/wayang-vs-trino">Apache Wayang vs. Presto/Trino</a></h2><div class="container_mt6G margin-vert--md"><time datetime="2024-03-08T00:00:00.000Z" itemprop="datePublished">March 8, 2024</time> · <!-- -->3 min read</div><div class="margin-top--md margin-bottom--sm row"><div class="col col--6 authorCol_Hf19"><div class="avatar margin-bottom--sm"><a href="https://github.com/zkaoudi" target="_blank" rel="noopener noreferrer" class="avatar__photo-link"><img class="avatar__photo" src="https://avatars.githubusercontent.com/zkaoudi" alt="Zoi Kaoudi" itemprop="image"></a><div class="avatar__intro" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><div class="avatar__name"><a href="https://github.com/zkaoudi" target="_blank" rel="noopener noreferrer" itemprop="url"><span itemprop="name">Zoi Kaoudi</span></a></div><small class="avatar__subtitle" itemprop="description">(P)PMC Apache Wayang</small></div></div></div></div></header><div class="markdown" itemprop="articleBody"><p>We have been asked several times about the difference between Apache Wayang and Presto/Trino. In this blog post, we will clarify the main differences and how they impact various applications and use cases.</p></div><footer class="row docusaurus-mt-lg"><div class="col col--9"><b>Tags:</b><ul class="tags_jXut padding--none margin-left--sm"><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/wayang">wayang</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/presto">presto</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/trino">trino</a></li></ul></div><div class="col text--right col--3"><a aria-label="Read more about Apache Wayang vs. Presto/Trino" href="/blog/wayang-vs-trino"><b>Read More</b></a></div></footer></article><article class="margin-bottom--xl" itemprop="blogPost" itemscope="" itemtype="https://schema.org/BlogPosting"><meta itemprop="description" content="In the second part of the article series we describe the implementation of the Kafka Source and Kafka Sink component for Apache Wayang."><header><h2 class="title_f1Hy" itemprop="headline"><a itemprop="url" href="/blog/kafka-meets-wayang-2">Apache Kafka meets Apache Wayang - Part 2</a></h2><div class="container_mt6G margin-vert--md"><time datetime="2024-03-06T00:00:00.000Z" itemprop="datePublished">March 6, 2024</time> · <!-- -->6 min read</div><div class="margin-top--md margin-bottom--sm row"><div class="col col--6 authorCol_Hf19"><div class="avatar margin-bottom--sm"><a href="https://github.com/kamir" target="_blank" rel="noopener noreferrer" class="avatar__photo-link"><img class="avatar__photo" src="https://avatars.githubusercontent.com/u/1241122?v=4" alt="Mirko Kämpf" itemprop="image"></a><div class="avatar__intro" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><div class="avatar__name"><a href="https://github.com/kamir" target="_blank" rel="noopener noreferrer" itemprop="url"><span itemprop="name">Mirko Kämpf</span></a></div><small class="avatar__subtitle" itemprop="description">Apache Committer</small></div></div></div></div></header><div class="markdown" itemprop="articleBody"><p>In the second part of the article series we describe the implementation of the Kafka Source and Kafka Sink component for Apache Wayang.
We look into the “Read- and Write-Path” for our data items, called <em>DataQuanta</em>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="apache-wayangs-read--write-path-for-kafka-topics">Apache Wayang’s Read &amp; Write Path for Kafka topics<a href="#apache-wayangs-read--write-path-for-kafka-topics" class="hash-link" aria-label="Direct link to Apache Wayang’s Read &amp; Write Path for Kafka topics" title="Direct link to Apache Wayang’s Read &amp; Write Path for Kafka topics"></a></h2>
<p>To describe the read and write paths for data in the context of the created Apache Wayang code snippet, the primary classes and interfaces we need to understand are as follows:</p>
<p><strong>WayangContext:</strong> This class is essential for initializing the Wayang processing environment.
It allows you to configure the execution environment and register plugins that define which platforms Wayang can use for data processing tasks, such as <em>Java.basicPlugin()</em> for local Java execution.</p>
<p><strong>JavaPlanBuilder:</strong> This class is used to build and define the data processing pipeline (or plan) in Wayang.
It provides a fluent API to specify the operations to be performed on the data, from reading the input to processing it and writing the output.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="read-path">Read Path<a href="#read-path" class="hash-link" aria-label="Direct link to Read Path" title="Direct link to Read Path"></a></h3>
<p>The read path describes how data is ingested from a source into the Wayang processing pipeline:</p>
<p><em>Reading from Kafka Topic:</em> The method <em>readKafkaTopic(topicName)</em> is used to ingest data from a specified Kafka topic.
This is the starting point of the data processing pipeline, where topicName represents the name of the Kafka topic from which data is read.</p>
<p><em>Data Tokenization and Preparation:</em> Once the data is read from Kafka, it undergoes several transformations such as Splitting, Filtering, and Mapping.
What follows are the procedures known as Reducing, Grouping, Co-Grouping, and Counting.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="write-path">Write Path<a href="#write-path" class="hash-link" aria-label="Direct link to Write Path" title="Direct link to Write Path"></a></h3>
<p><em>Writing to Kafka Topic:</em> The final step in the pipeline involves writing the processed data back to a Kafka topic using <em>.writeKafkaTopic(...)</em>.
This method takes parameters that specify the target Kafka topic, a serialization function to format the data as strings, and additional configuration for load profile estimation, which optimizes the writing process.</p>
<p>This read-write path provides a comprehensive flow of data from ingestion from Kafka, through various processing steps, and finally back to Kafka, showcasing a full cycle of data processing within Apache Wayang&#x27;s abstracted environment and is implemented in our example program shown in <em>listing 1</em>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="implementation-of-input--and-output-operators">Implementation of Input- and Output Operators<a href="#implementation-of-input--and-output-operators" class="hash-link" aria-label="Direct link to Implementation of Input- and Output Operators" title="Direct link to Implementation of Input- and Output Operators"></a></h2>
<p>The next section shows how a new pair of operators can be implemented to extend Apache Wayang’s capabilities on the input and output side.
We created the Kafka Source and Kafka Sink components so that our cross organizational data collaboration scenario can be implemented using data streaming infrastructure.</p>
<p><strong>Level 1 – Wayang execution plan with abstract operators</strong></p>
<p>The implementation of our Kafka Source and Kafka Sink components for Apache Wayang requires new methods and classes on three layers.
First of all in the API package.
Here we use the JavaPlanBuilder to expose the function for selecting a Kafka topic as the source to be used by client.<br>
<!-- -->The class <em>JavaPlanBuilder</em> in package <em>org.apache.wayang.api</em> in the project <em>wayang-api/wayang-api-scala-java</em> exposes our new functionality to our external client.
An instance of the JavaPlanBuilder is used to define the data processing pipeline.
We use its <em>readKafkaTopic()</em> which specifies the source Kafka topic to read from, and for the write path we use the <em>writeKafkaTopic()</em> method.
Both Methods do only trigger activities in the background.</p>
<p>For the output side, we use the <em>DataQuantaBuilder</em> class, which offers an implementation of the writeKafkaTopic function.
This function is designed to send processed data, referred to as DataQuanta, to a specified Kafka topic.
Essentially, it marks the final step in a data processing sequence constructed using the Apache Wayang framework.</p>
<p>In the DataQuanta class we implemented the methods writeKafkaTopic and writeKafkaTopicJava which use the KafkaTopicSink class.
In this API layer we use the Scala programming language, but we utilize the Java classes, implemented in the layer below.</p>
<p><strong>Level 2 – Wiring between Platform Abstraction and Implementation</strong></p>
<p>The second layer builds the bridge between the WayangContext and PlanBuilders which work together with DataQuanta and the DataQuantaBuilder.</p>
<p>Also, the mapping between the abstract components and the specific implementations are defined in this layer.</p>
<p>Therefore, the mappings package has a class <em>Mappings</em> in which all relevant input and output operators are listed.
We use it to register the KafkaSourceMapping and a KafkaSinkMapping for the particular platform, Java in our case.
These classes allow the Apache Wayang framework to use the Java implementation of the KafkaTopicSource component (and KafkaTopicSink respectively).
While the Wayang execution plan uses the higher abstractions, here on the “platform level” we have to link the specific implementation for the target platform.
In our case this leads to a Java program running on a JVM which is set up by the Apache Wayang framework using the logical components of the execution plan.</p>
<p>Those mappings link the real implementation of our operators the ones used in an execution plan.
The JavaKafkaTopicSource and the JavaKafkaTopicSink extend the KafkaTopicSource and KafkaTopicSink so that the lower level implementation of those classes become available within Wayang’s Java Platform context.</p>
<p>In this layer, the KafkaConsumer class and the KafkaProducer class are used, but both are configured and instantiated in the next layer underneath.
All this is done in the project <em>wayang-plarforms/wayang-java</em>.</p>
<p><strong>Layer 3 – Input/Output Connector Layer</strong></p>
<p>The <em>KafkaTopicSource</em> and <em>KafkaTopicSink</em> classes build the third layer of our implementation.
Both are implemented in Java programming language.
In this layer, the real Kafka-Client logic is defined.
Details about consumer and producers, client configuration, and schema handling have to be handled here.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="summary">Summary<a href="#summary" class="hash-link" aria-label="Direct link to Summary" title="Direct link to Summary"></a></h2>
<p>Both classes in the third layer implement the Kafka client logic which is needed by the Wayang-execution plan when external data flows should be established.
The layer above handles the mapping of the components at startup time.
All this wiring is needed to keep Wayang open and flexible so that multiple external systems can be used in a variety of combinations and using multiple target platforms in combinations.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="outlook">Outlook<a href="#outlook" class="hash-link" aria-label="Direct link to Outlook" title="Direct link to Outlook"></a></h2>
<p>The next part of the article series will cover the creation of an Kafka Source and Sink component for the Apache Spark platform, which allows our use case to scale.
Finally, in part four we bring all puzzles together, and show the full implementation of the multi organizational data collaboration use case.</p></div><footer class="row docusaurus-mt-lg"><div class="col"><b>Tags:</b><ul class="tags_jXut padding--none margin-left--sm"><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/wayang">wayang</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/kafka">kafka</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/cross-organization-data-collaboration">cross organization data collaboration</a></li></ul></div></footer></article><article class="margin-bottom--xl" itemprop="blogPost" itemscope="" itemtype="https://schema.org/BlogPosting"><meta itemprop="description" content="Intro"><header><h2 class="title_f1Hy" itemprop="headline"><a itemprop="url" href="/blog/kafka-meets-wayang-1">Apache Kafka meets Apache Wayang - Part 1</a></h2><div class="container_mt6G margin-vert--md"><time datetime="2024-03-05T00:00:00.000Z" itemprop="datePublished">March 5, 2024</time> · <!-- -->4 min read</div><div class="margin-top--md margin-bottom--sm row"><div class="col col--6 authorCol_Hf19"><div class="avatar margin-bottom--sm"><a href="https://github.com/kamir" target="_blank" rel="noopener noreferrer" class="avatar__photo-link"><img class="avatar__photo" src="https://avatars.githubusercontent.com/u/1241122?v=4" alt="Mirko Kämpf" itemprop="image"></a><div class="avatar__intro" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><div class="avatar__name"><a href="https://github.com/kamir" target="_blank" rel="noopener noreferrer" itemprop="url"><span itemprop="name">Mirko Kämpf</span></a></div><small class="avatar__subtitle" itemprop="description">Apache Committer</small></div></div></div></div></header><div class="markdown" itemprop="articleBody"><h2 class="anchor anchorWithStickyNavbar_LWe7" id="intro">Intro<a href="#intro" class="hash-link" aria-label="Direct link to Intro" title="Direct link to Intro"></a></h2>
<p>This article is the first of a four part series about federated data analysis using Apache Wayang.
The first article starts with an introduction of a typical data colaboration scenario which will emerge in our digital future.</p>
<p>In part two and three we will share a summary of our Apache Kafka client implementation for Apache Wayang.
We started with the Java Platform (part 2) and the Apache Spark implementation follows (W.I.P.) in part three.</p>
<p>The use case behind this work is an imaginary data collaboration scenario.
We see this example and the demand for a solution already in many places.<br>
<!-- -->For us this is motivation enough to propose a solution.
This would also allow us to do more local data processing, and businesses can stop moving data around the world, but rather care about data locality while they expose and share specific information to others by using data federation.
This reduces complexity of data management and cost dramatically.</p>
<p>For this purpose, we illustrate a cross organizational data sharing scenario from the finance sector soon.
This analysis pattern will also be relevant in the context of data analysis along supply chains, another typical example where data from many stakeholder together is needed but never managed in one place, for good reasons.</p>
<p>Data federation can help us to unlock the hidden value of all those isolated data lakes.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-cross-organizational-data-sharing-scenario">A cross organizational data sharing scenario<a href="#a-cross-organizational-data-sharing-scenario" class="hash-link" aria-label="Direct link to A cross organizational data sharing scenario" title="Direct link to A cross organizational data sharing scenario"></a></h2>
<p>Our goal is the implementation of a cross organization decentralized data processing scenario, in which protected local data should be processed in combination with public data from public sources in a collaborative manner.
Instead of copying all data into a central data lake or a central data platform we decided to use federated analytics.
Apache Wayang is the tool we work with.
In our case, the public data is hosted on publicly available websites or data pods.
A client can use the HTTP(S) protocol to read the data which is given in a well defined format.
For simplicity we decided to use CSV format.
When we look into the data of each participant we have a different perspective.</p>
<p>Our processing procedure should calculate a particular metric on the <em>local data</em> of each participant.
An example of such a metric is the average spending of all users on a particular product category per month.
This can vary from partner to partner, hence, we want to be able to calculate a peer-group comparison so that each partner can see its own metric compared with a global average calculated from contributions by all partners.
Such a process requires global averaging and local averaging.
And due to governance constraints, we can’t bring all raw data together in one place.</p>
<p>Instead, we want to use Apache Wayang for this purpose.
We simplify the procedure and split it into two phases.
Phase one is the process, which allows each participant to calculate the local metrics.
This requires only local data. The second phase requires data from all collaborating partners.
The monthly sum and counter values per partner and category are needed in one place by all other parties.
Hence, the algorithm of the first phase stores the local results locally, and the contributions to the global results in an externally accessible Kafka topic.
We assume this is done by each of the partners.</p>
<p>Now we have a scenario, in which an Apache Wayang process must be able to read data from multiple Apache Kafka topics from multiple Apache Kafka clusters but finally writes into a single Kafka topic, which then can be accessed by all the participating clients.</p>
<p><img decoding="async" loading="lazy" alt="images/image-1.png" src="/assets/images/image-1-9cc35d5aea2b867d7e5759a96bd02334.png" width="904" height="550" class="img_ev3q"></p>
<p>The illustration shows the data flows in such a scenario.
Jobs with red border are executed by the participants in isolation within their own data processing environments.
But they share some of the data, using publicly accessible Kafka topics, marked by A. Job 4 is the Apache Wayang job in our focus: here we intent to read data from 3 different source systems, and write results into a fourth system (marked as B), which can be accesses by all participants again.</p>
<p>With this in mind we want to implement an Apache Wayang application which implements the illustrated <em>Job 4</em>.
Since as of today, there is now <em>KafkaSource</em> and <em>KafkaSink</em> available in Apache Wayang, an implementation of both will be our first step.
Our assumption is, that in the beginning, there won’t be much data.</p>
<p>Apache Spark is not required to cope with the load, but we expect, that in the future, a single Java application would not be able to handle our workload.
Hence, we want to utilize the Apache Wayang abstraction over multiple processing platforms, starting with Java.
Later, we want to switch to Apache Spark.</p></div><footer class="row docusaurus-mt-lg"><div class="col"><b>Tags:</b><ul class="tags_jXut padding--none margin-left--sm"><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/wayang">wayang</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/kafka">kafka</a></li><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/cross-organization-data-collaboration">cross organization data collaboration</a></li></ul></div></footer></article><article class="margin-bottom--xl" itemprop="blogPost" itemscope="" itemtype="https://schema.org/BlogPosting"><meta itemprop="description" content="We&#x27;re updated our website and use now Docusaurus."><header><h2 class="title_f1Hy" itemprop="headline"><a itemprop="url" href="/blog/website_update">Website updated</a></h2><div class="container_mt6G margin-vert--md"><time datetime="2024-01-25T00:00:00.000Z" itemprop="datePublished">January 25, 2024</time> · <!-- -->One min read</div><div class="margin-top--md margin-bottom--sm row"><div class="col col--6 authorCol_Hf19"><div class="avatar margin-bottom--sm"><a href="https://github.com/2pk03" target="_blank" rel="noopener noreferrer" class="avatar__photo-link"><img class="avatar__photo" src="https://avatars.githubusercontent.com/u/1323575?v=4" alt="Alexander Alten" itemprop="image"></a><div class="avatar__intro" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><div class="avatar__name"><a href="https://github.com/2pk03" target="_blank" rel="noopener noreferrer" itemprop="url"><span itemprop="name">Alexander Alten</span></a></div><small class="avatar__subtitle" itemprop="description">(P)PMC Apache Wayang</small></div></div></div></div></header><div class="markdown" itemprop="articleBody"><p>We&#x27;re updated our website and use now Docusaurus.</p></div><footer class="row docusaurus-mt-lg"><div class="col col--9"><b>Tags:</b><ul class="tags_jXut padding--none margin-left--sm"><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/wayang">wayang</a></li></ul></div><div class="col text--right col--3"><a aria-label="Read more about Website updated" href="/blog/website_update"><b>Read More</b></a></div></footer></article><nav class="pagination-nav" aria-label="Blog list page navigation"></nav></main></div></div></div><footer class="footer footer--dark"><div class="container container-fluid"><div class="row footer__links"><div class="col footer__col"><div class="footer__title">Community</div><ul class="footer__items clean-list"><li class="footer__item"><a href="https://lists.apache.org/list.html?dev@wayang.apache.org" target="_blank" rel="noopener noreferrer" class="footer__link-item">Mailing list<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://www.youtube.com/@apachewayang" target="_blank" rel="noopener noreferrer" class="footer__link-item">YouTube<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://www.linkedin.com/company/apachewayang" target="_blank" rel="noopener noreferrer" class="footer__link-item">LinkedIn<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://www.reddit.com/r/ApacheWayang" target="_blank" rel="noopener noreferrer" class="footer__link-item">Reddit<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://twitter.com/apachewayang" target="_blank" rel="noopener noreferrer" class="footer__link-item">Twitter<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li></ul></div><div class="col footer__col"><div class="footer__title">Docs</div><ul class="footer__items clean-list"><li class="footer__item"><a class="footer__link-item" href="/docs/start/download">Install</a></li><li class="footer__item"><a class="footer__link-item" href="/docs/introduction/features">Features</a></li><li class="footer__item"><a class="footer__link-item" href="/docs/introduction/benchmark">Benchmark</a></li></ul></div><div class="col footer__col"><div class="footer__title">Repositories</div><ul class="footer__items clean-list"><li class="footer__item"><a href="https://github.com/apache/incubator-wayang" target="_blank" rel="noopener noreferrer" class="footer__link-item">Wayang<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://github.com/apache/incubator-wayang-website" target="_blank" rel="noopener noreferrer" class="footer__link-item">Website<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li></ul></div></div><div class="footer__bottom text--center"><div class="margin-bottom--sm"><a href="https://incubator.apache.org/" rel="noopener noreferrer" class="footerLogoLink_BH7S"><img src="/img/apache-incubator.svg" alt="Apache Incubator logo" class="footer__logo themedComponent_mlkZ themedComponent--light_NVdE" width="200"><img src="/img/apache-incubator.svg" alt="Apache Incubator logo" class="footer__logo themedComponent_mlkZ themedComponent--dark_xIcU" width="200"></a></div><div class="footer__copyright"><div>
<p> Apache Wayang is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. </p>
<p>
Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0. <br>
Apache, the names of Apache projects, and the feather logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
</p>
</div></div></div></div></footer></div>
</body>
</html>