use-cases/spark-acceleration.html - ignite-website - Git at Google

 <!--
  ▄▄▄       ██▓███   ▄▄▄       ▄████▄   ██░ ██ ▓█████     ██▓  ▄████  ███▄    █  ██▓▄▄▄█████▓▓█████
 ▒████▄    ▓██░  ██▒▒████▄    ▒██▀ ▀█  ▓██░ ██▒▓█   ▀    ▓██▒ ██▒ ▀█▒ ██ ▀█   █ ▓██▒▓  ██▒ ▓▒▓█   ▀
 ▒██  ▀█▄  ▓██░ ██▓▒▒██  ▀█▄  ▒▓█    ▄ ▒██▀▀██░▒███      ▒██▒▒██░▄▄▄░▓██  ▀█ ██▒▒██▒▒ ▓██░ ▒░▒███
 ░██▄▄▄▄██ ▒██▄█▓▒ ▒░██▄▄▄▄██ ▒▓▓▄ ▄██▒░▓█ ░██ ▒▓█  ▄    ░██░░▓█  ██▓▓██▒  ▐▌██▒░██░░ ▓██▓ ░ ▒▓█  ▄
  ▓█   ▓██▒▒██▒ ░  ░ ▓█   ▓██▒▒ ▓███▀ ░░▓█▒░██▓░▒████▒   ░██░░▒▓███▀▒▒██░   ▓██░░██░  ▒██▒ ░ ░▒████▒
  ▒▒   ▓▒█░▒▓▒░ ░  ░ ▒▒   ▓▒█░░ ░▒ ▒  ░ ▒ ░░▒░▒░░ ▒░ ░   ░▓   ░▒   ▒ ░ ▒░   ▒ ▒ ░▓    ▒ ░░   ░░ ▒░ ░
   ▒   ▒▒ ░░▒ ░       ▒   ▒▒ ░  ░  ▒    ▒ ░▒░ ░ ░ ░  ░    ▒ ░  ░   ░ ░ ░░   ░ ▒░ ▒ ░    ░     ░ ░  ░
   ░   ▒   ░░         ░   ▒   ░         ░  ░░ ░   ░       ▒ ░░ ░   ░    ░   ░ ░  ▒ ░  ░         ░
       ░  ░               ░  ░░ ░       ░  ░  ░   ░  ░    ░        ░          ░  ░              ░  ░
 -->

 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 <!DOCTYPE html>
 <html lang="en">
 <head>
     <link rel="canonical" href="https://ignite.apache.org/use-cases/spark-acceleration.html"/>
     <meta charset="utf-8">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">

     <meta name="description"
           content="Apache Ignite integrates with Apache Spark to accelerate the performance of Spark applications
           and APIs by keeping data in a shared in-memory cluster."/>

     <title>Apache Spark Performance Acceleration</title>

     <!--#include virtual="/includes/styles.html" -->


 </head>
 <body>
 <!--#include virtual="/includes/header.html" -->
 <article>
     <header>
         <div class="container">
             <h1>Apache Spark <strong>Performance Acceleration</strong></h1>
         </div>
     </header>
     <div class="container">
         <p>
             The performance of Apache Spark® applications can be accelerated by keeping data in a shared
             Apache Ignite® in-memory cluster. Spark works with Ignite as a data source similar to how it uses Hadoop or a
             relational database. You can start an Ignite cluster, set it as a data source for Spark workers, and
             continue using Spark RDDs or DataFrames APIs. You can gain even more speed by running Ignite SQL or
             compute APIs directly on the Spark dataset. Ignite can also be used as a distributed in-memory layer by Spark
             workers that need to share both data and state.
         </p>
         <img class="img-fluid diagram-right" alt="Apache Spark Performance Acceleration" src="/images/svg-diagrams/spark_acceleration.svg"/>


         <p>
             The performance increase is achievable for several reasons. First, Ignite is designed to store data sets
             in memory across a cluster of nodes reducing latency of Spark operations that usually need to pull date
             from disk-based systems. Second, Ignite tries to minimize data shuffling over the network between its
             store and Spark applications by running certain Spark tasks, produced by RDDs or DataFrames APIs,
             in-place on Ignite nodes. This optimization helps to reduce the effect of network latency on the
             performance of Spark calls. Finally, the network impact can be further reduced if the native
             Ignite APIs, such as SQL, are called from Spark applications directly. By doing so, you can eliminate
             data shuffling between Spark and Ignite as long as Ignite SQL queries are always executed on
             Ignite nodes returning a much smaller final result set to the application layer.
         </p>

         <h2>Ignite Shared RDDs</h2>
         <p>
             Apache Ignite provides an implementation of the Spark RDD, which allows any data and state to be shared
             in memory as RDDs across Spark jobs. The Ignite RDD provides a shared, mutable view of the data stored
             in Ignite caches across different Spark jobs, workers, or applications.
         </p>

         <p>
             The Ignite RDD is implemented as a view over a distributed Ignite table (aka. cache). It can be deployed
             with an Ignite node either within the Spark job executing process, on a Spark worker, or in a separate
             Ignite cluster. This means that depending on the chosen deployment mode, the shared state may either
             exist only during the lifespan of a Spark application (embedded mode), or it may out-survive the Spark
             application (standalone mode).
         </p>

         <h2>Ignite DataFrames</h2>
         <p>
             The Apache Spark DataFrame API introduced the concept of a schema to describe the data,
             allowing Spark to manage the schema and organize the data into a tabular format. To put it simply,
             a DataFrame is a distributed collection of data organized into named columns. It is conceptually
             equivalent to a table in a relational database and allows Spark to leverage the Catalyst query
             optimizer to produce much more efficient query execution plans in comparison to RDDs, which are
             collections of elements partitioned across the nodes of the cluster.
         </p>
         <p>
             Ignite supports DataFrame APIs allowing Spark to write to and read from Ignite through that interface.
             Furthermore, Ignite analyses execution plans produced by Spark's Catalyst engine and can execute
             parts of the plan on Ignite nodes directly, which will reduce data shuffling and consequently make your
             SparkSQL perform better.
         </p>


         <div class="jumbotron jumbotron-fluid">
             <div class="container">
                 <div class="title display-6">Learn More</div>
                 <hr class="my-4">
                 <div class="row">
                     <div class="col-sm-6">
                         <ul>
                             <li>
                                 <a href="https://apacheignite-fs.readme.io/docs/installation-deployment" target="docs">
                                     Ignite and Spark Installation and Deployment <i
                                         class="fa fa-angle-double-right"></i>
                                 </a>
                             </li>
                             <li>
                                 <a href="https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd" target="docs">
                                     Ignite RDDs in Details <i class="fa fa-angle-double-right"></i>
                                 </a>
                             </li>
                         </ul>
                     </div>
                     <div class="col-sm-6">
                         <ul>
                             <li>
                                 <a href="https://apacheignite-fs.readme.io/docs/ignite-data-frame" target="docs">
                                     Ignite DataFrames in Details <i class="fa fa-angle-double-right"></i>
                                 </a>
                             </li>
                             <li>

                                 <a href="/use-cases/digital-integration-hub.html">
                                     Ignite as a Digital Integration Hub <i class="fa fa-angle-double-right"></i>
                                 </a>

                             </li>
                         </ul>
                     </div>
                 </div>
             </div>
         </div>

     </div>

 </article>
 <!--#include virtual="/includes/footer.html" -->
 <!--#include virtual="/includes/scripts.html" -->
 </body>
 </html>
	<!--
	▄▄▄ ██▓███ ▄▄▄ ▄████▄ ██░ ██ ▓█████ ██▓ ▄████ ███▄ █ ██▓▄▄▄█████▓▓█████
	▒████▄ ▓██░ ██▒▒████▄ ▒██▀ ▀█ ▓██░ ██▒▓█ ▀ ▓██▒ ██▒ ▀█▒ ██ ▀█ █ ▓██▒▓ ██▒ ▓▒▓█ ▀
	▒██ ▀█▄ ▓██░ ██▓▒▒██ ▀█▄ ▒▓█ ▄ ▒██▀▀██░▒███ ▒██▒▒██░▄▄▄░▓██ ▀█ ██▒▒██▒▒ ▓██░ ▒░▒███
	░██▄▄▄▄██ ▒██▄█▓▒ ▒░██▄▄▄▄██ ▒▓▓▄ ▄██▒░▓█ ░██ ▒▓█ ▄ ░██░░▓█ ██▓▓██▒ ▐▌██▒░██░░ ▓██▓ ░ ▒▓█ ▄
	▓█ ▓██▒▒██▒ ░ ░ ▓█ ▓██▒▒ ▓███▀ ░░▓█▒░██▓░▒████▒ ░██░░▒▓███▀▒▒██░ ▓██░░██░ ▒██▒ ░ ░▒████▒
	▒▒ ▓▒█░▒▓▒░ ░ ░ ▒▒ ▓▒█░░ ░▒ ▒ ░ ▒ ░░▒░▒░░ ▒░ ░ ░▓ ░▒ ▒ ░ ▒░ ▒ ▒ ░▓ ▒ ░░ ░░ ▒░ ░
	▒ ▒▒ ░░▒ ░ ▒ ▒▒ ░ ░ ▒ ▒ ░▒░ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ░░ ░ ▒░ ▒ ░ ░ ░ ░ ░
	░ ▒ ░░ ░ ▒ ░ ░ ░░ ░ ░ ▒ ░░ ░ ░ ░ ░ ░ ▒ ░ ░ ░
	░ ░ ░ ░░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░
	-->

	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<link rel="canonical" href="https://ignite.apache.org/use-cases/spark-acceleration.html"/>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">

	<meta name="description"
	content="Apache Ignite integrates with Apache Spark to accelerate the performance of Spark applications
	and APIs by keeping data in a shared in-memory cluster."/>

	<title>Apache Spark Performance Acceleration</title>

	<!--#include virtual="/includes/styles.html" -->


	</head>
	<body>
	<!--#include virtual="/includes/header.html" -->
	<article>
	<header>
	<div class="container">
	<h1>Apache Spark <strong>Performance Acceleration</strong></h1>
	</div>
	</header>
	<div class="container">
	<p>
	The performance of Apache Spark® applications can be accelerated by keeping data in a shared
	Apache Ignite® in-memory cluster. Spark works with Ignite as a data source similar to how it uses Hadoop or a
	relational database. You can start an Ignite cluster, set it as a data source for Spark workers, and
	continue using Spark RDDs or DataFrames APIs. You can gain even more speed by running Ignite SQL or
	compute APIs directly on the Spark dataset. Ignite can also be used as a distributed in-memory layer by Spark
	workers that need to share both data and state.
	</p>
	<img class="img-fluid diagram-right" alt="Apache Spark Performance Acceleration" src="/images/svg-diagrams/spark_acceleration.svg"/>


	<p>
	The performance increase is achievable for several reasons. First, Ignite is designed to store data sets
	in memory across a cluster of nodes reducing latency of Spark operations that usually need to pull date
	from disk-based systems. Second, Ignite tries to minimize data shuffling over the network between its
	store and Spark applications by running certain Spark tasks, produced by RDDs or DataFrames APIs,
	in-place on Ignite nodes. This optimization helps to reduce the effect of network latency on the
	performance of Spark calls. Finally, the network impact can be further reduced if the native
	Ignite APIs, such as SQL, are called from Spark applications directly. By doing so, you can eliminate
	data shuffling between Spark and Ignite as long as Ignite SQL queries are always executed on
	Ignite nodes returning a much smaller final result set to the application layer.
	</p>

	<h2>Ignite Shared RDDs</h2>
	<p>
	Apache Ignite provides an implementation of the Spark RDD, which allows any data and state to be shared
	in memory as RDDs across Spark jobs. The Ignite RDD provides a shared, mutable view of the data stored
	in Ignite caches across different Spark jobs, workers, or applications.
	</p>

	<p>
	The Ignite RDD is implemented as a view over a distributed Ignite table (aka. cache). It can be deployed
	with an Ignite node either within the Spark job executing process, on a Spark worker, or in a separate
	Ignite cluster. This means that depending on the chosen deployment mode, the shared state may either
	exist only during the lifespan of a Spark application (embedded mode), or it may out-survive the Spark
	application (standalone mode).
	</p>

	<h2>Ignite DataFrames</h2>
	<p>
	The Apache Spark DataFrame API introduced the concept of a schema to describe the data,
	allowing Spark to manage the schema and organize the data into a tabular format. To put it simply,
	a DataFrame is a distributed collection of data organized into named columns. It is conceptually
	equivalent to a table in a relational database and allows Spark to leverage the Catalyst query
	optimizer to produce much more efficient query execution plans in comparison to RDDs, which are
	collections of elements partitioned across the nodes of the cluster.
	</p>
	<p>
	Ignite supports DataFrame APIs allowing Spark to write to and read from Ignite through that interface.
	Furthermore, Ignite analyses execution plans produced by Spark's Catalyst engine and can execute
	parts of the plan on Ignite nodes directly, which will reduce data shuffling and consequently make your
	SparkSQL perform better.
	</p>


	<div class="jumbotron jumbotron-fluid">
	<div class="container">
	<div class="title display-6">Learn More</div>
	<hr class="my-4">
	<div class="row">
	<div class="col-sm-6">
	<ul>
	<li>
	<a href="https://apacheignite-fs.readme.io/docs/installation-deployment" target="docs">
	Ignite and Spark Installation and Deployment <i
	class="fa fa-angle-double-right"></i>
	</a>
	</li>
	<li>
	<a href="https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd" target="docs">
	Ignite RDDs in Details <i class="fa fa-angle-double-right"></i>
	</a>
	</li>
	</ul>
	</div>
	<div class="col-sm-6">
	<ul>
	<li>
	<a href="https://apacheignite-fs.readme.io/docs/ignite-data-frame" target="docs">
	Ignite DataFrames in Details <i class="fa fa-angle-double-right"></i>
	</a>
	</li>
	<li>

	<a href="/use-cases/digital-integration-hub.html">
	Ignite as a Digital Integration Hub <i class="fa fa-angle-double-right"></i>
	</a>

	</li>
	</ul>
	</div>
	</div>
	</div>
	</div>

	</div>

	</article>
	<!--#include virtual="/includes/footer.html" -->
	<!--#include virtual="/includes/scripts.html" -->
	</body>
	</html>