blob: 66caa360a710923158bec6f845fbeab91238e8bc [file] [log] [blame]
<!--
▄▄▄ ██▓███ ▄▄▄ ▄████▄ ██░ ██ ▓█████ ██▓ ▄████ ███▄ █ ██▓▄▄▄█████▓▓█████
▒████▄ ▓██░ ██▒▒████▄ ▒██▀ ▀█ ▓██░ ██▒▓█ ▀ ▓██▒ ██▒ ▀█▒ ██ ▀█ █ ▓██▒▓ ██▒ ▓▒▓█ ▀
▒██ ▀█▄ ▓██░ ██▓▒▒██ ▀█▄ ▒▓█ ▄ ▒██▀▀██░▒███ ▒██▒▒██░▄▄▄░▓██ ▀█ ██▒▒██▒▒ ▓██░ ▒░▒███
░██▄▄▄▄██ ▒██▄█▓▒ ▒░██▄▄▄▄██ ▒▓▓▄ ▄██▒░▓█ ░██ ▒▓█ ▄ ░██░░▓█ ██▓▓██▒ ▐▌██▒░██░░ ▓██▓ ░ ▒▓█ ▄
▓█ ▓██▒▒██▒ ░ ░ ▓█ ▓██▒▒ ▓███▀ ░░▓█▒░██▓░▒████▒ ░██░░▒▓███▀▒▒██░ ▓██░░██░ ▒██▒ ░ ░▒████▒
▒▒ ▓▒█░▒▓▒░ ░ ░ ▒▒ ▓▒█░░ ░▒ ▒ ░ ▒ ░░▒░▒░░ ▒░ ░ ░▓ ░▒ ▒ ░ ▒░ ▒ ▒ ░▓ ▒ ░░ ░░ ▒░ ░
▒ ▒▒ ░░▒ ░ ▒ ▒▒ ░ ░ ▒ ▒ ░▒░ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ░░ ░ ▒░ ▒ ░ ░ ░ ░ ░
░ ▒ ░░ ░ ▒ ░ ░ ░░ ░ ░ ▒ ░░ ░ ░ ░ ░ ░ ▒ ░ ░ ░
░ ░ ░ ░░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░
-->
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE html>
<html lang="en">
<head>
<link rel="canonical" href="https://ignite.apache.org/use-cases/spark-acceleration.html"/>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description"
content="Apache Ignite integrates with Apache Spark to accelerate the performance of Spark applications
and APIs by keeping data in a shared in-memory cluster."/>
<title>Apache Spark Performance Acceleration</title>
<!--#include virtual="/includes/styles.html" -->
</head>
<body>
<!--#include virtual="/includes/header.html" -->
<article>
<header>
<div class="container">
<h1>Apache Spark <strong>Performance Acceleration</strong></h1>
</div>
</header>
<div class="container">
<p>
The performance of Apache Spark® applications can be accelerated by keeping data in a shared
Apache Ignite® in-memory cluster. Spark works with Ignite as a data source similar to how it uses Hadoop or a
relational database. You can start an Ignite cluster, set it as a data source for Spark workers, and
continue using Spark RDDs or DataFrames APIs. You can gain even more speed by running Ignite SQL or
compute APIs directly on the Spark dataset. Ignite can also be used as a distributed in-memory layer by Spark
workers that need to share both data and state.
</p>
<img class="img-fluid diagram-right" alt="Apache Spark Performance Acceleration" src="/images/svg-diagrams/spark_acceleration.svg"/>
<p>
The performance increase is achievable for several reasons. First, Ignite is designed to store data sets
in memory across a cluster of nodes reducing latency of Spark operations that usually need to pull date
from disk-based systems. Second, Ignite tries to minimize data shuffling over the network between its
store and Spark applications by running certain Spark tasks, produced by RDDs or DataFrames APIs,
in-place on Ignite nodes. This optimization helps to reduce the effect of network latency on the
performance of Spark calls. Finally, the network impact can be further reduced if the native
Ignite APIs, such as SQL, are called from Spark applications directly. By doing so, you can eliminate
data shuffling between Spark and Ignite as long as Ignite SQL queries are always executed on
Ignite nodes returning a much smaller final result set to the application layer.
</p>
<h2>Ignite Shared RDDs</h2>
<p>
Apache Ignite provides an implementation of the Spark RDD, which allows any data and state to be shared
in memory as RDDs across Spark jobs. The Ignite RDD provides a shared, mutable view of the data stored
in Ignite caches across different Spark jobs, workers, or applications.
</p>
<p>
The Ignite RDD is implemented as a view over a distributed Ignite table (aka. cache). It can be deployed
with an Ignite node either within the Spark job executing process, on a Spark worker, or in a separate
Ignite cluster. This means that depending on the chosen deployment mode, the shared state may either
exist only during the lifespan of a Spark application (embedded mode), or it may out-survive the Spark
application (standalone mode).
</p>
<h2>Ignite DataFrames</h2>
<p>
The Apache Spark DataFrame API introduced the concept of a schema to describe the data,
allowing Spark to manage the schema and organize the data into a tabular format. To put it simply,
a DataFrame is a distributed collection of data organized into named columns. It is conceptually
equivalent to a table in a relational database and allows Spark to leverage the Catalyst query
optimizer to produce much more efficient query execution plans in comparison to RDDs, which are
collections of elements partitioned across the nodes of the cluster.
</p>
<p>
Ignite supports DataFrame APIs allowing Spark to write to and read from Ignite through that interface.
Furthermore, Ignite analyses execution plans produced by Spark's Catalyst engine and can execute
parts of the plan on Ignite nodes directly, which will reduce data shuffling and consequently make your
SparkSQL perform better.
</p>
<div class="jumbotron jumbotron-fluid">
<div class="container">
<div class="title display-6">Learn More</div>
<hr class="my-4">
<div class="row">
<div class="col-sm-6">
<ul>
<li>
<a href="https://apacheignite-fs.readme.io/docs/installation-deployment" target="docs">
Ignite and Spark Installation and Deployment <i
class="fa fa-angle-double-right"></i>
</a>
</li>
<li>
<a href="https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd" target="docs">
Ignite RDDs in Details <i class="fa fa-angle-double-right"></i>
</a>
</li>
</ul>
</div>
<div class="col-sm-6">
<ul>
<li>
<a href="https://apacheignite-fs.readme.io/docs/ignite-data-frame" target="docs">
Ignite DataFrames in Details <i class="fa fa-angle-double-right"></i>
</a>
</li>
<li>
<a href="/use-cases/digital-integration-hub.html">
Ignite as a Digital Integration Hub <i class="fa fa-angle-double-right"></i>
</a>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
</article>
<!--#include virtual="/includes/footer.html" -->
<!--#include virtual="/includes/scripts.html" -->
</body>
</html>