layout: post title: “Ballista: A Distributed Scheduler for Apache Arrow” description: “We are excited to announce that Ballista has been donated to the Apache Arrow project. Ballista is a distributed scheduler for the Rust implementation of Apache Arrow.” date: “2021-04-12 00:00:00 -0600” author: agrove categories: [application]

We are excited to announce that Ballista has been donated to the Apache Arrow project.

Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

Ballista can be deployed as a standalone cluster and also supports Kubernetes. In either case, the scheduler can be configured to use etcd as a backing store to (eventually) provide redundancy in the case of a scheduler failing.

Status

The Ballista project is at an early stage of development. However, it is capable of running complex analytics queries in a distributed cluster with reasonable performance.

The following chart shows the current performance of Ballista for a number of TPC-H queries at scale factor 100. The current performance at this scale is similar to Apache Spark, and better in some cases. More work is now needed to make Ballista scale with larger data sets.

Ballista Performance

Benchmarks were executed on a 24 core desktop with 64 GB RAM and NVMe drives. The 100 GB dataset was converted to Parquet and repartitioned with 8 partitions.

One of the benefits of Ballista being part of the Arrow codebase is that there is now an opportunity to push parts of the scheduler down to DataFusion so that is possible to seamlessly scale across cores in DataFusion, and across nodes in Ballista, using the same unified query scheduler.

Contributors Welcome!

If you are excited about being able to use Rust for distributed compute and ETL and would like to contribute to this work then there are many ways to get involved. The simplest way to get started is to try out Ballista against your own datasets and file bug reports for any issues that you find. You could also check out the current list of issues and have a go at fixing one.

The Arrow Rust Community section of the Rust README provides information on other ways to interact with the Ballista contributors and maintainers.