The easiest way to get started is to run one of the standalone or distributed examples. After that, refer to the Getting Started Guide.
Ballista supports a wide range of SQL, including CTEs, Joins, and Subqueries and can execute complex queries at scale.
Refer to the DataFusion SQL Reference for more information on supported SQL.
Ballista is maturing quickly and is now working towards being production ready. See the following roadmap for more details.
There is an excellent discussion in https://github.com/apache/arrow-ballista/issues/30 about the future of the project, and we encourage you to participate and add your feedback there if you are interested in using or contributing to Ballista.
The current focus is on the following items:
- Make production ready
- Shuffle file cleanup
- Periodically (#185)
- Add gRPC & REST interfaces for clients/UI to actively call the cleanup for a job or the whole system
- Fill functional gaps between DataFusion and Ballista
- Improve task scheduling and data exchange efficiency
- Better error handling
- Improve monitoring, logging, and metrics
- Auto scaling support
- Better configuration management
- Support for multi-scheduler deployments. Initially for resiliency and fault tolerance but ultimately to support sharding for scalability and more efficient caching.
- Shuffle improvement
- Shuffle memory control (#320)
- Improve shuffle IO to avoid producing too many files
- Support sort-based shuffle
- Support range partition
- Support broadcast shuffle (#342)
- Scheduler Improvements
- All-at-once job task scheduling
- Executor deployment grouping based on resource allocation
- Cloud Support
- Support Azure Blob Storage (#294)
- Support Google Cloud Storage (#293)
- Performance and scalability
- Implement Adaptive Query Execution (#387)
- Implement bubble execution (#408)
- Improve benchmark results (#339)
- Python Support
- Support Python UDFs (#173)
There are currently no up-to-date architecture documents available. You can get a general overview of the architecture by watching the Ballista: Distributed Compute with Rust and Apache Arrow talk from the New York Open Statistical Programming Meetup (Feb 2021).
Please see the Contribution Guide for information about contributing to Ballista.