Python client for Ballista.
This project is versioned and released independently from the main Ballista project and is intentionally not part of the default Cargo workspace so that it doesn't cause overhead for maintainers of the main Ballista codebase.
[!IMPORTANT] Current approach is to support datafusion python API, there are know limitations of current approach, with some cases producing errors. We are trying to come up with the best approach to support ballista python interface. More details could be found at #1142
Creates a new context and connects to a Ballista scheduler process.
from ballista import BallistaBuilder >>> ctx = BallistaBuilder().standalone()
>>> ctx.sql("create external table t stored as parquet location './testdata/test.parquet'") >>> df = ctx.sql("select * from t limit 5") >>> pyarrow_batches = df.collect()
>>> df = ctx.read_parquet('./testdata/test.parquet').limit(5) >>> pyarrow_batches = df.collect()
Scheduler and executors can be configured and started from python code.
To start scheduler:
from ballista import BallistaScheduler scheduler = BallistaScheduler() scheduler.start() scheduler.wait_for_termination()
For executor:
from ballista import BallistaExecutor executor = BallistaExecutor() executor.start() executor.wait_for_termination()
Detailed development process explanation can be found in datafusion python documentation. Improving build speed section can be relevant.
python3 -m venv venv source venv/bin/activate pip3 install -r requirements.txt
uv sync --dev --no-install-package ballista
maturin develop
Note that you can also run maturin develop --release to get a release build locally.
uv run --no-project maturin develop --uv
Or uv run --no-project maturin build --release --strip to get a release build.
python3 -m pytest
uv run --no-project pytest