Ballista support for datafusion python.
This project is tracked under its own Cargo.toml and is intentionally not part of the default Cargo workspace so that it doesn't cause overhead for maintainers of the main Ballista codebase. Its version is bumped in lockstep with the workspace crates by dev/update_ballista_versions.py, and the wheels are built against the in-repo ballista crates via path dependencies (not crates.io), so an RC can produce wheels for an unpublished version.
[!IMPORTANT] Current approach is to support datafusion python API, there are know limitations of current approach, with some cases producing errors.
We are trying to come up with the best approach to support ballista python interface.
More details could be found at #1142
Creates a new context which connects to a Ballista scheduler process.
from datafusion import col, lit from datafusion import DataFrame # we do not need datafusion context # it will be replaced by BallistaSessionContext # from datafusion import SessionContext from ballista import BallistaSessionContext # Change from: # # ctx = SessionContext() # # to: ctx = BallistaSessionContext("df://localhost:50050") # all other functions and functions are from # datafusion module ctx.sql("create external table t stored as parquet location './testdata/test.parquet'") df : DataFrame = ctx.sql("select * from t limit 5") df.show()
Known limitations and inefficiencies of the current approach:
SessionConfig is not propagated to Ballista.datafusion_proto::logical_plan::LogicalExtensionCodec.UDF as DataFusion Python does not serialise them.ctx = BallistaSessionContext("df://localhost:50050") df = ctx.read_parquet('./testdata/test.parquet').filter(col(id) > lit(4)).limit(5) pyarrow_batches = df.collect()
Check DataFusion python provides more examples and manuals.
PyBallista provides first-class Jupyter notebook support with SQL magic commands and rich HTML rendering.
pip install "ballista[jupyter]"
DataFrames automatically render as styled HTML tables in Jupyter notebooks:
from ballista import BallistaSessionContext ctx = BallistaSessionContext("df://localhost:50050") df = ctx.sql("SELECT * FROM my_table LIMIT 10") df # Renders as HTML table via _repr_html_()
For a more interactive SQL experience, load the Ballista Jupyter extension:
# Load the extension %load_ext ballista.jupyter # Connect to a Ballista cluster %ballista connect df://localhost:50050 # Register .parquet table %register parquet public.test_data_v1 ../testdata/test.parquet # Check connection status %ballista status # List registered tables %ballista tables # Show table schema %ballista schema my_table # Execute a simple query (line magic) %sql SELECT COUNT(*) FROM orders # Execute a complex query (cell magic) %%sql SELECT customer_id, SUM(amount) as total FROM orders GROUP BY customer_id ORDER BY total DESC LIMIT 10
You can also store results in a variable:
%%sql my_result SELECT * FROM orders WHERE status = 'pending'
Visualize query execution plans directly in notebooks:
df = ctx.sql("SELECT * FROM orders WHERE amount > 100") df.explain_visual() # Displays SVG visualization # With runtime statistics df.explain_visual(analyze=True)
Note: Full SVG visualization requires graphviz to be installed (
brew install graphvizon macOS).
For long-running queries, use collect_with_progress() to see execution status:
df = ctx.sql("SELECT * FROM large_table") batches = df.collect_with_progress()
See the examples/ directory for Jupyter notebooks demonstrating various features:
getting_started.ipynb - Basic connection and queriesdataframe_api.ipynb - DataFrame transformationsdistributed_queries.ipynb - Multi-stage distributed query examplesScheduler and executors can be configured and started from python code.
To start scheduler:
from ballista import BallistaScheduler scheduler = BallistaScheduler() scheduler.start() scheduler.wait_for_termination()
For executor:
from ballista import BallistaExecutor executor = BallistaExecutor() executor.start() executor.wait_for_termination()
Detailed development process explanation can be found in datafusion python documentation. Improving build speed section can be relevant.
python3 -m venv .venv source .venv/bin/activate pip3 install -r requirements.txt
uv sync --dev --no-install-package ballista
maturin develop
Note that you can also run maturin develop --release to get a release build locally.
uv run --no-project maturin develop --uv
Or uv run --no-project maturin build --release --strip to get a release build.
python3 -m pytest
uv run --no-project pytest