blob: a5c38afd1f2df3b1d5137a8582e32002c47caae8 [file] [log] [blame] [view]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# PyBallista
Python client for Ballista.
This project is versioned and released independently from the main Ballista project and is intentionally not
part of the default Cargo workspace so that it doesn't cause overhead for maintainers of the main Ballista codebase.
## Creating a SessionContext
> [!IMPORTANT]
> Current approach is to support datafusion python API, there are know limitations of current approach,
> with some cases producing errors.
> We trying to come up with the best approach to support datafusion python interface.
> More details could be found at [#1142](https://github.com/apache/datafusion-ballista/issues/1142)
Creates a new context and connects to a Ballista scheduler process.
```python
from ballista import BallistaBuilder
>>> ctx = BallistaBuilder().standalone()
```
### Example SQL Usage
```python
>>> ctx.sql("create external table t stored as parquet location './testdata/test.parquet'")
>>> df = ctx.sql("select * from t limit 5")
>>> pyarrow_batches = df.collect()
```
### Example DataFrame Usage
```python
>>> df = ctx.read_parquet('./testdata/test.parquet').limit(5)
>>> pyarrow_batches = df.collect()
```
## Scheduler and Executor
Scheduler and executors can be configured and started from python code.
To start scheduler:
```python
from ballista import BallistaScheduler
scheduler = BallistaScheduler()
scheduler.start()
scheduler.wait_for_termination()
```
For executor:
```python
from ballista import BallistaExecutor
executor = BallistaExecutor()
executor.start()
executor.wait_for_termination()
```
## Development Process
### Creating Virtual Environment
#### pip
```shell
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
```
#### uv
```shell
uv sync --dev --no-install-package ballista
```
### Building
#### pip
```shell
maturin develop
```
Note that you can also run `maturin develop --release` to get a release build locally.
#### uv
```shell
uv run --no-project maturin build
```
Or `uv run --no-project maturin build --release --strip` to get a release build.
### Testing
#### pip
```shell
python3 -m pytest
```
#### uv
```shell
uv run --no-project pytest
```