Testing

Single Node Testing

Install Ray on one (head) node.

sudo apt install -y python3-pip python3.12-venv
python3 -m venv venv
source venv/bin/activate
pip3 install -U "ray[default]"

Start Ray Head Node

 ray start --head --dashboard-host 0.0.0.0 --include-dashboard true

Start Ray Worker Nodes(s) (Optional)

This is optional, if you go add Ray worker noeds, it becomes distributed.

Also Ray doesn't support MacOS multi-node cluster

ray start --address=127.0.0.1:6379

Install DataFusion Ray (on head node)

Clone the repo with the version that you want to test. Run maturin build --release in the virtual env.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
. "$HOME/.cargo/env"
pip3 install maturin
git clone https://github.com/apache/datafusion-ray.git
cd datafusion-ray
maturin develop --release

Submit Job

  1. If started the cluster manually, simply connect to the existing cluster instead of reinitializing it.
# Start a local cluster
# ray.init(resources={"worker": 1})

# Connect to a cluster
ray.init()
  1. Submit the job to Ray Cluster
RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir examples -- python3 tips.py

Distributed Testing

Install Ray on at least two nodes.

sudo apt install -y python3-pip python3.12-venv
python3 -m venv venv
source venv/bin/activate
pip3 install -U "ray[default]"

Start Ray Head Node

ray start --head --dashboard-host 0.0.0.0 --include-dashboard true

Start Ray Worker Nodes(s)

Replace NODE_IP_ADDRESS with the address accessible in your distributed setup, which will be displayed after the previous step.

ray start --address={NODE_IP_ADDRESS}:6379

Install DataFusion Ray (on each node)

Clone the repo with the version that you want to test. Run maturin build --release in the virtual env.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
. "$HOME/.cargo/env"
pip3 install maturin
git clone https://github.com/apache/datafusion-ray.git
cd datafusion-ray
maturin develop --release

Submit Job

  1. If starting the cluster manually, simply connect to the existing cluster instead of reinitializing it.
# Start a local cluster
# ray.init(resources={"worker": 1})

# Connect to a cluster
ray.init()
  1. Submit the job to Ray Cluster
RAY_ADDRESS='http://{NODE_IP_ADDRESS}:8265' ray job submit --working-dir examples -- python3 tips.py