This tutorial demonstrates how to access Apache Ozone from Python using PyArrow, with Ozone running in Docker.
Download the latest Docker Compose file for Ozone and start the cluster with 3 DataNodes:
curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml docker compose up -d --scale datanode=3
docker exec -it <your-scm-container-name-or-id> bash
Change the container id
<your-scm-container-name-or-id>to your actual container id.
The rest of the tutorial will run on this container.
Create a volume and a bucket inside Ozone:
ozone sh volume create volume ozone sh bucket create volume/bucket
pip install pyarrow
Depending on your system architecture, run one of the following:
For ARM64 (Apple Silicon, ARM servers):
curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0-aarch64.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*'
For x86_64 (most desktops and servers):
curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*'
Set environment variables to point to the native libraries and Ozone classpath:
export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/ export CLASSPATH=$(ozone classpath ozone-tools)
Add the following to /etc/hadoop/core-site.xml:
<configuration> <property> <name>fs.defaultFS</name> <value>ofs://om:9862</value> <description>Ozone Manager endpoint</description> </property> </configuration>
Note: the Docker container has environment variable
OZONE_CONF_DIR=/etc/hadoop/so it knows where to locate the configuration files.
Create a Python script (ozone_pyarrow_example.py) with the following code:
#!/usr/bin/python import pyarrow.fs as pafs # Connect to Ozone using HadoopFileSystem # "default" tells PyArrow to use the fs.defaultFS property from core-site.xml fs = pafs.HadoopFileSystem("default") # Create a directory inside the bucket fs.create_dir("volume/bucket/aaa") # Write data to a file path = "volume/bucket/file1" with fs.open_output_stream(path) as stream: stream.write(b'data')
Run the script:
python ozone_pyarrow_example.py
✅ Congratulations! You’ve successfully accessed Ozone from Python using PyArrow and Docker.
ARROW_LIBHDFS_DIR is set and points to the correct native library path.om:9862) is correct and reachable.