This getting started guide provides a docker-compose
file to set up Apache Spark with Apache Polaris. Apache Polaris is configured as an Iceberg REST Catalog in Spark. A Jupyter notebook is used to run PySpark.
If a Polaris image is not already present locally, build one with the following command:
./gradlew \ :polaris-server:assemble \ :polaris-server:quarkusAppPartsBuild --rerun \ -Dquarkus.container-image.build=true
docker-compose
fileTo start the docker-compose
file with the necessary dependencies, run this command from the repo's root directory:
sh getting-started/spark/launch-docker.sh
This will spin up 2 container services
polaris
service for running Apache Polaris using an in-memory metastorejupyter
service for running Jupyter notebook with PySparkIn the Jupyter notebook container log, look for the URL to access the Jupyter notebook. The url should be in the format, http://127.0.0.1:8888/lab?token=<token>
.
Open the Jupyter notebook in a browser. Navigate to notebooks/SparkPolaris.ipynb
You can now run all cells in the notebook or write your own code!