This tutorial will guide you through the process of installing Sedona on Azure Synapse Analytics when Data Exfiltration Protection (DEP) is enabled or when you have no internet connection from the Spark pools due to other networking constraints.
This tutorial focuses on getting you up and running with Sedona 1.6.1 on Spark 3.4 Python 3.10
If you want to run newer version, you will need to dive into the detailed build and diagnose process detailed in the lower part of this document.
Caution: Precise versions are critical, latest is not always greatest here.
From Maven
From PyPI
rasterio-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
shapely-2.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
apache_sedona-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
This tutorial used the second method on this page: If you are updating from the Synapse Studio
Start your notebook with:
from sedona.spark import SedonaContext config = ( SedonaContext.builder() .config( "spark.jars.packages", "org.apache.sedona:sedona-spark-shaded-3.4_2.12-1.6.1," "org.datasyslab:geotools-wrapper-1.6.1-28.2", ) .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config( "spark.kryo.registrator", "org.apache.sedona.core.serde.SedonaKryoRegistrator" ) .config( "spark.sql.extensions", "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions", ) .getOrCreate() ) sedona = SedonaContext.create(config)
Run a test
sedona.sql("SELECT ST_GeomFromEWKT('SRID=4269;POINT(40.7128 -74.0060)')").show()
If you see the output of the point, then the installation is successful. Are you are all done with the setup.
spark-xml_2.12-0.17.0.jar sedona-spark-shaded-3.4_2.12-1.6.0.jar click_plugins-1.1.1-py2.py3-none-any.whl affine-2.4.0-py3-none-any.whl apache_sedona-1.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl cligj-0.7.2-py3-none-any.whl rasterio-1.3.10-cp310-cp310-manylinux2014_x86_64.whl shapely-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl snuggs-1.4.7-py3-none-any.whl geotools-wrapper-1.6.0-28.2.jar
Warning: this process is going to require some tenacious technical skills and troubleshooting.
Broad steps: build a linux VM from the same image as the deployed Spark Pool, configure for Synapse, install Sedona packages, identify required packages over and above baseline Synapse config.
This is the process for Sedona 1.6.1 on Spark 3.4 Python 3.10. (The same process was used for Sedona 1.6.0)
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-34-runtime
https://github.com/microsoft/azurelinux/tree/2.0
https://github.com/microsoft/azurelinux/blob/2.0/toolkit/docs/quick_start/quickstart.md#iso-image
Important settings if using Hyper-V
Connect the VM. Note: it will take longer to first boot than you'd expect
sudo dnf upgrade
sudo tdnf install -y openssh-server
Enable root and password auth
sudo vi /etc/ssh/sshd_config - PasswordAuthentication yes - PermitRootLogin yes
Start ssh-server
sudo systemctl enable --now sshd.service
Identify the ip of the VM (I'm using Hyper-V on windows 10 desktop)
Get-VMNetworkAdapter -VMName "Synapse Spark 3.4 Python 3.10 Sedona 1.6.1" | Select-Object -ExpandProperty IPAddresses
cd /tmp wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh chmod +x Miniconda3-latest-Linux-x86_64.sh ./Miniconda3-latest-Linux-x86_64.sh
sudo tdnf -y install gcc g++
Download the virtual env spec
wget -O Synapse-Python310-CPU.yml https://raw.githubusercontent.com/microsoft/synapse-spark-runtime/refs/heads/main/Synapse/spark3.4/Synapse-Python310-CPU.yml source
conda env create -f Synapse-Python310-CPU.yml -n synapse
if you get errors due to fsspec_wrapper then remove fsspec_wrapper==0.1.13=py_3 from the yml and run again
if you get further but different errors from pip after making the above change, ignore them you can still proceed
conda activate synapse echo "apache-sedona==1.6.1" > requirements.txt pip install -r requirements.txt > pip-output.txt
grep Downloading pip-output.txt
This will be the list of packages you need to locate and download from PyPI
Example output
Downloading apache_sedona-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB) Downloading shapely-2.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB) Downloading rasterio-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.2 MB) Downloading affine-2.4.0-py3-none-any.whl (15 kB) Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB) Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Pay careful attention to errors reported back from Synapse and troubleshoot to resolve conflicts.
Note: We didn't have issues with Sedona 1.6.0 on Spark 3.4, but Sedona 1.6.1 and supporting packages had a conflict around numpy which requires us to download a specific version and add it to the packages list. numpy was not listed in the output of the grep.