docs/setup/databricks.md - sedona - Git at Google

 <!--
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
  -->

 You can run Sedona in Databricks to leverage the functionality that Sedona provides.  Here’s an example of a Databricks notebook that’s running Sedona code:

 ![Example notebook](../image/databricks/image1.png)

 Sedona isn’t available in all Databricks environments because of the platform's limitations. This post explains how and where you can run Sedona in Databricks.

 ## Databricks and Sedona version requirements

 Databricks and Sedona depend on Spark, Scala, and other libraries.

 For example, one Databricks Runtime 16.4 depends on Scala 2.12 and Spark 3.5.  Here are the version requirements for a few Databricks runtimes.

 ![Databricks runtimes](../image/databricks/image2.png)

 If you use a Databricks Runtime compiled with Spark 3.5 and Scala 2.12, then you should use a Sedona version compiled with Spark 3.5 and Scala 2.12.  You need to make sure the Scala versions are aligned, even if you’re using the Python or SQL APIs.

 ## Install the Sedona library in Databricks

 Download the required Sedona packages by executing the following commands:

 ```sh
 %sh
 # Create JAR directory for Sedona
 mkdir -p /Workspace/Shared/sedona/{{ sedona.current_version }}

 # Download the dependencies from Maven into DBFS
 curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/geotools-wrapper-{{ sedona.current_geotools }}.jar "https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/{{ sedona.current_geotools }}/geotools-wrapper-{{ sedona.current_geotools }}.jar"

 curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/sedona-spark-shaded-3.5_2.12-{{ sedona.current_version }}.jar "https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.5_2.12/{{ sedona.current_version }}/sedona-spark-shaded-3.5_2.12-{{ sedona.current_version }}.jar"
 ```

 Here are the software versions used to compile `sedona-spark-shaded-3.5_2.12-{{ sedona.current_version }}.jar`:

 * Spark 3.5
 * Scala 2.12
 * Sedona {{ sedona.current_version }}

 Ensure that you use a Databricks Runtime with versions compatible with this jar.

 You will be able to see these in your Databricks environment after downloading them:

 ![Downloaded JARs](../image/databricks/image3.png)

 Create an init script as follows:

 ```
 %sh

 # Create init script directory for Sedona
 mkdir -p /Workspace/Shared/sedona/

 # Create init script
 cat > /Workspace/Shared/sedona/sedona-init.sh <<'EOF'
 #!/bin/bash
 #
 # File: sedona-init.sh
 #
 # On cluster startup, this script will copy the Sedona jars to the cluster's default jar directory.

 cp /Workspace/Shared/sedona/{{ sedona.current_version }}/*.jar /databricks/jars

 EOF
 ```

 ## Create a Databricks cluster

 You need to create a Databricks cluster compatible with the Sedona JAR files.  If you use Sedona JAR files compiled with Scala 2.12, you must use a Databricks cluster that runs Scala 2.12.

 Go to the compute tab and configure the cluster:

 ![Cluster config](../image/databricks/image4.png)

 Set the proper cluster configurations:

 ![Spark configs](../image/databricks/image5.png)

 Here’s a list of the cluster configurations that’s easy to copy and paste:

 ```
 spark.sql.extensions org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
 spark.serializer org.apache.spark.serializer.KryoSerializer
 spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
 spark.sedona.enableParserExtension false
 ```

 Specify the path to the init script:

 ![init script](../image/databricks/image6.png)

 If you are creating a Shared cluster, you won't be able to use init scripts and jars stored under Workspace. Please store them in volumes instead. The overall process should be the same.

 Add the required dependencies in the Library tab:

 ![Python libraries](../image/databricks/image7.png)

 Here’s the full list of libraries:

 ```
 apache-sedona=={{ sedona.current_version }}
 geopandas==1.0.1
 keplergl==0.3.7
 pydeck==0.9.1
 ```

 Then click “Create compute” to start the cluster.

 ## Create a Databricks notebook

 Create a Databricks notebook and connect it to the cluster.  Verify that you can run a Python computation with a Sedona function:

 ![Python computation](../image/databricks/image1.png)

 If you want to use Sedona Python functions such as [DataFrame APIs](../api/sql/DataFrameAPI.md) or [StructuredAdapter](../tutorial/sql.md#spatialrdd-to-dataframe-with-spatial-partitioning), you need to initialize Sedona as follows:

 ```python
 from sedona.spark import *

 sedona = SedonaContext.create(spark)
 ```

 You can also use the SQL API as follows:

 ![SQL computation](../image/databricks/image8.png)

 ## Saving geometry in Databricks Delta Lake tables

 Here’s how to create a Sedona DataFrame with a geometry column

 ```python
 df = sedona.createDataFrame(
     [
         ("a", "POLYGON((1.0 1.0,1.0 3.0,2.0 3.0,2.0 1.0,1.0 1.0))"),
         ("b", "LINESTRING(4.0 1.0,4.0 2.0,6.0 4.0)"),
         ("c", "POINT(9.0 2.0)"),
     ],
     ["id", "geometry"],
 )
 df = df.withColumn("geometry", expr("ST_GeomFromWKT(geometry)"))
 ```

 Write the Sedona DataFrame to a Delta Lake table:

 ```python
 df.write.saveAsTable("your_org.default.geotable")
 ```

 Here’s how to read the table: `sedona.table("your_org.default.geotable").display()`

 This is what the results look like in Databricks:

 ![Write table](../image/databricks/image9.png)

 ## Known bugs

 To ensure stability, we recommend using a currently supported Long-Term Support (LTS) version, such as Databricks Runtime 16.4 LTS or 15.4 LTS.  Some Databricks Runtimes, such as 16.2 (non-LTS), are not compatible with Apache Sedona, as this particular runtime introduced a change in the json4s dependency.
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	You can run Sedona in Databricks to leverage the functionality that Sedona provides. Here’s an example of a Databricks notebook that’s running Sedona code:

	![Example notebook](../image/databricks/image1.png)

	Sedona isn’t available in all Databricks environments because of the platform's limitations. This post explains how and where you can run Sedona in Databricks.

	## Databricks and Sedona version requirements

	Databricks and Sedona depend on Spark, Scala, and other libraries.

	For example, one Databricks Runtime 16.4 depends on Scala 2.12 and Spark 3.5. Here are the version requirements for a few Databricks runtimes.

	![Databricks runtimes](../image/databricks/image2.png)

	If you use a Databricks Runtime compiled with Spark 3.5 and Scala 2.12, then you should use a Sedona version compiled with Spark 3.5 and Scala 2.12. You need to make sure the Scala versions are aligned, even if you’re using the Python or SQL APIs.

	## Install the Sedona library in Databricks

	Download the required Sedona packages by executing the following commands:

	```sh
	%sh
	# Create JAR directory for Sedona
	mkdir -p /Workspace/Shared/sedona/{{ sedona.current_version }}

	# Download the dependencies from Maven into DBFS
	curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/geotools-wrapper-{{ sedona.current_geotools }}.jar "https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/{{ sedona.current_geotools }}/geotools-wrapper-{{ sedona.current_geotools }}.jar"

	curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/sedona-spark-shaded-3.5_2.12-{{ sedona.current_version }}.jar "https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.5_2.12/{{ sedona.current_version }}/sedona-spark-shaded-3.5_2.12-{{ sedona.current_version }}.jar"
	```

	Here are the software versions used to compile `sedona-spark-shaded-3.5_2.12-{{ sedona.current_version }}.jar`:

	* Spark 3.5
	* Scala 2.12
	* Sedona {{ sedona.current_version }}

	Ensure that you use a Databricks Runtime with versions compatible with this jar.

	You will be able to see these in your Databricks environment after downloading them:

	![Downloaded JARs](../image/databricks/image3.png)

	Create an init script as follows:

	```
	%sh

	# Create init script directory for Sedona
	mkdir -p /Workspace/Shared/sedona/

	# Create init script
	cat > /Workspace/Shared/sedona/sedona-init.sh <<'EOF'
	#!/bin/bash
	#
	# File: sedona-init.sh
	#
	# On cluster startup, this script will copy the Sedona jars to the cluster's default jar directory.

	cp /Workspace/Shared/sedona/{{ sedona.current_version }}/*.jar /databricks/jars

	EOF
	```

	## Create a Databricks cluster

	You need to create a Databricks cluster compatible with the Sedona JAR files. If you use Sedona JAR files compiled with Scala 2.12, you must use a Databricks cluster that runs Scala 2.12.

	Go to the compute tab and configure the cluster:

	![Cluster config](../image/databricks/image4.png)

	Set the proper cluster configurations:

	![Spark configs](../image/databricks/image5.png)

	Here’s a list of the cluster configurations that’s easy to copy and paste:

	```
	spark.sql.extensions org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
	spark.serializer org.apache.spark.serializer.KryoSerializer
	spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
	spark.sedona.enableParserExtension false
	```

	Specify the path to the init script:

	![init script](../image/databricks/image6.png)

	If you are creating a Shared cluster, you won't be able to use init scripts and jars stored under Workspace. Please store them in volumes instead. The overall process should be the same.

	Add the required dependencies in the Library tab:

	![Python libraries](../image/databricks/image7.png)

	Here’s the full list of libraries:

	```
	apache-sedona=={{ sedona.current_version }}
	geopandas==1.0.1
	keplergl==0.3.7
	pydeck==0.9.1
	```

	Then click “Create compute” to start the cluster.

	## Create a Databricks notebook

	Create a Databricks notebook and connect it to the cluster. Verify that you can run a Python computation with a Sedona function:

	![Python computation](../image/databricks/image1.png)

	If you want to use Sedona Python functions such as [DataFrame APIs](../api/sql/DataFrameAPI.md) or [StructuredAdapter](../tutorial/sql.md#spatialrdd-to-dataframe-with-spatial-partitioning), you need to initialize Sedona as follows:

	```python
	from sedona.spark import *

	sedona = SedonaContext.create(spark)
	```

	You can also use the SQL API as follows:

	![SQL computation](../image/databricks/image8.png)

	## Saving geometry in Databricks Delta Lake tables

	Here’s how to create a Sedona DataFrame with a geometry column

	```python
	df = sedona.createDataFrame(
	[
	("a", "POLYGON((1.0 1.0,1.0 3.0,2.0 3.0,2.0 1.0,1.0 1.0))"),
	("b", "LINESTRING(4.0 1.0,4.0 2.0,6.0 4.0)"),
	("c", "POINT(9.0 2.0)"),
	],
	["id", "geometry"],
	)
	df = df.withColumn("geometry", expr("ST_GeomFromWKT(geometry)"))
	```

	Write the Sedona DataFrame to a Delta Lake table:

	```python
	df.write.saveAsTable("your_org.default.geotable")
	```

	Here’s how to read the table: `sedona.table("your_org.default.geotable").display()`

	This is what the results look like in Databricks:

	![Write table](../image/databricks/image9.png)

	## Known bugs

	To ensure stability, we recommend using a currently supported Long-Term Support (LTS) version, such as Databricks Runtime 16.4 LTS or 15.4 LTS. Some Databricks Runtimes, such as 16.2 (non-LTS), are not compatible with Apache Sedona, as this particular runtime introduced a change in the json4s dependency.