docs/programming-guide.md - sedona-db - Git at Google

 <!---
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 # Working with Vector Data

 > Note: Before running this notebook, ensure that you have installed SedonaDB: `pip install "apache-sedona[db]"`

 Process vector data using sedona.db. You will learn to create DataFrames, run spatial queries, and manage file I/O. Let's begin by connecting to sedona.db.

 Let's start by establishing a SedonaDB connection.

 ## Establish SedonaDB connection

 Here's how to create the SedonaDB connection:


 ```python
 import sedona.db

 sd = sedona.db.connect()
 ```

 Now, let's see how to create SedonaDB dataframes.

 ## Create SedonaDB DataFrame

 **Manually creating SedonaDB DataFrame**

 Here's how to manually create a SedonaDB DataFrame:


 ```python
 df = sd.sql("""
 SELECT * FROM (VALUES
     ('one', ST_GeomFromWkt('POINT(1 2)')),
     ('two', ST_GeomFromWkt('POLYGON((-74.0 40.7, -74.0 40.8, -73.9 40.8, -73.9 40.7, -74.0 40.7))')),
     ('three', ST_GeomFromWkt('LINESTRING(-74.0060 40.7128, -73.9352 40.7306, -73.8561 40.8484)')))
 AS t(val, point)""")
 ```

 Check the type of the DataFrame.


 ```python
 type(df)
 ```


     sedona.db.dataframe.DataFrame


 **Create SedonaDB DataFrame from files in S3**

 For most production applications, you will create SedonaDB DataFrames by reading data from a file.  Let's see how to read GeoParquet files in AWS S3 into a SedonaDB DataFrame.


 ```python
 sd.read_parquet(
     "s3://overturemaps-us-west-2/release/2025-08-20.0/theme=divisions/type=division_area/",
     options={"aws.skip_signature": True, "aws.region": "us-west-2"},
 ).to_view("division_area")
 ```

 Now, let's run some spatial queries.

 ### Read from GeoPandas DataFrame

 This section shows how to convert a GeoPandas DataFrame into a SedonaDB DataFrame.

 Start by reading a FlatGeoBuf file into a GeoPandas DataFrame:


 ```python
 import geopandas as gpd

 path = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities.fgb"
 gdf = gpd.read_file(path)
 ```

 Now convert the GeoPandas DataFrame to a SedonaDB DataFrame and view three rows of content:


 ```python
 df = sd.create_data_frame(gdf)
 df.show(3)
 ```

     ┌──────────────┬──────────────────────────────┐
     │     name     ┆           geometry           │
     │     utf8     ┆           geometry           │
     ╞══════════════╪══════════════════════════════╡
     │ Vatican City ┆ POINT(12.4533865 41.9032822) │
     ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ San Marino   ┆ POINT(12.4417702 43.9360958) │
     ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ Vaduz        ┆ POINT(9.5166695 47.1337238)  │
     └──────────────┴──────────────────────────────┘


 ## Spatial queries

 Let's see how to run spatial operations like filtering, joins, and clustering algorithms.

 ### Spatial filtering

 Let's run a spatial filtering operation to fetch all the objects in the following polygon:


 ```python
 nova_scotia_bbox_wkt = (
     "POLYGON((-66.5 43.4, -66.5 47.1, -59.8 47.1, -59.8 43.4, -66.5 43.4))"
 )

 ns = sd.sql(f"""
 SELECT country, region, geometry
 FROM division_area
 WHERE ST_Intersects(geometry, ST_SetSRID(ST_GeomFromText('{nova_scotia_bbox_wkt}'), 4326))
 """)

 ns.show(3)
 ```

     ┌──────────┬──────────┬────────────────────────────────────────────────────────────────────────────┐
     │  country ┆  region  ┆                                  geometry                                  │
     │ utf8view ┆ utf8view ┆                                  geometry                                  │
     ╞══════════╪══════════╪════════════════════════════════════════════════════════════════════════════╡
     │ CA       ┆ CA-NS    ┆ POLYGON((-66.0528452 43.4531336,-66.0883401 43.3978188,-65.9647654 43.361… │
     ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ CA       ┆ CA-NS    ┆ POLYGON((-66.0222822 43.5166842,-66.0252286 43.5100071,-66.0528452 43.453… │
     ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ CA       ┆ CA-NS    ┆ POLYGON((-65.7451389 43.5336263,-65.7450818 43.5347004,-65.7449545 43.535… │
     └──────────┴──────────┴────────────────────────────────────────────────────────────────────────────┘


 You can see it only includes the divisions in the Nova Scotia area.

 ### K-nearest neighbors (KNN) joins

 Create `restaurants` and `customers` views so we can demonstrate the KNN join functionality.


 ```python
 df = sd.sql("""
 SELECT name, ST_Point(lng, lat) AS location
 FROM (VALUES
     (101, -74.0, 40.7, 'Pizza Palace'),
     (102, -73.99, 40.69, 'Burger Barn'),
     (103, -74.02, 40.72, 'Taco Town'),
     (104, -73.98, 40.75, 'Sushi Spot'),
     (105, -74.05, 40.68, 'Deli Direct')
 ) AS t(id, lng, lat, name)
 """)
 sd.sql("drop view if exists restaurants")
 df.to_view("restaurants")

 df = sd.sql("""
 SELECT name, ST_Point(lng, lat) AS location
 FROM (VALUES
     (1, -74.0, 40.7, 'Alice'),
     (2, -73.9, 40.8, 'Bob'),
     (3, -74.1, 40.6, 'Carol')
 ) AS t(id, lng, lat, name)
 """)
 sd.sql("drop view if exists customers")
 df.to_view("customers")
 ```


 ```python
 df.show()
 ```

     ┌───────┬───────────────────┐
     │  name ┆      location     │
     │  utf8 ┆      geometry     │
     ╞═══════╪═══════════════════╡
     │ Alice ┆ POINT(-74 40.7)   │
     ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ Bob   ┆ POINT(-73.9 40.8) │
     ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ Carol ┆ POINT(-74.1 40.6) │
     └───────┴───────────────────┘


 Perform a KNN join to identify the two restaurants that are nearest to each customer:


 ```python
 sd.sql("""
 SELECT
     c.name AS customer,
     r.name AS restaurant
 FROM customers c, restaurants r
 WHERE ST_KNN(c.location, r.location, 2, false)
 ORDER BY c.name, r.name;
 """).show()
 ```

     ┌──────────┬──────────────┐
     │ customer ┆  restaurant  │
     │   utf8   ┆     utf8     │
     ╞══════════╪══════════════╡
     │ Alice    ┆ Burger Barn  │
     ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ Alice    ┆ Pizza Palace │
     ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ Bob      ┆ Pizza Palace │
     ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ Bob      ┆ Sushi Spot   │
     ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ Carol    ┆ Deli Direct  │
     ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
     │ Carol    ┆ Pizza Palace │
     └──────────┴──────────────┘


 Notice how each customer has two rows - one for each of the two closest restaurants.
	<!---
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	# Working with Vector Data

	> Note: Before running this notebook, ensure that you have installed SedonaDB: `pip install "apache-sedona[db]"`

	Process vector data using sedona.db. You will learn to create DataFrames, run spatial queries, and manage file I/O. Let's begin by connecting to sedona.db.

	Let's start by establishing a SedonaDB connection.

	## Establish SedonaDB connection

	Here's how to create the SedonaDB connection:


	```python
	import sedona.db

	sd = sedona.db.connect()
	```

	Now, let's see how to create SedonaDB dataframes.

	## Create SedonaDB DataFrame

	Manually creating SedonaDB DataFrame

	Here's how to manually create a SedonaDB DataFrame:


	```python
	df = sd.sql("""
	SELECT * FROM (VALUES
	('one', ST_GeomFromWkt('POINT(1 2)')),
	('two', ST_GeomFromWkt('POLYGON((-74.0 40.7, -74.0 40.8, -73.9 40.8, -73.9 40.7, -74.0 40.7))')),
	('three', ST_GeomFromWkt('LINESTRING(-74.0060 40.7128, -73.9352 40.7306, -73.8561 40.8484)')))
	AS t(val, point)""")
	```

	Check the type of the DataFrame.


	```python
	type(df)
	```




	sedona.db.dataframe.DataFrame



	Create SedonaDB DataFrame from files in S3

	For most production applications, you will create SedonaDB DataFrames by reading data from a file. Let's see how to read GeoParquet files in AWS S3 into a SedonaDB DataFrame.


	```python
	sd.read_parquet(
	"s3://overturemaps-us-west-2/release/2025-08-20.0/theme=divisions/type=division_area/",
	options={"aws.skip_signature": True, "aws.region": "us-west-2"},
	).to_view("division_area")
	```

	Now, let's run some spatial queries.

	### Read from GeoPandas DataFrame

	This section shows how to convert a GeoPandas DataFrame into a SedonaDB DataFrame.

	Start by reading a FlatGeoBuf file into a GeoPandas DataFrame:


	```python
	import geopandas as gpd

	path = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities.fgb"
	gdf = gpd.read_file(path)
	```

	Now convert the GeoPandas DataFrame to a SedonaDB DataFrame and view three rows of content:


	```python
	df = sd.create_data_frame(gdf)
	df.show(3)
	```

	┌──────────────┬──────────────────────────────┐
	│ name ┆ geometry │
	│ utf8 ┆ geometry │
	╞══════════════╪══════════════════════════════╡
	│ Vatican City ┆ POINT(12.4533865 41.9032822) │
	├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ San Marino ┆ POINT(12.4417702 43.9360958) │
	├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ Vaduz ┆ POINT(9.5166695 47.1337238) │
	└──────────────┴──────────────────────────────┘


	## Spatial queries

	Let's see how to run spatial operations like filtering, joins, and clustering algorithms.

	### Spatial filtering

	Let's run a spatial filtering operation to fetch all the objects in the following polygon:


	```python
	nova_scotia_bbox_wkt = (
	"POLYGON((-66.5 43.4, -66.5 47.1, -59.8 47.1, -59.8 43.4, -66.5 43.4))"
	)

	ns = sd.sql(f"""
	SELECT country, region, geometry
	FROM division_area
	WHERE ST_Intersects(geometry, ST_SetSRID(ST_GeomFromText('{nova_scotia_bbox_wkt}'), 4326))
	""")

	ns.show(3)
	```

	┌──────────┬──────────┬────────────────────────────────────────────────────────────────────────────┐
	│ country ┆ region ┆ geometry │
	│ utf8view ┆ utf8view ┆ geometry │
	╞══════════╪══════════╪════════════════════════════════════════════════════════════════════════════╡
	│ CA ┆ CA-NS ┆ POLYGON((-66.0528452 43.4531336,-66.0883401 43.3978188,-65.9647654 43.361… │
	├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ CA ┆ CA-NS ┆ POLYGON((-66.0222822 43.5166842,-66.0252286 43.5100071,-66.0528452 43.453… │
	├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ CA ┆ CA-NS ┆ POLYGON((-65.7451389 43.5336263,-65.7450818 43.5347004,-65.7449545 43.535… │
	└──────────┴──────────┴────────────────────────────────────────────────────────────────────────────┘


	You can see it only includes the divisions in the Nova Scotia area.

	### K-nearest neighbors (KNN) joins

	Create `restaurants` and `customers` views so we can demonstrate the KNN join functionality.


	```python
	df = sd.sql("""
	SELECT name, ST_Point(lng, lat) AS location
	FROM (VALUES
	(101, -74.0, 40.7, 'Pizza Palace'),
	(102, -73.99, 40.69, 'Burger Barn'),
	(103, -74.02, 40.72, 'Taco Town'),
	(104, -73.98, 40.75, 'Sushi Spot'),
	(105, -74.05, 40.68, 'Deli Direct')
	) AS t(id, lng, lat, name)
	""")
	sd.sql("drop view if exists restaurants")
	df.to_view("restaurants")

	df = sd.sql("""
	SELECT name, ST_Point(lng, lat) AS location
	FROM (VALUES
	(1, -74.0, 40.7, 'Alice'),
	(2, -73.9, 40.8, 'Bob'),
	(3, -74.1, 40.6, 'Carol')
	) AS t(id, lng, lat, name)
	""")
	sd.sql("drop view if exists customers")
	df.to_view("customers")
	```


	```python
	df.show()
	```

	┌───────┬───────────────────┐
	│ name ┆ location │
	│ utf8 ┆ geometry │
	╞═══════╪═══════════════════╡
	│ Alice ┆ POINT(-74 40.7) │
	├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ Bob ┆ POINT(-73.9 40.8) │
	├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ Carol ┆ POINT(-74.1 40.6) │
	└───────┴───────────────────┘


	Perform a KNN join to identify the two restaurants that are nearest to each customer:


	```python
	sd.sql("""
	SELECT
	c.name AS customer,
	r.name AS restaurant
	FROM customers c, restaurants r
	WHERE ST_KNN(c.location, r.location, 2, false)
	ORDER BY c.name, r.name;
	""").show()
	```

	┌──────────┬──────────────┐
	│ customer ┆ restaurant │
	│ utf8 ┆ utf8 │
	╞══════════╪══════════════╡
	│ Alice ┆ Burger Barn │
	├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ Alice ┆ Pizza Palace │
	├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ Bob ┆ Pizza Palace │
	├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ Bob ┆ Sushi Spot │
	├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ Carol ┆ Deli Direct │
	├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
	│ Carol ┆ Pizza Palace │
	└──────────┴──────────────┘


	Notice how each customer has two rows - one for each of the two closest restaurants.