blob: a59e7ced63c9f5e55f152607e20622c690b3c82e [file] [log] [blame] [view]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Working with Vector Data
> Note: Before running this notebook, ensure that you have installed SedonaDB: `pip install "apache-sedona[db]"`
Process vector data using sedona.db. You will learn to create DataFrames, run spatial queries, and manage file I/O. Let's begin by connecting to sedona.db.
Let's start by establishing a SedonaDB connection.
## Establish SedonaDB connection
Here's how to create the SedonaDB connection:
```python
import sedona.db
sd = sedona.db.connect()
```
Now, let's see how to create SedonaDB dataframes.
## Create SedonaDB DataFrame
**Manually creating SedonaDB DataFrame**
Here's how to manually create a SedonaDB DataFrame:
```python
df = sd.sql("""
SELECT * FROM (VALUES
('one', ST_GeomFromWkt('POINT(1 2)')),
('two', ST_GeomFromWkt('POLYGON((-74.0 40.7, -74.0 40.8, -73.9 40.8, -73.9 40.7, -74.0 40.7))')),
('three', ST_GeomFromWkt('LINESTRING(-74.0060 40.7128, -73.9352 40.7306, -73.8561 40.8484)')))
AS t(val, point)""")
```
Check the type of the DataFrame.
```python
type(df)
```
sedona.db.dataframe.DataFrame
**Create SedonaDB DataFrame from files in S3**
For most production applications, you will create SedonaDB DataFrames by reading data from a file. Let's see how to read GeoParquet files in AWS S3 into a SedonaDB DataFrame.
```python
sd.read_parquet(
"s3://overturemaps-us-west-2/release/2025-08-20.0/theme=divisions/type=division_area/",
options={"aws.skip_signature": True, "aws.region": "us-west-2"},
).to_view("division_area")
```
Now, let's run some spatial queries.
### Read from GeoPandas DataFrame
This section shows how to convert a GeoPandas DataFrame into a SedonaDB DataFrame.
Start by reading a FlatGeoBuf file into a GeoPandas DataFrame:
```python
import geopandas as gpd
path = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities.fgb"
gdf = gpd.read_file(path)
```
Now convert the GeoPandas DataFrame to a SedonaDB DataFrame and view three rows of content:
```python
df = sd.create_data_frame(gdf)
df.show(3)
```
┌──────────────┬──────────────────────────────┐
│ name ┆ geometry │
│ utf8 ┆ geometry │
╞══════════════╪══════════════════════════════╡
│ Vatican City ┆ POINT(12.4533865 41.9032822) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ San Marino ┆ POINT(12.4417702 43.9360958) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Vaduz ┆ POINT(9.5166695 47.1337238) │
└──────────────┴──────────────────────────────┘
## Spatial queries
Let's see how to run spatial operations like filtering, joins, and clustering algorithms.
### Spatial filtering
Let's run a spatial filtering operation to fetch all the objects in the following polygon:
```python
nova_scotia_bbox_wkt = (
"POLYGON((-66.5 43.4, -66.5 47.1, -59.8 47.1, -59.8 43.4, -66.5 43.4))"
)
ns = sd.sql(f"""
SELECT country, region, geometry
FROM division_area
WHERE ST_Intersects(geometry, ST_SetSRID(ST_GeomFromText('{nova_scotia_bbox_wkt}'), 4326))
""")
ns.show(3)
```
┌──────────┬──────────┬────────────────────────────────────────────────────────────────────────────┐
│ country ┆ region ┆ geometry │
│ utf8view ┆ utf8view ┆ geometry │
╞══════════╪══════════╪════════════════════════════════════════════════════════════════════════════╡
│ CA ┆ CA-NS ┆ POLYGON((-66.0528452 43.4531336,-66.0883401 43.3978188,-65.9647654 43.361… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ CA ┆ CA-NS ┆ POLYGON((-66.0222822 43.5166842,-66.0252286 43.5100071,-66.0528452 43.453… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ CA ┆ CA-NS ┆ POLYGON((-65.7451389 43.5336263,-65.7450818 43.5347004,-65.7449545 43.535… │
└──────────┴──────────┴────────────────────────────────────────────────────────────────────────────┘
You can see it only includes the divisions in the Nova Scotia area.
### K-nearest neighbors (KNN) joins
Create `restaurants` and `customers` views so we can demonstrate the KNN join functionality.
```python
df = sd.sql("""
SELECT name, ST_Point(lng, lat) AS location
FROM (VALUES
(101, -74.0, 40.7, 'Pizza Palace'),
(102, -73.99, 40.69, 'Burger Barn'),
(103, -74.02, 40.72, 'Taco Town'),
(104, -73.98, 40.75, 'Sushi Spot'),
(105, -74.05, 40.68, 'Deli Direct')
) AS t(id, lng, lat, name)
""")
sd.sql("drop view if exists restaurants")
df.to_view("restaurants")
df = sd.sql("""
SELECT name, ST_Point(lng, lat) AS location
FROM (VALUES
(1, -74.0, 40.7, 'Alice'),
(2, -73.9, 40.8, 'Bob'),
(3, -74.1, 40.6, 'Carol')
) AS t(id, lng, lat, name)
""")
sd.sql("drop view if exists customers")
df.to_view("customers")
```
```python
df.show()
```
┌───────┬───────────────────┐
│ name ┆ location │
│ utf8 ┆ geometry │
╞═══════╪═══════════════════╡
│ Alice ┆ POINT(-74 40.7) │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bob ┆ POINT(-73.9 40.8) │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Carol ┆ POINT(-74.1 40.6) │
└───────┴───────────────────┘
Perform a KNN join to identify the two restaurants that are nearest to each customer:
```python
sd.sql("""
SELECT
c.name AS customer,
r.name AS restaurant
FROM customers c, restaurants r
WHERE ST_KNN(c.location, r.location, 2, false)
ORDER BY c.name, r.name;
""").show()
```
┌──────────┬──────────────┐
│ customer ┆ restaurant │
│ utf8 ┆ utf8 │
╞══════════╪══════════════╡
│ Alice ┆ Burger Barn │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Alice ┆ Pizza Palace │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bob ┆ Pizza Palace │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bob ┆ Sushi Spot │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Carol ┆ Deli Direct │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Carol ┆ Pizza Palace │
└──────────┴──────────────┘
Notice how each customer has two rows - one for each of the two closest restaurants.