| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "id": "1932983e-1cd2-41d0-a5eb-0537b3ac3feb", |
| "metadata": {}, |
| "source": [ |
| "<!---\n", |
| " Licensed to the Apache Software Foundation (ASF) under one\n", |
| " or more contributor license agreements. See the NOTICE file\n", |
| " distributed with this work for additional information\n", |
| " regarding copyright ownership. The ASF licenses this file\n", |
| " to you under the Apache License, Version 2.0 (the\n", |
| " \"License\"); you may not use this file except in compliance\n", |
| " with the License. You may obtain a copy of the License at\n", |
| "\n", |
| " http://www.apache.org/licenses/LICENSE-2.0\n", |
| "\n", |
| " Unless required by applicable law or agreed to in writing,\n", |
| " software distributed under the License is distributed on an\n", |
| " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", |
| " KIND, either express or implied. See the License for the\n", |
| " specific language governing permissions and limitations\n", |
| " under the License.\n", |
| "-->\n", |
| "\n", |
| "# Working with Vector Data\n", |
| "\n", |
| "> Note: Before running this notebook, ensure that you have installed SedonaDB: `pip install \"apache-sedona[db]\"`\n", |
| "\n", |
| "Process vector data using sedona.db. You will learn to create DataFrames, run spatial queries, and manage file I/O. Let's begin by connecting to sedona.db.\n", |
| "\n", |
| "Let's start by establishing a SedonaDB connection." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "119fcbae", |
| "metadata": {}, |
| "source": [ |
| "## Establish SedonaDB connection\n", |
| "\n", |
| "Here's how to create the SedonaDB connection:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 1, |
| "id": "53c3b7a8-c42a-407a-a454-6ee1e943fbcc", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "import sedona.db\n", |
| "\n", |
| "sd = sedona.db.connect()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "7aeaa60f-2325-418c-8e72-4344bd4a75fe", |
| "metadata": {}, |
| "source": [ |
| "Now, let's see how to create SedonaDB dataframes.\n", |
| "\n", |
| "## Create SedonaDB DataFrame\n", |
| "\n", |
| "**Manually creating SedonaDB DataFrame**\n", |
| "\n", |
| "Here's how to manually create a SedonaDB DataFrame:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 2, |
| "id": "b3377767-d747-407c-92c0-8786c1998131", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "df = sd.sql(\"\"\"\n", |
| "SELECT * FROM (VALUES\n", |
| " ('one', ST_GeomFromWkt('POINT(1 2)')),\n", |
| " ('two', ST_GeomFromWkt('POLYGON((-74.0 40.7, -74.0 40.8, -73.9 40.8, -73.9 40.7, -74.0 40.7))')),\n", |
| " ('three', ST_GeomFromWkt('LINESTRING(-74.0060 40.7128, -73.9352 40.7306, -73.8561 40.8484)')))\n", |
| "AS t(val, point)\"\"\")" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "0f9e1319-2e7a-4d98-9df0-47a9a73cfff3", |
| "metadata": {}, |
| "source": [ |
| "Check the type of the DataFrame." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 3, |
| "id": "e8be30ab-4818-4db8-bae2-83e973ad1b77", |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "sedonadb.dataframe.DataFrame" |
| ] |
| }, |
| "execution_count": 3, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "type(df)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "8225ed1f-45a4-4915-a582-8ae191ec53ed", |
| "metadata": {}, |
| "source": [ |
| "**Create SedonaDB DataFrame from files in S3**\n", |
| "\n", |
| "For most production applications, you will create SedonaDB DataFrames by reading data from a file. Let's see how to read GeoParquet files in AWS S3 into a SedonaDB DataFrame." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 4, |
| "id": "151df287-4b2d-433e-9769-c3378df03b1b", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "sd.read_parquet(\n", |
| " \"s3://overturemaps-us-west-2/release/2025-11-19.0/theme=divisions/type=division_area/\",\n", |
| " options={\"aws.skip_signature\": True, \"aws.region\": \"us-west-2\"},\n", |
| ").to_view(\"division_area\")" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "858fcc66-816d-4c71-8875-82b74169eccd", |
| "metadata": {}, |
| "source": [ |
| "Now, let's run some spatial queries.\n", |
| "\n", |
| "### Read from GeoPandas DataFrame\n", |
| "\n", |
| "This section shows how to convert a GeoPandas DataFrame into a SedonaDB DataFrame.\n", |
| "\n", |
| "Start by reading a FlatGeoBuf file into a GeoPandas DataFrame:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 11, |
| "id": "b81549f2-0f58-49e4-9011-8de6578c2b0e", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "import geopandas as gpd\n", |
| "\n", |
| "path = \"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities.fgb\"\n", |
| "gdf = gpd.read_file(path)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "2265f94b-ccb3-4634-8c52-a8799c68c76a", |
| "metadata": {}, |
| "source": [ |
| "Now convert the GeoPandas DataFrame to a SedonaDB DataFrame and view three rows of content:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 6, |
| "id": "0e4819db-bf58-42d7-8b5b-f272d0f19266", |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "┌──────────────┬──────────────────────────────┐\n", |
| "│ name ┆ geometry │\n", |
| "│ utf8 ┆ geometry │\n", |
| "╞══════════════╪══════════════════════════════╡\n", |
| "│ Vatican City ┆ POINT(12.4533865 41.9032822) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ San Marino ┆ POINT(12.4417702 43.9360958) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Vaduz ┆ POINT(9.5166695 47.1337238) │\n", |
| "└──────────────┴──────────────────────────────┘\n" |
| ] |
| } |
| ], |
| "source": [ |
| "df = sd.create_data_frame(gdf)\n", |
| "df.show(3)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "6890bcc3-f3bd-4c47-bf86-2607bed5e480", |
| "metadata": {}, |
| "source": [ |
| "## Spatial queries\n", |
| "\n", |
| "Let's see how to run spatial operations like filtering, joins, and clustering algorithms.\n", |
| "\n", |
| "### Spatial filtering\n", |
| "\n", |
| "Let's run a spatial filtering operation to fetch all the objects in the following polygon:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 12, |
| "id": "8c8a4b48-8c4e-412e-900f-8c0f6f4ccc1d", |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "┌─────────┬────────┬───────────────────────────────────────────────────────────────────────────────┐\n", |
| "│ country ┆ region ┆ geometry │\n", |
| "│ utf8 ┆ utf8 ┆ geometry │\n", |
| "╞═════════╪════════╪═══════════════════════════════════════════════════════════════════════════════╡\n", |
| "│ CA ┆ CA-NB ┆ MULTIPOLYGON(((-67.1074147 44.4817314,-67.1058772 44.4815007,-67.104319 44.4… │\n", |
| "├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ CA ┆ CA-NB ┆ POLYGON((-66.2598821 45.1380421,-66.2599962 45.1381233,-66.2600591 45.138285… │\n", |
| "├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ CA ┆ CA-NB ┆ POLYGON((-66.4595418 45.2215004,-66.4595406 45.221468,-66.4595396 45.2213915… │\n", |
| "└─────────┴────────┴───────────────────────────────────────────────────────────────────────────────┘\n" |
| ] |
| } |
| ], |
| "source": [ |
| "nova_scotia_bbox_wkt = (\n", |
| " \"POLYGON((-66.5 43.4, -66.5 47.1, -59.8 47.1, -59.8 43.4, -66.5 43.4))\"\n", |
| ")\n", |
| "\n", |
| "ns = sd.sql(f\"\"\"\n", |
| "SELECT country, region, geometry\n", |
| "FROM division_area\n", |
| "WHERE ST_Intersects(geometry, ST_SetSRID(ST_GeomFromText('{nova_scotia_bbox_wkt}'), 4326))\n", |
| "\"\"\")\n", |
| "\n", |
| "ns.show(3)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "32076e01-d807-40ed-8457-9d8c4244e89f", |
| "metadata": {}, |
| "source": [ |
| "You can see it only includes the divisions in the Nova Scotia area.\n", |
| "\n", |
| "### K-nearest neighbors (KNN) joins\n", |
| "\n", |
| "Create `restaurants` and `customers` views so we can demonstrate the KNN join functionality." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 8, |
| "id": "deaa36db-2fee-4ba2-ab79-1dc756cb1655", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "df = sd.sql(\"\"\"\n", |
| "SELECT name, ST_Point(lng, lat) AS location\n", |
| "FROM (VALUES\n", |
| " (101, -74.0, 40.7, 'Pizza Palace'),\n", |
| " (102, -73.99, 40.69, 'Burger Barn'),\n", |
| " (103, -74.02, 40.72, 'Taco Town'),\n", |
| " (104, -73.98, 40.75, 'Sushi Spot'),\n", |
| " (105, -74.05, 40.68, 'Deli Direct')\n", |
| ") AS t(id, lng, lat, name)\n", |
| "\"\"\")\n", |
| "sd.sql(\"drop view if exists restaurants\")\n", |
| "df.to_view(\"restaurants\")\n", |
| "\n", |
| "df = sd.sql(\"\"\"\n", |
| "SELECT name, ST_Point(lng, lat) AS location\n", |
| "FROM (VALUES\n", |
| " (1, -74.0, 40.7, 'Alice'),\n", |
| " (2, -73.9, 40.8, 'Bob'),\n", |
| " (3, -74.1, 40.6, 'Carol')\n", |
| ") AS t(id, lng, lat, name)\n", |
| "\"\"\")\n", |
| "sd.sql(\"drop view if exists customers\")\n", |
| "df.to_view(\"customers\")" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 9, |
| "id": "e3bc4976-4245-432f-b265-7f6aa13f35b9", |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "┌───────┬───────────────────┐\n", |
| "│ name ┆ location │\n", |
| "│ utf8 ┆ geometry │\n", |
| "╞═══════╪═══════════════════╡\n", |
| "│ Alice ┆ POINT(-74 40.7) │\n", |
| "├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Bob ┆ POINT(-73.9 40.8) │\n", |
| "├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Carol ┆ POINT(-74.1 40.6) │\n", |
| "└───────┴───────────────────┘\n" |
| ] |
| } |
| ], |
| "source": [ |
| "df.show()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "9df227d6-0972-457a-87e3-5a89802c460f", |
| "metadata": {}, |
| "source": [ |
| "Perform a KNN join to identify the two restaurants that are nearest to each customer:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 10, |
| "id": "05565e15-ee18-431c-8fd2-673291d8d0ee", |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "┌──────────┬──────────────┐\n", |
| "│ customer ┆ restaurant │\n", |
| "│ utf8 ┆ utf8 │\n", |
| "╞══════════╪══════════════╡\n", |
| "│ Alice ┆ Burger Barn │\n", |
| "├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Alice ┆ Pizza Palace │\n", |
| "├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Bob ┆ Pizza Palace │\n", |
| "├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Bob ┆ Sushi Spot │\n", |
| "├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Carol ┆ Deli Direct │\n", |
| "├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Carol ┆ Pizza Palace │\n", |
| "└──────────┴──────────────┘\n" |
| ] |
| } |
| ], |
| "source": [ |
| "sd.sql(\"\"\"\n", |
| "SELECT\n", |
| " c.name AS customer,\n", |
| " r.name AS restaurant\n", |
| "FROM customers c, restaurants r\n", |
| "WHERE ST_KNN(c.location, r.location, 2, false)\n", |
| "ORDER BY c.name, r.name;\n", |
| "\"\"\").show()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "2e93fe6a-b0a7-4ec0-952c-dde9edcacdc4", |
| "metadata": {}, |
| "source": [ |
| "Notice how each customer has two rows - one for each of the two closest restaurants." |
| ] |
| } |
| ], |
| "metadata": { |
| "kernelspec": { |
| "display_name": "Python 3 (ipykernel)", |
| "language": "python", |
| "name": "python3" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.12.4" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 5 |
| } |