blob: e23c07c798c752b251ba816fee1d7b9f0a3361e4 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"id": "91910e50-a5ae-4d5a-a431-62ac5fbc11ca",
"metadata": {},
"source": [
"# Joining Spatial Data with Different Coordinate Systems\n",
"\n",
"> Note: Before running this notebook, ensure that you have installed SedonaDB: `pip install \"apache-sedona[db]\"`\n",
"\n",
"This example demonstrates how one table with an EPSG 4326 CRS cannot be joined with another table that uses EPSG 3857.\n",
"\n",
"A Coordinate Reference System (CRS) defines how the two-dimensional coordinates of a map relate to real locations on Earth. Operations like spatial joins, distance calculations, or overlays require all datasets to be in the same CRS to produce accurate results.\n",
"\n",
"This notebook demonstrates a key feature of SedonaDB: it protects users from generating incorrect results by raising an error if you attempt to join tables with mismatched coordinate reference systems."
]
},
{
"cell_type": "markdown",
"id": "84493c4d",
"metadata": {},
"source": [
"We will walk through two examples:\n",
"\n",
"- Joining countries (using EPSG:4326, a geographic CRS) and cities (using EPSG:3857, a projected CRS).\n",
"\n",
"- Finding the number of buildings in Vermont by joining two large datasets with different coordinate reference systems."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "be8ffe47-dc89-4842-bb1e-1e8640afffc3",
"metadata": {},
"outputs": [],
"source": [
"import sedona.db\n",
"\n",
"sd = sedona.db.connect()"
]
},
{
"cell_type": "markdown",
"id": "54b48173-6be0-4827-ac42-1439eb31e9f7",
"metadata": {},
"source": [
"Read a table with a geometry column that uses EPSG 4326.\n",
"\n",
"Note how SedonaDB reads the CRS specified in the Parquet file."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "143f00d5-6878-4dab-a82c-c9fb4dbfaf00",
"metadata": {},
"outputs": [],
"source": [
"countries = sd.read_parquet(\n",
" \"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries_geo.parquet\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8ef94b7b-b65d-4da5-9443-3253e84e2e7f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SedonaSchema with 3 fields:\n",
" name: utf8<Utf8View>\n",
" continent: utf8<Utf8View>\n",
" geometry: geometry<WkbView(epsg:4326)>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"countries.schema"
]
},
{
"cell_type": "markdown",
"id": "34676ec5",
"metadata": {},
"source": [
"## Example 1: Create a cities table with a Projected CRS\n",
"\n",
"We will now create a DataFrame containing several major US cities. The coordinates are provided in Web Mercator (EPSG:3857), which uses meters as its unit.\n",
"\n",
"We use the `ST_SetSRID` function to assign the correct CRS identifier to our geometry data. It's important to distinguish between these two key functions:\n",
"\n",
"`ST_SetSRID(geometry, srid)`: This function assigns an SRID to a geometry. It does not change the underlying coordinate values. You should only use this when your data has a CRS that SedonaDB was unable to infer.\n",
"\n",
"`ST_Transform(geometry, target_srid)`: This function transforms the geometry from its current CRS to a new one. It re-projects the coordinate values themselves."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "579e843f",
"metadata": {},
"outputs": [],
"source": [
"cities = sd.sql(\"\"\"\n",
"SELECT city, ST_SetSRID(ST_GeomFromText(wkt), 3857) AS geometry FROM (VALUES\n",
" ('New York', 'POINT(-8238310.24 4969803.34)'),\n",
" ('Los Angeles', 'POINT(-13153204.78 4037636.04)'),\n",
" ('Chicago', 'POINT(-9757148.04 5138517.44)'))\n",
"AS t(city, wkt)\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "36e53438-c2d2-444e-9f34-d391f0f3f588",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SedonaSchema with 2 fields:\n",
" city: utf8<Utf8>\n",
" geometry: geometry<Wkb(epsg:3857)>"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cities.schema"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "62c87571-50aa-4f57-a7dd-4afa3210320a",
"metadata": {},
"outputs": [],
"source": [
"cities.to_view(\"cities\", overwrite=True)\n",
"countries.to_view(\"countries\", overwrite=True)"
]
},
{
"cell_type": "markdown",
"id": "561b3c8c-4952-4fa7-9fe1-3fa0522b0d9f",
"metadata": {},
"source": [
"### Join with mismatched Coordinate Reference Systems\n",
"\n",
"The cities and countries tables have different CRSs.\n",
"\n",
"The cities table uses EPSG:3857 and the countries table uses EPSG:4326.\n",
"\n",
"Let's confirm that the code errors out if we try to join the mismatched tables."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "906bad37-4f3f-4028-82b4-487fabe5957f",
"metadata": {},
"outputs": [
{
"ename": "SedonaError",
"evalue": "type_coercion\ncaused by\nError during planning: Mismatched CRS arguments: epsg:3857 vs epsg:4326\nUse ST_Transform() or ST_SetSRID() to ensure arguments are compatible.",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mSedonaError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[7]\u001b[39m\u001b[32m, line 6\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;66;03m# join doesn't work when CRSs don't match\u001b[39;00m\n\u001b[32m 2\u001b[39m \u001b[43msd\u001b[49m\u001b[43m.\u001b[49m\u001b[43msql\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\"\"\u001b[39;49m\n\u001b[32m 3\u001b[39m \u001b[33;43mselect * from cities\u001b[39;49m\n\u001b[32m 4\u001b[39m \u001b[33;43mjoin countries\u001b[39;49m\n\u001b[32m 5\u001b[39m \u001b[33;43mwhere ST_Intersects(cities.geometry, countries.geometry)\u001b[39;49m\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m \u001b[33;43m\"\"\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[43mshow\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32m~/new-sedonadb/sedona-db/python/sedonadb/python/sedonadb/dataframe.py:380\u001b[39m, in \u001b[36mDataFrame.show\u001b[39m\u001b[34m(self, limit, width, ascii)\u001b[39m\n\u001b[32m 356\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"Print the first limit rows to the console\u001b[39;00m\n\u001b[32m 357\u001b[39m \n\u001b[32m 358\u001b[39m \u001b[33;03mArgs:\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 377\u001b[39m \n\u001b[32m 378\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 379\u001b[39m width = \u001b[38;5;28mself\u001b[39m._out_width(width)\n\u001b[32m--> \u001b[39m\u001b[32m380\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_impl\u001b[49m\u001b[43m.\u001b[49m\u001b[43mshow\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_ctx\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlimit\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mwidth\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mascii\u001b[49m\u001b[43m)\u001b[49m, end=\u001b[33m\"\u001b[39m\u001b[33m\"\u001b[39m)\n",
"\u001b[31mSedonaError\u001b[39m: type_coercion\ncaused by\nError during planning: Mismatched CRS arguments: epsg:3857 vs epsg:4326\nUse ST_Transform() or ST_SetSRID() to ensure arguments are compatible."
]
}
],
"source": [
"# join doesn't work when CRSs don't match\n",
"sd.sql(\"\"\"\n",
"select * from cities\n",
"join countries\n",
"where ST_Intersects(cities.geometry, countries.geometry)\n",
"\"\"\").show()"
]
},
{
"cell_type": "markdown",
"id": "41e6f59f-5217-40b2-b05a-9c95eae29df8",
"metadata": {},
"source": [
"### Convert CRS and then join\n",
"\n",
"Let's convert the cities table to use EPSG:4326 and then perform the join with the two tables once they have matching CRSs."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "122857c1-f68d-4037-9787-54c20706e60f",
"metadata": {},
"outputs": [],
"source": [
"# update cities to use 4326\n",
"cities = sd.sql(\"\"\"\n",
"SELECT city, ST_Transform(geometry, 'EPSG:4326') as geometry\n",
"FROM cities\n",
"\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "947e085c-62a4-4315-b155-007a95156964",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SedonaSchema with 2 fields:\n",
" city: utf8<Utf8>\n",
" geometry: geometry<Wkb(ogc:crs84)>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cities.schema"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "3bcbaf7a-ec40-4b7e-85c2-db5ef1e3232e",
"metadata": {},
"outputs": [],
"source": [
"cities.to_view(\"cities\", overwrite=True)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "819c8d04-fa03-4ef8-aecb-e79d48f0b820",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"┌─────────────┬──────────────────────┬──────────────────────┬───────────────┬──────────────────────┐\n",
"│ city ┆ geometry ┆ name ┆ continent ┆ geometry │\n",
"│ utf8 ┆ geometry ┆ utf8 ┆ utf8 ┆ geometry │\n",
"╞═════════════╪══════════════════════╪══════════════════════╪═══════════════╪══════════════════════╡\n",
"│ New York ┆ POINT(-74.006000039… ┆ United States of Am… ┆ North America ┆ MULTIPOLYGON(((-122… │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Los Angeles ┆ POINT(-118.15724889… ┆ United States of Am… ┆ North America ┆ MULTIPOLYGON(((-122… │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Chicago ┆ POINT(-87.649952137… ┆ United States of Am… ┆ North America ┆ MULTIPOLYGON(((-122… │\n",
"└─────────────┴──────────────────────┴──────────────────────┴───────────────┴──────────────────────┘\n"
]
}
],
"source": [
"# join works when CRSs match\n",
"sd.sql(\"\"\"\n",
"select * from cities\n",
"join countries\n",
"where ST_Intersects(cities.geometry, countries.geometry)\n",
"\"\"\").show()"
]
},
{
"cell_type": "markdown",
"id": "5279bebd-1d8d-4f33-bcd9-2c1e93ff7221",
"metadata": {},
"source": [
"## Example #2: Joining two tables with different CRSs\n",
"\n",
"This example shows how to join a `vermont` table with an EPSG 32618 CRS with a `buildings` table that uses an EPSG 4326 CRS.\n",
"\n",
"The example highlights the following features:\n",
"\n",
"1. SedonaDB reads the CRS stored in the files.\n",
"2. SedonaDB protects you from accidentally joining files with mismatched CRSs.\n",
"3. It's easy to convert a GeoPandas DataFrame to a SedonaDB DataFrame and maintain the CRS."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "d9ea1469-8e6d-4ef1-a440-5573c1345f0d",
"metadata": {},
"outputs": [],
"source": [
"import geopandas as gpd\n",
"\n",
"path = \"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/example-crs/files/example-crs_vermont-utm.fgb\"\n",
"gdf = gpd.read_file(path)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "46660482-3fed-4e6c-b37a-2947326e884a",
"metadata": {},
"outputs": [],
"source": [
"vermont = sd.create_data_frame(gdf)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "1571aa5e-638c-493a-90a1-0cffbeea0bd9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SedonaSchema with 1 field:\n",
" geometry: geometry<Wkb(epsg:32618)>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vermont.schema"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "eb4b33f1-972e-4c18-bdbd-fd4fe268b339",
"metadata": {},
"outputs": [],
"source": [
"buildings = sd.read_parquet(\n",
" \"https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/microsoft-buildings_point_geo.parquet\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "22baf082-736e-4881-9704-d57eea07068c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"┌─────────────────────────────────┐\n",
"│ geometry │\n",
"│ geometry │\n",
"╞═════════════════════════════════╡\n",
"│ POINT(-97.16154292 26.08759861) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ POINT(-97.1606625 26.08481) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ POINT(-97.16133375 26.08519809) │\n",
"└─────────────────────────────────┘\n"
]
}
],
"source": [
"buildings.show(3)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "e1d2d89d-8227-43ea-8507-fd6524fe2ac5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SedonaSchema with 1 field:\n",
" geometry: geometry<WkbView(ogc:crs84)>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"buildings.schema"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "e959081e-3a8d-4041-b00f-a19bca10be39",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"129735970"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"buildings.count()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "d9ef702f-e424-4e53-9629-da6ed256ee7f",
"metadata": {},
"outputs": [],
"source": [
"buildings.to_view(\"buildings\", overwrite=True)\n",
"vermont.to_view(\"vermont\", overwrite=True)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "0c300dad-aa60-4291-a10b-d2a332b37593",
"metadata": {},
"outputs": [
{
"ename": "SedonaError",
"evalue": "type_coercion\ncaused by\nError during planning: Mismatched CRS arguments: ogc:crs84 vs epsg:32618\nUse ST_Transform() or ST_SetSRID() to ensure arguments are compatible.",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mSedonaError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[20]\u001b[39m\u001b[32m, line 8\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;66;03m# Again, SedonaDB prevents accidentally joining files with mismatched CRSs.\u001b[39;00m\n\u001b[32m 2\u001b[39m \u001b[43msd\u001b[49m\u001b[43m.\u001b[49m\u001b[43msql\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\"\"\u001b[39;49m\n\u001b[32m 3\u001b[39m \u001b[33;43mSELECT count(*) from buildings\u001b[39;49m\n\u001b[32m 4\u001b[39m \u001b[33;43mJOIN vermont\u001b[39;49m\n\u001b[32m 5\u001b[39m \u001b[33;43mWHERE ST_Intersects(\u001b[39;49m\n\u001b[32m 6\u001b[39m \u001b[33;43m buildings.geometry,\u001b[39;49m\n\u001b[32m 7\u001b[39m \u001b[33;43m vermont.geometry)\u001b[39;49m\n\u001b[32m----> \u001b[39m\u001b[32m8\u001b[39m \u001b[33;43m\"\"\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[43mshow\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32m~/new-sedonadb/sedona-db/python/sedonadb/python/sedonadb/dataframe.py:380\u001b[39m, in \u001b[36mDataFrame.show\u001b[39m\u001b[34m(self, limit, width, ascii)\u001b[39m\n\u001b[32m 356\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"Print the first limit rows to the console\u001b[39;00m\n\u001b[32m 357\u001b[39m \n\u001b[32m 358\u001b[39m \u001b[33;03mArgs:\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 377\u001b[39m \n\u001b[32m 378\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 379\u001b[39m width = \u001b[38;5;28mself\u001b[39m._out_width(width)\n\u001b[32m--> \u001b[39m\u001b[32m380\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_impl\u001b[49m\u001b[43m.\u001b[49m\u001b[43mshow\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_ctx\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlimit\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mwidth\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mascii\u001b[49m\u001b[43m)\u001b[49m, end=\u001b[33m\"\u001b[39m\u001b[33m\"\u001b[39m)\n",
"\u001b[31mSedonaError\u001b[39m: type_coercion\ncaused by\nError during planning: Mismatched CRS arguments: ogc:crs84 vs epsg:32618\nUse ST_Transform() or ST_SetSRID() to ensure arguments are compatible."
]
}
],
"source": [
"# Again, SedonaDB prevents accidentally joining files with mismatched CRSs.\n",
"sd.sql(\"\"\"\n",
"SELECT count(*) from buildings\n",
"JOIN vermont\n",
"WHERE ST_Intersects(\n",
" buildings.geometry,\n",
" vermont.geometry)\n",
"\"\"\").show()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "648984ce-ac7a-4f76-ac9f-17bd5c628bd0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"┌──────────┐\n",
"│ count(*) │\n",
"│ int64 │\n",
"╞══════════╡\n",
"│ 361856 │\n",
"└──────────┘\n"
]
}
],
"source": [
"sd.sql(\"\"\"\n",
"SELECT count(*)\n",
"FROM buildings\n",
"JOIN vermont\n",
"WHERE ST_Intersects(\n",
" buildings.geometry,\n",
" ST_Transform(vermont.geometry, 'EPSG:4326')\n",
")\n",
"\"\"\").show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv (3.13.3)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}