docs/usecases/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb - sedona - Git at Google

 {
  "cells": [
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "```\n",
     "Licensed to the Apache Software Foundation (ASF) under one\n",
     "or more contributor license agreements.  See the NOTICE file\n",
     "distributed with this work for additional information\n",
     "regarding copyright ownership.  The ASF licenses this file\n",
     "to you under the Apache License, Version 2.0 (the\n",
     "\"License\"); you may not use this file except in compliance\n",
     "with the License.  You may obtain a copy of the License at\n",
     "  http://www.apache.org/licenses/LICENSE-2.0\n",
     "Unless required by applicable law or agreed to in writing,\n",
     "software distributed under the License is distributed on an\n",
     "\"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
     "KIND, either express or implied.  See the License for the\n",
     "specific language governing permissions and limitations\n",
     "under the License.\n",
     "```"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "import os\n",
     "\n",
     "import geopandas as gpd\n",
     "from pyspark.sql import SparkSession\n",
     "from pyspark.sql.functions import col, expr, when, explode, hex\n",
     "\n",
     "\n",
     "from sedona.spark import *\n",
     "from utilities import getConfig"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Setup Sedona environment"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "config = (\n",
     "    SedonaContext.builder()\n",
     "    .config(\n",
     "        \"spark.jars.packages\",\n",
     "        \"uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4\",\n",
     "    )\n",
     "    .master(\"spark://localhost:7077\")\n",
     "    .getOrCreate()\n",
     ")\n",
     "\n",
     "sedona = SedonaContext.create(config)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Read countries shapefile into a Sedona DataFrame \n",
     "Data link: https://www.naturalearthdata.com/downloads/50m-cultural-vectors/"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "countries_df = sedona.read.format(\"shapefile\").load(\n",
     "    \"data/ne_50m_admin_0_countries_lakes\"\n",
     ")\n",
     "countries_df.createOrReplaceTempView(\"country\")\n",
     "countries_df.printSchema()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Read airports shapefile into a Sedona DataFrame \n",
     "Data link: https://www.naturalearthdata.com/downloads/50m-cultural-vectors/"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "airports_df = sedona.read.format(\"shapefile\").load(\"data/ne_50m_airports/\")\n",
     "airports_df.createOrReplaceTempView(\"airport\")\n",
     "airports_df.printSchema()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "\n",
     "\n",
     "## Run Spatial Join using SQL API"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "result = sedona.sql(\n",
     "    \"SELECT c.geometry as country_geom, c.NAME_EN, a.geometry as airport_geom, a.name FROM country c, airport a WHERE ST_Contains(c.geometry, a.geometry)\"\n",
     ")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Run Spatial Join using RDD API"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "from sedona.spark import StructuredAdapter\n",
     "\n",
     "airports_rdd = StructuredAdapter.toSpatialRdd(airports_df, \"geometry\")\n",
     "# Drop the duplicate name column in countries_df\n",
     "countries_df = countries_df.drop(\"NAME\")\n",
     "countries_rdd = StructuredAdapter.toSpatialRdd(countries_df, \"geometry\")\n",
     "\n",
     "airports_rdd.analyze()\n",
     "countries_rdd.analyze()\n",
     "\n",
     "# 4 is the num partitions used in spatial partitioning. This is an optional parameter\n",
     "airports_rdd.spatialPartitioning(GridType.KDBTREE, 4)\n",
     "countries_rdd.spatialPartitioning(airports_rdd.getPartitioner())\n",
     "\n",
     "buildOnSpatialPartitionedRDD = True\n",
     "usingIndex = True\n",
     "considerBoundaryIntersection = True\n",
     "airports_rdd.buildIndex(IndexType.QUADTREE, buildOnSpatialPartitionedRDD)\n",
     "\n",
     "result_pair_rdd = JoinQueryRaw.SpatialJoinQueryFlat(\n",
     "    airports_rdd, countries_rdd, usingIndex, considerBoundaryIntersection\n",
     ")\n",
     "\n",
     "result2 = StructuredAdapter.pairRddToDf(\n",
     "    result_pair_rdd, countries_df.schema, airports_df.schema, sedona\n",
     ")\n",
     "\n",
     "result2.createOrReplaceTempView(\"join_result_with_all_cols\")\n",
     "# Select the columns needed in the join\n",
     "result2 = sedona.sql(\"SELECT * FROM join_result_with_all_cols\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Print spatial join results"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "# The result of SQL API\n",
     "result.show()\n",
     "# The result of RDD API\n",
     "result2.show()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Group airports by country"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "scrolled": true
    },
    "outputs": [],
    "source": [
     "result.createOrReplaceTempView(\"result\")\n",
     "groupedresult = sedona.sql(\n",
     "    \"SELECT c.NAME_EN, c.country_geom, count(*) as AirportCount FROM result c GROUP BY c.NAME_EN, c.country_geom\"\n",
     ")\n",
     "groupedresult.show()\n",
     "groupedresult.createOrReplaceTempView(\"grouped_result\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Visualize the number of airports in each country"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Visualize using SedonaKepler"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "sedona_kepler_map = SedonaKepler.create_map(\n",
     "    df=groupedresult, name=\"AirportCount\", config=getConfig()\n",
     ")\n",
     "sedona_kepler_map"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Visualize using SedonaPyDeck\n",
     "The above visualization is generated by a pre-set config informing SedonaKepler that the map to be rendered has to be a choropleth map with choropleth of the `AirportCount` column value.\n",
     "\n",
     "This can be also be achieved using [SedonaPyDeck](https://sedona.apache.org/1.5.0/tutorial/sql/#sedonapydeck) and its `create_choropleth_map` API."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
     "sedona_pydeck_map = SedonaPyDeck.create_choropleth_map(\n",
     "    df=groupedresult, plot_col=\"AirportCount\"\n",
     ")\n",
     "sedona_pydeck_map"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Visualize Uber H3 cells using SedonaKepler\n",
     "The following tutorial depicts how Uber H3 cells can be generated using Sedona and visualized using SedonaKepler."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Generate H3 cell IDs\n",
     "[ST_H3CellIDs](https://sedona.apache.org/1.5.0/api/flink/Function/#st_h3cellids) can be used to generated cell IDs for given geometries"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
     "h3_df = sedona.sql(\n",
     "    \"SELECT g.NAME_EN, g.country_geom, ST_H3CellIDs(g.country_geom, 3, false) as h3_cellID from grouped_result g\"\n",
     ")\n",
     "h3_df.show(2)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Since each geometry can have multiple H3 cell IDs, let's explode the generated H3 cell ID array to get individual cells"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
     "exploded_h3 = h3_df.select(\n",
     "    h3_df.NAME_EN, h3_df.country_geom, explode(h3_df.h3_cellID).alias(\"h3\")\n",
     ")\n",
     "exploded_h3.show(2)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Convert generated long H3 cell ID to a hex cell ID\n",
     "SedonaKepler accepts each H3 cell ID as a hexadecimal to automatically visualize them. Also, let us sample the data to be able to visualize sparse cells on the map."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
     "exploded_h3 = exploded_h3.sample(0.3)\n",
     "exploded_h3.createOrReplaceTempView(\"exploded_h3\")\n",
     "hex_exploded_h3 = exploded_h3.select(\n",
     "    exploded_h3.NAME_EN, hex(exploded_h3.h3).alias(\"ex_h3\")\n",
     ")\n",
     "hex_exploded_h3.show(2)\n",
     "hex_exploded_h3.printSchema()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Visualize using SedonaKepler\n",
     "Now, simply provide the final df to SedonaKepler.create_map and you can automagically visualize the H3 cells on the map!"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
     "sedona_kepler_h3 = SedonaKepler.create_map(df=hex_exploded_h3, name=\"h3\")\n",
     "sedona_kepler_h3"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
     "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.10.12"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"```\n",
	"Licensed to the Apache Software Foundation (ASF) under one\n",
	"or more contributor license agreements. See the NOTICE file\n",
	"distributed with this work for additional information\n",
	"regarding copyright ownership. The ASF licenses this file\n",
	"to you under the Apache License, Version 2.0 (the\n",
	"\"License\"); you may not use this file except in compliance\n",
	"with the License. You may obtain a copy of the License at\n",
	" http://www.apache.org/licenses/LICENSE-2.0\n",
	"Unless required by applicable law or agreed to in writing,\n",
	"software distributed under the License is distributed on an\n",
	"\"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
	"KIND, either express or implied. See the License for the\n",
	"specific language governing permissions and limitations\n",
	"under the License.\n",
	"```"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import os\n",
	"\n",
	"import geopandas as gpd\n",
	"from pyspark.sql import SparkSession\n",
	"from pyspark.sql.functions import col, expr, when, explode, hex\n",
	"\n",
	"\n",
	"from sedona.spark import *\n",
	"from utilities import getConfig"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Setup Sedona environment"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"config = (\n",
	" SedonaContext.builder()\n",
	" .config(\n",
	" \"spark.jars.packages\",\n",
	" \"uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4\",\n",
	" )\n",
	" .master(\"spark://localhost:7077\")\n",
	" .getOrCreate()\n",
	")\n",
	"\n",
	"sedona = SedonaContext.create(config)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Read countries shapefile into a Sedona DataFrame \n",
	"Data link: https://www.naturalearthdata.com/downloads/50m-cultural-vectors/"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"countries_df = sedona.read.format(\"shapefile\").load(\n",
	" \"data/ne_50m_admin_0_countries_lakes\"\n",
	")\n",
	"countries_df.createOrReplaceTempView(\"country\")\n",
	"countries_df.printSchema()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Read airports shapefile into a Sedona DataFrame \n",
	"Data link: https://www.naturalearthdata.com/downloads/50m-cultural-vectors/"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"airports_df = sedona.read.format(\"shapefile\").load(\"data/ne_50m_airports/\")\n",
	"airports_df.createOrReplaceTempView(\"airport\")\n",
	"airports_df.printSchema()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\n",
	"\n",
	"## Run Spatial Join using SQL API"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"result = sedona.sql(\n",
	" \"SELECT c.geometry as country_geom, c.NAME_EN, a.geometry as airport_geom, a.name FROM country c, airport a WHERE ST_Contains(c.geometry, a.geometry)\"\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Run Spatial Join using RDD API"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from sedona.spark import StructuredAdapter\n",
	"\n",
	"airports_rdd = StructuredAdapter.toSpatialRdd(airports_df, \"geometry\")\n",
	"# Drop the duplicate name column in countries_df\n",
	"countries_df = countries_df.drop(\"NAME\")\n",
	"countries_rdd = StructuredAdapter.toSpatialRdd(countries_df, \"geometry\")\n",
	"\n",
	"airports_rdd.analyze()\n",
	"countries_rdd.analyze()\n",
	"\n",
	"# 4 is the num partitions used in spatial partitioning. This is an optional parameter\n",
	"airports_rdd.spatialPartitioning(GridType.KDBTREE, 4)\n",
	"countries_rdd.spatialPartitioning(airports_rdd.getPartitioner())\n",
	"\n",
	"buildOnSpatialPartitionedRDD = True\n",
	"usingIndex = True\n",
	"considerBoundaryIntersection = True\n",
	"airports_rdd.buildIndex(IndexType.QUADTREE, buildOnSpatialPartitionedRDD)\n",
	"\n",
	"result_pair_rdd = JoinQueryRaw.SpatialJoinQueryFlat(\n",
	" airports_rdd, countries_rdd, usingIndex, considerBoundaryIntersection\n",
	")\n",
	"\n",
	"result2 = StructuredAdapter.pairRddToDf(\n",
	" result_pair_rdd, countries_df.schema, airports_df.schema, sedona\n",
	")\n",
	"\n",
	"result2.createOrReplaceTempView(\"join_result_with_all_cols\")\n",
	"# Select the columns needed in the join\n",
	"result2 = sedona.sql(\"SELECT * FROM join_result_with_all_cols\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Print spatial join results"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# The result of SQL API\n",
	"result.show()\n",
	"# The result of RDD API\n",
	"result2.show()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Group airports by country"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"result.createOrReplaceTempView(\"result\")\n",
	"groupedresult = sedona.sql(\n",
	" \"SELECT c.NAME_EN, c.country_geom, count(*) as AirportCount FROM result c GROUP BY c.NAME_EN, c.country_geom\"\n",
	")\n",
	"groupedresult.show()\n",
	"groupedresult.createOrReplaceTempView(\"grouped_result\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Visualize the number of airports in each country"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Visualize using SedonaKepler"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"sedona_kepler_map = SedonaKepler.create_map(\n",
	" df=groupedresult, name=\"AirportCount\", config=getConfig()\n",
	")\n",
	"sedona_kepler_map"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Visualize using SedonaPyDeck\n",
	"The above visualization is generated by a pre-set config informing SedonaKepler that the map to be rendered has to be a choropleth map with choropleth of the `AirportCount` column value.\n",
	"\n",
	"This can be also be achieved using [SedonaPyDeck](https://sedona.apache.org/1.5.0/tutorial/sql/#sedonapydeck) and its `create_choropleth_map` API."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"tags": []
	},
	"outputs": [],
	"source": [
	"sedona_pydeck_map = SedonaPyDeck.create_choropleth_map(\n",
	" df=groupedresult, plot_col=\"AirportCount\"\n",
	")\n",
	"sedona_pydeck_map"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Visualize Uber H3 cells using SedonaKepler\n",
	"The following tutorial depicts how Uber H3 cells can be generated using Sedona and visualized using SedonaKepler."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Generate H3 cell IDs\n",
	"[ST_H3CellIDs](https://sedona.apache.org/1.5.0/api/flink/Function/#st_h3cellids) can be used to generated cell IDs for given geometries"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"tags": []
	},
	"outputs": [],
	"source": [
	"h3_df = sedona.sql(\n",
	" \"SELECT g.NAME_EN, g.country_geom, ST_H3CellIDs(g.country_geom, 3, false) as h3_cellID from grouped_result g\"\n",
	")\n",
	"h3_df.show(2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Since each geometry can have multiple H3 cell IDs, let's explode the generated H3 cell ID array to get individual cells"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"tags": []
	},
	"outputs": [],
	"source": [
	"exploded_h3 = h3_df.select(\n",
	" h3_df.NAME_EN, h3_df.country_geom, explode(h3_df.h3_cellID).alias(\"h3\")\n",
	")\n",
	"exploded_h3.show(2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Convert generated long H3 cell ID to a hex cell ID\n",
	"SedonaKepler accepts each H3 cell ID as a hexadecimal to automatically visualize them. Also, let us sample the data to be able to visualize sparse cells on the map."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"tags": []
	},
	"outputs": [],
	"source": [
	"exploded_h3 = exploded_h3.sample(0.3)\n",
	"exploded_h3.createOrReplaceTempView(\"exploded_h3\")\n",
	"hex_exploded_h3 = exploded_h3.select(\n",
	" exploded_h3.NAME_EN, hex(exploded_h3.h3).alias(\"ex_h3\")\n",
	")\n",
	"hex_exploded_h3.show(2)\n",
	"hex_exploded_h3.printSchema()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Visualize using SedonaKepler\n",
	"Now, simply provide the final df to SedonaKepler.create_map and you can automagically visualize the H3 cells on the map!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"tags": []
	},
	"outputs": [],
	"source": [
	"sedona_kepler_h3 = SedonaKepler.create_map(df=hex_exploded_h3, name=\"h3\")\n",
	"sedona_kepler_h3"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.10.12"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}