| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "# Working with Parquet Files\n", |
| "\n", |
| "The easiest way to read a GeoParquet or Parquet file is to use `sd.read_parquet()`. Alternatively, you can query these files directly by their path in SQL." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## Install SedonaDB\n", |
| "\n", |
| "Use pip to install SedonaDB from the Python Package Index (PyPI)." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "> **Note**: Before running this notebook on your local machine, you must have SedonaDB installed in your environment. You can install SedonaDB with the following command: `pip install \"apache-sedona[db]\"`" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## Implementation\n", |
| "\n", |
| "A common workflow for working with GeoParquet and/or Parquet files is:\n", |
| "\n", |
| "1. **Load** the Parquet file into a data frame using `sd.read_parquet()`.\n", |
| "2. **Register** the data frame as a view with `to_view()`.\n", |
| "3. **Query** the view using `sd.sql()`.\n", |
| "4. **Write** your results to a Parquet file with `.to_parquet()` or use `.to_pandas()` to export your results to a DataFrame or GeoDataFrame." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 1, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "# Import the sedona.db module and connect to SedonaDB\n", |
| "import sedona.db\n", |
| "\n", |
| "sd = sedona.db.connect()" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 2, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "┌──────────────┬───────────────────────────────┐\n", |
| "│ name ┆ geometry │\n", |
| "│ utf8 ┆ geometry │\n", |
| "╞══════════════╪═══════════════════════════════╡\n", |
| "│ Vatican City ┆ POINT(12.4533865 41.9032822) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ San Marino ┆ POINT(12.4417702 43.9360958) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Vaduz ┆ POINT(9.5166695 47.1337238) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Lobamba ┆ POINT(31.1999971 -26.4666675) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Luxembourg ┆ POINT(6.1300028 49.6116604) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Palikir ┆ POINT(158.1499743 6.9166437) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Majuro ┆ POINT(171.3800002 7.1030043) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Funafuti ┆ POINT(179.2166471 -8.516652) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Melekeok ┆ POINT(134.6265485 7.4873962) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Bir Lehlou ┆ POINT(-9.6525222 26.1191667) │\n", |
| "└──────────────┴───────────────────────────────┘\n" |
| ] |
| } |
| ], |
| "source": [ |
| "# 1. Load the Parquet file\n", |
| "df = sd.read_parquet(\n", |
| " \"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/\"\n", |
| " \"natural-earth/files/natural-earth_cities_geo.parquet\"\n", |
| ")\n", |
| "\n", |
| "# 2. Register the data frame as a view\n", |
| "df.to_view(\"zone\")\n", |
| "\n", |
| "# 3. Query the view and store the result in a new DataFrame\n", |
| "query_result_df = sd.sql(\"SELECT * FROM zone LIMIT 10\")\n", |
| "query_result_df.show()" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 3, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "\n", |
| "Verifying the written file at 'query_results.parquet'...\n", |
| "┌──────────────┬───────────────────────────────┐\n", |
| "│ name ┆ geometry │\n", |
| "│ utf8 ┆ geometry │\n", |
| "╞══════════════╪═══════════════════════════════╡\n", |
| "│ Vatican City ┆ POINT(12.4533865 41.9032822) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ San Marino ┆ POINT(12.4417702 43.9360958) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Vaduz ┆ POINT(9.5166695 47.1337238) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Lobamba ┆ POINT(31.1999971 -26.4666675) │\n", |
| "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", |
| "│ Luxembourg ┆ POINT(6.1300028 49.6116604) │\n", |
| "└──────────────┴───────────────────────────────┘\n" |
| ] |
| } |
| ], |
| "source": [ |
| "# 4. Write the result to a new Parquet file\n", |
| "output_path = \"query_results.parquet\"\n", |
| "query_result_df.to_parquet(output_path)\n", |
| "\n", |
| "# (Optional) Verify the written file\n", |
| "print(f\"\\nVerifying the written file at '{output_path}'...\")\n", |
| "verified_df = sd.read_parquet(output_path)\n", |
| "verified_df.show(5)" |
| ] |
| } |
| ], |
| "metadata": { |
| "kernelspec": { |
| "display_name": ".venv (3.13.3)", |
| "language": "python", |
| "name": "python3" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.13.3" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 4 |
| } |