blob: 40aedaf4fc349ba30d443f0599152ab907118a3f [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working with Parquet Files\n",
"\n",
"The easiest way to read a GeoParquet or Parquet file is to use `sd.read_parquet()`. Alternatively, you can query these files directly by their path in SQL."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install SedonaDB\n",
"\n",
"Use pip to install SedonaDB from the Python Package Index (PyPI)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Note**: Before running this notebook on your local machine, you must have SedonaDB installed in your environment. You can install SedonaDB with the following command: `pip install \"apache-sedona[db]\"`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementation\n",
"\n",
"A common workflow for working with GeoParquet and/or Parquet files is:\n",
"\n",
"1. **Load** the Parquet file into a data frame using `sd.read_parquet()`.\n",
"2. **Register** the data frame as a view with `to_view()`.\n",
"3. **Query** the view using `sd.sql()`.\n",
"4. **Write** your results to a Parquet file with `.to_parquet()` or use `.to_pandas()` to export your results to a DataFrame or GeoDataFrame."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import the sedona.db module and connect to SedonaDB\n",
"import sedona.db\n",
"\n",
"sd = sedona.db.connect()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"┌──────────────┬───────────────────────────────┐\n",
"│ name ┆ geometry │\n",
"│ utf8 ┆ geometry │\n",
"╞══════════════╪═══════════════════════════════╡\n",
"│ Vatican City ┆ POINT(12.4533865 41.9032822) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ San Marino ┆ POINT(12.4417702 43.9360958) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Vaduz ┆ POINT(9.5166695 47.1337238) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Lobamba ┆ POINT(31.1999971 -26.4666675) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Luxembourg ┆ POINT(6.1300028 49.6116604) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Palikir ┆ POINT(158.1499743 6.9166437) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Majuro ┆ POINT(171.3800002 7.1030043) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Funafuti ┆ POINT(179.2166471 -8.516652) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Melekeok ┆ POINT(134.6265485 7.4873962) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Bir Lehlou ┆ POINT(-9.6525222 26.1191667) │\n",
"└──────────────┴───────────────────────────────┘\n"
]
}
],
"source": [
"# 1. Load the Parquet file\n",
"df = sd.read_parquet(\n",
" \"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/\"\n",
" \"natural-earth/files/natural-earth_cities_geo.parquet\"\n",
")\n",
"\n",
"# 2. Register the data frame as a view\n",
"df.to_view(\"zone\")\n",
"\n",
"# 3. Query the view and store the result in a new DataFrame\n",
"query_result_df = sd.sql(\"SELECT * FROM zone LIMIT 10\")\n",
"query_result_df.show()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Verifying the written file at 'query_results.parquet'...\n",
"┌──────────────┬───────────────────────────────┐\n",
"│ name ┆ geometry │\n",
"│ utf8 ┆ geometry │\n",
"╞══════════════╪═══════════════════════════════╡\n",
"│ Vatican City ┆ POINT(12.4533865 41.9032822) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ San Marino ┆ POINT(12.4417702 43.9360958) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Vaduz ┆ POINT(9.5166695 47.1337238) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Lobamba ┆ POINT(31.1999971 -26.4666675) │\n",
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
"│ Luxembourg ┆ POINT(6.1300028 49.6116604) │\n",
"└──────────────┴───────────────────────────────┘\n"
]
}
],
"source": [
"# 4. Write the result to a new Parquet file\n",
"output_path = \"query_results.parquet\"\n",
"query_result_df.to_parquet(output_path)\n",
"\n",
"# (Optional) Verify the written file\n",
"print(f\"\\nVerifying the written file at '{output_path}'...\")\n",
"verified_df = sd.read_parquet(output_path)\n",
"verified_df.show(5)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv (3.13.3)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}