This page explains how to build spatial data lakes with Apache Sedona and GeoParquet.
GeoParquet offers many advantages for spatial data sets:
Let's see how to write a Sedona DataFrame to GeoParquet files.
Start by creating a Sedona DataFrame:
df = sedona.createDataFrame( [ ("a", "LINESTRING(2.0 5.0,6.0 1.0)"), ("b", "LINESTRING(7.0 4.0,9.0 2.0)"), ("c", "LINESTRING(1.0 3.0,3.0 1.0)"), ], ["id", "geometry"], ) df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
Now write the DataFrame to GeoParquet files:
df.write.format("geoparquet").mode("overwrite").save("/tmp/somewhere")
Here are the files created in storage:
somewhere/ _SUCCESS part-00000-1c13be9e-6d4c-401e-89d8-739000ad3aba-c000.snappy.parquet part-00003-1c13be9e-6d4c-401e-89d8-739000ad3aba-c000.snappy.parquet part-00007-1c13be9e-6d4c-401e-89d8-739000ad3aba-c000.snappy.parquet part-00011-1c13be9e-6d4c-401e-89d8-739000ad3aba-c000.snappy.parquet
Sedona writes many Parquet files in parallel because that's much faster than writing a single file.
Let's read the GeoParquet files into a Sedona DataFrame:
df = sedona.read.format("geoparquet").load("/tmp/somewhere") df.show(truncate=False)
Here are the results:
+---+---------------------+ |id |geometry | +---+---------------------+ |a |LINESTRING (2 5, 6 1)| |b |LINESTRING (7 4, 9 2)| |c |LINESTRING (1 3, 3 1)| +---+---------------------+
Here's how Sedona executes this query under the hood:
Sedona queries on GeoParquet files can often be executed faster and more reliably than other file formats, like CSV. CSV files don‘t store the schema in the file footer, don’t support column pruning, and don't have row-group metadata for data skipping.
Since v1.5.1, Sedona provides a Spark SQL data source "geoparquet.metadata" for inspecting GeoParquet metadata. The resulting dataframe contains the “geo” metadata for each input file.
=== “Scala”
val df = sedona.read.format("geoparquet.metadata").load(geoparquetdatalocation1) df.printSchema()
=== “Java”
Dataset<Row> df = sedona.read.format("geoparquet.metadata").load(geoparquetdatalocation1) df.printSchema()
=== “Python”
df = sedona.read.format("geoparquet.metadata").load(geoparquetdatalocation1) df.printSchema()
The output will be as follows:
root |-- path: string (nullable = true) |-- version: string (nullable = true) |-- primary_column: string (nullable = true) |-- columns: map (nullable = true) | |-- key: string | |-- value: struct (valueContainsNull = true) | | |-- encoding: string (nullable = true) | | |-- geometry_types: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- bbox: array (nullable = true) | | | |-- element: double (containsNull = true) | | |-- crs: string (nullable = true)
The GeoParquet footer stores the following data for each column:
The rest of the Parquet metadata is stored in the Parquet footer, right where it is stored for other Parquet files too.
If the input Parquet file does not have GeoParquet metadata, the values of version, primary_column and columns fields of the resulting dataframe will be null.
geoparquet.metadata only supports reading GeoParquet specific metadata. Users can use G-Research/spark-extension to read comprehensive metadata of generic Parquet files.
Let’s check out the content of the metadata and the schema:
df.show(truncate=False)
+-----------------------------+-------------------------------------------------------------------+
|path |columns |
+-----------------------------+-------------------------------------------------------------------+
|file:/part-00003-1c13.parquet|{geometry -> {WKB, [LineString], [2.0, 1.0, 6.0, 5.0], null, NULL}}|
|file:/part-00007-1c13.parquet|{geometry -> {WKB, [LineString], [7.0, 2.0, 9.0, 4.0], null, NULL}}|
|file:/part-00011-1c13.parquet|{geometry -> {WKB, [LineString], [1.0, 1.0, 3.0, 3.0], null, NULL}}|
|file:/part-00000-1c13.parquet|{geometry -> {WKB, [], [0.0, 0.0, 0.0, 0.0], null, NULL}} |
+-----------------------------+-------------------------------------------------------------------+
The columns column contains bounding box information on each file in the GeoParquet data lake. These bbox values are useful for skipping files or row-groups. More will come on bbox metadata in the following section.
Since v1.5.1, Sedona supports writing GeoParquet files with custom GeoParquet spec version and crs. The default GeoParquet spec version is 1.0.0 and the default crs is null. You can specify the GeoParquet spec version and crs as follows:
val projjson = "{...}" // PROJJSON string for all geometry columns df.write.format("geoparquet") .option("geoparquet.version", "1.0.0") .option("geoparquet.crs", projjson) .save(geoparquetoutputlocation + "/GeoParquet_File_Name.parquet")
If you have multiple geometry columns written to the GeoParquet file, you can specify the CRS for each column. For example, g0 and g1 are two geometry columns in the DataFrame df, and you want to specify the CRS for each column as follows:
val projjson_g0 = "{...}" // PROJJSON string for g0 val projjson_g1 = "{...}" // PROJJSON string for g1 df.write.format("geoparquet") .option("geoparquet.version", "1.0.0") .option("geoparquet.crs.g0", projjson_g0) .option("geoparquet.crs.g1", projjson_g1) .save(geoparquetoutputlocation + "/GeoParquet_File_Name.parquet")
The value of geoparquet.crs and geoparquet.crs.<column_name> can be one of the following:
"null": Explicitly setting crs field to null. This is the default behavior."" (empty string): Omit the crs field. This implies that the CRS is OGC:CRS84 for CRS-aware implementations."{...}" (PROJJSON string): The crs field will be set as the PROJJSON object representing the Coordinate Reference System (CRS) of the geometry. You can find the PROJJSON string of a specific CRS from here: https://epsg.io/ (click the JSON option at the bottom of the page). You can also customize your PROJJSON string as needed.Please note that Sedona currently cannot set/get a projjson string to/from a CRS. Its geoparquet reader will ignore the projjson metadata and you will have to set your CRS via ST_SetSRID after reading the file. Its geoparquet writer will not leverage the SRID field of a geometry so you will have to always set the geoparquet.crs option manually when writing the file, if you want to write a meaningful CRS field.
Due to the same reason, Sedona geoparquet reader and writer do NOT check the axis order (lon/lat or lat/lon) and assume they are handled by the users themselves when writing / reading the files. You can always use ST_FlipCoordinates to swap the axis order of your geometries.
Since v1.6.1, Sedona supports writing the covering field to geometry column metadata. The covering field specifies a bounding box column to help accelerate spatial data retrieval. The bounding box column should be a top-level struct column containing xmin, ymin, xmax, ymax columns. If the DataFrame you are writing contains such columns, you can specify .option("geoparquet.covering.<geometryColumnName>", "<coveringColumnName>") option to write covering metadata to GeoParquet files:
df.write.format("geoparquet") .option("geoparquet.covering.geometry", "bbox") .save("/path/to/saved_geoparquet.parquet")
If the DataFrame has only one geometry column, you can simply specify the geoparquet.covering option and omit the geometry column name:
df.write.format("geoparquet") .option("geoparquet.covering", "bbox") .save("/path/to/saved_geoparquet.parquet")
If the DataFrame does not have a covering column, you can construct one using Sedona's SQL functions:
val df_bbox = df.withColumn("bbox", expr("struct(ST_XMin(geometry) AS xmin, ST_YMin(geometry) AS ymin, ST_XMax(geometry) AS xmax, ST_YMax(geometry) AS ymax)")) df_bbox.write.format("geoparquet").option("geoparquet.covering.geometry", "bbox").save("/path/to/saved_geoparquet.parquet")
To maximize the performance of Sedona GeoParquet filter pushdown, we suggest that you sort the data by their geohash values (see ST_GeoHash) and then save as a GeoParquet file. An example is as follows:
SELECT col1, col2, geom, ST_GeoHash(geom, 5) as geohash FROM spatialDf ORDER BY geohash
Let's look closer at how Sedona uses the GeoParquet bbox metadata to optimize queries.
The bounding box metadata specifies the area covered by geometric shapes in a given file. Suppose you query points in a region not covered by the bounding box for a given file. The engine can skip that entire file when executing the query because it’s known that it does not cover any relevant data.
Skipping entire files of data can massively improve performance. The more data that’s skipped, the faster the query will run.
Let’s look at an example of a dataset with points and three bounding boxes.
Now, let’s apply a spatial filter to read points within a particular area:
Here is the query:
my_shape = "POLYGON((4.0 3.5, 4.0 6.0, 8.0 6.0, 8.0 4.5, 4.0 3.5))" res = sedona.sql( f""" select * from points where st_intersects(geometry, ST_GeomFromWKT('{my_shape}')) """ ) res.show(truncate=False)
Here are the results:
+---+-----------+ |id |geometry | +---+-----------+ |e |POINT (5 4)| |f |POINT (6 4)| |g |POINT (6 5)| +---+-----------+
We don’t need the data in bounding boxes 1 or 3 to run this query. We just need points in bounding box 2. File skipping lets us read one file instead of three for this query. Here are the bounding boxes for the files in the GeoParquet data lake:
+--------------------------------------------------------------+
|columns |
+--------------------------------------------------------------+
|{geometry -> {WKB, [Point], [2.0, 1.0, 3.0, 3.0], null, NULL}}|
|{geometry -> {WKB, [Point], [7.0, 1.0, 8.0, 2.0], null, NULL}}|
|{geometry -> {WKB, [Point], [5.0, 4.0, 6.0, 5.0], null, NULL}}|
+--------------------------------------------------------------+
Skipping entire files can result in massive performance gains.
As previously mentioned, GeoParquet files have many advantages over row-oriented files like CSV or GeoJSON:
These are huge advantages but don’t offer the complete feature set that spatial data practitioners need.
Parquet data lakes and GeoParquet data lakes face many of the same limitations as other data lakes:
To overcome these limitations, the data community has migrated to open table formats like Iceberg, Delta Lake, and Hudi.
Iceberg recently rolled out support for geometry and geography columns. This allows data practitioners to get the best features of open table formats and the benefits of Parquet files, where the data is stored. Iceberg overcomes the limitations of GeoParquet data lakes and allows for reliable transactions and easy access to common operations like updates and deletes.
GeoParquet is a significant innovation for the spatial data community. It provides geo engineers with embedded schemas, performance optimizations, and native support for geometry columns.
Spatial data engineers can also migrate to Iceberg now that it supports all these positive features of GeoParquet, and even more. Iceberg provides many useful open table features and is almost always a better option than vanilla GeoParquet (except for single file datasets that will never change or for compatibility with other engines).
Switching from legacy file formats like Shapefile, GeoPackage, CSV, GeoJSON to high-performance file formats like GeoParquet/Iceberg should immediately speed up your computational workflows.