SpatialBench is a benchmark for assessing geospatial SQL analytics query performance across database systems. It provides a reproducible and scalable way to evaluate the performance of spatial data engines using realistic synthetic workloads.
Goals:
SpatialBench includes a set of 12 SQL queries. Because spatial sql syntaxes vary widely across systems, we provide a cli to print all 12 queries in the dialect of your choice. Currently supported dialects are:
We tried to vary the queries only as much as necessary to accommodate dialect differences.
Geopandas is obviously a distinct case, as it is not SQL-based, however, due to its popularity, we felt it was important to include it. Pandas/Geopandas users often hand optimize their queries in ways that SQL engines handle automatically. We felt hand-tuning the queries was unfair for this exercise, and tried to do as little of that as possible while still writing “idiomatic” pandas code. We would be interested in hearing feedback on this approach as well as seeing a “fully hand-optimized” version of the queries.
We welcome contributions and civil discussions on how to improve the queries and their implementations.
You can print the queries in your dialect of choice using the following command:
./print_queries.py <dialect>
SpatialBench defines a spatial star schema with the following tables:
Table | Type | Abbr. | Description | Spatial Attributes | Cardinality per SF |
---|---|---|---|---|---|
Trip | Fact Table | t_ | Individual trip records | pickup & dropoff points | 6M × SF |
Customer | Dimension | c_ | Trip customer info | None | 30K × SF |
Driver | Dimension | s_ | Trip driver info | None | 500 × SF |
Vehicle | Dimension | v_ | Trip vehicle info | None | 100 × SF |
Zone | Dimension | z_ | Administrative zones (SF-aware scaling) | Polygon | Tiered by SF range (see below) |
Building | Dimension | b_ | Building footprints | Polygon | 20K × (1 + log₂(SF)) |
The Zone table uses scale factor–aware generation so that zone granularity scales with dataset size and keeps query cost realistic. At small scales, this feels like querying ZIP-level units; at large scales, it uses coarser administrative units.
Scale Factor (SF) | Zone Subtypes Included | Zone Cardinality |
---|---|---|
[0, 10) | microhood, macrohood, county | 156,095 |
[10, 100) | + neighborhood | 455,711 |
[100, 1000) | + localadmin, locality, region, dependency | 1,035,371 |
[1000+) | + country | 1,035,749 |
This tiered scaling reflects geometry complexity and area distributions observed in the Overture division_area
dataset which represents administrative boundaries, release version 2025-08-20.1.
Spatial Bench's data generator uses continent-bounded affines. Each continent is defined by a bounding polygon, ensuring generation mostly covers land areas and introducing the natural skew of real geographies.
Bounding polygons:
Africa: POLYGON ((-20.062752 -40.044425, 64.131567 -40.044425, 64.131567 37.579421, -20.062752 37.579421, -20.062752 -40.044425)) Europe: POLYGON ((-11.964479 37.926872, 64.144374 37.926872, 64.144374 71.82884, -11.964479 71.82884, -11.964479 37.926872)) South Asia: POLYGON ((64.58354 -9.709049, 145.526096 -9.709049, 145.526096 51.672557, 64.58354 51.672557, 64.58354 -9.709049)) North Asia: POLYGON ((64.495655 51.944267, 178.834704 51.944267, 178.834704 77.897255, 64.495655 77.897255, 64.495655 51.944267)) Oceania: POLYGON ((112.481901 -48.980212, 180.768942 -48.980212, 180.768942 -10.228433, 112.481901 -10.228433, 112.481901 -48.980212)) South America: POLYGON ((-83.833822 -56.170016, -33.904338 -56.170016, -33.904338 12.211188, -83.833822 12.211188, -83.833822 -56.170016)) South North America: POLYGON ((-124.890724 12.382931, -69.511192 12.382931, -69.511192 42.55308, -124.890724 42.55308, -124.890724 12.382931)) North North America: POLYGON ((-166.478008 42.681087, -52.053245 42.681087, -52.053245 72.659041, -166.478008 72.659041, -166.478008 42.681087))
SpatialBench inherits its speed and efficiency from the tpchgen-rs project, which is one of the fastest open-source data generators available.
Key performance benefits:
SpatialBench is a Rust-based fork of the tpchgen-rs project. It preserves the original’s high-performance, multi-threaded, streaming architecture, while extending it with a spatial star schema and geometry generation logic.
You can build the SpatialBench data generator using Cargo:
cargo build --release
Alternatively, install it directly using:
cargo install --path ./spatialbench-cli
For contribution or debugging, refer to the ARCHITECTURE.md guide.
spatialbench-cli -s 1 --format=parquet
spatialbench-cli -s 1 --format=parquet --tables trip,building --output-dir sf1-parquet
for PART in $(seq 1 4); do mkdir part-$PART spatialbench-cli -s 10 --tables trip,building --output-dir part-$PART --parts 4 --part $PART done
The generator cli itself supports generating multiple files via the --parts
and --part
flags. However, if you want to generate multiple files per table of roughly a specific size, you can use the provided script tools/generate_data.py
.
This algorithm is how data was generated for the benchmark results cited in the SedonaDB launch blog post.
tools/generate_data.py --scale-factor 10 --mb-per-file 256 --output-dir sf10-parquet
You can override these defaults at runtime by passing a YAML file via the --config
flag:
spatialbench-cli -s 1 --format=parquet --tables trip,building --config spatialbench-config.yml
If --config is not provided, SpatialBench checks for ./spatialbench-config.yml. If absent, it falls back to built-in defaults.
For reference, see the provided spatialbench-config.yml.
See CONFIGURATION.md for more details about spatial data generation and the full YAML schema and examples.