A benchmark for assessing geospatial SQL analytics query performance across database systems

Clone this repo:
  1. eb82013 Improve the base zone table and update the trip table complexity (#3) by Jia Yu · 4 days ago main
  2. b930cd0 Fix duplicate geometries issue and update default configs to improve coverage (#2) by Pranav Toggi · 8 days ago
  3. 371d976 Add .asf.yml to enable issues (#1) by Jia Yu · 8 days ago
  4. 2a107c3 [EWT-3249] Make Zone table cardinality scale in tiered fashion (#12) by Pranav Toggi · 8 days ago
  5. 2afa9fa Update the spider default to sync with the config file (#13) by Jia Yu · 9 days ago

SpatialBench

SpatialBench is a high-performance geospatial benchmark for generating synthetic spatial data at scale. Inspired by the Star Schema Benchmark (SSB) and real-world mobility data like the NYC TLC dataset, SpatialBench is designed to evaluate spatial query performance in modern data platforms.

Built in Rust and powered by Apache Arrow, SpatialBench brings fast, scalable, and streaming-friendly data generation for spatial workloads—minimal dependencies, blazing speed.

SpatialBench provides a reproducible and scalable way to evaluate the performance of spatial data engines using realistic synthetic workloads.

Goals:

  • Establish a fair and extensible benchmark suite for spatial data processing.
  • Help users compare engines and frameworks across different data scales.
  • Support open standards and foster collaboration in the spatial computing community.

Data Model

SpatialBench defines a spatial star schema with the following tables:

TableTypeAbbr.DescriptionSpatial AttributesCardinality per SF
TripFact Tablet_Individual trip recordspickup & dropoff points6M × SF
CustomerDimensionc_Trip customer infoNone30K × SF
DriverDimensions_Trip driver infoNone500 × SF
VehicleDimensionv_Trip vehicle infoNone100 × SF
ZoneDimensionz_Administrative zones (SF-aware scaling)PolygonTiered by SF range (see below)
BuildingDimensionb_Building footprintsPolygon20K × (1 + log₂(SF))

Zone Table Scaling

The Zone table uses scale factor–aware generation so that zone granularity scales with dataset size and keeps query cost realistic. At small scales, this feels like querying ZIP-level units; at large scales, it uses coarser administrative units.

Scale Factor (SF)Zone Subtypes IncludedZone Cardinality
[0, 10)microhood, macrohood117,416
[10, 100)+ neighborhood, county455,711
[100, 1000)+ localadmin, locality, region, dependency1,035,371
[1000+)+ country1,035,749

This tiered scaling reflects geometry complexity and area distributions observed in the Overture division_area dataset which represents administrative boundaries, release version 2025-08-20.1.

image.png

Performance

SpatialBench inherits its speed and efficiency from the tpchgen-rs project, which is one of the fastest open-source data generators available.

Key performance benefits:

  • Zero-copy, streaming architecture: Generates data in constant memory, suitable for very large datasets.
  • Multithreaded from the ground up: Leverages all CPU cores for high-throughput generation.
  • Arrow-native output: Supports fast serialization to Parquet and other formats without bottlenecks.
  • Fast geometry generation: The Spider module generates millions of spatial geometries per second, with deterministic output and affine transforms.

How is SpatialBench dbgen built?

SpatialBench is a Rust-based fork of the tpchgen-rs project. It preserves the original’s high-performance, multi-threaded, streaming architecture, while extending it with a spatial star schema and geometry generation logic.

You can build the SpatialBench data generator using Cargo:

cargo build --release

Alternatively, install it directly using:

cargo install --path ./spatialbench-cli

Notes

  • The core generator logic lives in the spatialbench crate.
  • Geometry-aware logic is in spatialbench-arrow and integrated via Arrow-based schemas.
  • The spatial extension modules like the Spider geometry generator reside in spider.rs.
  • The generator supports output formats like .tbl and Apache Parquet via the Arrow writer.

For contribution or debugging, refer to the ARCHITECTURE.md guide.

Usage

Generate All Tables (Scale Factor 1)

spatialbench-cli -s 1 --format=parquet

Generate Individual Tables

spatialbench-cli -s 1 --format=parquet --tables trip,building --output-dir sf1-parquet

Partitioned Output Example

for PART in $(seq 1 4); do
  mkdir part-$PART
  spatialbench-cli -s 10 --tables trip,building --output-dir part-$PART --parts 4 --part $PART
done

Custom Spider Configuration

You can override these defaults at runtime by passing a YAML file via the --config flag:

spatialbench-cli -s 1 --format=parquet --tables trip,building --config spatialbench-config.yml

If --config is not provided, SpatialBench checks for ./spatialbench-config.yml. If absent, it falls back to built-in defaults.

For reference, see the provided spatialbench-config.yml.

See CONFIGURATION.md for more details about spatial data generation and the full YAML schema and examples.

Acknowledgements