The GeoPandas API for Apache Sedona provides a familiar GeoPandas interface that scales your geospatial analysis beyond single-node limitations. This API combines the intuitive GeoPandas DataFrame syntax with the distributed processing power of Apache Sedona on Apache Spark, enabling you to work with planetary-scale datasets using the same code patterns you already know.
The GeoPandas API for Apache Sedona is a compatibility layer that allows you to use GeoPandas-style operations on distributed geospatial data. Instead of being limited to single-node processing, your GeoPandas code can leverage the full power of Apache Spark clusters for large-scale geospatial analysis.
The GeoPandas API for Apache Sedona automatically handles SparkSession management through PySpark's pandas-on-Spark integration. You have two options for setup:
The GeoPandas API automatically uses the default SparkSession from PySpark:
from sedona.spark.geopandas import GeoDataFrame, read_parquet # No explicit SparkSession setup needed - uses default session # The API automatically handles Sedona context initialization
If you need to configure a custom SparkSession or are working in an environment where you need explicit control:
from sedona.spark.geopandas import GeoDataFrame, read_parquet from sedona.spark import SedonaContext # Create and configure SparkSession config = SedonaContext.builder().getOrCreate() sedona = SedonaContext.create(config) # The GeoPandas API will use this configured session
If you already have a SparkSession (e.g., in Databricks, EMR, or other managed environments):
from sedona.spark.geopandas import GeoDataFrame, read_parquet from sedona.spark import SedonaContext # Use existing SparkSession (e.g., 'spark' in Databricks) sedona = SedonaContext.create(spark) # 'spark' is the existing session
The GeoPandas API leverages PySpark's pandas-on-Spark functionality, which automatically manages the SparkSession lifecycle:
Default Session: When you import sedona.spark.geopandas, it automatically uses PySpark's default session via pyspark.pandas.utils.default_session()
Automatic Sedona Registration: The API automatically registers Sedona's spatial functions and optimizations with the SparkSession when needed
Transparent Integration: All GeoPandas operations are translated to Spark SQL operations under the hood, using the configured SparkSession
No Manual Context Management: Unlike traditional Sedona usage, you don't need to explicitly call SedonaContext.create() unless you need custom configuration
This design makes the API more user-friendly by hiding the complexity of SparkSession management while still providing the full power of distributed processing.
When working with S3 data, the GeoPandas API uses Spark's built-in S3 support rather than external libraries like s3fs. Configure anonymous access to public S3 buckets using Spark configuration:
from sedona.spark import SedonaContext # For anonymous access to public S3 buckets config = ( SedonaContext.builder() .config( "spark.hadoop.fs.s3a.bucket.bucket-name.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider", ) .getOrCreate() ) sedona = SedonaContext.create(config)
For authenticated S3 access, use appropriate AWS credential providers:
# For IAM roles (recommended for EC2/EMR) config = ( SedonaContext.builder() .config( "spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider", ) .getOrCreate() ) # For access keys (not recommended for production) config = ( SedonaContext.builder() .config("spark.hadoop.fs.s3a.access.key", "your-access-key") .config("spark.hadoop.fs.s3a.secret.key", "your-secret-key") .getOrCreate() )
Instead of importing GeoPandas directly, import from the Sedona GeoPandas module:
# Traditional GeoPandas import # import geopandas as gpd # Sedona GeoPandas API import import sedona.spark.geopandas as gpd # or from sedona.spark.geopandas import GeoDataFrame, read_parquet
The API supports reading from various geospatial formats, including Parquet files from cloud storage. For S3 access with anonymous credentials, configure Spark to use anonymous AWS credentials:
from sedona.spark import SedonaContext # Configure Spark for anonymous S3 access config = ( SedonaContext.builder() .config( "spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider", ) .getOrCreate() ) sedona = SedonaContext.create(config) # Load GeoParquet file directly from S3 s3_path = "s3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet" nyc_buildings = gpd.read_parquet(s3_path) # Display basic information print(f"Dataset shape: {nyc_buildings.shape}") print(f"Columns: {nyc_buildings.columns.tolist()}") nyc_buildings.head()
Use spatial indexing and filtering methods. Note that cx spatial indexing is not yet implemented in the current version:
from shapely.geometry import box # Define bounding box for Central Park central_park_bbox = box( -73.973, 40.764, # bottom-left (longitude, latitude) -73.951, 40.789, # top-right (longitude, latitude) ) # Filter buildings within the bounding box using spatial index # Note: This requires collecting data to driver for spatial filtering # For large datasets, consider using spatial joins instead buildings_sample = nyc_buildings.sample(1000) # Sample for demonstration central_park_buildings = buildings_sample[ buildings_sample.geometry.intersects(central_park_bbox) ] # Display results print( central_park_buildings[["BUILD_ID", "PROP_ADDR", "height_val", "geometry"]].head() )
Alternative approach for large datasets using spatial joins:
# Create a GeoDataFrame with the bounding box bbox_gdf = gpd.GeoDataFrame({"id": [1]}, geometry=[central_park_bbox], crs="EPSG:4326") # Use spatial join to filter buildings within the bounding box central_park_buildings = nyc_buildings.sjoin(bbox_gdf, predicate="intersects")
Perform spatial joins using the same syntax as GeoPandas:
# Load two datasets left_df = gpd.read_parquet("s3://bucket/left_data.parquet") right_df = gpd.read_parquet("s3://bucket/right_data.parquet") # Spatial join with distance predicate result = left_df.sjoin(right_df, predicate="dwithin", distance=50) # Other spatial predicates intersects_result = left_df.sjoin(right_df, predicate="intersects") contains_result = left_df.sjoin(right_df, predicate="contains")
Transform geometries between different coordinate reference systems:
# Set initial CRS buildings = gpd.read_parquet("buildings.parquet") buildings = buildings.set_crs("EPSG:4326") # Transform to projected CRS for area calculations buildings_projected = buildings.to_crs("EPSG:3857") # Calculate areas buildings_projected["area"] = buildings_projected.geometry.area
Apply geometric transformations and analysis:
# Buffer operations buffered = buildings.geometry.buffer(100) # 100 meter buffer # Geometric properties buildings["is_valid"] = buildings.geometry.is_valid buildings["is_simple"] = buildings.geometry.is_simple buildings["bounds"] = buildings.geometry.bounds # Distance calculations from shapely.geometry import Point reference_point = Point(-73.9857, 40.7484) # Times Square buildings["distance_to_times_square"] = buildings.geometry.distance(reference_point) # Area and length calculations (requires projected CRS) buildings_projected = buildings.to_crs("EPSG:3857") # Web Mercator buildings_projected["area"] = buildings_projected.geometry.area buildings_projected["perimeter"] = buildings_projected.geometry.length
The GeoPandas API for Apache Sedona has implemented 39 GeoSeries functions and 10 GeoDataFrame functions, covering the most commonly used GeoPandas operations:
read_parquet() - Read GeoParquet filesread_file() - Read various geospatial formatsto_parquet() - Write to Parquet formatsjoin() - Spatial joins with various predicatesbuffer() - Geometric bufferingdistance() - Distance calculationsintersects(), contains(), within() - Spatial predicatessindex - Spatial indexing (limited functionality)set_crs() - Set coordinate reference systemto_crs() - Transform between CRScrs - Access CRS informationarea, length, bounds - Geometric measurementsis_valid, is_simple, is_empty - Geometric validationcentroid, envelope, boundary - Geometric propertiesx, y, z, has_z - Coordinate accesstotal_bounds, estimate_utm_crs - Bounds and CRS utilitiesbuffer() - Geometric bufferingdistance() - Distance calculationsintersects(), contains(), within() - Spatial predicatesintersection() - Geometric intersectionmake_valid() - Geometry validation and repairsindex - Spatial indexing (limited functionality)to_geopandas() - Convert to traditional GeoPandasto_wkb(), to_wkt() - Convert to WKB/WKT formatsfrom_xy() - Create geometries from coordinatesgeom_type - Get geometry typesimport sedona.spark.geopandas as gpd from sedona.spark import SedonaContext # Configure Spark for anonymous S3 access config = ( SedonaContext.builder() .config( "spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider", ) .getOrCreate() ) sedona = SedonaContext.create(config) # Load data DATA_DIR = "s3://wherobots-examples/data/geopandas_blog/" overture_size = "1M" postal_codes_path = DATA_DIR + "postal-code/" overture_path = DATA_DIR + overture_size + "/" + "overture-buildings/" postal_codes = gpd.read_parquet(postal_codes_path) buildings = gpd.read_parquet(overture_path) # Spatial analysis buildings = buildings.set_crs("EPSG:4326") buildings_projected = buildings.to_crs("EPSG:3857") # Calculate areas and filter buildings_projected["area"] = buildings_projected.geometry.area large_buildings = buildings_projected[buildings_projected["area"] > 1000] result = large_buildings.sjoin(postal_codes, predicate="intersects") # Aggregate by postal code summary = ( result.groupby("postal_code") .agg({"area": "sum", "BUILD_ID": "count"}) .rename(columns={"BUILD_ID": "building_count"}) ) print(summary.head())
For detailed and up-to-date API documentation, including complete method signatures, parameters, and examples, see:
📚 GeoPandas API Documentation
The GeoPandas API for Apache Sedona is an open-source project. Contributions are welcome through the GitHub issue tracker for reporting bugs, requesting features, or contributing code. For more information on contributing, see the Contributor Guide.