blob: 1c99c8711d5c4cba0b3ff6e85be0a5120436faf2 [file] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# GeoPandas API for Apache Sedona
The GeoPandas API for Apache Sedona provides a familiar GeoPandas interface that scales your geospatial analysis beyond single-node limitations. This API combines the intuitive GeoPandas DataFrame syntax with the distributed processing power of Apache Sedona on Apache Spark, enabling you to work with planetary-scale datasets using the same code patterns you already know.
## Overview
### What is the GeoPandas API for Apache Sedona?
The GeoPandas API for Apache Sedona is a compatibility layer that allows you to use GeoPandas-style operations on distributed geospatial data. Instead of being limited to single-node processing, your GeoPandas code can leverage the full power of Apache Spark clusters for large-scale geospatial analysis.
### Key Benefits
- **Familiar API**: Use the same GeoPandas syntax and methods you're already familiar with
- **Distributed Processing**: Scale beyond single-node limitations to handle large datasets
- **Lazy Evaluation**: Benefit from Apache Sedona's query optimization and lazy execution
- **Performance**: Leverage distributed computing for complex geospatial operations
- **Seamless Migration**: Minimal code changes required to migrate existing GeoPandas workflows
## Setup
The GeoPandas API for Apache Sedona automatically handles SparkSession management through PySpark's pandas-on-Spark integration. You have two options for setup:
### Option 1: Automatic SparkSession (Recommended)
The GeoPandas API automatically uses the default SparkSession from PySpark:
```python
from sedona.spark.geopandas import GeoDataFrame, read_parquet
# No explicit SparkSession setup needed - uses default session
# The API automatically handles Sedona context initialization
```
### Option 2: Manual SparkSession Setup
If you need to configure a custom SparkSession or are working in an environment where you need explicit control:
```python
from sedona.spark.geopandas import GeoDataFrame, read_parquet
from sedona.spark import SedonaContext
# Create and configure SparkSession
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
# The GeoPandas API will use this configured session
```
### Option 3: Using Existing SparkSession
If you already have a SparkSession (e.g., in Databricks, EMR, or other managed environments):
```python
from sedona.spark.geopandas import GeoDataFrame, read_parquet
from sedona.spark import SedonaContext
# Use existing SparkSession (e.g., 'spark' in Databricks)
sedona = SedonaContext.create(spark) # 'spark' is the existing session
```
### How SparkSession Management Works
The GeoPandas API leverages PySpark's pandas-on-Spark functionality, which automatically manages the SparkSession lifecycle:
1. **Default Session**: When you import `sedona.spark.geopandas`, it automatically uses PySpark's default session via `pyspark.pandas.utils.default_session()`
2. **Automatic Sedona Registration**: The API automatically registers Sedona's spatial functions and optimizations with the SparkSession when needed
3. **Transparent Integration**: All GeoPandas operations are translated to Spark SQL operations under the hood, using the configured SparkSession
4. **No Manual Context Management**: Unlike traditional Sedona usage, you don't need to explicitly call `SedonaContext.create()` unless you need custom configuration
This design makes the API more user-friendly by hiding the complexity of SparkSession management while still providing the full power of distributed processing.
### S3 Configuration
When working with S3 data, the GeoPandas API uses Spark's built-in S3 support rather than external libraries like s3fs. Configure anonymous access to public S3 buckets using Spark configuration:
```python
from sedona.spark import SedonaContext
# For anonymous access to public S3 buckets
config = (
SedonaContext.builder()
.config(
"spark.hadoop.fs.s3a.bucket.bucket-name.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
)
.getOrCreate()
)
sedona = SedonaContext.create(config)
```
For authenticated S3 access, use appropriate AWS credential providers:
```python
# For IAM roles (recommended for EC2/EMR)
config = (
SedonaContext.builder()
.config(
"spark.hadoop.fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider",
)
.getOrCreate()
)
# For access keys (not recommended for production)
config = (
SedonaContext.builder()
.config("spark.hadoop.fs.s3a.access.key", "your-access-key")
.config("spark.hadoop.fs.s3a.secret.key", "your-secret-key")
.getOrCreate()
)
```
## Basic Usage
### Importing the API
Instead of importing GeoPandas directly, import from the Sedona GeoPandas module:
```python
# Traditional GeoPandas import
# import geopandas as gpd
# Sedona GeoPandas API import
import sedona.spark.geopandas as gpd
# or
from sedona.spark.geopandas import GeoDataFrame, read_parquet
```
### Reading Data
The API supports reading from various geospatial formats, including Parquet files from cloud storage. For S3 access with anonymous credentials, configure Spark to use anonymous AWS credentials:
```python
from sedona.spark import SedonaContext
# Configure Spark for anonymous S3 access
config = (
SedonaContext.builder()
.config(
"spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
)
.getOrCreate()
)
sedona = SedonaContext.create(config)
# Load GeoParquet file directly from S3
s3_path = "s3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet"
nyc_buildings = gpd.read_parquet(s3_path)
# Display basic information
print(f"Dataset shape: {nyc_buildings.shape}")
print(f"Columns: {nyc_buildings.columns.tolist()}")
nyc_buildings.head()
```
### Spatial Filtering
Use spatial indexing and filtering methods. Note that `cx` spatial indexing is not yet implemented in the current version:
```python
from shapely.geometry import box
# Define bounding box for Central Park
central_park_bbox = box(
-73.973,
40.764, # bottom-left (longitude, latitude)
-73.951,
40.789, # top-right (longitude, latitude)
)
# Filter buildings within the bounding box using spatial index
# Note: This requires collecting data to driver for spatial filtering
# For large datasets, consider using spatial joins instead
buildings_sample = nyc_buildings.sample(1000) # Sample for demonstration
central_park_buildings = buildings_sample[
buildings_sample.geometry.intersects(central_park_bbox)
]
# Display results
print(
central_park_buildings[["BUILD_ID", "PROP_ADDR", "height_val", "geometry"]].head()
)
```
**Alternative approach for large datasets using spatial joins:**
```python
# Create a GeoDataFrame with the bounding box
bbox_gdf = gpd.GeoDataFrame({"id": [1]}, geometry=[central_park_bbox], crs="EPSG:4326")
# Use spatial join to filter buildings within the bounding box
central_park_buildings = nyc_buildings.sjoin(bbox_gdf, predicate="intersects")
```
## Advanced Operations
### Spatial Joins
Perform spatial joins using the same syntax as GeoPandas:
```python
# Load two datasets
left_df = gpd.read_parquet("s3://bucket/left_data.parquet")
right_df = gpd.read_parquet("s3://bucket/right_data.parquet")
# Spatial join with distance predicate
result = left_df.sjoin(right_df, predicate="dwithin", distance=50)
# Other spatial predicates
intersects_result = left_df.sjoin(right_df, predicate="intersects")
contains_result = left_df.sjoin(right_df, predicate="contains")
```
### Coordinate Reference System Operations
Transform geometries between different coordinate reference systems:
```python
# Set initial CRS
buildings = gpd.read_parquet("buildings.parquet")
buildings = buildings.set_crs("EPSG:4326")
# Transform to projected CRS for area calculations
buildings_projected = buildings.to_crs("EPSG:3857")
# Calculate areas
buildings_projected["area"] = buildings_projected.geometry.area
```
### Geometric Operations
Apply geometric transformations and analysis:
```python
# Buffer operations
buffered = buildings.geometry.buffer(100) # 100 meter buffer
# Geometric properties
buildings["is_valid"] = buildings.geometry.is_valid
buildings["is_simple"] = buildings.geometry.is_simple
buildings["bounds"] = buildings.geometry.bounds
# Distance calculations
from shapely.geometry import Point
reference_point = Point(-73.9857, 40.7484) # Times Square
buildings["distance_to_times_square"] = buildings.geometry.distance(reference_point)
# Area and length calculations (requires projected CRS)
buildings_projected = buildings.to_crs("EPSG:3857") # Web Mercator
buildings_projected["area"] = buildings_projected.geometry.area
buildings_projected["perimeter"] = buildings_projected.geometry.length
```
## Performance Considerations
### Use Traditional GeoPandas when:
- Working with small datasets (< 1GB)
- Simple operations on local data
- Complete functional coverage is required
- Single-node processing is sufficient
### Use GeoPandas API for Apache Sedona when:
- Working with large datasets (> 1GB)
- Complex geospatial analyses
- Distributed processing is needed
- Data is stored in cloud storage (S3, HDFS, etc.)
## Supported Operations
The GeoPandas API for Apache Sedona has implemented **39 GeoSeries functions** and **10 GeoDataFrame functions**, covering the most commonly used GeoPandas operations:
### Data I/O
- `read_parquet()` - Read GeoParquet files
- `read_file()` - Read various geospatial formats
- `to_parquet()` - Write to Parquet format
### Spatial Operations
- `sjoin()` - Spatial joins with various predicates
- `buffer()` - Geometric buffering
- `distance()` - Distance calculations
- `intersects()`, `contains()`, `within()` - Spatial predicates
- `sindex` - Spatial indexing (limited functionality)
### CRS Operations
- `set_crs()` - Set coordinate reference system
- `to_crs()` - Transform between CRS
- `crs` - Access CRS information
### Geometric Properties
- `area`, `length`, `bounds` - Geometric measurements
- `is_valid`, `is_simple`, `is_empty` - Geometric validation
- `centroid`, `envelope`, `boundary` - Geometric properties
- `x`, `y`, `z`, `has_z` - Coordinate access
- `total_bounds`, `estimate_utm_crs` - Bounds and CRS utilities
### Spatial Operations
- `buffer()` - Geometric buffering
- `distance()` - Distance calculations
- `intersects()`, `contains()`, `within()` - Spatial predicates
- `intersection()` - Geometric intersection
- `make_valid()` - Geometry validation and repair
- `sindex` - Spatial indexing (limited functionality)
### Data Conversion
- `to_geopandas()` - Convert to traditional GeoPandas
- `to_wkb()`, `to_wkt()` - Convert to WKB/WKT formats
- `from_xy()` - Create geometries from coordinates
- `geom_type` - Get geometry types
## Complete Workflow Example
```python
import sedona.spark.geopandas as gpd
from sedona.spark import SedonaContext
# Configure Spark for anonymous S3 access
config = (
SedonaContext.builder()
.config(
"spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
)
.getOrCreate()
)
sedona = SedonaContext.create(config)
# Load data
DATA_DIR = "s3://wherobots-examples/data/geopandas_blog/"
overture_size = "1M"
postal_codes_path = DATA_DIR + "postal-code/"
overture_path = DATA_DIR + overture_size + "/" + "overture-buildings/"
postal_codes = gpd.read_parquet(postal_codes_path)
buildings = gpd.read_parquet(overture_path)
# Spatial analysis
buildings = buildings.set_crs("EPSG:4326")
buildings_projected = buildings.to_crs("EPSG:3857")
# Calculate areas and filter
buildings_projected["area"] = buildings_projected.geometry.area
large_buildings = buildings_projected[buildings_projected["area"] > 1000]
result = large_buildings.sjoin(postal_codes, predicate="intersects")
# Aggregate by postal code
summary = (
result.groupby("postal_code")
.agg({"area": "sum", "BUILD_ID": "count"})
.rename(columns={"BUILD_ID": "building_count"})
)
print(summary.head())
```
## Resources and Contributing
For detailed and up-to-date API documentation, including complete method signatures, parameters, and examples, see:
**📚 [GeoPandas API Documentation](https://sedona.apache.org/latest/api/pydocs/sedona.spark.geopandas.html)**
The GeoPandas API for Apache Sedona is an open-source project. Contributions are welcome through the [GitHub issue tracker](https://github.com/apache/sedona/issues/2230) for reporting bugs, requesting features, or contributing code. For more information on contributing, see the [Contributor Guide](../community/develop.md).