commit	8ca94c91d993496404e5e6551d06212e8c2433d0	[log] [tgz]
author	Jia Yu <jiayu2@asu.edu>	Tue Mar 06 20:12:15 2018 -0700
committer	GitHub <noreply@github.com>	Tue Mar 06 20:12:15 2018 -0700
tree	4abd8b9a5b2b3cb75d7d344450220db4572bb501
parent	7b1213e0685c7b4649c752b74ee5adedfc2ad3d4 [diff]
parent	16f43cd6a9b95164bc0f650044be35bf35dd666d [diff]

tree: 4abd8b9a5b2b3cb75d7d344450220db4572bb501

README.md

Page hit count (since Jan. 2018):

Status	Stable	Latest	Source code	Spark compatibility
GeoSpark				Spark 2.X, 1.X
GeoSparkSQL				Spark SQL 2.1, 2.2, 2.3
GeoSparkViz				Spark 2.X, 1.X

GeoSpark@Twitter||GeoSpark Discussion Board||

GeoSpark is listed as Infrastructure Project on Apache Spark Official Third Party Project Page

GeoSpark is a cluster computing system for processing large-scale spatial data. GeoSpark extends Apache Spark with a set of out-of-the-box Spatial Resilient Distributed Datasets (SRDDs) that efficiently load, process, and analyze large-scale spatial data across machines. GeoSpark provides APIs for Apache Spark programmer to easily develop their spatial analysis programs with Spatial Resilient Distributed Datasets (SRDDs) which have in house support for geometrical and Spatial Queries (Range, K Nearest Neighbors, Join).

GeoSpark artifacts are hosted in Maven Central: Maven Central Coordinates

Companies that are using GeoSpark (incomplete list)

Please make a Pull Request to add yourself!

Version release notes: click here

News!

GeoSpark 1.1.0 is released. This release contains several bug fixes, new SQL functions and index serializer. Note, GeoSparkSQL Maven Coordinate changed Release notes || Maven Coordinate
GeoSpark wiki is now moved to GeoSpark new website! Users are welcome to contribute your tutorials and stories by making a PR!
We just released a template project about how to use GeoSpark in Spatial Data Mining.
GeoSpark 1.0 is released. This release mainly includes a complete version of GeoSparkSQL.

Important features (more)

Spatial SQL on Spark

GeoSparkSQL fully supports Apache Spark SQL. Features are as follows:

Supports SQL/MM-Part3, Spatial SQL standard
Supports Spark SQL statement.
Supports Spark query optimizer: the beloved GeoSpark Spatial Join / predicate pushdown!

SELECT superhero.name
FROM city, superhero
WHERE ST_Contains(city.geom, superhero.geom)
AND city.name = 'Gotham';

Spatial Resilient Distributed Datasets (SRDDs)

Supported Special Spatial RDDs: PointRDD, RectangleRDD, PolygonRDD, LineStringRDD

The generic SpatialRDD supports all the following geometries (they can be mixed in a SpatialRDD):

Point, Polygon, Line string, Multi-point, Multi-polygon, Multi-line string, GeometryCollection, Circle

Supported input data format

Native input format support:

CSV
TSV
WKT
GeoJSON (single-line compact format)
NASA Earth Data NetCDF/HDF
ESRI ShapeFile(.shp, .shx, .dbf)

User-supplied input format mapper: Any single-line input formats

Spatial Partitioning

Supported Spatial Partitioning techniques: Quad-Tree (recommend), KDB-Tree (recommend), R-Tree, Voronoi diagram, Uniform grids (Experimental), Hilbert Curve (Experimental)

Spatial Index

Supported Spatial Indexes: Quad-Tree and R-Tree. R-Tree supports Spatial K Nearest Neighbors query.

Geometrical operation

DatasetBoundary, Minimum Bounding Rectangle, Polygon Union

Spatial Operation

Spatial Range Query, Distance Join Query, Spatial Join Query, and Spatial K Nearest Neighbors Query.

Coordinate Reference System (CRS) Transformation (aka. Coordinate projection)

GeoSpark allows users to transform the original CRS (e.g., degree based coordinates such as EPSG:4326 and WGS84) to any other CRS (e.g., meter based coordinates such as EPSG:3857) so that it can accurately process both geographic data and geometrical data.

GeoSpark Tutorial (more)

GeoSpark full tutorial is available at GeoSpark website

GeoSpark Scala and Java template project is available here: Template Project

GeoSpark Function Use Cases: Scala Example, Java Example

GeoSpark Visualization Extension (GeoSparkViz)

GeoSparkViz is a large-scale in-memory geospatial visualization system.

GeoSparkViz provides native support for general cartographic design by extending GeoSpark to process large-scale spatial data. It can visulize Spatial RDD and Spatial Queries and render super high resolution image in parallel.

More details are available here: GeoSpark Visualization Extension

GeoSparkViz Gallery

Watch High Resolution on a real map

Publication

Jia Yu, Jinxuan Wu, Mohamed Sarwat. “A Demonstration of GeoSpark: A Cluster Computing Framework for Processing Big Spatial Data”. (demo paper) In Proceeding of IEEE International Conference on Data Engineering ICDE 2016, Helsinki, FI, May 2016

Jia Yu, Jinxuan Wu, Mohamed Sarwat. “GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data”. (short paper) In Proceeding of the ACM International Conference on Advances in Geographic Information Systems ACM SIGSPATIAL GIS 2015, Seattle, WA, USA November 2015

Statement from our team

Benchmark

We welcome people to use GeoSpark for benchmark purpose. To achieve the best performance or enjoy all features of GeoSpark,

Please always use the latest version or state the version used in your benchmark so that we can trace back to the issues.
Please consider using GeoSpark core instead of GeoSparkSQL. Due to the limitation of SparkSQL (for instance, not support clustered index), we are not able to expose all features to SparkSQL.
Please open GeoSpark kryo serializer to reduce the memory footprint.

Paper citation

Currently, we have published two papers about GeoSpark. Only these two papers are associated with GeoSpark Development Team.

Contact

Questions

Contact

Jia Yu (Email: jiayu2@asu.edu)
Mohamed Sarwat (Email: msarwat@asu.edu)

Project website

Please visit GeoSpark wesbite for latest news and releases.

Data Systems Lab

GeoSpark is one of the projects initiated by Data Systems Lab at Arizona State University. The mission of Data Systems Lab is designing and developing experimental data management systems (e.g., database systems).