| commit | 630d0249896ef4fc40015f8eae8948bb5a64a530 | [log] [tgz] |
|---|---|---|
| author | Jia Yu <jiayu2@asu.edu> | Thu Oct 29 12:29:04 2015 -0700 |
| committer | Jia Yu <jiayu2@asu.edu> | Thu Oct 29 12:29:04 2015 -0700 |
| tree | a56e7cd266b506e18f87918db6d518c9cc217bd6 | |
| parent | 8525fd41c52c1c467c32130525db7f5fa6b1106a [diff] |
Add test data
current version: v0.2
GeoSpark is a cluster computing system for processing large-scale spatial data. GeoSpark extends Apache Spark with a set of out-of-the-box Spatial Resilient Distributed Datasets (SRDDs) that efficiently load, process, and analyze large-scale spatial data across machines. This problem is quite challenging due to the fact that (1) spatial data may be quite complex, e.g., rivers' and cities' geometrical boundaries, (2) spatial (and geometric) operations (e.g., Overlap, Intersect, Convex Hull, Cartographic Distances) cannot be easily and efficiently expressed using regular RDD transformations and actions. eoSpark provides APIs for Apache Spark programmer to easily develop their spatial analysis programs with Spatial Resilient Distributed Datasets (SRDDs) which have in house support for geometrical and distance operations. Experiments show that GeoSpark is scalable and exhibits faster run-time performance than Hadoop-based systems in spatial analysis applications like spatial join, spatial aggregation, spatial autocorrelation analysis and spatial co-location pattern recognition.
Note: GeoSpark has been tested on Apache Spark 1.2, 1.3, 1.4 and Apache Hadoop 2.4, 2.6.
Please refer GeoSpark Java API Usage
GeoSpark extends RDDs to form Spatial RDDs (SRDDs) and efficiently partitions SRDD data elements across machines and introduces novel parallelized spatial (geometric operations that follows the Open Geosptial Consortium (OGC) standard) transformations and actions (for SRDD) that provide a more intuitive interface for users to write spatial data analytics programs. Moreover, GeoSpark extends the SRDD layer to execute spatial queries (e.g., Range query, KNN query, and Join query) on large-scale spatial datasets. After geometrical objects are retrieved in the Spatial RDD layer, users can invoke spatial query processing operations provided in the Spatial Query Processing Layer of GeoSpark which runs over the in-memory cluster, decides how spatial object-relational tuples could be stored, indexed, and accessed using SRDDs, and returns the spatial query results required by user.
GeoSpark supports either Comma-Separated Values (CSV) or Tab-separated values (TSV) as the input format. Users only need to specify input format as Splitter and the start column of spatial info in one tuple as Offset when call Constructors.
(column, column,..., Longitude, Latitude, column, column,...)
(column, column,...,Longitude 1, Longitude 2, Latitude 1, Latitude 2,column, column,...)
Two pairs of longitude and latitude present the vertexes lie on the diagonal of one rectangle.
(column, column,...,Longitude 1, Latitude 1, Longitude 2, Latitude 2, ...)
Each tuple contains unlimited points.
GeoSpark supports two Spatial Indexes, Quad-Tree and R-Tree. There are two methods in GeoSpark can create a desired Spatial Index.
GeoSpark currently provides native support for Inside, Overlap, DatasetBoundary, Minimum Bounding Rectangle and Polygon Union in SRDDS following Open Geospatial Consortium (OGC) standard.
GeoSpark so far provides spatial range query, join query and KNN query in SRDDs.
GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data [PDF]
Jia Yu, Jinxuan Wu, Mohamed Sarwat
To appear at ACM International Conference on Advances in Geographic Information Systems ACM SIGSPATIAL GIS 2015, Seattle, WA, USA November 2015
GeoSaprk makes use of JTS Topology Suite Version 1.13 for some geometrical computations.
Please refer JTS Topology Suite website for more details.
Jia Yu (Email: jiayu2@asu.edu)
Jinxuan Wu (Email: jinxuanw@asu.edu)
Mohamed Sarwat (Email: msarwat@asu.edu)
###Project website Please visit GeoSpark project wesbite for latest news and releases.
GeoSpark is one of the projects under DataSys Lab at Arizona State University. The mission of DataSys Lab is designing and developing experimental data management systems (e.g., database systems).