What is spatial index

A spatial index is a data structure that allows for accessing a spatial object efficiently. It is a common technique used by spatial databases. Without indexing, any search for a feature would require a “sequential scan” of every record in the database, resulting in much longer processing time. In a spatial index construction process, the minimum bounding rectangle serves as an object approximation. Various types of spatial indices across commercial and open-source databases yield measurable performance differences. Spatial indexing techniques are playing a central role in time-critical applications and the manipulation of spatial big data.

How does CarbonData implement spatial index

There are many open source implementations for spatial indexing and to process spatial queries. CarbonData implements a different way of spatial index. Its core idea is to use the raster data. Raster is made up of matrix of cells organized into rows and columns(called a grid). Each cell represents a coordinate. The index for the coordinate is generated using longitude and latitude, like the Z order curve.

CarbonData rasterize the user data during data load into segments. A set of latitude and longitude represents a grid range. The size of the grid can be configured. Hence, the coordinates loaded are often discrete and not continuous.

Below figure shows the relationship between the grid and the points residing in it. Black point represents the center point of the grid, and the red points are the coordinates at the arbitrary positions inside the grid. The red points can be replaced by the center point of the grid to indicate that the points lies within the grid. During data load, CarbonData generates an index for the coordinate according to row and column of the grid(in the raster) where that coordinate lies. These indices are the same as Z order. For the detailed conversion algorithm, please refer to the design documents of spatial index.

File Directory Structure

Carbon supports Polygon User Defined Function(UDF) as filter condition in the query to return all the data points lying within it. Polygon UDF takes multiple points(i.e., pair of longitude and latitude) separated by a comma. Longitude and latitude in the pair are separated by a space. The first and last points in the polygon must be same to form a closed loop. CarbonData builds a quad tree using this polygon and spatial region information passed while creating a table. The nodes in the quad tree are composed of indices generated by the row and column information projected in the polygon area. When the grid center point lies within the polygon area, the grid is considered as selected. In the following figure, user selects a quadrilateral shaped polygon. The grid at the center of the region is chosen to build a quad tree. Once tree is build, all the leafs are scanned to get the list of range of indices(with each range consisting of minimum index and maximum index in the range). All the indices starting from minimum to maximum in each range forms the result.

File Directory Structure

There are some other UDFs supporting more filter conditions in the query, including Polygon List, Polyline List, and spatial index range list.

Polygon List UDF takes multiple polygons(i.e., a set of points) and operation type for combining polygons. Only OR and AND are supported at present, operation ‘OR’ means union of multiple polygons and ‘AND’ means intersection of that, shown as the following figure. Then CarbonData gets the list of range of indices from the combined region by quad tree, which is the same processing as Polygon UDF.

File Directory Structure

Polyline List UDF takes multiple polylines(i.e., a set of points) and buffer in meter. CarbonData first converts polyline to polygon and then gets the list of range of indices from these polygons. The processing is the same as Polygon UDF and return all the data points lying within the buffer region of polylines.

File Directory Structure

Polygon Range List UDF takes multiple range lists and operation type for merging the range lists. Range is an area bounded by start spatial index and end spatial index(i.e., minimum index and maximum index of range) in a quad tree. Range List is internal representation of a range definition that may contains one or multiple polygons. Operation includes OR and AND at present, means the union and intersection set of multiple range lists. This UDF returns all the data points whose spatial index is lying within the input range lists.

File Directory Structure

The main reasons for faster query response are as follows :

Data is sorted based on the index values.
Above UDF filter is pushed down from engine to the carbon layer such that CarbonData scans only matched blocklets avoiding full scan.

Beside, CarbonData also provides some spatial conversion utils UDFs. Such as converting spatial index to spatial grid coordinate x,y, converting spatial index to longitude and latitude pair, converting longitude and latitude pair to spatial index, converting spatial index to upper layer spatial index of pyramid model, and converting input polygon string to list of range of indices.

Installation and Deployment

Geo is a separate module in the Project. It can be included or excluded from the project build based on the requirement.

Basic Command

Create Table

Create table with spatial index table properties

create table source_index(id BIGINT, latitude long, longitude long) stored by 'carbondata' TBLPROPERTIES (
'SPATIAL_INDEX'='mygeohash',
'SPATIAL_INDEX.mygeohash.type'='geohash',
'SPATIAL_INDEX.mygeohash.sourcecolumns'='longitude, latitude',
'SPATIAL_INDEX.mygeohash.originLatitude'='19.832277',
'SPATIAL_INDEX.mygeohash.gridSize'='50',
'SPATIAL_INDEX.mygeohash.conversionRatio'='1000000');

Create spatial table using spark dataframe

val geoSchema = StructType(Seq(StructField("timevalue", LongType, nullable = true),
StructField("longitude", LongType, nullable = false),
StructField("latitude", LongType, nullable = false)))

val geoDf = sqlContext.read.option("delimeter", ",").option("header", "true").schema(geoSchema)
  .csv(s"$resourcesPath/geodata.csv")

geoDf.write
  .format("carbondata")
  .option("tableName", "geo1")
  .option("SPATIAL_INDEX", "mygeohash")
  .option("SPATIAL_INDEX.mygeohash.type", "geohash")
  .option("SPATIAL_INDEX.mygeohash.sourcecolumns", "longitude, latitude")
  .option("SPATIAL_INDEX.mygeohash.originLatitude", "39.832277")
  .option("SPATIAL_INDEX.mygeohash.gridSize", "50")
  .option("SPATIAL_INDEX.mygeohash.conversionRatio", "1000000")
  .option("SPATIAL_INDEX.mygeohash.class", "org.apache.carbondata.geo.GeoHashIndex")
  .mode(SaveMode.Overwrite)
  .save()

Note:

mygeohash in the above example represent the index name.
Columns present in spatial_index table properties cannot be altered i.e., sourcecolumns: longitude, latitude and index column: mygeohash in the above example.
To make the spatial instance compatible with previous versions, trigger refresh table command. In direct upgrade scenario, if spatial table already exists then refresh command fails but updates the instance property in metadata.

List of spatial index table properties

Name	Description
SPATIAL_INDEX	Used to configure Spatial Index name. This name is appended to `SPATIAL_INDEX` in the subsequent sub-property configurations. `xxx` in the below sub-properties refer to index name. Generated spatial index column is not allowed in any properties except in `SORT_COLUMNS` table property.
SPATIAL_INDEX.xxx.type	Type of algorithm for processing spatial data. Currently, supports only ‘geohash’.
SPATIAL_INDEX.xxx.sourcecolumns	longitude and latitude column names as in the table. These columns are used to generate index value for each row.
SPATIAL_INDEX.xxx.originLatitude	Latitude of origin.
SPATIAL_INDEX.xxx.gridSize	Grid size of raster data in metres. Currently, spatial index supports raster data.
SPATIAL_INDEX.xxx.conversionRatio	Conversion factor. It allows user to translate longitude and latitude to long. For example, if the data to load is longitude = 13.123456, latitude = 101.12356. User can configure conversion ratio sub-property value as 1000000, and change data to load as longitude = 13123456 and latitude = 10112356. Operations on long is much faster compared to floating-point numbers.
SPATIAL_INDEX.xxx.class	Optional user custom implementation class. Value is fully qualified class name.

Load/Insert

Load/Insert with no geoId column, then geoId will be generated internally.

insert into source_index select 1,116.285807,40.084087;

Load/Insert with custom geoId

insert into source_index select 0, 1,116.285807,40.084087;

Note:

Load custom geoId values using dataframe is not supported.

Select Query

Query with Polygon UDF predicate

select * from source_index where IN_POLYGON('16.321011 4.123503,16.137676 5.947911,16.560993 5.935276,16.321011 4.123503')

Query with Polygon List UDF predicate

select * from source_index where IN_POLYGON_LIST('POLYGON ((116.137676 40.163503, 116.137676 39.935276, 116.560993 39.935276, 116.137676 40.163503)), POLYGON ((116.560993 39.935276, 116.560993 40.163503, 116.137676 40.163503, 116.560993 39.935276))', 'OR')

select * from source_index where IN_POLYGON_LIST('select polygon from polyon_table', 'OR')

Polygon table example for above sub-query:

polygon: String Type	poiId: Int Type
POLYGON ((116.137676 40.163503, 116.137676 39.935276, 116.560993 39.935276, 116.137676 40.163503))	1
POLYGON ((116.560993 39.935276, 116.560993 40.163503, 116.137676 40.163503, 116.560993 39.935276))	2

Query with Polyline List UDF predicate

select * from source_index where IN_POLYLINE_LIST('LINESTRING (116.137676 40.163503, 116.137676 39.935276, 116.260993 39.935276), LINESTRING (116.260993 39.935276, 116.560993 39.935276, 116.560993 40.163503)', 65)

select * from source_index where IN_POLYLINE_LIST('select polyLine from polyon_table', 65)

PolyLine table example for above sub-query:

polyLine: String Type	poiId: Int Type
LINESTRING (116.137676 40.163503, 116.137676 39.935276, 116.260993 39.935276)	1
LINESTRING (116.260993 39.935276, 116.560993 39.935276, 116.560993 40.163503)	2

Query with Polygon Range List UDF predicate

select * from source_index where IN_POLYGON_RANGE_LIST('RANGELIST (855279368848 855279368850, 855280799610 855280799612, 855282156300 855282157400), RANGELIST (855279368852 855279368854, 855280799613 855280799615, 855282156200 855282157500)', 'OR')

Query having Join on Spatial and Polygon table with Polygon Join UDF predicate

select sum(t1.col1), t2.poiId
from spatial_table t1
inner join
(select polygon, poiId from polygon_table where poiType='abc') t2
on IN_POLYGON_JOIN(t1.mygeohash, t2.polygon)
group by t2.poiId

Polygon table example for above query:

polygon: String Type	poiType: String Type	poiId: Int Type
POLYGON ((116.137676 40.163503, 116.137676 39.935276, 116.560993 39.935276, 116.137676 40.163503))	abc	1
POLYGON ((116.560993 39.935276, 116.560993 40.163503, 116.137676 40.163503, 116.560993 39.935276))	def	2
POLYGON ((116.560993 40.935276, 116.360993 40.163503, 116.137676 40.163403, 116.560993 39.935276))	abc	3

Query having Join on Spatial and Polygon table with Polygon Join RangeList UDF predicate

select sum(t1.col1), t2.poiId
from spatial_table t1
inner join
(select polygonRanges, poiId from polygon_table) t2
on IN_POLYGON_JOIN_RANGE_LIST(t1.mygeohash, t2.polygonRanges)
group by t2.poiId

Polygon table example for above query:

polygonRanges: String Type	poiType: String Type	poiId: Int Type
RANGELIST (855279368848 855279368850, 855280799610 855280799612, 855282156300 855282157400)	abc	1
rangelist (855279368852 855279368854, 855280799613 855280799615, 855282156200 855282157500)	def	2
RANGELIST (855279368848 855279368850, 855280799613 855280799617, 855282156300 855282157400)	abc	3

Convert spatial index to spatial grid x, y

select GeoIdToGridXy(mygeohash) as GridXY from source_index

Convert longitude and latitude pair to spatial index
The UDF needs two other parameters, oriLatitude and gridSize

select LatLngToGeoId(latitude, longitude, 39.832277, 50) as geoId from source_index

Convert spatial index to longitude and latitude pair
The UDF needs two other parameters, oriLatitude and gridSize

select GeoIdToLatLng(mygeohash, 39.832277, 50) as LatitudeAndLongitude from source_index

Convert spatial index to upper layer spatial index of pyramid model

select ToUpperLayerGeoId(mygeohash) as upperLayerGeoId from source_index

Convert string polygon to internal spatial index range list

select ToRangeList('116.321011 40.123503, 116.320311 40.122503, 116.321111 40.121503, 116.321011 40.123503', 39.832277, 50) as rangeList

Reference

[1] https://issues.apache.org/jira/browse/CARBONDATA-3548
[2] https://gistbok.ucgis.org/topic-keywords/indexing
[3] https://en.wikipedia.org/wiki/Z-order_curve