Before getting into this advanced tutorial, please make sure that you have tried several Sedona functions on your local machine.
The versions of Sedona have three levels: X.X.X (i.e., 0.8.1)
The first level means that this version contains big structure redesign which may bring big changes in APIs and performance.
The second level (i.e., 0.8) indicates that this version contains significant performance enhancement, big new features and API changes. An old Sedona user who wants to pick this version needs to be careful about the API changes. Before you move to this version, please read Sedona version release notes and make sure you are ready to accept the API changes.
The third level (i.e., 0.8.1) tells that this version only contains bug fixes, some small new features and slight performance enhancement. This version will not contain any API changes. Moving to this version is safe. We highly suggest all Sedona users that stay at the same level move to the latest version in this level.
Sedona provides a number of constructors for each SpatialRDD (PointRDD, PolygonRDD and LineStringRDD). In general, you have two options to start with.
public PointRDD(JavaSparkContext sparkContext, String InputLocation, Integer Offset, FileDataSplitter splitter, boolean carryInputData, Integer partitions, StorageLevel newLevel)
public PointRDD(JavaRDD<Point> rawSpatialRDD, StorageLevel newLevel)
You may notice that these constructors all take as input a “StorageLevel” parameter. This is to tell Apache Spark cache the “rawSpatialRDD”, one attribute of SpatialRDD. The reason why Sedona does this is that Sedona wants to calculate the dataset boundary and approximate total count using several Apache Spark “Action”s. These information are useful when doing Spatial Join Query and Distance Join Query.
However, in some cases, you may know well about your datasets. If so, you can manually provide these information by calling this kind of Spatial RDD constructors:
public PointRDD(JavaSparkContext sparkContext, String InputLocation, Integer Offset, FileDataSplitter splitter, boolean carryInputData, Integer partitions, Envelope datasetBoundary, Integer approximateTotalCount) {
Manually providing the dataset boundary and approximate total count helps Sedona avoiding several slow “Action”s during initialization.
Each SpatialRDD (PointRDD, PolygonRDD and LineStringRDD) possesses four RDD attributes. They are:
These four RDDs don‘t co-exist so you don’t need to worry about the memory issue. These four RDDs are invoked in different queries:
Therefore, if you use one of the queries above many times, you'd better cache the associated RDD into memory. There are several possible use cases:
Sometimes users complain that the execution time is slow in some cases. As the first step, you should always consider increasing the number of your SpatialRDD partitions (2 - 8 times more than the original number). You can do this when you initialize a SpatialRDD. This may significantly improve your performance.
After that, you may consider tuning some other parameters in Apache Spark. For example, you may use Kyro serializer or change the RDD fraction that is cached into memory.