This page shows how to read/write single-line GeoJSON files and multiline GeoJSON files with Apache Sedona and Spark.
The post concludes with a summary of the benefits and drawbacks of the GeoJSON file format for spatial analyses.
GeoJSON is based on JSON and supports the following types:
See here for more details about the GeoJSON format specification.
Here’s how to read a multiline GeoJSON file with Sedona:
df = ( sedona.read.format("geojson") .option("multiLine", "true") .load("data/multiline_geojson.json") .selectExpr("explode(features) as features") .select("features.*") .withColumn("prop0", expr("properties['prop0']")) .drop("properties") .drop("type") ) df.show(truncate=False)
Here’s the output:
+---------------------------------------------+------+ |geometry |prop0 | +---------------------------------------------+------+ |POINT (102 0.5) |value0| |LINESTRING (102 0, 103 1, 104 0, 105 1) |value1| |POLYGON ((100 0, 101 0, 101 1, 100 1, 100 0))|value2| +---------------------------------------------+------+
The multiline GeoJSON file contains a point, a linestring, and a polygon. Let’s inspect the content of the file:
{ "type": "FeatureCollection", "features": [ { "type": "Feature", "geometry": {"type": "Point", "coordinates": [102.0, 0.5]}, "properties": {"prop0": "value0"} }, { "type": "Feature", "geometry": { "type": "LineString", "coordinates": [ [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0] ] }, "properties": { "prop0": "value1", "prop1": 0.0 } }, { "type": "Feature", "geometry": { "type": "Polygon", "coordinates": [ [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ] ] }, "properties": { "prop0": "value2", "prop1": {"this": "that"} } } ] }
Notice how the data is modeled as a FeatureCollection. Each feature has a geometry type, geometry coordinates, and properties.
You can also read many multiline GeoJSON files. Suppose you have the following GeoJSON files:
many_geojsons/ file1.json file2.json
Here's how you can read many GeoJSON files:
df = ( sedona.read.format("geojson").option("multiLine", "true").load("data/many_geojsons") )
You just need to pass the directory that contains the JSON files.
Multiline GeoJSON is nicely formatted for humans but inefficient for machines. It’s better to store all the JSON data in a single line.
Here’s how to read single-line GeoJSON files with Sedona:
df = ( sedona.read.format("geojson") .load("data/singleline_geojson.json") .withColumn("prop0", expr("properties['prop0']")) .drop("properties") .drop("type") ) df.show(truncate=False)
Here’s the result:
+---------------------------------------------+------+ |geometry |prop0 | +---------------------------------------------+------+ |POINT (102 0.5) |value0| |LINESTRING (102 0, 103 1, 104 0, 105 1) |value1| |POLYGON ((100 0, 101 0, 101 1, 100 1, 100 0))|value2| +---------------------------------------------+------+
Here’s the data:
{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}}
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}}
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}}
Notice how the multi-line GeoJSON uses a FeatureCollection whereas each single-line GeoJSON row uses a different Feature.
Single-line GeoJSON files are better because they’re splittable by query engines.
Now, let's see how to create GeoJSON files with Sedona by writing out DataFrames.
Let’s create a Sedona DataFrame and then write it out to GeoJSON files:
df = sedona.createDataFrame([
("a", 'LINESTRING(2.0 5.0,6.0 1.0)'),
("b", 'LINESTRING(7.0 4.0,9.0 2.0)'),
("c", 'LINESTRING(1.0 3.0,3.0 1.0)'),
], ["id", "geometry"])
actual = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
actual.write.format("geojson").mode("overwrite").save("/tmp/a_thing")
Here are the files that get written:
a_thing/ _SUCCESS part-00000-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json part-00003-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json part-00007-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json part-00011-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json
Sedona writes multiple GeoJSON files in parallel, which is faster than writing a single file.
Note that the DataFrame must contain at least one column with geometry type for the write operation to work. Sedona will use the following rules to determine which column to use as the geometry:
You can also manually specify which geometry column to use with the “geometry.column” option:
df.write.format("geojson").option("geometry.column", "geometry").save("/tmp/a_thing")
Now let’s read these GeoJSON files into a DataFrame:
df = sedona.read.format("geojson").load("/tmp/a_thing") df.show(truncate=False)
+---------------------+----------+-------+
|geometry |properties|type |
+---------------------+----------+-------+
|LINESTRING (1 3, 3 1)|{c} |Feature|
|LINESTRING (2 5, 6 1)|{a} |Feature|
|LINESTRING (7 4, 9 2)|{b} |Feature|
+---------------------+----------+-------+
The GeoJSON file format has many advantages:
However, GeoJSON has many downsides, making it a suboptimal choice for storing geospatial data.
The GeoJSON format has many limitations that can make it a slow option for spatial data lakes:
GeoJSON is a common file format in spatial data analyses, and it’s convenient that Apache Sedona offers full read and write capabilities.
GeoJSON is well-supported and human-readable, but it’s pretty slow compared to formats like GeoParquet. It’s generally best to use GeoParquet or Iceberg for spatial data analyses because the performance is much better.