Spark IOTDB connector

aim of design

Use Spark SQL to read IOTDB data and return it to the client in the form of a Spark DataFrame

main idea

Because IOTDB has the ability to parse and execute SQL, this part can directly forward SQL to the IOTDB process for execution, and then convert the data to RDD.

Implementation process

1.Entrance

src/main/scala/org/apache/iotdb/spark/db/DefaultSource.scala

2. Building Relation

Relation mainly saves RDD meta-information, such as column names, partitioning strategies, and so on. Calling Relation's buildScan method can create RDDs

src/main/scala/org/apache/iotdb/spark/db/IoTDBRelation.scala

3. Building RDD

RDD executes SQL request to IOTDB and saves cursor

The compute method in src / main / scala / org / apache / iotdb / spark / db / IoTDBRDD.scala

4.Iterative RDD

Due to Spark's lazy loading mechanism, the RDD iteration is called specifically when the user traverses the RDD, which is the fetch result of IOTDB

getNext method in src / main / scala / org / apache / iotdb / spark / db / IoTDBRDD.scala

Wide and narrow table structure conversion

Wide table structure: IOTDB native path format

time	root.ln.wf02.wt02.temperature	root.ln.wf02.wt02.status	root.ln.wf02.wt02.hardware	root.ln.wf01.wt01.temperature	root.ln.wf01.wt01.status	root.ln.wf01.wt01.hardware
1	null	true	null	2.2	true	null
2	null	false	aaa	2.2	null	null
3	null	null	null	2.1	true	null
4	null	true	bbb	null	null	null
5	null	null	null	null	false	null
6	null	null	ccc	null	null	null

Narrow table structure: Relational database schema, IOTDB align by device format

time	device_name	status	hardware	temperature
1	root.ln.wf02.wt01	true	null	2.2
1	root.ln.wf02.wt02	true	null	null
2	root.ln.wf02.wt01	null	null	2.2
2	root.ln.wf02.wt02	false	aaa	null
3	root.ln.wf02.wt01	true	null	2.1
4	root.ln.wf02.wt02	true	bbb	null
5	root.ln.wf02.wt01	false	null	null
6	root.ln.wf02.wt02	null	ccc	null

Because the data queried by IOTDB defaults to a wide table structure, a wide-narrow table conversion is required. There are two implementation methods as follows

1. Use the IOTDB group by device statement

This way you can get the narrow table structure directly, and the calculation is done by IOTDB

2. Use Transformer

You can use Transformer to convert between wide and narrow tables. The calculation is done by Spark.

src/main/scala/org/apache/iotdb/spark/db/Transformer.scala

Wide table to narrow table uses traversing the device list to generate the corresponding narrow table. The parallelization strategy is better (no shuffle). The narrow table to wide table uses a timestamp-based join operation. There is potential for shuffle. Performance issues