tree: 92464e37734e910941ad354dce03fb16167691a4 [path history] [tgz]
  1. documentation/
  2. src/
  3. pom.xml
  4. README.md
pinot-connectors/pinot-spark-connector/README.md

Spark-Pinot Connector

Spark-pinot connector to read and write data from/to Pinot.

Detailed read model documentation is here; Spark-Pinot Connector Read Model

Features

  • Query realtime, offline or hybrid tables
  • Distributed, parallel scan
  • SQL support instead of PQL
  • Column and filter push down to optimize performance
  • Overlap between realtime and offline segments is queried exactly once for hybrid tables
  • Schema discovery
    • Dynamic inference
    • Static analysis of case class

Quick Start

import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession
      .builder()
      .appName("spark-pinot-connector-test")
      .master("local")
      .getOrCreate()

import spark.implicits._

val data = spark.read
  .format("pinot")
  .option("table", "airlineStats")
  .option("tableType", "offline")
  .load()
  .filter($"DestStateName" === "Florida")

data.show(100)

For more examples, see src/test/scala/example/ExampleSparkPinotConnectorTest.scala

Spark-Pinot connector uses Spark DatasourceV2 API. Please check the Databricks presentation for DatasourceV2 API;

https://www.slideshare.net/databricks/apache-spark-data-source-v2-with-wenchen-fan-and-gengliang-wang

Future Works

  • Add integration tests for read operation
  • Streaming endpoint support for read operation
  • Add write support(pinot segment write logic will be changed in later versions of pinot)