tree: 82314b6ef81b2ea694fc3b28cdce2e4113439aac [path history] [tgz]
  1. documentation/
  2. src/
  3. pom.xml
  4. README.md
pinot-connectors/pinot-spark-3-connector/README.md

Spark-Pinot Connector

Spark-pinot connector to read data from Pinot.

Detailed read model documentation is here; Spark-Pinot Connector Read Model

Features

  • Query realtime, offline or hybrid tables
  • Distributed, parallel scan
  • Streaming reads using gRPC (optional)
  • SQL support instead of PQL
  • Column and filter push down to optimize performance
  • Overlap between realtime and offline segments is queried exactly once for hybrid tables
  • Schema discovery
    • Dynamic inference
    • Static analysis of case class
  • Supports query options

Quick Start

import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession
      .builder()
      .appName("spark-pinot-connector-test")
      .master("local")
      .getOrCreate()

import spark.implicits._

val data = spark.read
  .format("pinot")
  .option("table", "airlineStats")
  .option("tableType", "offline")
  .load()
  .filter($"DestStateName" === "Florida")

data.show(100)

Examples

There are more examples included in src/test/scala/.../ExampleSparkPinotConnectorTest.scala. You can run the examples locally (e.g. using your IDE) in standalone mode by starting a local Pinot cluster. See: https://docs.pinot.apache.org/basics/getting-started/running-pinot-locally

You can also run the tests in cluster mode using following command:

export SPARK_CLUSTER=<YOUR_YARN_OR_SPARK_CLUSTER>

# Edit the ExampleSparkPinotConnectorTest to get rid of `.master("local")` and rebuild the jar before running this command
spark-submit \
    --class org.apache.pinot.connector.spark.v3.datasource.ExampleSparkPinotConnectorTest \
    --jars ./target/pinot-spark-3-connector-0.13.0-SNAPSHOT-shaded.jar \
    --master $SPARK_CLUSTER \
    --deploy-mode cluster \
  ./target/pinot-spark-3-connector-0.13.0-SNAPSHOT-tests.jar

Spark-Pinot connector uses Spark DatasourceV2 API. Please check the Databricks presentation for DatasourceV2 API;

https://www.slideshare.net/databricks/apache-spark-data-source-v2-with-wenchen-fan-and-gengliang-wang

Future Works

  • Add integration tests for read operation
  • Add write support(pinot segment write logic will be changed in later versions of pinot)