tree: db8086f9cac3a90906f1bc38a2c21f5d1ae676b1 [path history] [tgz]
  1. src/
  2. .gitignore
  3. pom.xml
  4. README.MD
tools/qualification-tool/README.MD

Qualification Tool

The Qualification Tool analyzes Spark event log files to determine the compatibility and performance of SQL workloads with Gluten.

Build

To compile and package the Qualification Tool, run the following Maven command:

mvn clean package

This will create a jar file in the target directory.

Run

To execute the tool, use the following command:

java -jar target/qualification-tool-1.3.0-SNAPSHOT-jar-with-dependencies.jar -f <eventFile>

Parameters:

  • -f <eventFile>: Path to the Spark event log file(s). This can be:
    • A single event log file.
    • A folder containing multiple event log files.
    • Deeply nested folders of event log files.
    • Compressed event log files.
    • Rolling event log files.
    • Comma separated files
  • -k <gcsKey>: (Optional) Path to Google Cloud Storage service account keys.
  • -o <output>: (Optional) Path to the directory where output will be written. Defaults to a temporary directory.
  • -t <threads>: (Optional) Number of processing threads. Defaults to 4.
  • -v: (Optional) Enable non verbose output. Omit this flag for verbose mode.
  • -p <project>: (Optional) Project ID for the run.
  • -d <dateFilter>: (Optional) Analyze only files created after this date (format: YYYY-MM-DD). Defaults to the last 90 days.

Example Usage:

java -jar target/qualification-tool-1.3.0-SNAPSHOT-jar-with-dependencies.jar -f /path/to/eventlog

Advanced Example:

java -jar target/qualification-tool-1.3.0-SNAPSHOT-jar-with-dependencies.jar -f /path/to/folder -o /output/path -t 8 -d 2023-01-01 -k /path/to/gcs_keys.json -p my_project

Features

  • Analyzes Spark SQL execution plans for compatibility with Gluten.
  • Supports single files, folders, deeply nested folders, compressed files, and rolling event logs.
  • Provides detailed reports on supported and unsupported operations.
  • Generates metrics on SQL execution times and operator impact.
  • Configurable verbosity and threading.

How It Works

The Qualification Tool analyzes a Spark plan to determine the compatibility of its nodes / operators and clusters with Gluten. Here's a step-by-step explanation of the process:

Example Spark Plan

Consider the following Spark plan:

            G
        /        \
      G[2]       G[3]
       |            |
      S[2]       G[3]
       |            |
      G          S
    /    \
  G[1]    G
  |
  G[1]
  |
  G
  • G: Represents a plan supported by Gluten.
  • S: Represents a plan not supported by Gluten.
  • [1], [2], [3]: Indicates the node belongs to a Whole Stage Code Gen Block (Cluster).

1. NodeSupportVisitor

The first step is marking each node as supported (*) or not supported (!) by Gluten:

            *G
        /        \
      *G[2]       *G[3]
       |            |
      !S[2]       *G[3]
       |            |
      *G           !S
    /    \
  *G[1]    *G
  |
  *G[1]
  |
  *G
  • All supported nodes are marked with *.
  • All unsupported nodes are marked with !.

2. ClusterSupportVisitor

The second step marks entire clusters as not supported (!) if any node in the cluster is unsupported:

            *G
        /        \
      !G[2]       *G[3]
       |            |
      !S[2]       *G[3]
       |            |
      *G           !S
    /    \
  *G[1]    *G
  |
  *G[1]
  |
  *G

Reasoning:

Although Gluten supports these operators, breaking Whole Stage Code Gen (WSCG) boundaries introduces row-to-columnar and columnar-to-row conversions, degrading performance. Hence, we pessimistically mark the entire cluster as not supported.

3. ChildSupportVisitor

The final step marks nodes and their parents as not supported if their children are unsupported:

            !G
        /        \
      !G[2]       !G[3]
       |            |
      !S[2]       !G[3]
       |            |
      *G           !S
    /    \
  *G[1]    *G
  |
  *G[1]
  |
  *G

Reasoning:

If a child node is not supported by Gluten, row-to-columnar and columnar-to-row conversions are added, degrading performance. Therefore, we pessimistically mark such nodes as not supported.

Sample Output

AppsRecommendedForBoost.tsv

applicationIdapplicationNamebatchUuidrddPercentageunsupportedSqlPercentagetotalTaskTimesupportedTaskTimesupportedSqlPercentagerecommendedForBoostexpectedRuntimeReduction
app-20241001064609-0000POC - StoreItema56bee420.0%5.2%3244484900307463367294.7%true28.4%
application_1729530422325_0001job-audience-generation-21UNKNOWN0.3%0.6%1378942742136525962199.0%true29.7%
application_1729410290443_0001job-audience-generation-27UNKNOWN0.0%0.7%1212612881120369093699.2%true29.7%
app-20241001080158-0000POC - StoreItem940888070.0%12.5%66860151358481205687.4%true26.2%
application_1729991008434_0001DailyLoadUNKNOWN0.0%12.6%171346751496353587.3%true26.1%
application_1730097715348_0003Spark shellUNKNOWN0.0%0.0%46804680100.0%true30.0%
application_1728526688221_0001job-audience-generation-27UNKNOWN0.4%59.4%80599111332327456840.1%false12.0%
application_1726070403340_0450driver.pyUNKNOWN0.0%81.1%3989923327518787918.8%false5.6%
application_1727842554841_0001driver.pyUNKNOWN0.0%58.3%1666868906949253941.6%false12.5%
application_1726070403340_0474driver.pyUNKNOWN0.0%81.6%3253896695968763418.3%false5.5%
application_1729133595704_0001job-audience-generation-54UNKNOWN0.6%99.3%472621872202050.0%false0.0%
application_1730097715348_0002Spark shellUNKNOWN33.8%16.6%4197207749.5%false14.8%
application_1730097715348_0001Spark shellUNKNOWN0.0%0.0%000.0%false0.0%
application_1712155629437_0011Spark shellUNKNOWN0.0%0.0%000.0%false0.0%
application_1712155629437_0012Spark shellUNKNOWN0.0%0.0%000.0%false0.0%
application_1712155629437_0007Spark shellUNKNOWN0.0%0.0%000.0%false0.0%
app-20241120163343-0000Config Testc28842850.0%0.0%000.0%false0.0%
application_1712155629437_0008Spark shellUNKNOWN0.0%0.0%000.0%false0.0%
application_1712155629437_0010Spark shellUNKNOWN0.0%0.0%000.0%false0.0%
application_1730097715348_0004Spark shellUNKNOWN0.0%0.0%000.0%false0.0%
application_1718075857669_0261UnregisteredAssetUNKNOWN0.0%100.0%1220545800.0%false0.0%
application_1712155629437_0009Spark shellUNKNOWN0.0%0.0%000.0%false0.0%
application_1712155629437_0013Spark shellUNKNOWN0.0%0.0%000.0%false0.0%

UnsupportedOperators.tsv

unsupportedOperatorcumulativeCpuMscount
Scan with format JSON not supported17226067166873105
Parquet with struct17226067166873105
Execute InsertIntoHadoopFsRelationCommand not supported37884897400405732
BHJ Not Supported20120117622132164
SerializeFromObject not supported1385674289916098
MapPartitions not supported1385674289916098
FlatMapGroupsInPandas not supported257970400224
Sample not supported157615298432
DeserializeToObject not supported1576103796721
MapElements not supported5261628694
Parquet with struct and map2902044
Execute CreateViewCommand not supported0157
Execute CreateTableCommand not supported042
WriteFiles not supported020
Execute SetCommand not supported012
Execute SaveIntoDataSourceCommand not supported010
CreateTable not supported04
Execute RefreshTableCommand not supported03
Execute DeltaDynamicPartitionOverwriteCommand not supported02
Execute TruncateTableCommand not supported02
Execute AlterTableAddPartitionCommand not supported01
CreateNamespace not supported01

Summary

The tool ensures that the Spark plan optimizes performance by:

  1. Identifying individual node compatibility.
  2. Accounting for cluster boundaries and WSCG optimizations.
  3. Considering child dependencies and their impact on parent nodes.

Requirements

  • Java: Ensure you have JDK 11 or later installed.