Comet Fuzz is a standalone project for generating random data and queries and executing queries against Spark with Comet disabled and enabled and checking for incompatibilities.
Although it is a simple tool it has already been useful in finding many bugs.
Comet Fuzz is inspired by the SparkFuzz paper from Databricks and CWI.
Planned areas of improvement:
From the root of the project, run mvn install -DskipTests to install Comet.
Then build the fuzz testing jar.
mvn package
Set appropriate values for SPARK_HOME, SPARK_MASTER, and COMET_JAR environment variables and then use spark-submit to run CometFuzz against a Spark cluster.
$SPARK_HOME/bin/spark-submit \ --master $SPARK_MASTER \ --class org.apache.comet.fuzz.Main \ target/comet-fuzz-spark3.4_2.12-0.7.0-SNAPSHOT-jar-with-dependencies.jar \ data --num-files=2 --num-rows=200 --exclude-negative-zero --generate-arrays --generate-structs --generate-maps
There is an optional --exclude-negative-zero flag for excluding -0.0 from the generated data, which is sometimes useful because we already know that we often have different behavior for this edge case due to differences between Rust and Java handling of this value.
Generate random queries that are based on the available test files.
$SPARK_HOME/bin/spark-submit \ --master $SPARK_MASTER \ --class org.apache.comet.fuzz.Main \ target/comet-fuzz-spark3.4_2.12-0.7.0-SNAPSHOT-jar-with-dependencies.jar \ queries --num-files=2 --num-queries=500
Note that the output filename is currently hard-coded as queries.sql
$SPARK_HOME/bin/spark-submit \ --master $SPARK_MASTER \ --conf spark.plugins=org.apache.spark.CometPlugin \ --conf spark.comet.enabled=true \ --conf spark.comet.exec.enabled=true \ --conf spark.comet.exec.all.enabled=true \ --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ --conf spark.comet.exec.shuffle.enabled=true \ --conf spark.comet.exec.shuffle.mode=auto \ --jars $COMET_JAR \ --conf spark.driver.extraClassPath=$COMET_JAR \ --conf spark.executor.extraClassPath=$COMET_JAR \ --class org.apache.comet.fuzz.Main \ target/comet-fuzz-spark3.4_2.12-0.7.0-SNAPSHOT-jar-with-dependencies.jar \ run --num-files=2 --filename=queries.sql
Note that the output filename is currently hard-coded as results-${System.currentTimeMillis()}.md