fuzz-testing/README.md - datafusion-comet - Git at Google

 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 # Comet Fuzz

 Comet Fuzz is a standalone project for generating random data and queries and executing queries against Spark
 with Comet disabled and enabled and checking for incompatibilities.

 Although it is a simple tool it has already been useful in finding many bugs.

 Comet Fuzz is inspired by the [SparkFuzz](https://ir.cwi.nl/pub/30222) paper from Databricks and CWI.

 ## Roadmap

 Planned areas of improvement:

 - ANSI mode
 - Support for all data types, expressions, and operators supported by Comet
 - IF and CASE WHEN expressions
 - Complex (nested) expressions
 - Literal scalar values in queries
 - Add option to avoid grouping and sorting on floating-point columns
 - Improve join query support:
   - Support joins without join keys
   - Support composite join keys
   - Support multiple join keys
   - Support join conditions that use expressions

 ## Usage

 From the root of the project, run `mvn install -DskipTests` to install Comet.

 Then build the fuzz testing jar.

 ```shell
 mvn package
 ```

 Set appropriate values for `SPARK_HOME`, `SPARK_MASTER`, and `COMET_JAR` environment variables and then use
 `spark-submit` to run CometFuzz against a Spark cluster.

 ### Generating Data Files

 ```shell
 $SPARK_HOME/bin/spark-submit \
     --master $SPARK_MASTER \
     --class org.apache.comet.fuzz.Main \
     target/comet-fuzz-spark3.5_2.12-0.13.0-SNAPSHOT-jar-with-dependencies.jar \
     data --num-files=2 --num-rows=200 --exclude-negative-zero --generate-arrays --generate-structs --generate-maps
 ```

 There is an optional `--exclude-negative-zero` flag for excluding `-0.0` from the generated data, which is
 sometimes useful because we already know that we often have different behavior for this edge case due to
 differences between Rust and Java handling of this value.

 ### Generating Queries

 Generate random queries that are based on the available test files.

 ```shell
 $SPARK_HOME/bin/spark-submit \
     --master $SPARK_MASTER \
     --class org.apache.comet.fuzz.Main \
     target/comet-fuzz-spark3.5_2.12-0.13.0-SNAPSHOT-jar-with-dependencies.jar \
     queries --num-files=2 --num-queries=500
 ```

 Note that the output filename is currently hard-coded as `queries.sql`

 ### Execute Queries

 ```shell
 $SPARK_HOME/bin/spark-submit \
     --master $SPARK_MASTER \
     --conf spark.memory.offHeap.enabled=true \
     --conf spark.memory.offHeap.size=16G \
     --conf spark.plugins=org.apache.spark.CometPlugin \
     --conf spark.comet.enabled=true \
     --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
     --conf spark.comet.exec.shuffle.enabled=true \
     --jars $COMET_JAR \
     --conf spark.driver.extraClassPath=$COMET_JAR \
     --conf spark.executor.extraClassPath=$COMET_JAR \
     --class org.apache.comet.fuzz.Main \
     target/comet-fuzz-spark3.5_2.12-0.13.0-SNAPSHOT-jar-with-dependencies.jar \
     run --num-files=2 --filename=queries.sql
 ```

 Note that the output filename is currently hard-coded as `results-${System.currentTimeMillis()}.md`

 ### Compare existing datasets

 To compare a pair of existing datasets you can use a comparison tool.
 The example below is for TPC-H queries results generated by pure Spark and Comet

 ```shell
 $SPARK_HOME/bin/spark-submit \
     --master $SPARK_MASTER \
     --class org.apache.comet.fuzz.ComparisonTool
     target/comet-fuzz-spark3.5_2.12-0.13.0-SNAPSHOT-jar-with-dependencies.jar \
     compareParquet --input-spark-folder=/tmp/tpch/spark --input-comet-folder=/tmp/tpch/comet
 ```

 The tool takes a pair of existing folders of the same layout and compares subfolders treating them as parquet based datasets
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	# Comet Fuzz

	Comet Fuzz is a standalone project for generating random data and queries and executing queries against Spark
	with Comet disabled and enabled and checking for incompatibilities.

	Although it is a simple tool it has already been useful in finding many bugs.

	Comet Fuzz is inspired by the [SparkFuzz](https://ir.cwi.nl/pub/30222) paper from Databricks and CWI.

	## Roadmap

	Planned areas of improvement:

	- ANSI mode
	- Support for all data types, expressions, and operators supported by Comet
	- IF and CASE WHEN expressions
	- Complex (nested) expressions
	- Literal scalar values in queries
	- Add option to avoid grouping and sorting on floating-point columns
	- Improve join query support:
	- Support joins without join keys
	- Support composite join keys
	- Support multiple join keys
	- Support join conditions that use expressions

	## Usage

	From the root of the project, run `mvn install -DskipTests` to install Comet.

	Then build the fuzz testing jar.

	```shell
	mvn package
	```

	Set appropriate values for `SPARK_HOME`, `SPARK_MASTER`, and `COMET_JAR` environment variables and then use
	`spark-submit` to run CometFuzz against a Spark cluster.

	### Generating Data Files

	```shell
	$SPARK_HOME/bin/spark-submit \
	--master $SPARK_MASTER \
	--class org.apache.comet.fuzz.Main \
	target/comet-fuzz-spark3.5_2.12-0.13.0-SNAPSHOT-jar-with-dependencies.jar \
	data --num-files=2 --num-rows=200 --exclude-negative-zero --generate-arrays --generate-structs --generate-maps
	```

	There is an optional `--exclude-negative-zero` flag for excluding `-0.0` from the generated data, which is
	sometimes useful because we already know that we often have different behavior for this edge case due to
	differences between Rust and Java handling of this value.

	### Generating Queries

	Generate random queries that are based on the available test files.

	```shell
	$SPARK_HOME/bin/spark-submit \
	--master $SPARK_MASTER \
	--class org.apache.comet.fuzz.Main \
	target/comet-fuzz-spark3.5_2.12-0.13.0-SNAPSHOT-jar-with-dependencies.jar \
	queries --num-files=2 --num-queries=500
	```

	Note that the output filename is currently hard-coded as `queries.sql`

	### Execute Queries

	```shell
	$SPARK_HOME/bin/spark-submit \
	--master $SPARK_MASTER \
	--conf spark.memory.offHeap.enabled=true \
	--conf spark.memory.offHeap.size=16G \
	--conf spark.plugins=org.apache.spark.CometPlugin \
	--conf spark.comet.enabled=true \
	--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
	--conf spark.comet.exec.shuffle.enabled=true \
	--jars $COMET_JAR \
	--conf spark.driver.extraClassPath=$COMET_JAR \
	--conf spark.executor.extraClassPath=$COMET_JAR \
	--class org.apache.comet.fuzz.Main \
	target/comet-fuzz-spark3.5_2.12-0.13.0-SNAPSHOT-jar-with-dependencies.jar \
	run --num-files=2 --filename=queries.sql
	```

	Note that the output filename is currently hard-coded as `results-${System.currentTimeMillis()}.md`

	### Compare existing datasets

	To compare a pair of existing datasets you can use a comparison tool.
	The example below is for TPC-H queries results generated by pure Spark and Comet

	```shell
	$SPARK_HOME/bin/spark-submit \
	--master $SPARK_MASTER \
	--class org.apache.comet.fuzz.ComparisonTool
	target/comet-fuzz-spark3.5_2.12-0.13.0-SNAPSHOT-jar-with-dependencies.jar \
	compareParquet --input-spark-folder=/tmp/tpch/spark --input-comet-folder=/tmp/tpch/comet
	```

	The tool takes a pair of existing folders of the same layout and compares subfolders treating them as parquet based datasets