docs/source/contributor-guide/development.md - datafusion-comet - Git at Google

 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 # Comet Development Guide

 ## Project Layout

 ```
 ├── common     <- common Java/Scala code
 ├── conf       <- configuration files
 ├── native     <- native code, in Rust
 ├── spark      <- Spark integration
 ```

 ## Threading Architecture

 Comet's native execution runs on a shared tokio multi-threaded runtime. Understanding this
 architecture is important because it affects how you write native operators and JVM callbacks.

 ### How execution works

 Spark calls into native code via JNI from an **executor task thread**. There are two execution
 paths depending on whether the plan reads data from the JVM:

 **Async I/O path (no JVM data sources, e.g. Iceberg scans):** The DataFusion stream is spawned
 onto a tokio worker thread and batches are delivered to the executor thread via an `mpsc` channel.
 The executor thread parks in `blocking_recv()` until the next batch is ready. This avoids
 busy-polling on I/O-bound workloads.

 **JVM data source path (ScanExec present):** The executor thread calls `block_on()` and polls the
 DataFusion stream directly, interleaving `pull_input_batches()` calls on `Poll::Pending` to feed
 data from the JVM into ScanExec operators.

 In both cases, DataFusion operators execute on **tokio worker threads**, not on the Spark executor
 task thread. All Spark tasks on an executor share one tokio runtime.

 ### Rules for native code

 **Do not use `thread_local!` or assume thread identity.** Tokio may run your operator's `poll`
 method on any worker thread, and may move it between threads across polls. Any state must live
 in the operator struct or be shared via `Arc`.

 **JNI calls work from any thread, but have overhead.** `JVMClasses::get_env()` calls
 `AttachCurrentThread`, which acquires JVM internal locks. The `AttachGuard` detaches the thread
 when dropped. Repeated attach/detach cycles on tokio workers add overhead, so avoid calling
 into the JVM on hot paths during stream execution.

 **Do not call `TaskContext.get()` from JVM callbacks during execution.** Spark's `TaskContext` is
 a `ThreadLocal` on the executor task thread. JVM methods invoked from tokio worker threads will
 see `null`. If you need task metadata, capture it at construction time (in `createPlan` or
 operator setup) and store it in the operator. See `CometTaskMemoryManager` for an example — it
 captures `TaskContext.get().taskMemoryManager()` in its constructor and uses the stored reference
 thereafter.

 **Memory pool operations call into the JVM.** `CometUnifiedMemoryPool` and `CometFairMemoryPool`
 call `acquireMemory()` / `releaseMemory()` via JNI whenever DataFusion operators grow or shrink
 memory reservations. This happens on whatever thread the operator is executing on. These calls
 are thread-safe (they use stored `GlobalRef`s, not thread-locals), but they do trigger
 `AttachCurrentThread`.

 **Scalar subqueries call into the JVM.** `Subquery::evaluate()` calls static methods on
 `CometScalarSubquery` via JNI. These use a static `HashMap`, not thread-locals, so they are
 safe from any thread.

 **Parquet encryption calls into the JVM.** `CometKeyRetriever::retrieve_key()` calls the JVM
 to unwrap decryption keys during Parquet reads. It uses a stored `GlobalRef` and a cached
 `JMethodID`, so it is safe from any thread.

 ### The tokio runtime

 The runtime is created once per executor JVM in a `Lazy<Runtime>` static:

 - **Worker threads:** `num_cpus` by default, configurable via `COMET_WORKER_THREADS`
 - **Max blocking threads:** 512 by default, configurable via `COMET_MAX_BLOCKING_THREADS`
 - All async I/O (S3, HTTP, Parquet reads) runs on worker threads as non-blocking futures

 ### Summary of what is safe and what is not

 | Pattern                                   | Safe?  | Notes                                    |
 | ----------------------------------------- | ------ | ---------------------------------------- |
 | `Arc<T>` shared across operators          | Yes    | Standard Rust thread safety              |
 | `JVMClasses::get_env()` from tokio worker | Yes    | Attaches thread to JVM automatically     |
 | `thread_local!` in operator code          | **No** | Tokio moves tasks between threads        |
 | `TaskContext.get()` in JVM callback       | **No** | Returns `null` on non-executor threads   |
 | Storing `JNIEnv` in an operator           | **No** | `JNIEnv` is thread-specific              |
 | Capturing state at plan creation time     | Yes    | Runs on executor thread, store in struct |

 ## Global singletons

 Comet code runs in both the driver and executor JVM processes, and different parts of the
 codebase run in each. Global singletons have **process lifetime** — they are created once and
 never dropped until the JVM exits. Since multiple Spark jobs, queries, and tasks share the same
 process, this makes it difficult to reason about what state a singleton holds and whether it is
 still valid.

 ### How to recognize them

 **Rust:** `static` variables using `OnceLock`, `LazyLock`, `OnceCell`, `Lazy`, or `lazy_static!`:

 ```rust
 static TOKIO_RUNTIME: OnceLock<Runtime> = OnceLock::new();
 static TASK_SHARED_MEMORY_POOLS: Lazy<Mutex<HashMap<i64, PerTaskMemoryPool>>> = Lazy::new(..);
 ```

 **Java:** `static` fields, especially mutable collections:

 ```java
 private static final HashMap<Long, HashMap<Long, ScalarSubquery>> subqueryMap = new HashMap<>();
 ```

 **Scala:** `object` declarations (companion objects are JVM singletons) holding mutable state:

 ```scala
 object MyCache {
   private val cache = new ConcurrentHashMap[String, Value]()
 }
 ```

 ### Why they are dangerous

 - **Credential staleness.** A singleton caching an authenticated client will hold stale
   credentials after token rotation, causing silent failures mid-job.
 - **Unbounded growth.** A cache keyed by file path or configuration grows with every query
   but never shrinks. Over hours of process uptime this becomes a memory leak.
 - **Cross-job contamination.** Different Spark jobs on the same process may use different
   configurations. A singleton initialized by the first job silently serves wrong state to
   subsequent jobs.
 - **Testing difficulty.** Global state persists across test cases, making tests
   order-dependent.

 ### When a singleton is acceptable

 Some state genuinely has process lifetime:

 | Singleton                                     | Why it is safe                                      |
 | --------------------------------------------- | --------------------------------------------------- |
 | `TOKIO_RUNTIME`                               | One runtime per executor, no configuration variance |
 | `JAVA_VM` / `JVM_CLASSES`                     | One JVM per process, set once at JNI load           |
 | `OperatorRegistry` / `ExpressionRegistry`     | Immutable after initialization                      |
 | Compiled `Regex` patterns (`LazyLock<Regex>`) | Stateless and immutable                             |

 ### When to avoid a singleton

 If any of these apply, do **not** use a global singleton:

 - The state depends on configuration that can vary between jobs or queries
 - The state holds credentials or authenticated connections that will not expire or invalidate appropriately
 - The state grows proportionally to the number of queries or files processed
 - The state needs cleanup or refresh during process lifetime

 Instead, scope state to the plan or task by adding the cache as a field in an existing session or context object.

 If a singleton is truly needed, add a comment explaining why `static` is the right lifetime,
 whether the cache is bounded, and how credential refresh is handled (if applicable).

 ## Development Setup

 1. Make sure `JAVA_HOME` is set and point to JDK using [support matrix](../user-guide/latest/installation.md)
 2. Install Rust toolchain. The easiest way is to use
    [rustup](https://rustup.rs).

 ## Build & Test

 A few common commands are specified in project's `Makefile`:

 - `make`: compile the entire project, but don't run tests
 - `make test-rust`: compile the project and run tests in Rust side
 - `make test-jvm`: compile the project and run tests in Java side
 - `make test`: compile the project and run tests in both Rust and Java
   side.
 - `make release`: compile the project and creates a release build. This
   is useful when you want to test Comet local installation in another project
   such as Spark.
 - `make clean`: clean up the workspace

 ## Common Build and Test Pitfalls

 ### Native Code Must Be Built First

 The native Rust code must be compiled before running JVM tests. If you skip this step, tests will
 fail because they cannot find the native library. Always run `make core` (or `cd native && cargo build`)
 before running Maven tests.

 ```sh
 # Correct order
 make core                    # Build native code first
 ./mvnw test -Dsuites="..."   # Then run JVM tests
 ```

 ### Debug vs Release Mode

 There is no need to use release mode (`make release`) during normal development. Debug builds
 are faster to compile and provide better error messages. Only use release mode when:

 - Running benchmarks
 - Testing performance
 - Creating a release build for deployment

 For regular development and testing, use `make` or `make core` which build in debug mode.

 ### Running Rust Tests

 When running Rust tests directly with `cargo test`, the JVM library (`libjvm.so`) must be on
 your library path. Set the `LD_LIBRARY_PATH` environment variable to include your JDK's `lib/server`
 directory:

 ```sh
 # Find your libjvm.so location (example for typical JDK installation)
 export LD_LIBRARY_PATH=$JAVA_HOME/lib/server:$LD_LIBRARY_PATH

 # Now you can run Rust tests
 cd native && cargo test
 ```

 Alternatively, use `make test-rust` which handles the JVM compilation dependency automatically.

 ### Avoid Using `-pl` to Select Modules

 When running Maven tests, avoid using `-pl spark` to select only the spark module. This can cause
 Maven to pick up the `common` module from your local Maven repository instead of using the current
 codebase, leading to inconsistent test results:

 ```sh
 # Avoid this - may use stale common module from local repo
 ./mvnw test -pl spark -Dsuites="..."

 # Do this instead - builds and tests with current code
 ./mvnw test -Dsuites="..."
 ```

 ### Use `wildcardSuites` for Running Tests

 When running specific test suites, use `wildcardSuites` instead of `suites` for more flexible
 matching. The `wildcardSuites` parameter allows partial matching of suite names:

 ```sh
 # Run all suites containing "CometCast"
 ./mvnw test -DwildcardSuites="CometCast"

 # Run specific suite with filter
 ./mvnw test -Dsuites="org.apache.comet.CometCastSuite valid"
 ```

 ## Development Environment

 Comet is a multi-language project with native code written in Rust and JVM code written in Java and Scala.
 For Rust code, the CLion IDE is recommended. For JVM code, IntelliJ IDEA is recommended.

 Before opening the project in an IDE, make sure to run `make` first to generate the necessary files for the IDEs. Currently, it's mostly about
 generating protobuf message classes for the JVM side. It's only required to run `make` once after cloning the repo.

 ### IntelliJ IDEA

 First make sure to install the Scala plugin in IntelliJ IDEA.
 After that, you can open the project in IntelliJ IDEA. The IDE should automatically detect the project structure and import as a Maven project.

 Comet uses generated source files that are too large for IntelliJ's default size limit for code inspections. To avoid IDE errors
 (missing definitions, etc.) caused by IntelliJ skipping these generated files, modify
 [IntelliJ's Platform Properties](https://intellij-support.jetbrains.com/hc/en-us/articles/206544869-Configuring-JVM-options-and-platform-properties)
 by going to `Help -> Edit Custom Properties...`. For example, adding `idea.max.intellisense.filesize=16384` increases the file
 size limit to 16 MB.

 ### CLion

 First make sure to install the Rust plugin in CLion or you can use the dedicated Rust IDE: RustRover.
 After that you can open the project in CLion. The IDE should automatically detect the project structure and import as a Cargo project.

 ### SQL file tests (recommended for expressions)

 For testing expressions and operators, prefer using SQL file tests over writing Scala test
 code. SQL file tests are plain `.sql` files that are automatically discovered and executed --
 no Scala code to write, and no recompilation needed when tests change. This makes it easy to
 iterate quickly and to get good coverage of edge cases and argument combinations.

 See the [SQL File Tests](sql-file-tests) guide for the full documentation on how to write
 and run these tests.

 ### Running Tests in IDEA

 Like other Maven projects, you can run tests in IntelliJ IDEA by right-clicking on the test class or test method and selecting "Run" or "Debug".
 However if the tests is related to the native side. Please make sure to run `make core` or `cd native && cargo build` before running the tests in IDEA.

 ### Running Tests from command line

 It is possible to specify which ScalaTest suites you want to run from the CLI using the `suites`
 argument, for example if you only want to execute the test cases that contains _valid_
 in their name in `org.apache.comet.CometCastSuite` you can use

 ```sh
 ./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometCastSuite valid"
 ```

 Other options for selecting specific suites are described in the [ScalaTest Maven Plugin documentation](https://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin)

 ## Plan Stability Testing

 Comet has a plan stability testing framework that can be used to test the stability of the query plans generated by Comet.
 The plan stability testing framework is located in the `spark` module.

 ### Using the Helper Script

 The easiest way to regenerate golden files is to use the provided script:

 ```sh
 # Regenerate golden files for all Spark versions
 ./dev/regenerate-golden-files.sh

 # Regenerate only for a specific Spark version
 ./dev/regenerate-golden-files.sh --spark-version 3.5
 ```

 The script verifies that JDK 17+ is configured (required for Spark 4.0), installs Comet for each
 Spark version, and runs the plan stability tests with `SPARK_GENERATE_GOLDEN_FILES=1`.

 ### Manual Instructions

 Alternatively, you can run the tests manually using the following commands.

 Note that the output files get written to `$SPARK_HOME`.

 The tests can be run with:

 ```sh
 export SPARK_HOME=`pwd`
 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
 ```

 and

 ```sh
 export SPARK_HOME=`pwd`
 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test
 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test
 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test
 ```

 If your pull request changes the query plans generated by Comet, you should regenerate the golden files.
 To regenerate the golden files, you can run the following commands.

 ```sh
 export SPARK_HOME=`pwd`
 SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
 SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
 SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
 ```

 and

 ```sh
 export SPARK_HOME=`pwd`
 SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test
 SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test
 SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test
 ```

 ## Benchmark

 There's a `make` command to run micro benchmarks in the repo. For
 instance:

 ```
 make benchmark-org.apache.spark.sql.benchmark.CometReadBenchmark
 ```

 To run TPC-H or TPC-DS micro benchmarks, please follow the instructions
 in the respective source code, e.g., `CometTPCHQueryBenchmark`.

 ## Debugging

 Comet is a multi-language project with native code written in Rust and JVM code written in Java and Scala.
 It is possible to debug both native and JVM code concurrently as described in the [DEBUGGING guide](debugging)

 ## Submitting a Pull Request

 Before submitting a pull request, follow this checklist to ensure your changes are ready:

 ### 1. Format Your Code

 Comet uses `cargo fmt`, [Scalafix](https://github.com/scalacenter/scalafix) and [Spotless](https://github.com/diffplug/spotless/tree/main/plugin-maven) to
 automatically format the code. Run the following command to format all code:

 ```sh
 make format
 ```

 ### 2. Build and Verify

 After formatting, run a full build to ensure everything compiles correctly and generated
 documentation is up to date:

 ```sh
 make
 ```

 This builds both native and JVM code. Fix any compilation errors before proceeding.

 ### 3. Run Clippy (Recommended)

 It's strongly recommended to run Clippy locally to catch potential issues before the CI/CD pipeline does. You can run the same Clippy checks used in CI/CD with:

 ```bash
 cd native
 cargo clippy --color=never --all-targets --workspace -- -D warnings
 ```

 Make sure to resolve any Clippy warnings before submitting your pull request, as the CI/CD pipeline will fail if warnings are present.

 ### 4. Run Tests

 Run the relevant tests for your changes:

 ```sh
 # Run all tests
 make test

 # Or run only Rust tests
 make test-rust

 # Or run only JVM tests (native must be built first)
 make test-jvm
 ```

 ### 5. Register New Test Suites in CI

 Comet's CI does not automatically discover test suites. Instead, test suites are explicitly listed
 in the GitHub Actions workflow files so they can be grouped by category and run as separate parallel
 jobs. This reduces overall CI time.

 If you add a new Scala test suite, you must add it to the `suite` matrix in **both** workflow files:

 - `.github/workflows/pr_build_linux.yml`
 - `.github/workflows/pr_build_macos.yml`

 Each file contains a `suite` matrix with named groups such as `fuzz`, `shuffle`, `parquet`, `csv`,
 `exec`, `expressions`, and `sql`. Add your new suite's fully qualified class name to the
 appropriate group. For example, if you add a new expression test suite, add it to the
 `expressions` group:

 ```yaml
 - name: "expressions"
   value: |
     org.apache.comet.CometExpressionSuite
     # ... existing suites ...
     org.apache.comet.YourNewExpressionSuite  # <-- add here
 ```

 Choose the group that best matches the area your test covers:

 | Group         | Covers                                                     |
 | ------------- | ---------------------------------------------------------- |
 | `fuzz`        | Fuzz testing and data generation                           |
 | `shuffle`     | Shuffle operators and related exchange behavior            |
 | `parquet`     | Parquet read/write and native reader tests                 |
 | `csv`         | CSV native read tests                                      |
 | `exec`        | Execution operators, joins, aggregates, plan rules, TPC-\* |
 | `expressions` | Expression evaluation, casts, and SQL file tests           |
 | `sql`         | SQL-level behavior tests                                   |

 **Important:** The suite lists in both workflow files must stay in sync. A separate CI check
 (`.github/workflows/pr_missing_suites.yml`) runs `dev/ci/check-suites.py` on every pull request.
 It scans for all `*Suite.scala` files in the repository and verifies that each one appears in both
 workflow files. If any suite is missing, this check will fail and block the PR.

 ### Pre-PR Summary

 ```sh
 make format   # Format code
 make          # Build everything and update generated docs
 make test     # Run tests (optional but recommended)
 ```

 ## How to format `.md` document

 We are using `prettier` to format `.md` files.

 You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary. Using `npx` required a working node environment. Upgrading to the latest prettier is recommended (by adding `--upgrade` to the `npm` command).

 ```bash
 $ prettier --version
 2.3.0
 ```

 After you've confirmed your prettier version, you can format all the `.md` files:

 ```bash
 npx prettier "**/*.md" --write
 ```