├── common <- common Java/Scala code ├── conf <- configuration files ├── native <- native code, in Rust ├── spark <- Spark integration
Comet's native execution runs on a shared tokio multi-threaded runtime. Understanding this architecture is important because it affects how you write native operators and JVM callbacks.
Spark calls into native code via JNI from an executor task thread. There are two execution paths depending on whether the plan reads data from the JVM:
Async I/O path (no JVM data sources, e.g. Iceberg scans): The DataFusion stream is spawned onto a tokio worker thread and batches are delivered to the executor thread via an mpsc channel. The executor thread parks in blocking_recv() until the next batch is ready. This avoids busy-polling on I/O-bound workloads.
JVM data source path (ScanExec present): The executor thread calls block_on() and polls the DataFusion stream directly, interleaving pull_input_batches() calls on Poll::Pending to feed data from the JVM into ScanExec operators.
In both cases, DataFusion operators execute on tokio worker threads, not on the Spark executor task thread. All Spark tasks on an executor share one tokio runtime.
Do not use thread_local! or assume thread identity. Tokio may run your operator's poll method on any worker thread, and may move it between threads across polls. Any state must live in the operator struct or be shared via Arc.
JNI calls work from any thread, but have overhead. JVMClasses::get_env() calls AttachCurrentThread, which acquires JVM internal locks. The AttachGuard detaches the thread when dropped. Repeated attach/detach cycles on tokio workers add overhead, so avoid calling into the JVM on hot paths during stream execution.
Do not call TaskContext.get() from JVM callbacks during execution. Spark's TaskContext is a ThreadLocal on the executor task thread. JVM methods invoked from tokio worker threads will see null. If you need task metadata, capture it at construction time (in createPlan or operator setup) and store it in the operator. See CometTaskMemoryManager for an example — it captures TaskContext.get().taskMemoryManager() in its constructor and uses the stored reference thereafter.
Memory pool operations call into the JVM. CometUnifiedMemoryPool and CometFairMemoryPool call acquireMemory() / releaseMemory() via JNI whenever DataFusion operators grow or shrink memory reservations. This happens on whatever thread the operator is executing on. These calls are thread-safe (they use stored GlobalRefs, not thread-locals), but they do trigger AttachCurrentThread.
Scalar subqueries call into the JVM. Subquery::evaluate() calls static methods on CometScalarSubquery via JNI. These use a static HashMap, not thread-locals, so they are safe from any thread.
Parquet encryption calls into the JVM. CometKeyRetriever::retrieve_key() calls the JVM to unwrap decryption keys during Parquet reads. It uses a stored GlobalRef and a cached JMethodID, so it is safe from any thread.
The runtime is created once per executor JVM in a Lazy<Runtime> static:
num_cpus by default, configurable via COMET_WORKER_THREADSCOMET_MAX_BLOCKING_THREADS| Pattern | Safe? | Notes |
|---|---|---|
Arc<T> shared across operators | Yes | Standard Rust thread safety |
JVMClasses::get_env() from tokio worker | Yes | Attaches thread to JVM automatically |
thread_local! in operator code | No | Tokio moves tasks between threads |
TaskContext.get() in JVM callback | No | Returns null on non-executor threads |
Storing JNIEnv in an operator | No | JNIEnv is thread-specific |
| Capturing state at plan creation time | Yes | Runs on executor thread, store in struct |
Comet code runs in both the driver and executor JVM processes, and different parts of the codebase run in each. Global singletons have process lifetime — they are created once and never dropped until the JVM exits. Since multiple Spark jobs, queries, and tasks share the same process, this makes it difficult to reason about what state a singleton holds and whether it is still valid.
Rust: static variables using OnceLock, LazyLock, OnceCell, Lazy, or lazy_static!:
static TOKIO_RUNTIME: OnceLock<Runtime> = OnceLock::new(); static TASK_SHARED_MEMORY_POOLS: Lazy<Mutex<HashMap<i64, PerTaskMemoryPool>>> = Lazy::new(..);
Java: static fields, especially mutable collections:
private static final HashMap<Long, HashMap<Long, ScalarSubquery>> subqueryMap = new HashMap<>();
Scala: object declarations (companion objects are JVM singletons) holding mutable state:
object MyCache { private val cache = new ConcurrentHashMap[String, Value]() }
Some state genuinely has process lifetime:
| Singleton | Why it is safe |
|---|---|
TOKIO_RUNTIME | One runtime per executor, no configuration variance |
JAVA_VM / JVM_CLASSES | One JVM per process, set once at JNI load |
OperatorRegistry / ExpressionRegistry | Immutable after initialization |
Compiled Regex patterns (LazyLock<Regex>) | Stateless and immutable |
If any of these apply, do not use a global singleton:
Instead, scope state to the plan or task by adding the cache as a field in an existing session or context object.
If a singleton is truly needed, add a comment explaining why static is the right lifetime, whether the cache is bounded, and how credential refresh is handled (if applicable).
JAVA_HOME is set and point to JDK using support matrixA few common commands are specified in project's Makefile:
make: compile the entire project, but don't run testsmake test-rust: compile the project and run tests in Rust sidemake test-jvm: compile the project and run tests in Java sidemake test: compile the project and run tests in both Rust and Java side.make release: compile the project and creates a release build. This is useful when you want to test Comet local installation in another project such as Spark.make clean: clean up the workspaceThe native Rust code must be compiled before running JVM tests. If you skip this step, tests will fail because they cannot find the native library. Always run make core (or cd native && cargo build) before running Maven tests.
# Correct order make core # Build native code first ./mvnw test -Dsuites="..." # Then run JVM tests
There is no need to use release mode (make release) during normal development. Debug builds are faster to compile and provide better error messages. Only use release mode when:
For regular development and testing, use make or make core which build in debug mode.
When running Rust tests directly with cargo test, the JVM library (libjvm.so) must be on your library path. Set the LD_LIBRARY_PATH environment variable to include your JDK's lib/server directory:
# Find your libjvm.so location (example for typical JDK installation) export LD_LIBRARY_PATH=$JAVA_HOME/lib/server:$LD_LIBRARY_PATH # Now you can run Rust tests cd native && cargo test
Alternatively, use make test-rust which handles the JVM compilation dependency automatically.
-pl to Select ModulesWhen running Maven tests, avoid using -pl spark to select only the spark module. This can cause Maven to pick up the common module from your local Maven repository instead of using the current codebase, leading to inconsistent test results:
# Avoid this - may use stale common module from local repo ./mvnw test -pl spark -Dsuites="..." # Do this instead - builds and tests with current code ./mvnw test -Dsuites="..."
wildcardSuites for Running TestsWhen running specific test suites, use wildcardSuites instead of suites for more flexible matching. The wildcardSuites parameter allows partial matching of suite names:
# Run all suites containing "CometCast" ./mvnw test -DwildcardSuites="CometCast" # Run specific suite with filter ./mvnw test -Dsuites="org.apache.comet.CometCastSuite valid"
Comet is a multi-language project with native code written in Rust and JVM code written in Java and Scala. For Rust code, the CLion IDE is recommended. For JVM code, IntelliJ IDEA is recommended.
Before opening the project in an IDE, make sure to run make first to generate the necessary files for the IDEs. Currently, it‘s mostly about generating protobuf message classes for the JVM side. It’s only required to run make once after cloning the repo.
First make sure to install the Scala plugin in IntelliJ IDEA. After that, you can open the project in IntelliJ IDEA. The IDE should automatically detect the project structure and import as a Maven project.
Comet uses generated source files that are too large for IntelliJ's default size limit for code inspections. To avoid IDE errors (missing definitions, etc.) caused by IntelliJ skipping these generated files, modify IntelliJ's Platform Properties by going to Help -> Edit Custom Properties.... For example, adding idea.max.intellisense.filesize=16384 increases the file size limit to 16 MB.
First make sure to install the Rust plugin in CLion or you can use the dedicated Rust IDE: RustRover. After that you can open the project in CLion. The IDE should automatically detect the project structure and import as a Cargo project.
For testing expressions and operators, prefer using SQL file tests over writing Scala test code. SQL file tests are plain .sql files that are automatically discovered and executed -- no Scala code to write, and no recompilation needed when tests change. This makes it easy to iterate quickly and to get good coverage of edge cases and argument combinations.
See the SQL File Tests guide for the full documentation on how to write and run these tests.
Like other Maven projects, you can run tests in IntelliJ IDEA by right-clicking on the test class or test method and selecting “Run” or “Debug”. However if the tests is related to the native side. Please make sure to run make core or cd native && cargo build before running the tests in IDEA.
It is possible to specify which ScalaTest suites you want to run from the CLI using the suites argument, for example if you only want to execute the test cases that contains valid in their name in org.apache.comet.CometCastSuite you can use
./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometCastSuite valid"
Other options for selecting specific suites are described in the ScalaTest Maven Plugin documentation
Comet has a plan stability testing framework that can be used to test the stability of the query plans generated by Comet. The plan stability testing framework is located in the spark module.
The easiest way to regenerate golden files is to use the provided script:
# Regenerate golden files for all Spark versions ./dev/regenerate-golden-files.sh # Regenerate only for a specific Spark version ./dev/regenerate-golden-files.sh --spark-version 3.5
The script verifies that JDK 17+ is configured (required for Spark 4.0), installs Comet for each Spark version, and runs the plan stability tests with SPARK_GENERATE_GOLDEN_FILES=1.
Alternatively, you can run the tests manually using the following commands.
Note that the output files get written to $SPARK_HOME.
The tests can be run with:
export SPARK_HOME=`pwd` ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
and
export SPARK_HOME=`pwd` ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test
If your pull request changes the query plans generated by Comet, you should regenerate the golden files. To regenerate the golden files, you can run the following commands.
export SPARK_HOME=`pwd` SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
and
export SPARK_HOME=`pwd` SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test
There's a make command to run micro benchmarks in the repo. For instance:
make benchmark-org.apache.spark.sql.benchmark.CometReadBenchmark
To run TPC-H or TPC-DS micro benchmarks, please follow the instructions in the respective source code, e.g., CometTPCHQueryBenchmark.
Comet is a multi-language project with native code written in Rust and JVM code written in Java and Scala. It is possible to debug both native and JVM code concurrently as described in the DEBUGGING guide
Before submitting a pull request, follow this checklist to ensure your changes are ready:
Comet uses cargo fmt, Scalafix and Spotless to automatically format the code. Run the following command to format all code:
make format
After formatting, run a full build to ensure everything compiles correctly and generated documentation is up to date:
make
This builds both native and JVM code. Fix any compilation errors before proceeding.
It's strongly recommended to run Clippy locally to catch potential issues before the CI/CD pipeline does. You can run the same Clippy checks used in CI/CD with:
cd native cargo clippy --color=never --all-targets --workspace -- -D warnings
Make sure to resolve any Clippy warnings before submitting your pull request, as the CI/CD pipeline will fail if warnings are present.
Run the relevant tests for your changes:
# Run all tests make test # Or run only Rust tests make test-rust # Or run only JVM tests (native must be built first) make test-jvm
Comet's CI does not automatically discover test suites. Instead, test suites are explicitly listed in the GitHub Actions workflow files so they can be grouped by category and run as separate parallel jobs. This reduces overall CI time.
If you add a new Scala test suite, you must add it to the suite matrix in both workflow files:
.github/workflows/pr_build_linux.yml.github/workflows/pr_build_macos.ymlEach file contains a suite matrix with named groups such as fuzz, shuffle, parquet, csv, exec, expressions, and sql. Add your new suite's fully qualified class name to the appropriate group. For example, if you add a new expression test suite, add it to the expressions group:
- name: "expressions" value: | org.apache.comet.CometExpressionSuite # ... existing suites ... org.apache.comet.YourNewExpressionSuite # <-- add here
Choose the group that best matches the area your test covers:
| Group | Covers |
|---|---|
fuzz | Fuzz testing and data generation |
shuffle | Shuffle operators and related exchange behavior |
parquet | Parquet read/write and native reader tests |
csv | CSV native read tests |
exec | Execution operators, joins, aggregates, plan rules, TPC-* |
expressions | Expression evaluation, casts, and SQL file tests |
sql | SQL-level behavior tests |
Important: The suite lists in both workflow files must stay in sync. A separate CI check (.github/workflows/pr_missing_suites.yml) runs dev/ci/check-suites.py on every pull request. It scans for all *Suite.scala files in the repository and verifies that each one appears in both workflow files. If any suite is missing, this check will fail and block the PR.
make format # Format code make # Build everything and update generated docs make test # Run tests (optional but recommended)
.md documentWe are using prettier to format .md files.
You can either use npm i -g prettier to install it globally or use npx to run it as a standalone binary. Using npx required a working node environment. Upgrading to the latest prettier is recommended (by adding --upgrade to the npm command).
$ prettier --version 2.3.0
After you've confirmed your prettier version, you can format all the .md files:
npx prettier "**/*.md" --write