AGENTS.md

This file provides guidance to AI coding agents working with this repository.

Development Setup

Prerequisites: pyenv, JDK 21, Docker, docker-compose, jq

Optional: pbzip2 (parallel bzip2 — install via apt install pbzip2 or brew install pbzip2). Without it, .bz2 corpus decompression falls back to Python stdlib (slower).

make develop          # Install Python 3.12 via pyenv, create .venv, install all deps
source .venv/bin/activate  # Activate virtual environment

Common Commands

make lint             # Run ruff check on all Python source files
make test             # Run unit tests (pytest tests/)
pytest tests/path/to/test_file.py::TestClass::test_method  # Run a single test
make it               # Run integration tests via tox (requires Java, Docker; ~30 min)
make it312            # Integration tests for Python 3.12 only
make benchmark        # Run performance benchmarks (pytest benchmarks/)
make build            # Build distribution wheel
make clean            # Remove build artifacts, caches, tox environments

Code Style

Linter: ruff (configured in pyproject.toml under [tool.ruff])
Max line length: 180 characters
Run make lint before committing; CI enforces this on every PR

Architecture

Apache Solr Orbit is a macrobenchmarking framework for Apache Solr clusters, using an actor-based concurrent execution model via the Thespian library.

Entry Points

solr-orbit / sb → solrorbit/benchmark.py:main — CLI for running benchmarks
solr-orbitd / sbd → solrorbit/benchmarkd.py:main — Daemon for distributed worker nodes

Core Package (`solrorbit/`)

Orchestration layer:

benchmark.py — CLI arg parsing, subcommands: run, list, info, generate, convert-workload
test_run_orchestrator.py — Pipeline execution: prepares, launches cluster, runs workload, publishes results
actor.py — Thespian actor system setup for parallel/distributed execution
config.py — Configuration loading and management

Cluster management (builder/):

solr_provisioner.py — Download, install and launch Solr (from distribution, sources, or Docker)
provisioners/ — Generic node provisioning infrastructure
downloaders/ — Download Solr distributions
installers/ — Install Solr on provisioned nodes
launchers/ — Start/stop cluster nodes
executors/ — Execute remote commands on cluster nodes
configs/ — Jinja2 templates for cluster configuration

Benchmark execution:

workload/ — Load and manage workload definitions (test procedures, operations, schedules)
worker_coordinator/ — Coordinate distributed worker nodes; driver.py drives actual load
worker_coordinator/runner.py — Solr operation runners (SolrBulkIndex, SolrSearch, SolrCreateCollection, etc.)
metrics.py — Collect, store, and aggregate benchmark metrics (filesystem-backed; no external store)
telemetry.py — Solr-specific telemetry devices (JVM, node, collection, query, indexing, cache stats)
publisher.py — Publish and format benchmark results
result_writer.py — Write results to local filesystem (JSON/CSV)

Data and connectivity:

client.py — SolrAdminClient and SolrClient (HTTP via requests/pysolr; Collections API, /select, /update)
synthetic_data_generator/ — Generate synthetic test datasets
workload_generator/ — Generate workload definition files from existing Solr collections

Workload conversion:

conversion/workload_converter.py — Convert an OpenSearch Benchmark workload directory to Solr format
conversion/detector.py — Detect whether a workload uses OpenSearch-only operations/query DSL
conversion/query.py — Translate OpenSearch Query DSL to Solr JSON Query DSL
conversion/schema.py — Translate OpenSearch index mappings to Solr managed-schema.xml

Utilities:

utils/ — IO, process management, console output, network, version parsing, options handling
cloud_provider/ — Cloud provider integrations (AWS via boto3, GCP via google-auth)
visualizations/ — Result visualization

Test Structure

tests/ — Unit tests mirroring solrorbit/ structure
it/ — Integration tests (spin up real Solr clusters via Docker/provisioning)
benchmarks/ — Performance benchmarks for Solr Orbit itself

Workload System

Workloads are defined as JSON/YAML files with:

Operations: individual actions (bulk indexing, search queries)
Test procedures: sequences of operations with parameters and schedules
Corpora: dataset files (compatible with OpenSearch Benchmark format)

Workloads must be in Solr format. Use solr-orbit convert-workload to convert from OpenSearch Benchmark format. Workloads can be loaded from a local path (--workload-path) or from a git workload repository (--workload-repository).

Pipeline Execution Flow

Prepare — Load workload, configure metrics store
Build (optional) — Download and provision Solr cluster
Run — Execute test procedure via worker coordinator and drivers
Publish — Store metrics, generate report

Key Technologies

Python 3.12+ with pysolr (data ops), requests (HTTP admin), psutil (I/O metrics), thespian (actor model), pytest (tests), tabulate (console output)
Metrics store: local filesystem — JSON/CSV result files at ~/.solr-orbit/, SQLite test-runs store
Docs: Jekyll 4.x + just-the-docs gem in docs/; deployed to GitHub Pages via .github/workflows/docs.yml