AGENTS.md

This file provides guidance to AI coding agents working with this repository.

Development Setup

Prerequisites: pyenv, JDK 21, Docker, docker-compose, jq

Optional: pbzip2 (parallel bzip2 — install via apt install pbzip2 or brew install pbzip2). Without it, .bz2 corpus decompression falls back to Python stdlib (slower).

make develop          # Install Python 3.12 via pyenv, create .venv, install all deps
source .venv/bin/activate  # Activate virtual environment

Common Commands

make lint             # Run ruff check on all Python source files
make test             # Run unit tests (pytest tests/)
pytest tests/path/to/test_file.py::TestClass::test_method  # Run a single test
make it               # Run integration tests via tox (requires Java, Docker; ~30 min)
make it312            # Integration tests for Python 3.12 only
make benchmark        # Run performance benchmarks (pytest benchmarks/)
make build            # Build distribution wheel
make clean            # Remove build artifacts, caches, tox environments

Code Style

  • Linter: ruff (configured in pyproject.toml under [tool.ruff])
  • Max line length: 180 characters
  • Run make lint before committing; CI enforces this on every PR

Architecture

Apache Solr Orbit is a macrobenchmarking framework for Apache Solr clusters, using an actor-based concurrent execution model via the Thespian library.

Entry Points

  • solr-orbit / sbsolrorbit/benchmark.py:main — CLI for running benchmarks
  • solr-orbitd / sbdsolrorbit/benchmarkd.py:main — Daemon for distributed worker nodes

Core Package (solrorbit/)

Orchestration layer:

  • benchmark.py — CLI arg parsing, subcommands: run, list, info, generate, convert-workload
  • test_run_orchestrator.py — Pipeline execution: prepares, launches cluster, runs workload, publishes results
  • actor.py — Thespian actor system setup for parallel/distributed execution
  • config.py — Configuration loading and management

Cluster management (builder/):

  • solr_provisioner.py — Download, install and launch Solr (from distribution, sources, or Docker)
  • provisioners/ — Generic node provisioning infrastructure
  • downloaders/ — Download Solr distributions
  • installers/ — Install Solr on provisioned nodes
  • launchers/ — Start/stop cluster nodes
  • executors/ — Execute remote commands on cluster nodes
  • configs/ — Jinja2 templates for cluster configuration

Benchmark execution:

  • workload/ — Load and manage workload definitions (test procedures, operations, schedules)
  • worker_coordinator/ — Coordinate distributed worker nodes; driver.py drives actual load
  • worker_coordinator/runner.py — Solr operation runners (SolrBulkIndex, SolrSearch, SolrCreateCollection, etc.)
  • metrics.py — Collect, store, and aggregate benchmark metrics (filesystem-backed; no external store)
  • telemetry.py — Solr-specific telemetry devices (JVM, node, collection, query, indexing, cache stats)
  • publisher.py — Publish and format benchmark results
  • result_writer.py — Write results to local filesystem (JSON/CSV)

Data and connectivity:

  • client.pySolrAdminClient and SolrClient (HTTP via requests/pysolr; Collections API, /select, /update)
  • synthetic_data_generator/ — Generate synthetic test datasets
  • workload_generator/ — Generate workload definition files from existing Solr collections

Workload conversion:

  • conversion/workload_converter.py — Convert an OpenSearch Benchmark workload directory to Solr format
  • conversion/detector.py — Detect whether a workload uses OpenSearch-only operations/query DSL
  • conversion/query.py — Translate OpenSearch Query DSL to Solr JSON Query DSL
  • conversion/schema.py — Translate OpenSearch index mappings to Solr managed-schema.xml

Utilities:

  • utils/ — IO, process management, console output, network, version parsing, options handling
  • cloud_provider/ — Cloud provider integrations (AWS via boto3, GCP via google-auth)
  • visualizations/ — Result visualization

Test Structure

  • tests/ — Unit tests mirroring solrorbit/ structure
  • it/ — Integration tests (spin up real Solr clusters via Docker/provisioning)
  • benchmarks/ — Performance benchmarks for Solr Orbit itself

Workload System

Workloads are defined as JSON/YAML files with:

  • Operations: individual actions (bulk indexing, search queries)
  • Test procedures: sequences of operations with parameters and schedules
  • Corpora: dataset files (compatible with OpenSearch Benchmark format)

Workloads must be in Solr format. Use solr-orbit convert-workload to convert from OpenSearch Benchmark format. Workloads can be loaded from a local path (--workload-path) or from a git workload repository (--workload-repository).

Pipeline Execution Flow

  1. Prepare — Load workload, configure metrics store
  2. Build (optional) — Download and provision Solr cluster
  3. Run — Execute test procedure via worker coordinator and drivers
  4. Publish — Store metrics, generate report

Key Technologies

  • Python 3.12+ with pysolr (data ops), requests (HTTP admin), psutil (I/O metrics), thespian (actor model), pytest (tests), tabulate (console output)
  • Metrics store: local filesystem — JSON/CSV result files at ~/.solr-orbit/, SQLite test-runs store
  • Docs: Jekyll 4.x + just-the-docs gem in docs/; deployed to GitHub Pages via .github/workflows/docs.yml