blob: 97ef3790336700cb3ef10b45d00a0b06c76a7a72 [file] [view]
# AGENTS.md
This file provides guidance to AI coding agents working with this repository.
## Development Setup
Prerequisites: `pyenv`, JDK 21, Docker, `docker-compose`, `jq`
Optional: `pbzip2` (parallel bzip2 install via `apt install pbzip2` or `brew install pbzip2`).
Without it, `.bz2` corpus decompression falls back to Python stdlib (slower).
```bash
make develop # Install Python 3.12 via pyenv, create .venv, install all deps
source .venv/bin/activate # Activate virtual environment
```
## Common Commands
```bash
make lint # Run ruff check on all Python source files
make test # Run unit tests (pytest tests/)
pytest tests/path/to/test_file.py::TestClass::test_method # Run a single test
make it # Run integration tests via tox (requires Java, Docker; ~30 min)
make it312 # Integration tests for Python 3.12 only
make benchmark # Run performance benchmarks (pytest benchmarks/)
make build # Build distribution wheel
make clean # Remove build artifacts, caches, tox environments
```
## Code Style
- **Linter**: [ruff](https://docs.astral.sh/ruff/) (configured in `pyproject.toml` under `[tool.ruff]`)
- **Max line length**: 180 characters
- Run `make lint` before committing; CI enforces this on every PR
## Architecture
Apache Solr Orbit is a **macrobenchmarking framework** for Apache Solr clusters, using an **actor-based concurrent execution model** via the [Thespian](https://thespianpy.com/) library.
### Entry Points
- `solr-orbit` / `sb` `solrorbit/benchmark.py:main` CLI for running benchmarks
- `solr-orbitd` / `sbd` `solrorbit/benchmarkd.py:main` Daemon for distributed worker nodes
### Core Package (`solrorbit/`)
**Orchestration layer:**
- `benchmark.py` CLI arg parsing, subcommands: `run`, `list`, `info`, `generate`, `convert-workload`
- `test_run_orchestrator.py` Pipeline execution: prepares, launches cluster, runs workload, publishes results
- `actor.py` Thespian actor system setup for parallel/distributed execution
- `config.py` Configuration loading and management
**Cluster management (`builder/`):**
- `solr_provisioner.py` Download, install and launch Solr (from distribution, sources, or Docker)
- `provisioners/` Generic node provisioning infrastructure
- `downloaders/` Download Solr distributions
- `installers/` Install Solr on provisioned nodes
- `launchers/` Start/stop cluster nodes
- `executors/` Execute remote commands on cluster nodes
- `configs/` Jinja2 templates for cluster configuration
**Benchmark execution:**
- `workload/` Load and manage workload definitions (test procedures, operations, schedules)
- `worker_coordinator/` Coordinate distributed worker nodes; `driver.py` drives actual load
- `worker_coordinator/runner.py` Solr operation runners (`SolrBulkIndex`, `SolrSearch`, `SolrCreateCollection`, etc.)
- `metrics.py` Collect, store, and aggregate benchmark metrics (filesystem-backed; no external store)
- `telemetry.py` Solr-specific telemetry devices (JVM, node, collection, query, indexing, cache stats)
- `publisher.py` Publish and format benchmark results
- `result_writer.py` Write results to local filesystem (JSON/CSV)
**Data and connectivity:**
- `client.py` `SolrAdminClient` and `SolrClient` (HTTP via `requests`/`pysolr`; Collections API, `/select`, `/update`)
- `synthetic_data_generator/` Generate synthetic test datasets
- `workload_generator/` Generate workload definition files from existing Solr collections
**Workload conversion:**
- `conversion/workload_converter.py` Convert an OpenSearch Benchmark workload directory to Solr format
- `conversion/detector.py` Detect whether a workload uses OpenSearch-only operations/query DSL
- `conversion/query.py` Translate OpenSearch Query DSL to Solr JSON Query DSL
- `conversion/schema.py` Translate OpenSearch index mappings to Solr `managed-schema.xml`
**Utilities:**
- `utils/` IO, process management, console output, network, version parsing, options handling
- `cloud_provider/` Cloud provider integrations (AWS via boto3, GCP via google-auth)
- `visualizations/` Result visualization
### Test Structure
- `tests/` Unit tests mirroring `solrorbit/` structure
- `it/` Integration tests (spin up real Solr clusters via Docker/provisioning)
- `benchmarks/` Performance benchmarks for Solr Orbit itself
### Workload System
Workloads are defined as JSON/YAML files with:
- **Operations**: individual actions (bulk indexing, search queries)
- **Test procedures**: sequences of operations with parameters and schedules
- **Corpora**: dataset files (compatible with OpenSearch Benchmark format)
Workloads must be in Solr format. Use `solr-orbit convert-workload` to convert from
OpenSearch Benchmark format. Workloads can be loaded from a local path (`--workload-path`)
or from a git workload repository (`--workload-repository`).
### Pipeline Execution Flow
1. **Prepare** Load workload, configure metrics store
2. **Build** (optional) Download and provision Solr cluster
3. **Run** Execute test procedure via worker coordinator and drivers
4. **Publish** Store metrics, generate report
## Key Technologies
- **Python 3.12+** with `pysolr` (data ops), `requests` (HTTP admin), `psutil` (I/O metrics), `thespian` (actor model), `pytest` (tests), `tabulate` (console output)
- **Metrics store**: local filesystem JSON/CSV result files at `~/.solr-orbit/`, SQLite test-runs store
- **Docs**: Jekyll 4.x + just-the-docs gem in `docs/`; deployed to GitHub Pages via `.github/workflows/docs.yml`