blob: 5405ecf81f2c8171128a9e312bb36209d80f47ce [file] [view]
---
title: Randomizing Queries
parent: Optimizing Benchmarks
grand_parent: User Guide
nav_order: 40
---
# Randomizing Queries
By default, a workload runs the same fixed queries on every iteration. This produces unrealistically optimistic latency numbers because Solr's filter cache and query result cache will be warm after the first pass — subsequent identical queries hit the cache and complete much faster than they would in production.
Randomizing queries generates varied parameter values across iterations, so each run exercises a realistic mix of cache hits and misses.
## How it works
Apache Solr Orbit uses a **Zipf probability distribution** to model realistic cache behavior:
1. At benchmark startup, N value pairs are generated and stored in an indexed list.
2. For each operation, the benchmark probabilistically decides whether to reuse a stored pair (cache hit scenario) or generate a new random pair (cache miss scenario).
3. The **repeat frequency** (`rf`, 0.0–1.0) controls the maximum fraction of queries that reuse stored values.
With the default settings (`rf=0.3`, `N=5000`), 30% of queries reuse stored value pairs (likely cache hits) and 70% generate fresh random values (likely cache misses).
## Implementing randomized queries in a workload
Randomization requires a `workload.py` file in your workload directory. This file registers functions that generate random parameter values.
### Example: randomizing range query parameters
```python
import random
def random_fare_range(max_value):
gte_cents = random.randrange(0, max_value * 100)
lte_cents = random.randrange(gte_cents, max_value * 100)
return {
"gte": gte_cents / 100,
"lte": lte_cents / 100,
}
def fare_range_value_source():
return random_fare_range(120.00)
def register(registry):
registry.register_standard_value_source(
"range", # query type
"fare_amount", # field name
fare_range_value_source,
)
```
The `register` function is called once at startup. The `register_standard_value_source` call tells the benchmark: "when running a `range` query on the `fare_amount` field, use this function to generate parameter values."
### Example: randomizing non-range queries
For queries that are not range queries, use `register_query_randomization_info`:
```python
def register(registry):
registry.register_query_randomization_info(
"bbox", # operation name in the workload
"geo_bounding_box", # Solr query type
[["top_left"], ["bottom_right"]], # parameter variants
[], # optional parameters
)
```
## CLI flags
| Flag | Default | Description |
|------|---------|-------------|
| `--randomization-enabled` | `false` | Activate query randomization |
| `--randomization-repeat-frequency` | `0.3` | Fraction of queries that reuse stored value pairs (0.0–1.0) |
| `--randomization-n` | `5000` | Number of value pairs to generate at startup |
| `--randomization-alpha` | `1.0` | Zipf distribution alpha (≥ 0); higher values skew selection toward lower-indexed pairs |
## Enabling randomization at runtime
```bash
solr-orbit run \
--workload nyc_taxis \
--pipeline benchmark-only \
--target-hosts localhost:8983 \
--randomization-enabled true \
--randomization-repeat-frequency 0.2 \
--randomization-n 10000
```
## Choosing the right repeat frequency
| rf value | Interpretation |
|----------|---------------|
| `0.0` | Every query is unique — maximum cache miss rate |
| `0.3` | 30% reuse (default) — models typical mixed workloads |
| `1.0` | All queries reuse stored pairs — maximum cache hit rate |
Set `rf` to match your production cache hit ratio if you know it. If you don't know it, the default of `0.3` is a reasonable starting point.
## The Zipf distribution
The probability of selecting value pair *i* from the stored list follows the Zipf distribution: *P(i) ∝ 1/i^α*. This means:
- The first stored pair is selected most frequently
- Frequency drops off sharply for higher-indexed pairs
- `alpha=1.0` (default) gives the standard Zipf distribution
- Higher `alpha` increases the skew (more of the probability mass on the first few pairs)
- `alpha=0.0` makes all stored pairs equally likely
## See also
- [Fine-tuning workloads](../working-with-workloads/finetune-workloads.html)
- [Creating custom workloads](../working-with-workloads/creating-custom-workloads.html)