docs/user-guide/optimizing-benchmarks/randomizing-queries.md - solr-orbit - Git at Google

 ---
 title: Randomizing Queries
 parent: Optimizing Benchmarks
 grand_parent: User Guide
 nav_order: 40
 ---

 # Randomizing Queries

 By default, a workload runs the same fixed queries on every iteration. This produces unrealistically optimistic latency numbers because Solr's filter cache and query result cache will be warm after the first pass — subsequent identical queries hit the cache and complete much faster than they would in production.

 Randomizing queries generates varied parameter values across iterations, so each run exercises a realistic mix of cache hits and misses.

 ## How it works

 Apache Solr Orbit uses a **Zipf probability distribution** to model realistic cache behavior:

 1. At benchmark startup, N value pairs are generated and stored in an indexed list.
 2. For each operation, the benchmark probabilistically decides whether to reuse a stored pair (cache hit scenario) or generate a new random pair (cache miss scenario).
 3. The **repeat frequency** (`rf`, 0.0–1.0) controls the maximum fraction of queries that reuse stored values.

 With the default settings (`rf=0.3`, `N=5000`), 30% of queries reuse stored value pairs (likely cache hits) and 70% generate fresh random values (likely cache misses).

 ## Implementing randomized queries in a workload

 Randomization requires a `workload.py` file in your workload directory. This file registers functions that generate random parameter values.

 ### Example: randomizing range query parameters

 ```python
 import random

 def random_fare_range(max_value):
     gte_cents = random.randrange(0, max_value * 100)
     lte_cents = random.randrange(gte_cents, max_value * 100)
     return {
         "gte": gte_cents / 100,
         "lte": lte_cents / 100,
     }

 def fare_range_value_source():
     return random_fare_range(120.00)

 def register(registry):
     registry.register_standard_value_source(
         "range",           # query type
         "fare_amount",     # field name
         fare_range_value_source,
     )
 ```

 The `register` function is called once at startup. The `register_standard_value_source` call tells the benchmark: "when running a `range` query on the `fare_amount` field, use this function to generate parameter values."

 ### Example: randomizing non-range queries

 For queries that are not range queries, use `register_query_randomization_info`:

 ```python
 def register(registry):
     registry.register_query_randomization_info(
         "bbox",                # operation name in the workload
         "geo_bounding_box",    # Solr query type
         [["top_left"], ["bottom_right"]],  # parameter variants
         [],                    # optional parameters
     )
 ```

 ## CLI flags

 | Flag | Default | Description |
 |------|---------|-------------|
 | `--randomization-enabled` | `false` | Activate query randomization |
 | `--randomization-repeat-frequency` | `0.3` | Fraction of queries that reuse stored value pairs (0.0–1.0) |
 | `--randomization-n` | `5000` | Number of value pairs to generate at startup |
 | `--randomization-alpha` | `1.0` | Zipf distribution alpha (≥ 0); higher values skew selection toward lower-indexed pairs |

 ## Enabling randomization at runtime

 ```bash
 solr-orbit run \
   --workload nyc_taxis \
   --pipeline benchmark-only \
   --target-hosts localhost:8983 \
   --randomization-enabled true \
   --randomization-repeat-frequency 0.2 \
   --randomization-n 10000
 ```

 ## Choosing the right repeat frequency

 | rf value | Interpretation |
 |----------|---------------|
 | `0.0` | Every query is unique — maximum cache miss rate |
 | `0.3` | 30% reuse (default) — models typical mixed workloads |
 | `1.0` | All queries reuse stored pairs — maximum cache hit rate |

 Set `rf` to match your production cache hit ratio if you know it. If you don't know it, the default of `0.3` is a reasonable starting point.

 ## The Zipf distribution

 The probability of selecting value pair *i* from the stored list follows the Zipf distribution: *P(i) ∝ 1/i^α*. This means:

 - The first stored pair is selected most frequently
 - Frequency drops off sharply for higher-indexed pairs
 - `alpha=1.0` (default) gives the standard Zipf distribution
 - Higher `alpha` increases the skew (more of the probability mass on the first few pairs)
 - `alpha=0.0` makes all stored pairs equally likely

 ## See also

 - [Fine-tuning workloads](../working-with-workloads/finetune-workloads.html)
 - [Creating custom workloads](../working-with-workloads/creating-custom-workloads.html)
	---
	title: Randomizing Queries
	parent: Optimizing Benchmarks
	grand_parent: User Guide
	nav_order: 40
	---

	# Randomizing Queries

	By default, a workload runs the same fixed queries on every iteration. This produces unrealistically optimistic latency numbers because Solr's filter cache and query result cache will be warm after the first pass — subsequent identical queries hit the cache and complete much faster than they would in production.

	Randomizing queries generates varied parameter values across iterations, so each run exercises a realistic mix of cache hits and misses.

	## How it works

	Apache Solr Orbit uses a Zipf probability distribution to model realistic cache behavior:

	1. At benchmark startup, N value pairs are generated and stored in an indexed list.
	2. For each operation, the benchmark probabilistically decides whether to reuse a stored pair (cache hit scenario) or generate a new random pair (cache miss scenario).
	3. The repeat frequency (`rf`, 0.0–1.0) controls the maximum fraction of queries that reuse stored values.

	With the default settings (`rf=0.3`, `N=5000`), 30% of queries reuse stored value pairs (likely cache hits) and 70% generate fresh random values (likely cache misses).

	## Implementing randomized queries in a workload

	Randomization requires a `workload.py` file in your workload directory. This file registers functions that generate random parameter values.

	### Example: randomizing range query parameters

	```python
	import random

	def random_fare_range(max_value):
	gte_cents = random.randrange(0, max_value * 100)
	lte_cents = random.randrange(gte_cents, max_value * 100)
	return {
	"gte": gte_cents / 100,
	"lte": lte_cents / 100,
	}

	def fare_range_value_source():
	return random_fare_range(120.00)

	def register(registry):
	registry.register_standard_value_source(
	"range", # query type
	"fare_amount", # field name
	fare_range_value_source,
	)
	```

	The `register` function is called once at startup. The `register_standard_value_source` call tells the benchmark: "when running a `range` query on the `fare_amount` field, use this function to generate parameter values."

	### Example: randomizing non-range queries

	For queries that are not range queries, use `register_query_randomization_info`:

	```python
	def register(registry):
	registry.register_query_randomization_info(
	"bbox", # operation name in the workload
	"geo_bounding_box", # Solr query type
	[["top_left"], ["bottom_right"]], # parameter variants
	[], # optional parameters
	)
	```

	## CLI flags

	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `--randomization-enabled` \| `false` \| Activate query randomization \|
	\| `--randomization-repeat-frequency` \| `0.3` \| Fraction of queries that reuse stored value pairs (0.0–1.0) \|
	\| `--randomization-n` \| `5000` \| Number of value pairs to generate at startup \|
	\| `--randomization-alpha` \| `1.0` \| Zipf distribution alpha (≥ 0); higher values skew selection toward lower-indexed pairs \|

	## Enabling randomization at runtime

	```bash
	solr-orbit run \
	--workload nyc_taxis \
	--pipeline benchmark-only \
	--target-hosts localhost:8983 \
	--randomization-enabled true \
	--randomization-repeat-frequency 0.2 \
	--randomization-n 10000
	```

	## Choosing the right repeat frequency

	\| rf value \| Interpretation \|
	\|----------\|---------------\|
	\| `0.0` \| Every query is unique — maximum cache miss rate \|
	\| `0.3` \| 30% reuse (default) — models typical mixed workloads \|
	\| `1.0` \| All queries reuse stored pairs — maximum cache hit rate \|

	Set `rf` to match your production cache hit ratio if you know it. If you don't know it, the default of `0.3` is a reasonable starting point.

	## The Zipf distribution

	The probability of selecting value pair i from the stored list follows the Zipf distribution: P(i) ∝ 1/i^α. This means:

	- The first stored pair is selected most frequently
	- Frequency drops off sharply for higher-indexed pairs
	- `alpha=1.0` (default) gives the standard Zipf distribution
	- Higher `alpha` increases the skew (more of the probability mass on the first few pairs)
	- `alpha=0.0` makes all stored pairs equally likely

	## See also

	- [Fine-tuning workloads](../working-with-workloads/finetune-workloads.html)
	- [Creating custom workloads](../working-with-workloads/creating-custom-workloads.html)