By default, a workload runs the same fixed queries on every iteration. This produces unrealistically optimistic latency numbers because Solr's filter cache and query result cache will be warm after the first pass — subsequent identical queries hit the cache and complete much faster than they would in production.
Randomizing queries generates varied parameter values across iterations, so each run exercises a realistic mix of cache hits and misses.
Apache Solr Orbit uses a Zipf probability distribution to model realistic cache behavior:
rf, 0.0–1.0) controls the maximum fraction of queries that reuse stored values.With the default settings (rf=0.3, N=5000), 30% of queries reuse stored value pairs (likely cache hits) and 70% generate fresh random values (likely cache misses).
Randomization requires a workload.py file in your workload directory. This file registers functions that generate random parameter values.
import random def random_fare_range(max_value): gte_cents = random.randrange(0, max_value * 100) lte_cents = random.randrange(gte_cents, max_value * 100) return { "gte": gte_cents / 100, "lte": lte_cents / 100, } def fare_range_value_source(): return random_fare_range(120.00) def register(registry): registry.register_standard_value_source( "range", # query type "fare_amount", # field name fare_range_value_source, )
The register function is called once at startup. The register_standard_value_source call tells the benchmark: “when running a range query on the fare_amount field, use this function to generate parameter values.”
For queries that are not range queries, use register_query_randomization_info:
def register(registry): registry.register_query_randomization_info( "bbox", # operation name in the workload "geo_bounding_box", # Solr query type [["top_left"], ["bottom_right"]], # parameter variants [], # optional parameters )
| Flag | Default | Description |
|---|---|---|
--randomization-enabled | false | Activate query randomization |
--randomization-repeat-frequency | 0.3 | Fraction of queries that reuse stored value pairs (0.0–1.0) |
--randomization-n | 5000 | Number of value pairs to generate at startup |
--randomization-alpha | 1.0 | Zipf distribution alpha (≥ 0); higher values skew selection toward lower-indexed pairs |
solr-orbit run \ --workload nyc_taxis \ --pipeline benchmark-only \ --target-hosts localhost:8983 \ --randomization-enabled true \ --randomization-repeat-frequency 0.2 \ --randomization-n 10000
| rf value | Interpretation |
|---|---|
0.0 | Every query is unique — maximum cache miss rate |
0.3 | 30% reuse (default) — models typical mixed workloads |
1.0 | All queries reuse stored pairs — maximum cache hit rate |
Set rf to match your production cache hit ratio if you know it. If you don't know it, the default of 0.3 is a reasonable starting point.
The probability of selecting value pair i from the stored list follows the Zipf distribution: P(i) ∝ 1/i^α. This means:
alpha=1.0 (default) gives the standard Zipf distributionalpha increases the skew (more of the probability mass on the first few pairs)alpha=0.0 makes all stored pairs equally likely