docs/benchmark.md - otava-playground - Git at Google

 # Benchmark Guide

 ## Generating Benchmark Data

 Use the CLI to generate a comprehensive benchmark suite:

 ```bash
 otava-gen generate --output-dir ./benchmark --lengths 50 500 --seed 42
 ```

 This creates:
 - CSV files for each test case
 - `manifest.json` with metadata about each file
 - `summary.json` with overall statistics

 ## Running Otava

 ```bash
 # Example Otava invocation (adjust based on Otava's actual CLI)
 otava analyze --input ./benchmark/0001_step_function_L500.csv
 ```

 ## Comparing Algorithms

 The manifest.json file contains ground truth for each test case:

 ```python
 import json

 with open("benchmark/manifest.json") as f:
     manifest = json.load(f)

 for entry in manifest:
     print(f"{entry['filename']}: {entry['n_change_points']} change points")
     print(f"  Expected indices: {entry['change_point_indices']}")
 ```

 ## Metrics

 When comparing algorithms, consider:

 1. **True Positive Rate**: % of actual change points detected
 2. **False Positive Rate**: % of non-change-points flagged
 3. **Location Accuracy**: How close detected points are to actual
 4. **Latency**: How many points after change before detection
	# Benchmark Guide

	## Generating Benchmark Data

	Use the CLI to generate a comprehensive benchmark suite:

	```bash
	otava-gen generate --output-dir ./benchmark --lengths 50 500 --seed 42
	```

	This creates:
	- CSV files for each test case
	- `manifest.json` with metadata about each file
	- `summary.json` with overall statistics

	## Running Otava

	```bash
	# Example Otava invocation (adjust based on Otava's actual CLI)
	otava analyze --input ./benchmark/0001_step_function_L500.csv
	```

	## Comparing Algorithms

	The manifest.json file contains ground truth for each test case:

	```python
	import json

	with open("benchmark/manifest.json") as f:
	manifest = json.load(f)

	for entry in manifest:
	print(f"{entry['filename']}: {entry['n_change_points']} change points")
	print(f" Expected indices: {entry['change_point_indices']}")
	```

	## Metrics

	When comparing algorithms, consider:

	1. True Positive Rate: % of actual change points detected
	2. False Positive Rate: % of non-change-points flagged
	3. Location Accuracy: How close detected points are to actual
	4. Latency: How many points after change before detection