Design document for testing, evaluating, and operating the ASVS audit pipeline at scale across hundreds of ASF projects.
In the LLM context, an eval is a repeatable measurement of output quality against known-good answers. For a security audit pipeline, this means:
This is different from traditional unit testing. LLM outputs are non-deterministic — the same input can produce different (but equivalent) findings across runs. Evals need to measure semantic correctness, not string equality.
eval/ ├── fixtures/ # Known codebases with expected results │ ├── webgoat-minimal/ # Small app with intentional vulns │ │ ├── src/ # Source code │ │ ├── expected.json # Expected findings (section → severity → count) │ │ └── false_positives.json # Known FP patterns to check for absence │ ├── secure-app/ # Clean app, should produce few/no findings │ │ ├── src/ │ │ └── expected.json │ ├── library-only/ # Pure library, many N/A sections │ │ ├── src/ │ │ └── expected.json │ └── edge-cases/ │ ├── empty-repo/ │ ├── binary-only/ │ ├── single-file/ │ └── huge-repo/ # 10k+ files, tests path prefix scoping ├── harness.py # Eval runner ├── judge.py # LLM-as-judge for semantic comparison ├── metrics.py # Scoring functions ├── report.py # Eval report generator └── README.md
A fixture is a small, stable codebase with documented security properties. Each fixture needs:
Vulnerability fixtures — intentionally insecure code targeting specific ASVS sections:
{ "name": "webgoat-minimal", "type": "web_app", "framework": "flask", "expected_findings": { "1.2.1": {"min_severity": "HIGH", "min_count": 1, "category": "XSS"}, "6.2.1": {"min_severity": "MEDIUM", "min_count": 1, "category": "weak_password"}, "3.4.1": {"min_severity": "HIGH", "min_count": 1, "category": "insecure_cookie"} }, "expected_na": ["9.1.1", "9.1.2"], "false_positive_patterns": [ {"section": "1.3.1", "pattern": "CSRF on login form", "reason": "Login forms don't need CSRF"} ] }
Clean fixtures — secure code that should produce minimal findings:
{ "name": "secure-app", "type": "web_app", "max_high_findings": 2, "max_critical_findings": 0, "notes": "If pipeline finds critical issues here, it's probably hallucinating" }
Edge case fixtures — test robustness, not finding quality:
{ "name": "binary-only", "type": "edge_case", "expected_behavior": "graceful_na", "should_not_crash": true, "expected_reports": 0 }
The best fixtures come from real audits where we've manually verified findings:
ASVS/reports/tooling-trusted-releases/da901ba/ including consolidated report, issues, and triage notes| Metric | How to Measure | Target |
|---|---|---|
| Finding recall | Known vulns found / known vulns in fixture | > 80% |
| False positive rate | FP findings / total findings | < 20% |
| N/A accuracy | Correctly identified N/A sections / total N/A sections | > 90% |
| Severity accuracy | Findings with correct severity / findings with any severity | > 70% |
| Report completeness | Reports with all required sections (summary, findings, remediation) | 100% |
| Metric | How to Measure | Target |
|---|---|---|
| Completion rate | Sections with reports / total sections attempted | > 98% |
| Extraction success | Reports successfully extracted into consolidation | > 95% |
| Consolidation dedup rate | Unique findings / raw findings before dedup | 40-70% |
| Cost per section | Average token cost per ASVS section audit | Track trend |
| Time per section | Average wall clock time per section | Track trend |
For semantic comparison (did the pipeline find the same vulnerability, even if described differently), use an LLM judge:
JUDGE_PROMPT = """Compare these two security findings and determine if they describe the same vulnerability. Expected finding: {expected} Actual finding: {actual} Respond with JSON: {"match": true/false, "confidence": 0.0-1.0, "reason": "..."} """
This handles the non-determinism problem — the pipeline might describe a finding differently across runs, but the judge can determine if they're semantically equivalent.
# harness.py (sketch) async def run_eval(fixture_path: str, pipeline_config: dict) -> EvalResult: """Run a single fixture through the pipeline and score results.""" fixture = load_fixture(fixture_path) # 1. Load fixture source into data store namespace = f"eval:{fixture.name}" load_fixture_code(namespace, fixture.src_path) # 2. Run pipeline (discovery + audit, no GitHub push) results = await run_pipeline_local( namespace=namespace, level=fixture.level or "L1", sections=fixture.target_sections, # or all if not specified ) # 3. Score results scores = score_results(fixture, results) # 4. Generate report return EvalResult( fixture=fixture.name, scores=scores, findings=results.findings, duration=results.duration, cost=results.token_cost, ) async def run_eval_suite(suite_path: str) -> EvalSuiteResult: """Run all fixtures and produce aggregate scores.""" fixtures = discover_fixtures(suite_path) results = [] for fixture in fixtures: result = await run_eval(fixture) results.append(result) print(f" {fixture.name}: recall={result.scores.recall:.0%} " f"precision={result.scores.precision:.0%} " f"FP={result.scores.false_positive_rate:.0%}") return aggregate(results)
# Run full eval suite python eval/harness.py eval/fixtures/ # Run single fixture python eval/harness.py eval/fixtures/webgoat-minimal/ # Compare two pipeline versions python eval/harness.py eval/fixtures/ --baseline results/v1.json --output results/v2.json python eval/report.py results/v1.json results/v2.json
After any agent change (prompt update, model switch, parameter tweak), run the eval suite and compare:
Pipeline v1 → v2 Comparison
============================
v1 v2 Δ
Finding recall 82% 85% +3% ✅
False positive rate 18% 12% -6% ✅
N/A accuracy 91% 93% +2% ✅
Completion rate 98.5% 99.1% +0.6% ✅
Extraction success 96% 98% +2% ✅
Cost per section $0.42 $0.38 -$0.04 ✅
Regressions:
(none)
New findings in v2 not in v1:
webgoat-minimal 3.4.1: Found cookie without Secure flag (HIGH) ← NEW TP
Findings in v1 lost in v2:
(none)
When running across hundreds of projects, the pipeline will encounter errors it's never seen before. These need to be surfaced automatically, not silently swallowed.
KNOWN_ERRORS = { "litellm.Timeout": { "action": "retry", "max_retries": 2, "escalate_after": 3, # file issue after 3 occurrences in 24h }, "json.JSONDecodeError": { "action": "retry_with_fallback", "fallback": "parse_llm_json", "escalate_after": 10, }, "httpx.HTTPStatusError:404": { "action": "skip", "reason": "File not found in repo", "escalate_after": None, # never escalate, expected for some repos }, "httpx.HTTPStatusError:403": { "action": "abort", "reason": "Rate limited or token expired", "escalate_after": 1, }, }
When the pipeline encounters an error not in KNOWN_ERRORS, or a known error exceeds its escalation threshold:
async def handle_error(error, context): """Classify error and optionally file a GitHub issue.""" error_key = classify_error(error) if error_key in KNOWN_ERRORS: config = KNOWN_ERRORS[error_key] # Track occurrence count count = increment_error_count(error_key, window_hours=24) if config["escalate_after"] and count >= config["escalate_after"]: await file_issue(error, context, label="known-error-escalation") return config["action"] else: # Unknown error — always file an issue await file_issue(error, context, label="unknown-error") return "abort" async def file_issue(error, context, label): """File a GitHub issue for an error, deduplicating by error signature.""" signature = error_signature(error) # e.g., hash of error type + message pattern # Check if issue already exists existing = await search_issues( repo="apache/tooling-agents", query=f"label:{label} {signature} is:open" ) if existing: # Add comment to existing issue with new occurrence await add_comment(existing[0], format_occurrence(error, context)) return # Create new issue await create_issue( repo="apache/tooling-agents", title=f"[Pipeline Error] {error.__class__.__name__}: {str(error)[:80]}", labels=[label, "pipeline", context.get("agent_name", "unknown")], body=format_issue_body(error, context), )
## Pipeline Error Report **Error:** `json.JSONDecodeError: Expecting property name enclosed in double quotes` **Agent:** `consolidate_asvs_security_audit_reports` **Project:** apache/steve (v3, commit d0aa7e9) **Section:** 16.3.4 **Signature:** `err_7f3a2b` ### Context - Report size: 45,231 chars - Extraction attempt: 2 of 2 - LLM response first 200 chars: `{'timestamp': self.formatTime(record)...` ### Error Classification - **Type:** Known error exceeding threshold (10 occurrences in 24h) - **Root cause:** LLM returning Python-style dicts instead of JSON for reports with extensive code blocks - **Current mitigation:** `parse_llm_json` with regex fallback ### Occurrences (last 24h) | Time | Project | Section | Attempt | |------|---------|---------|---------| | 04:41 | apache/steve | 16.3.4 | 1/2 | | 04:42 | apache/steve | 16.3.4 | 2/2 | | ... | ... | ... | ... |
The error signature should group related errors without creating duplicate issues:
def error_signature(error, context=None): """Generate a stable signature for deduplication.""" components = [ error.__class__.__name__, # Normalize the message: strip specific values, keep pattern re.sub(r'\d+', 'N', str(error)[:100]), context.get("agent_name", "") if context else "", ] return hashlib.sha256("|".join(components).encode()).hexdigest()[:8]
This groups “JSONDecodeError at line 2 column 9” and “JSONDecodeError at line 5 column 12” into the same issue (both are JSON parse failures in the same agent), while separating them from a JSONDecodeError in a different agent.
At scale (hundreds of projects), we need visibility into pipeline health:
Per-run metrics (stored in data store): - project, commit, level, timestamp - sections_attempted, sections_completed, sections_failed - findings_total, findings_by_severity - extraction_success_rate - consolidation_success (bool) - errors[] (type, agent, section, message) - duration_seconds - estimated_cost Aggregate metrics (computed): - completion_rate_7d (rolling) - error_rate_by_type_7d - avg_findings_per_project - projects_audited_total - sections_audited_total
After each pipeline run, the orchestrator could write a summary to the data store:
run_summary = { "project": "apache/steve", "commit": "d0aa7e9", "level": "L3", "started_at": "2026-04-22T04:00:00Z", "completed_at": "2026-04-22T06:30:00Z", "sections": {"attempted": 345, "completed": 340, "failed": 5}, "findings": {"critical": 3, "high": 28, "medium": 142, "low": 89}, "extraction": {"success": 339, "failed": 1, "failed_reports": ["16.3.4.md"]}, "consolidation": {"success": True, "total_findings": 577}, "errors": [ {"type": "timeout", "agent": "run_asvs_security_audit", "section": "1.3.3", "retried": True, "resolved": False}, {"type": "json_parse", "agent": "consolidate", "section": "16.3.4", "retried": True, "resolved": False}, ], }
| Phase | What | Why | Effort |
|---|---|---|---|
| 1 | ATR regression fixture from existing verified L1+L2 run | We already have manually reviewed and triaged results — capture them as baseline | Low |
| 2 | Eval harness (run fixture, score, compare) | Enables confident agent changes | Medium |
| 3 | Run summary in data store | Operational visibility for multi-project runs | Low |
| 4 | Error classification + auto-filing | Scales operational support to hundreds of projects | Medium |
| 5 | LLM-as-judge for semantic comparison | Handles non-determinism in eval scoring | Medium |
| 6 | Additional fixtures (clean app, library, edge cases) | Broadens eval coverage | Ongoing |
| 7 | Dashboard / reporting | Aggregate visibility across all projects | Medium |