| # ASVS Pipeline: Eval Framework and Operational QA |
| |
| Design document for testing, evaluating, and operating the ASVS audit pipeline at scale across hundreds of ASF projects. |
| |
| ## What "Eval" Means Here |
| |
| In the LLM context, an eval is a repeatable measurement of output quality against known-good answers. For a security audit pipeline, this means: |
| |
| - **Does the pipeline find known vulnerabilities?** (recall) |
| - **Are the findings real?** (precision) |
| - **Does it correctly handle code that's secure?** (false positive rate) |
| - **Does it gracefully handle edge cases?** (robustness) |
| - **Do agent changes improve or degrade quality?** (regression) |
| |
| This is different from traditional unit testing. LLM outputs are non-deterministic — the same input can produce different (but equivalent) findings across runs. Evals need to measure semantic correctness, not string equality. |
| |
| ## Prior Art |
| |
| Potentially useful starting points from [ai-discuss](https://lists.apache.org/thread/85bnlfr4jb647lsppfkg19qzk8k3x122): |
| |
| 1. https://inspect.aisi.org.uk/ |
| 2. https://github.com/apache/sling-whiteboard/tree/master/skill-evals |
| 3. https://inspect.aisi.org.uk/providers.html |
| 4. https://github.com/apache/sling-whiteboard/blob/master/skill-evals/src/skill_evals/jcr_js_nodetypes/dataset.jsonl |
| 5. https://github.com/apache/sling-whiteboard/blob/master/skill-evals/src/skill_evals/jcr_js_nodetypes/Dockerfile |
| 6. https://github.com/apache/sling-whiteboard/blob/master/skill-evals/src/skill_evals/jcr_js_nodetypes/scorer.py |
| 7. https://inspect.aisi.org.uk/scorers.html |
| 8. https://github.com/apache/sling-whiteboard/blob/master/skill-evals/src/skill_evals/compare_eval_runs.py |
| 9. https://gist.github.com/rombert/c099c13013fbdf27445816c976005aba |
| |
| ## Eval Architecture |
| |
| ``` |
| eval/ |
| ├── fixtures/ # Known codebases with expected results |
| │ ├── webgoat-minimal/ # Small app with intentional vulns |
| │ │ ├── src/ # Source code |
| │ │ ├── expected.json # Expected findings (section → severity → count) |
| │ │ └── false_positives.json # Known FP patterns to check for absence |
| │ ├── secure-app/ # Clean app, should produce few/no findings |
| │ │ ├── src/ |
| │ │ └── expected.json |
| │ ├── library-only/ # Pure library, many N/A sections |
| │ │ ├── src/ |
| │ │ └── expected.json |
| │ └── edge-cases/ |
| │ ├── empty-repo/ |
| │ ├── binary-only/ |
| │ ├── single-file/ |
| │ └── huge-repo/ # 10k+ files, tests path prefix scoping |
| ├── harness.py # Eval runner |
| ├── judge.py # LLM-as-judge for semantic comparison |
| ├── metrics.py # Scoring functions |
| ├── report.py # Eval report generator |
| └── README.md |
| ``` |
| |
| ## Fixtures |
| |
| ### What Makes a Good Fixture |
| |
| A fixture is a small, stable codebase with documented security properties. Each fixture needs: |
| |
| 1. **Source code** — small enough to audit quickly (< 50 files), large enough to be realistic |
| 2. **Expected findings** — what the pipeline should find, expressed as ranges not exact counts |
| 3. **False positive patterns** — specific things the pipeline should NOT flag |
| 4. **Not-applicable sections** — ASVS sections that don't apply (for library fixtures) |
| |
| ### Fixture Types |
| |
| **Vulnerability fixtures** — intentionally insecure code targeting specific ASVS sections: |
| |
| ```json |
| { |
| "name": "webgoat-minimal", |
| "type": "web_app", |
| "framework": "flask", |
| "expected_findings": { |
| "1.2.1": {"min_severity": "HIGH", "min_count": 1, "category": "XSS"}, |
| "6.2.1": {"min_severity": "MEDIUM", "min_count": 1, "category": "weak_password"}, |
| "3.4.1": {"min_severity": "HIGH", "min_count": 1, "category": "insecure_cookie"} |
| }, |
| "expected_na": ["9.1.1", "9.1.2"], |
| "false_positive_patterns": [ |
| {"section": "1.3.1", "pattern": "CSRF on login form", "reason": "Login forms don't need CSRF"} |
| ] |
| } |
| ``` |
| |
| **Clean fixtures** — secure code that should produce minimal findings: |
| |
| ```json |
| { |
| "name": "secure-app", |
| "type": "web_app", |
| "max_high_findings": 2, |
| "max_critical_findings": 0, |
| "notes": "If pipeline finds critical issues here, it's probably hallucinating" |
| } |
| ``` |
| |
| **Edge case fixtures** — test robustness, not finding quality: |
| |
| ```json |
| { |
| "name": "binary-only", |
| "type": "edge_case", |
| "expected_behavior": "graceful_na", |
| "should_not_crash": true, |
| "expected_reports": 0 |
| } |
| ``` |
| |
| ### Building Fixtures from Real Runs |
| |
| The best fixtures come from real audits where we've manually verified findings: |
| |
| 1. Take the ATR da901ba L1+L2 run (253 sections, manually reviewed and triaged) — reports at [`ASVS/reports/tooling-trusted-releases/da901ba/`](../../ASVS/reports/tooling-trusted-releases/da901ba/) including [consolidated report](../../ASVS/reports/tooling-trusted-releases/da901ba/consolidated-L1-L2.md), [issues](../../ASVS/reports/tooling-trusted-releases/da901ba/issues-L1-L2.md), and [triage notes](../../ASVS/reports/tooling-trusted-releases/da901ba/triage.txt) |
| 2. Mark each finding as TP (true positive), FP (false positive), or PARTIAL |
| 3. Use this as a regression baseline — future pipeline changes should not lose TPs or reintroduce FPs |
| 4. The audit_guidance documents are effectively the "answer key" for false positive patterns |
| |
| ## Eval Metrics |
| |
| ### Per-Section Metrics |
| |
| | Metric | How to Measure | Target | |
| |--------|---------------|--------| |
| | **Finding recall** | Known vulns found / known vulns in fixture | > 80% | |
| | **False positive rate** | FP findings / total findings | < 20% | |
| | **N/A accuracy** | Correctly identified N/A sections / total N/A sections | > 90% | |
| | **Severity accuracy** | Findings with correct severity / findings with any severity | > 70% | |
| | **Report completeness** | Reports with all required sections (summary, findings, remediation) | 100% | |
| |
| ### Pipeline Metrics |
| |
| | Metric | How to Measure | Target | |
| |--------|---------------|--------| |
| | **Completion rate** | Sections with reports / total sections attempted | > 98% | |
| | **Extraction success** | Reports successfully extracted into consolidation | > 95% | |
| | **Consolidation dedup rate** | Unique findings / raw findings before dedup | 40-70% | |
| | **Cost per section** | Average token cost per ASVS section audit | Track trend | |
| | **Time per section** | Average wall clock time per section | Track trend | |
| |
| ### LLM-as-Judge |
| |
| For semantic comparison (did the pipeline find the same vulnerability, even if described differently), use an LLM judge: |
| |
| ```python |
| JUDGE_PROMPT = """Compare these two security findings and determine if they describe the same vulnerability. |
| |
| Expected finding: |
| {expected} |
| |
| Actual finding: |
| {actual} |
| |
| Respond with JSON: {"match": true/false, "confidence": 0.0-1.0, "reason": "..."} |
| """ |
| ``` |
| |
| This handles the non-determinism problem — the pipeline might describe a finding differently across runs, but the judge can determine if they're semantically equivalent. |
| |
| ## Eval Runner |
| |
| ```python |
| # harness.py (sketch) |
| |
| async def run_eval(fixture_path: str, pipeline_config: dict) -> EvalResult: |
| """Run a single fixture through the pipeline and score results.""" |
| |
| fixture = load_fixture(fixture_path) |
| |
| # 1. Load fixture source into data store |
| namespace = f"eval:{fixture.name}" |
| load_fixture_code(namespace, fixture.src_path) |
| |
| # 2. Run pipeline (discovery + audit, no GitHub push) |
| results = await run_pipeline_local( |
| namespace=namespace, |
| level=fixture.level or "L1", |
| sections=fixture.target_sections, # or all if not specified |
| ) |
| |
| # 3. Score results |
| scores = score_results(fixture, results) |
| |
| # 4. Generate report |
| return EvalResult( |
| fixture=fixture.name, |
| scores=scores, |
| findings=results.findings, |
| duration=results.duration, |
| cost=results.token_cost, |
| ) |
| |
| async def run_eval_suite(suite_path: str) -> EvalSuiteResult: |
| """Run all fixtures and produce aggregate scores.""" |
| fixtures = discover_fixtures(suite_path) |
| results = [] |
| for fixture in fixtures: |
| result = await run_eval(fixture) |
| results.append(result) |
| print(f" {fixture.name}: recall={result.scores.recall:.0%} " |
| f"precision={result.scores.precision:.0%} " |
| f"FP={result.scores.false_positive_rate:.0%}") |
| return aggregate(results) |
| ``` |
| |
| ### Running Evals |
| |
| ```bash |
| # Run full eval suite |
| python eval/harness.py eval/fixtures/ |
| |
| # Run single fixture |
| python eval/harness.py eval/fixtures/webgoat-minimal/ |
| |
| # Compare two pipeline versions |
| python eval/harness.py eval/fixtures/ --baseline results/v1.json --output results/v2.json |
| python eval/report.py results/v1.json results/v2.json |
| ``` |
| |
| ### Regression Detection |
| |
| After any agent change (prompt update, model switch, parameter tweak), run the eval suite and compare: |
| |
| ``` |
| Pipeline v1 → v2 Comparison |
| ============================ |
| v1 v2 Δ |
| Finding recall 82% 85% +3% ✅ |
| False positive rate 18% 12% -6% ✅ |
| N/A accuracy 91% 93% +2% ✅ |
| Completion rate 98.5% 99.1% +0.6% ✅ |
| Extraction success 96% 98% +2% ✅ |
| Cost per section $0.42 $0.38 -$0.04 ✅ |
| |
| Regressions: |
| (none) |
| |
| New findings in v2 not in v1: |
| webgoat-minimal 3.4.1: Found cookie without Secure flag (HIGH) ← NEW TP |
| |
| Findings in v1 lost in v2: |
| (none) |
| ``` |
| |
| ## Operational Error Handling at Scale |
| |
| When running across hundreds of projects, the pipeline will encounter errors it's never seen before. These need to be surfaced automatically, not silently swallowed. |
| |
| ### Error Classification |
| |
| ```python |
| KNOWN_ERRORS = { |
| "litellm.Timeout": { |
| "action": "retry", |
| "max_retries": 2, |
| "escalate_after": 3, # file issue after 3 occurrences in 24h |
| }, |
| "json.JSONDecodeError": { |
| "action": "retry_with_fallback", |
| "fallback": "parse_llm_json", |
| "escalate_after": 10, |
| }, |
| "httpx.HTTPStatusError:404": { |
| "action": "skip", |
| "reason": "File not found in repo", |
| "escalate_after": None, # never escalate, expected for some repos |
| }, |
| "httpx.HTTPStatusError:403": { |
| "action": "abort", |
| "reason": "Rate limited or token expired", |
| "escalate_after": 1, |
| }, |
| } |
| ``` |
| |
| ### Auto-Filing GitHub Issues |
| |
| When the pipeline encounters an error not in `KNOWN_ERRORS`, or a known error exceeds its escalation threshold: |
| |
| ```python |
| async def handle_error(error, context): |
| """Classify error and optionally file a GitHub issue.""" |
| |
| error_key = classify_error(error) |
| |
| if error_key in KNOWN_ERRORS: |
| config = KNOWN_ERRORS[error_key] |
| |
| # Track occurrence count |
| count = increment_error_count(error_key, window_hours=24) |
| |
| if config["escalate_after"] and count >= config["escalate_after"]: |
| await file_issue(error, context, label="known-error-escalation") |
| |
| return config["action"] |
| else: |
| # Unknown error — always file an issue |
| await file_issue(error, context, label="unknown-error") |
| return "abort" |
| |
| async def file_issue(error, context, label): |
| """File a GitHub issue for an error, deduplicating by error signature.""" |
| |
| signature = error_signature(error) # e.g., hash of error type + message pattern |
| |
| # Check if issue already exists |
| existing = await search_issues( |
| repo="apache/tooling-agents", |
| query=f"label:{label} {signature} is:open" |
| ) |
| |
| if existing: |
| # Add comment to existing issue with new occurrence |
| await add_comment(existing[0], format_occurrence(error, context)) |
| return |
| |
| # Create new issue |
| await create_issue( |
| repo="apache/tooling-agents", |
| title=f"[Pipeline Error] {error.__class__.__name__}: {str(error)[:80]}", |
| labels=[label, "pipeline", context.get("agent_name", "unknown")], |
| body=format_issue_body(error, context), |
| ) |
| ``` |
| |
| ### Issue Body Format |
| |
| ```markdown |
| ## Pipeline Error Report |
| |
| **Error:** `json.JSONDecodeError: Expecting property name enclosed in double quotes` |
| **Agent:** `consolidate_asvs_security_audit_reports` |
| **Project:** apache/steve (v3, commit d0aa7e9) |
| **Section:** 16.3.4 |
| **Signature:** `err_7f3a2b` |
| |
| ### Context |
| - Report size: 45,231 chars |
| - Extraction attempt: 2 of 2 |
| - LLM response first 200 chars: `{'timestamp': self.formatTime(record)...` |
| |
| ### Error Classification |
| - **Type:** Known error exceeding threshold (10 occurrences in 24h) |
| - **Root cause:** LLM returning Python-style dicts instead of JSON for reports with extensive code blocks |
| - **Current mitigation:** `parse_llm_json` with regex fallback |
| |
| ### Occurrences (last 24h) |
| | Time | Project | Section | Attempt | |
| |------|---------|---------|---------| |
| | 04:41 | apache/steve | 16.3.4 | 1/2 | |
| | 04:42 | apache/steve | 16.3.4 | 2/2 | |
| | ... | ... | ... | ... | |
| ``` |
| |
| ### Error Signature Deduplication |
| |
| The error signature should group related errors without creating duplicate issues: |
| |
| ```python |
| def error_signature(error, context=None): |
| """Generate a stable signature for deduplication.""" |
| components = [ |
| error.__class__.__name__, |
| # Normalize the message: strip specific values, keep pattern |
| re.sub(r'\d+', 'N', str(error)[:100]), |
| context.get("agent_name", "") if context else "", |
| ] |
| return hashlib.sha256("|".join(components).encode()).hexdigest()[:8] |
| ``` |
| |
| This groups "JSONDecodeError at line 2 column 9" and "JSONDecodeError at line 5 column 12" into the same issue (both are JSON parse failures in the same agent), while separating them from a JSONDecodeError in a different agent. |
| |
| ## Operational Dashboard |
| |
| At scale (hundreds of projects), we need visibility into pipeline health: |
| |
| ### Key Metrics to Track |
| |
| ``` |
| Per-run metrics (stored in data store): |
| - project, commit, level, timestamp |
| - sections_attempted, sections_completed, sections_failed |
| - findings_total, findings_by_severity |
| - extraction_success_rate |
| - consolidation_success (bool) |
| - errors[] (type, agent, section, message) |
| - duration_seconds |
| - estimated_cost |
| |
| Aggregate metrics (computed): |
| - completion_rate_7d (rolling) |
| - error_rate_by_type_7d |
| - avg_findings_per_project |
| - projects_audited_total |
| - sections_audited_total |
| ``` |
| |
| ### Run Summary |
| |
| After each pipeline run, the orchestrator could write a summary to the data store: |
| |
| ```python |
| run_summary = { |
| "project": "apache/steve", |
| "commit": "d0aa7e9", |
| "level": "L3", |
| "started_at": "2026-04-22T04:00:00Z", |
| "completed_at": "2026-04-22T06:30:00Z", |
| "sections": {"attempted": 345, "completed": 340, "failed": 5}, |
| "findings": {"critical": 3, "high": 28, "medium": 142, "low": 89}, |
| "extraction": {"success": 339, "failed": 1, "failed_reports": ["16.3.4.md"]}, |
| "consolidation": {"success": True, "total_findings": 577}, |
| "errors": [ |
| {"type": "timeout", "agent": "run_asvs_security_audit", "section": "1.3.3", "retried": True, "resolved": False}, |
| {"type": "json_parse", "agent": "consolidate", "section": "16.3.4", "retried": True, "resolved": False}, |
| ], |
| } |
| ``` |
| |
| ## Implementation Priority |
| |
| | Phase | What | Why | Effort | |
| |-------|------|-----|--------| |
| | **1** | ATR regression fixture from existing verified L1+L2 run | We already have manually reviewed and triaged results — capture them as baseline | Low | |
| | **2** | Eval harness (run fixture, score, compare) | Enables confident agent changes | Medium | |
| | **3** | Run summary in data store | Operational visibility for multi-project runs | Low | |
| | **4** | Error classification + auto-filing | Scales operational support to hundreds of projects | Medium | |
| | **5** | LLM-as-judge for semantic comparison | Handles non-determinism in eval scoring | Medium | |
| | **6** | Additional fixtures (clean app, library, edge cases) | Broadens eval coverage | Ongoing | |
| | **7** | Dashboard / reporting | Aggregate visibility across all projects | Medium | |