The pipeline as written has two systemic inefficiencies that compound:
for section in sections: with await inside, no asyncio.gather. The opus_semaphore=2 and sonnet_semaphore=5 inside run_asvs_security_audit only parallelize batches within a single section, never across sections.Fixing just these two gets the recommended-hybrid run from ~511 hours down to ~70–90 hours (5–7× speedup) with no loss of audit quality. Layering in three further wins (better caching, tarball download, smarter consolidation) takes it to ~57 hours.
| Tier | What changes | Total h | Days | vs. baseline |
|---|---|---|---|---|
| Baseline | Current pipeline | 511 | 64 | — |
| T1 | Orchestrator parallel-4 (one-line fix) | 141 | 18 | −72 % |
| T1+T2 | + raise opus_semaphore to 4 | 108 | 14 | −79 % |
| T1+T2+T3 | + tarball download | 108 | 14 | (cosmetic) |
| T1–T3 + bundling | + bundle sections per pass into one Opus call | ~68 | 8.5 | −87 % |
| Full | + global inventory cache + skip empty-finding format + smarter consolidation | ~57 | 7.2 | −89 % |
(All numbers solo-runner. With 4 parallel orchestrator runners on top, full-optimization gets to ~14 hours total wall-clock for all 11 repos.)
The rest of this doc walks through every optimization, what code to touch, expected impact, and risk to audit quality.
From the calibration-fit model, agents consume:
| Agent | Hours (recommended hybrid) | % | Why |
|---|---|---|---|
run_asvs_security_audit | 460.6 | 90.2 % | Step 4 Opus deep analysis dominates. Called 11×N_sections times. |
consolidate_asvs_security_audit_reports | 44.8 | 8.8 % | Phase 3 domain consolidation + Phase 4 final merge are the heavy LLM calls. |
discover_codebase_architecture | 2.8 | 0.5 % | Flat ~15 min per repo regardless of size. |
download_github_repo_to_data_store | 1.9 | 0.4 % | One GitHub API GET per file, sequential. |
add_markdown_file_to_github_directory | 0.9 | 0.2 % | One PUT per section. |
Within run_asvs_security_audit, per-section breakdown:
So the entire optimization story is really “make Step 4 cheaper or run it less often.”
Problem. orchestrate_asvs_audit_to_github.py lines 380–438 (and 619–653 for no-discover mode) run sections in a strict await-in-for-loop pattern:
for pass_def in passes: for section in sections: section_idx += 1 audit_result = await gofannon_client.call( agent_name="run_asvs_security_audit", input_dict={...} ) await gofannon_client.call( agent_name="add_markdown_file_to_github_directory", input_dict={...} )
This means with 345 sections per repo at ~10 min each = 57.5 hours of pure serial wall-clock. The opus_semaphore=2 inside run_asvs_security_audit is useless for this — there's only ever one section running at a time.
Fix. Wrap section execution in an asyncio.gather with an outer semaphore:
SECTION_CONCURRENCY = 4 # tune to your bedrock quota section_semaphore = asyncio.Semaphore(SECTION_CONCURRENCY) async def run_one_section(section, pass_def): async with section_semaphore: audit_result = await gofannon_client.call( agent_name="run_asvs_security_audit", input_dict={...} ) await gofannon_client.call( agent_name="add_markdown_file_to_github_directory", input_dict={...} ) return section, audit_result # Inside the pass loop: section_tasks = [run_one_section(s, pass_def) for s in sections] results = await asyncio.gather(*section_tasks, return_exceptions=True)
Impact. With concurrency=4, audit time drops from 460 h → ~115 h for the recommended hybrid. Total project: ~511 h → ~141 h (−72%). Concurrency=8 would cut another half if your bedrock quota allows.
Risk. Low. The agents are stateless w.r.t. each other; per-section caches are keyed by asvs so they don't collide. Watch out for: GitHub API rate limits on the PUT side (push step), and Bedrock InvocationRateLimit. Backoff already exists in call_llm.
Effort. ~30 lines of code changes in two files. A half-day's work for a >70% speedup.
opus_semaphore from 2 to 4 [TRIVIAL, BIG IMPACT]Problem. run_asvs_security_audit.py:93 hardcodes opus_semaphore = asyncio.Semaphore(2). For repos where a section's relevant file set spans many Opus batches (e.g., a 150 k LOC repo where a section flags 80 files relevant), this serializes batches that could run in parallel.
Fix. Bump to 4 (or make it configurable via env var):
import os OPUS_CONCURRENCY = int(os.environ.get("OPUS_CONCURRENCY", "4")) opus_semaphore = asyncio.Semaphore(OPUS_CONCURRENCY)
Impact. Within a single section, halves Step 4 wall-clock when there are ≥4 batches. Combined with T1, gets us to ~108 h total.
Risk. Bedrock account quotas. Most us-east accounts allow 5–10 concurrent Opus invocations. If you start hitting ThrottlingException, drop to 3. The retry logic at lines 625–645 + 667–684 handles transient failures correctly.
Effort. One-line change. Pair this with T1 and you're at −79% in an afternoon.
Problem. download_github_repo_to_data_store.py uses git/trees?recursive=1 to get the file list (good), then makes one GET /repos/.../contents/{path} per file (bad). For grails-core‘s ~7,000 files that’s 7,000 sequential API calls and a substantial chunk of GitHub's 5,000-req/hour quota.
Fix. Use the tarball endpoint, which streams the entire repo as a single response:
tarball_url = f"https://api.github.com/repos/{repo}/tarball/{default_branch}" response = await http_client.get(tarball_url, headers=headers, follow_redirects=True) # stream to /tmp, extract, walk, push to data_store
Impact. Reduces download from ~7–15 min on big repos to ~1–2 min. Recovers ~10–20 min total across all 11 repos. This is mostly a quota and robustness win — the time savings are small in the overall budget but it removes flaky failures on large repos.
Risk. Need to add tarfile extraction logic; need to recreate the same vendor-dir / >1MB filtering after extracting; binary-skip detection moves to local rather than via API. Modest refactor.
Effort. ~2–3 hours.
This is where the real savings live and it's worth the engineering effort.
Problem. The discovery agent produces “passes” — groups of ASVS sections that share file scope (same include_files). The orchestrator iterates each section in each pass, and each section's run_asvs_security_audit call independently:
If a pass has 6 sections sharing the same 50 files, Opus reads those 50 files 6 times (in 6 separate billing-and-wall-clock-expensive deep-analysis calls) just to answer 6 different ASVS questions about them. That's exactly the kind of work that should be batched.
Fix. Add a new agent (or a mode on run_asvs_security_audit) that takes multiple ASVS sections in one call:
# Modified system prompt structure analysis_system_prompt = f"""You are auditing the following code against MULTIPLE ASVS requirements. For each requirement below, produce a separate findings section. ## ASVS Requirements {requirements_block} # all section descriptions, levels, IDs ## Output Format For each requirement, emit: ### Requirement <ID>: <name> [findings, controls inventory, etc.] --- """
Discovery already produces the right grouping in pass_def["asvs_sections"]. The orchestrator change becomes:
# Instead of: for section in sections: await audit(section) audit_result = await gofannon_client.call( agent_name="run_asvs_security_audit_multi", # or pass list to existing agent input_dict={ "asvs_sections": sections, # <-- list, not single "namespaces": namespaces, "includeFiles": include_files, ... } ) # Then split the response into per-section files for the existing PUT loop
Impact. Reduces effective Opus calls from 345 → ~50 passes per repo (typical pass size is 5–10 sections). Audit step shrinks by ~5–6×. Combined with T1+T2, gets to ~68 h total (−87%).
There's a quality bonus too: Opus reasoning about multiple related ASVS reqs in one trace tends to surface cross-cutting issues better than 6 separate analyses.
Risk. Output parsing — need a reliable separator between per-section outputs. Token budget — multi-section prompts are larger system prompts but the user content (the code) stays the same. The main risk is Opus producing a less-thorough analysis per section because it's juggling multiple. Can mitigate by:
max_tokens on Opus from 64k to 128kEffort. Bigger lift — ~1–2 days. Need to add the multi-section mode, modify the orchestrator's pass loop, and update consolidation to handle the multi-output format.
Recommendation. Do this after T1+T2 are stable in production. T1+T2 already get you most of the way; T4 is the next leg.
Problem. run_asvs_security_audit.py:97 keys the inventory cache as audit-cache:inventory:asvs-{asvs}-{namespaces}. But reading the inventory prompt at lines 351–364, the inventory has no ASVS-specific content — it's pure structural extraction (imports, classes, functions, routes, security patterns). Yet the cache key includes the ASVS section, so 345 sections each compute a fresh inventory of the same files.
Fix. Change the cache key for inventory only:
# Old: inventory_cache_ns = data_store.use_namespace(f"audit-cache:inventory:{cache_key_prefix}") # New: keyed by file-set hash, not by asvs section import hashlib file_set_hash = hashlib.sha256( json.dumps(sorted(filtered_files.keys())).encode() ).hexdigest()[:16] inventory_cache_ns = data_store.use_namespace(f"audit-cache:inventory:{file_set_hash}")
When two sections share the same file scope (which they do within a pass), the second hits the cache.
Impact. Without bundling (T4), this saves ~30–60 sec per cached section on Step 3 = ~2–4 h across the run. With bundling, it‘s mostly redundant since each pass only inventories once anyway. Worth doing as a fallback for the no-discover mode and for when bundling can’t be used (single section pulled from a larger pass on retry).
Risk. Very low. Cache key is content-derived, can't go stale.
Effort. 5 lines of code.
Problem. Same issue as T5 but for relevance. Relevance does depend on the ASVS req (the prompt includes it at line 252), so per-section caching is correct. However, sections in the same chapter typically have very similar relevance patterns — e.g., 5.1.1 through 5.1.7 all care about the same input-validation files.
Fix. Two options:
Impact. Modest — relevance is only ~30–60 s per section, and ~2 h total in a baseline run. Within a pass it's already cached.
Risk. Low. Slightly less precise relevance scoring; can be mitigated by running per-section refinement only when chapter-level relevance is borderline (score 4–6).
Effort. Light option ~1 hour. Heavier option ~half-day.
Recommendation. Skip this unless T4 isn't viable. T4 makes it moot.
Problem. Many ASVS sections will have zero applicable findings for a given repo (e.g., session-management requirements against a logging library). Step 6 still pays a full Sonnet formatting call (~30–60 s) to produce a “no findings” report.
Fix. Detect zero findings before Step 6 and emit a templated stub:
findings_count = count_findings(consolidated_analysis) if sum(findings_count.values()) == 0: final_report = render_no_findings_template(asvs, asvs_description, repo_name, ...) else: # Existing Step 6 path final_report = await format_with_sonnet(...)
Impact. Across 345 sections × 11 repos, if even 30 % of sections have zero findings, that's ~1,000 skipped Sonnet calls = ~10 hours saved. Material.
Risk. Need accurate count_findings. The current heuristic (lines 808–824) checks three different patterns; could be fooled by formatting variation in the consolidated_analysis. Mitigation: only short-circuit when all three patterns return 0 AND the analysis is below a length threshold (say 500 chars).
Effort. ~1 hour.
Problem. Step 5 in run_asvs_security_audit.py:733–797 runs up to 5 rounds of pairwise consolidation when there are multiple Opus batches. Each round is a Sonnet call. For sections with 2–3 batches, this is overkill.
Fix. When len(analysis_results) <= 2, skip the iterative consolidation and do a single combined pass. When <= 4, do at most one round. Keep the multi-round logic only for genuinely large results sets.
if len(analysis_results) == 1: consolidated_analysis = analysis_results[0] elif len(analysis_results) <= 4: # Single-pass consolidation, no rounds consolidated_analysis = await single_pass_consolidate(analysis_results) else: # Existing multi-round logic ...
Impact. Saves ~30–90 sec per section that has 2–4 batches. ~3–5 hours across the run.
Risk. Quality of consolidation may dip slightly for medium-size results. Mitigation: keep the dedup/contradiction-check rules in the single-pass prompt.
Effort. ~30 lines.
Problem. Relevance filtering (Step 2) uses Sonnet at lines 295–301. The task — score 0–10 how relevant a file‘s first 200 lines are to an ASVS req — is well within Haiku 4.5’s capabilities and Haiku is roughly 1/5 the cost and 2× the speed.
Fix. Add a HAIKU model config and use it for relevance only:
HAIKU_MODEL = "us.anthropic.claude-haiku-4-5-20251001-v1:0" HAIKU_PARAMS = {"temperature": 0.3, "max_tokens": 4096} # In filter_batch(): content_resp, _ = await call_llm( provider=SONNET_PROVIDER, model=HAIKU_MODEL, # <-- swap messages=messages, parameters=HAIKU_PARAMS, timeout=60, )
Impact. Relevance is ~10 % of per-section wall-clock. Switching to Haiku halves that, plus reduces token cost meaningfully. Maybe ~3–5 h overall savings, plus material $$ savings.
Risk. Haiku might score slightly less precisely. Mitigation: validate against a sample of known-good repos; the existing fallback to score=5 on parse failure means even if Haiku misbehaves, you don't lose files outright. Also: keep Sonnet for the inventory step where it matters more.
Effort. ~10 lines.
Problem. consolidate_asvs_security_audit_reports runs after every section completes (~45 min on big repos). With T1 (parallel sections), there‘s a long tail where sections are still finishing and consolidation can’t start.
Fix. Start Phase 1 (read reports) and Phase 2 (extract findings) of consolidation incrementally, as soon as each section's PUT completes. By the time the last section finishes, Phases 1 and 2 are nearly done.
Impact. Saves ~20–30 min per repo of “everything is waiting on consolidation” time. ~3–5 h across the run.
Risk. Adds orchestration complexity. Probably not worth doing until T1–T4 are in place.
Effort. ~half-day.
Problem. The Opus system prompt at run_asvs_security_audit.py:452–547 is ~95 lines and ~3,000 tokens. It‘s identical for every Opus call (345 × 11 repos = 3,795 calls). That’s ~11M tokens of input redundancy.
Fix. Two options:
Impact. Mostly a $$ savings rather than wall-clock — Opus has high TTFT but the input tokens don't dominate generation time. Maybe 2–5 % wall-clock saving, but meaningful cost reduction.
Risk. Caching: low risk, just operational. Trimming: medium risk, could change Opus's behavior on edge cases.
Effort. Caching: ~1 hour to add the cache control headers. Trimming: ~half-day with careful regression testing on the calibration runs.
Problem. discover_codebase_architecture takes a flat ~15 min regardless of input size. For mahout (22 k LOC) or task-sdk (20 k LOC), discovery's value is questionable — these are small enough that you can just audit every file against every section.
Fix. In the orchestrator, skip discovery when loc < 30k:
if estimated_loc < 30_000: # Skip discovery, run all sections against all files in one big pass passes = [{"name": "all", "asvs_sections": all_sections, "files": [], ...}]
Impact. Saves 15 min × 3 small repos = 45 min. Tiny in the overall picture but operationally cleaner.
Risk. None for small repos.
Effort. ~15 lines.
reasoning_effort from “high” to “medium” on the Opus deep analysis. Initial testing during steve/v3 calibration showed a ~30 % drop in finding count. Speed gain ~2× but quality loss is too steep.consolidate_asvs_security_audit_reports run is naturally bounded; cross-repo merging adds complexity for negligible gain.Given the curve of effort vs. impact:
Week 1 (target: 79% reduction, ~108 h baseline)
opus_semaphore=4 — 5 minutesWeek 2 (target: 87% reduction, ~68 h)
Week 3 (operational polish, ~57 h)
Defer until needed: T6 (relevance domain-warm), T10 (streaming consolidation).
Each tier should be validated against the existing trusted-releases benchmark before moving to the next.
The pipeline's per-section reports go to GitHub, so diffing finding counts and severity distributions across versions of the pipeline is straightforward:
| Check | What to look for |
|---|---|
| Finding count drift | Total findings within ±10 % of baseline run |
| Severity distribution | Critical % and High % within ±20 % of baseline |
| Specific findings present | All previously-found CVE-equivalent issues still appear |
| Wall-clock | Within ±15 % of model prediction |
| Cost | Track Bedrock invoice — should drop ~70–85 % proportionally |
If any tier introduces a >15 % finding regression, roll back that tier and investigate.
After full optimization with 4-way orchestrator parallelism on top:
| Project | Baseline (h) | Optimized solo (h) | Optimized 4-runner (h) |
|---|---|---|---|
| grails-core | 80.0 | ~14.6 | ~3.7 |
| directory-ldap-api | 66.7 | ~11.7 | ~2.9 |
| airflow-core | 60.3 | ~10.7 | ~2.7 |
| directory-server | 59.1 | ~10.2 | ~2.6 |
| superset (backend) | 57.8 | ~10.2 | ~2.6 |
| superset (frontend) | 54.3 | ~10.0 | ~2.5 |
| airflow/providers/google | 39.5 | ~7.1 | ~1.8 |
| airflow/task-sdk | 28.4 | ~4.4 | ~1.1 |
| mina | 28.3 | ~4.4 | ~1.1 |
| log4net | 25.5 | ~4.3 | ~1.1 |
| mahout | 10.9 | ~2.3 | ~0.6 |
| TOTAL | 511 | ~90 | ~22 |
Equivalent: from “solo auditor for 3 months” to “solo auditor for 11 days” or “4 parallel runners for ~3 days”. The painful numbers become tolerable.