Adversarial Agent Testing
AI agents that attack each other to find vulnerabilities. Red Team probes, Blue Team defends, a Referee scores both -- all using Claude Code with worktree isolation. Two rounds of live exercises against a real target drove ASR from 65% CRITICAL to 47% HIGH, with a regression wave proving patches hold at 20% and an escalation wave exposing architectural gaps at 85.7%.
The Problem
Security testing for AI agent systems is manual, inconsistent, and rarely done. Most Claude Code projects ship with zero security infrastructure. Developers know they should test for prompt injection, credential leakage, and excessive agency — but the gap between knowing and doing is a full-time security team that most projects don’t have.
The OWASP Top 10 for Agentic Applications defines the threat categories. But knowing the categories and systematically testing against them are different problems. A one-time manual audit finds today’s vulnerabilities. It says nothing about whether your patches hold against tomorrow’s attack techniques.
We asked: what if the testing itself were agentic? What if AI agents could attack a codebase, other AI agents could defend it, and a third agent could score both sides — producing repeatable, comparable results across rounds?
What We Built
A multi-agent adversarial testing platform where three teams operate in isolation against a real target codebase. Red Team (2 agents) probes for vulnerabilities using the OWASP Agentic Top 10 as a classification framework. Blue Team (2 agents) detects attacks and patches vulnerabilities. A Referee (1 agent) scores both sides using severity-weighted metrics.

Figure 1 - The Adversarial Architecture: Three teams operate in complete isolation. Red Team cannot see Blue Team’s defenses. Blue Team cannot see Red Team’s attack plans. The Referee reads both sides’ outputs and produces an impartial score. Information asymmetry is enforced by hooks — not prompts.
| Before | After |
|---|---|
| Manual ad-hoc security testing | Structured 7-phase exercise lifecycle |
| No way to compare security posture across time | ASR metric makes rounds directly comparable |
| Patches assumed effective until next breach | Regression wave proves patches hold (or don’t) |
| Unknown OWASP coverage | 10/10 categories tested across 2 rounds |
| Security findings in free-text reports | JSON Schema-validated findings with severity weights |
Key Results
| Metric | Round 1 | Round 2 | Change |
|---|---|---|---|
| Attack Success Rate (ASR) | 65% CRITICAL | 47.06% HIGH | -17.9pp |
| Red Team Score | 41.25/100 | 30.88/100 | -10.37 |
| Blue Team Score | 67.86/100 | 63.76/100 | -4.10 |
| Detection Rate | 100% | 75% | -25pp |
| Total Attacks | 10 | 17 | +7 |
| Confirmed Findings | 7 | 8 | +1 |
| Defense Patches | 9 | 5 | 14 cumulative |
| Blind Spots | 0 | 2 | +2 |
| OWASP Categories Tested | 7/10 | 9/10 | +2 |
The headline numbers tell the story: patches work against known vectors (regression wave ASR dropped to 20%), but new attack techniques expose structural gaps (escalation wave ASR hit 85.7%). Per-vulnerability patching has a ceiling. Architecture-level remediation — authentication, input normalization, method-agnostic middleware — is the next step.
The Target
The target application — obsidian-youtube-agent — is a personal productivity tool built by the authors. It automates YouTube transcript processing into Obsidian Notes, generates note backlinks, and provides a RAG chatbot over the vault. 30+ API endpoints, a FastAPI backend, React 19 frontend, PostgreSQL, Qdrant vector database, and integrations with Anthropic Claude and OpenAI.
It was never designed as a commercial or multi-user application. Authentication was disabled by default for LAN-only convenience. Input normalization and rate limiting were minimal. This makes it a realistic test target — most internal tools and personal projects share these same gaps. The point is that even well-built personal tools have structural security assumptions that break when you point adversarial agents at them.
Platform Architecture
5 specialized agents across 3 teams:
| Agent | Team | Model | Role |
|---|---|---|---|
| Recon Agent | Red | Sonnet | Maps attack surfaces, prioritizes targets |
| Exploit Agent | Red | Opus | Executes attacks, writes PoC exploits |
| Monitor Agent | Blue | Sonnet | Analyzes audit logs, detects anomalies |
| Hardener Agent | Blue | Opus | Patches vulnerabilities, hardens defenses |
| Scorer Agent | Referee | Opus | Evaluates both teams, calculates ASR |
7-phase exercise lifecycle:
Setup → Recon → Attack → Detect → Defend → Score → ReportEach phase is gated. Red Team attacks first. Blue Team detects and patches without seeing Red’s findings (information asymmetry). The Referee scores both sides after all phases complete. Turn-based execution makes results deterministic and reproducible.
Isolation via worktrees and hooks:
Claude Code worktrees (claude --worktree red-team / claude --worktree blue-team) provide filesystem separation. Seven arena hooks enforce boundaries: team isolation, target protection, rate limiting, and audit logging. The critical insight: information asymmetry is enforced by deterministic hooks, not by prompt instructions that agents can ignore under context pressure.
Scoring Methodology
Attack Success Rate (ASR) is the primary metric: confirmed findings divided by total attempts, with severity weighting (Critical 4x, High 3x, Medium 2x, Low 1x). Partial findings count at 0.5 weight. Thresholds: >50% CRITICAL, 20-50% HIGH, 5-20% MEDIUM, <5% LOW.
Blue Team Score combines 5 components: detection rate (25%), inverse ASR (30%), OWASP coverage (10%), time efficiency (30%), and false positive rate (5%). When time data is unavailable, the 30% time weight is redistributed proportionally to the other four components.
The scoring calculator is a 900-line Python script (UV single-file, PEP 723) that takes both teams’ JSON outputs and produces a comprehensive scoring-result.json with per-finding evaluation, blind spot analysis, and OWASP category breakdown.
Security Findings Highlights
Round 1 — The Baseline (ASR 65% CRITICAL):
- Arbitrary file read via base64 path traversal — exfiltrated API keys, database credentials, and system files through a missing path validation that existed on one endpoint but not another
- Complete hook evasion — all 10 attacks executed via HTTP, triggering zero hook warnings. The hook monitoring layer and HTTP API operate in separate domains
- Prompt injection sanitizer covered 3 of 22 known patterns. 9 out of 9 test payloads passed through unmodified
- 100% detection rate by Blue Team via proactive code analysis, 9 patches applied and verified
Round 2 — The Two-Wave Strategy (ASR 47.06% HIGH):
- Wave 1 regression: 8 of 10 original attacks blocked. Patches hold against known vectors (20% ASR)
- Wave 2 escalation: 6 of 7 new attacks confirmed (85.7% ASR). Unicode zero-width characters bypassed all regex-based security. Event loop exhaustion crashed the server with 5 rapid requests. Pattern gaps between middleware (8 patterns) and sanitizer (19 patterns) allowed complete bypass of the HTTP defense layer
- 2 blind spots emerged: credential leakage via config endpoint and auth bypass not re-flagged across rounds
Technologies
| Layer | Technology |
|---|---|
| Agent Runtime | Claude Code (Opus for leads, Sonnet for implementation) |
| Hook Scripts | Python UV single-file scripts (PEP 723), stdlib only |
| Inter-Agent Validation | JSON Schema Draft 2020-12 |
| Scoring | Python calculator with severity weighting and time redistribution |
| Isolation | Claude Code worktrees + 7 arena hooks |
| Classification | OWASP Top 10 for Agentic Applications |
| Target | FastAPI + React 19 + PostgreSQL + Qdrant + Anthropic Claude |
The Article Series
This project is documented in a 4-part series covering the platform architecture, both exercise rounds, and transferable lessons:
Part 1: When Your AI Agents Attack Each Other — The platform architecture. Five agents, three teams, seven phases. How worktree isolation and deterministic hooks enforce information asymmetry that prompts cannot.
Part 2: 65% Attack Success Rate Against an Unpatched Target — Round 1 results. Ten attacks, seven confirmed, one critical credential exfiltration. Blue Team achieves 100% detection and patches every finding — but the ASR tells the real story.
Part 3: The Escalation Wave — Round 2 results. Regression wave proves patches hold (20% ASR). Escalation wave proves architecture doesn’t (85.7% ASR). Unicode bypasses, event loop crashes, and the ceiling of per-vulnerability patching.
Part 4: Securing Agentic AI Systems — What we learned. Patching vs. architecture, agent-driven testing advantages, the two-wave methodology, and recommendations for teams building agentic AI applications.
Builds on: Building the Bootstrap Framework — The security hooks from Part 3 of that series were the starting point for this platform’s defense infrastructure.
The throughline: Patching known vulnerabilities is necessary but insufficient. Adversarial agents find attack vectors that human testers miss — zero-width Unicode, event loop exhaustion, pattern gaps between security layers. The question is not whether your patches work against yesterday’s attacks, but whether your architecture survives tomorrow’s.