← All Projects

Adversarial Agent Testing

AI agents that attack each other to find vulnerabilities. Red Team probes, Blue Team defends, a Referee scores both -- all using Claude Code with worktree isolation. Two rounds of live exercises against a real target drove ASR from 65% CRITICAL to 47% HIGH, with a regression wave proving patches hold at 20% and an escalation wave exposing architectural gaps at 85.7%.

ASR 65% → 47% across 2 rounds
27 attacks, 14 defense patches
10/10 OWASP category coverage
5 specialized agents
Claude Code Python UV Scripts (PEP 723) JSON Schema OWASP Agentic Top 10

The Problem#

Security testing for AI agent systems is manual, inconsistent, and rarely done. Most Claude Code projects ship with zero security infrastructure. Developers know they should test for prompt injection, credential leakage, and excessive agency — but the gap between knowing and doing is a full-time security team that most projects don’t have.

The OWASP Top 10 for Agentic Applications defines the threat categories. But knowing the categories and systematically testing against them are different problems. A one-time manual audit finds today’s vulnerabilities. It says nothing about whether your patches hold against tomorrow’s attack techniques.

We asked: what if the testing itself were agentic? What if AI agents could attack a codebase, other AI agents could defend it, and a third agent could score both sides — producing repeatable, comparable results across rounds?

What We Built#

A multi-agent adversarial testing platform where three teams operate in isolation against a real target codebase. Red Team (2 agents) probes for vulnerabilities using the OWASP Agentic Top 10 as a classification framework. Blue Team (2 agents) detects attacks and patches vulnerabilities. A Referee (1 agent) scores both sides using severity-weighted metrics.

Figure 1 - Three-team adversarial architecture with Red Team, Blue Team, and Referee operating in isolated worktrees against a shared target

Figure 1 - The Adversarial Architecture: Three teams operate in complete isolation. Red Team cannot see Blue Team’s defenses. Blue Team cannot see Red Team’s attack plans. The Referee reads both sides’ outputs and produces an impartial score. Information asymmetry is enforced by hooks — not prompts.

BeforeAfter
Manual ad-hoc security testingStructured 7-phase exercise lifecycle
No way to compare security posture across timeASR metric makes rounds directly comparable
Patches assumed effective until next breachRegression wave proves patches hold (or don’t)
Unknown OWASP coverage10/10 categories tested across 2 rounds
Security findings in free-text reportsJSON Schema-validated findings with severity weights

Key Results#

MetricRound 1Round 2Change
Attack Success Rate (ASR)65% CRITICAL47.06% HIGH-17.9pp
Red Team Score41.25/10030.88/100-10.37
Blue Team Score67.86/10063.76/100-4.10
Detection Rate100%75%-25pp
Total Attacks1017+7
Confirmed Findings78+1
Defense Patches9514 cumulative
Blind Spots02+2
OWASP Categories Tested7/109/10+2

The headline numbers tell the story: patches work against known vectors (regression wave ASR dropped to 20%), but new attack techniques expose structural gaps (escalation wave ASR hit 85.7%). Per-vulnerability patching has a ceiling. Architecture-level remediation — authentication, input normalization, method-agnostic middleware — is the next step.

The Target#

The target application — obsidian-youtube-agent — is a personal productivity tool built by the authors. It automates YouTube transcript processing into Obsidian Notes, generates note backlinks, and provides a RAG chatbot over the vault. 30+ API endpoints, a FastAPI backend, React 19 frontend, PostgreSQL, Qdrant vector database, and integrations with Anthropic Claude and OpenAI.

It was never designed as a commercial or multi-user application. Authentication was disabled by default for LAN-only convenience. Input normalization and rate limiting were minimal. This makes it a realistic test target — most internal tools and personal projects share these same gaps. The point is that even well-built personal tools have structural security assumptions that break when you point adversarial agents at them.

Platform Architecture#

5 specialized agents across 3 teams:

AgentTeamModelRole
Recon AgentRedSonnetMaps attack surfaces, prioritizes targets
Exploit AgentRedOpusExecutes attacks, writes PoC exploits
Monitor AgentBlueSonnetAnalyzes audit logs, detects anomalies
Hardener AgentBlueOpusPatches vulnerabilities, hardens defenses
Scorer AgentRefereeOpusEvaluates both teams, calculates ASR

7-phase exercise lifecycle:

Setup → Recon → Attack → Detect → Defend → Score → Report

Each phase is gated. Red Team attacks first. Blue Team detects and patches without seeing Red’s findings (information asymmetry). The Referee scores both sides after all phases complete. Turn-based execution makes results deterministic and reproducible.

Isolation via worktrees and hooks:

Claude Code worktrees (claude --worktree red-team / claude --worktree blue-team) provide filesystem separation. Seven arena hooks enforce boundaries: team isolation, target protection, rate limiting, and audit logging. The critical insight: information asymmetry is enforced by deterministic hooks, not by prompt instructions that agents can ignore under context pressure.

Scoring Methodology#

Attack Success Rate (ASR) is the primary metric: confirmed findings divided by total attempts, with severity weighting (Critical 4x, High 3x, Medium 2x, Low 1x). Partial findings count at 0.5 weight. Thresholds: >50% CRITICAL, 20-50% HIGH, 5-20% MEDIUM, <5% LOW.

Blue Team Score combines 5 components: detection rate (25%), inverse ASR (30%), OWASP coverage (10%), time efficiency (30%), and false positive rate (5%). When time data is unavailable, the 30% time weight is redistributed proportionally to the other four components.

The scoring calculator is a 900-line Python script (UV single-file, PEP 723) that takes both teams’ JSON outputs and produces a comprehensive scoring-result.json with per-finding evaluation, blind spot analysis, and OWASP category breakdown.

Security Findings Highlights#

Round 1 — The Baseline (ASR 65% CRITICAL):

  • Arbitrary file read via base64 path traversal — exfiltrated API keys, database credentials, and system files through a missing path validation that existed on one endpoint but not another
  • Complete hook evasion — all 10 attacks executed via HTTP, triggering zero hook warnings. The hook monitoring layer and HTTP API operate in separate domains
  • Prompt injection sanitizer covered 3 of 22 known patterns. 9 out of 9 test payloads passed through unmodified
  • 100% detection rate by Blue Team via proactive code analysis, 9 patches applied and verified

Round 2 — The Two-Wave Strategy (ASR 47.06% HIGH):

  • Wave 1 regression: 8 of 10 original attacks blocked. Patches hold against known vectors (20% ASR)
  • Wave 2 escalation: 6 of 7 new attacks confirmed (85.7% ASR). Unicode zero-width characters bypassed all regex-based security. Event loop exhaustion crashed the server with 5 rapid requests. Pattern gaps between middleware (8 patterns) and sanitizer (19 patterns) allowed complete bypass of the HTTP defense layer
  • 2 blind spots emerged: credential leakage via config endpoint and auth bypass not re-flagged across rounds

Technologies#

LayerTechnology
Agent RuntimeClaude Code (Opus for leads, Sonnet for implementation)
Hook ScriptsPython UV single-file scripts (PEP 723), stdlib only
Inter-Agent ValidationJSON Schema Draft 2020-12
ScoringPython calculator with severity weighting and time redistribution
IsolationClaude Code worktrees + 7 arena hooks
ClassificationOWASP Top 10 for Agentic Applications
TargetFastAPI + React 19 + PostgreSQL + Qdrant + Anthropic Claude

The Article Series#

This project is documented in a 4-part series covering the platform architecture, both exercise rounds, and transferable lessons:

Part 1: When Your AI Agents Attack Each Other — The platform architecture. Five agents, three teams, seven phases. How worktree isolation and deterministic hooks enforce information asymmetry that prompts cannot.

Part 2: 65% Attack Success Rate Against an Unpatched Target — Round 1 results. Ten attacks, seven confirmed, one critical credential exfiltration. Blue Team achieves 100% detection and patches every finding — but the ASR tells the real story.

Part 3: The Escalation Wave — Round 2 results. Regression wave proves patches hold (20% ASR). Escalation wave proves architecture doesn’t (85.7% ASR). Unicode bypasses, event loop crashes, and the ceiling of per-vulnerability patching.

Part 4: Securing Agentic AI Systems — What we learned. Patching vs. architecture, agent-driven testing advantages, the two-wave methodology, and recommendations for teams building agentic AI applications.

Builds on: Building the Bootstrap Framework — The security hooks from Part 3 of that series were the starting point for this platform’s defense infrastructure.


The throughline: Patching known vulnerabilities is necessary but insufficient. Adversarial agents find attack vectors that human testers miss — zero-width Unicode, event loop exhaustion, pattern gaps between security layers. The question is not whether your patches work against yesterday’s attacks, but whether your architecture survives tomorrow’s.

← Back to Projects