Adversarial Agent Testing — Dotzlaw Consulting

The Problem#

Security testing for AI agent systems is manual, inconsistent, and rarely done. Most Claude Code projects ship with zero security infrastructure. Developers know they should test for prompt injection, credential leakage, and excessive agency — but the gap between knowing and doing is a full-time security team that most projects don’t have.

The OWASP Top 10 for Agentic Applications defines the threat categories. But knowing the categories and systematically testing against them are different problems. A one-time manual audit finds today’s vulnerabilities. It says nothing about whether your patches hold against tomorrow’s attack techniques.

We asked: what if the testing itself were agentic? What if AI agents could attack a codebase, other AI agents could defend it, and a third agent could score both sides — producing repeatable, comparable results across rounds?

What We Built#

A multi-agent adversarial testing platform where three teams operate in isolation against a real target codebase. Red Team (2 agents) probes for vulnerabilities using the OWASP Agentic Top 10 as a classification framework. Blue Team (2 agents) detects attacks and patches vulnerabilities. A Referee (1 agent) scores both sides using severity-weighted metrics.

Figure 1 - Three-team adversarial architecture with Red Team, Blue Team, and Referee operating in isolated worktrees against a shared target

Figure 1 - The Adversarial Architecture: Three teams operate in complete isolation. Red Team cannot see Blue Team’s defenses. Blue Team cannot see Red Team’s attack plans. The Referee reads both sides’ outputs and produces an impartial score. Information asymmetry is enforced by hooks — not prompts.

Before	After
Manual ad-hoc security testing	Structured 7-phase exercise lifecycle
No way to compare security posture across time	ASR metric makes rounds directly comparable
Patches assumed effective until next breach	Regression wave proves patches hold (or don’t)
Unknown OWASP coverage	10/10 categories tested across 2 rounds
Security findings in free-text reports	JSON Schema-validated findings with severity weights

Key Results#

Metric	Round 1	Round 2	Change
Attack Success Rate (ASR)	65% CRITICAL	47.06% HIGH	-17.9pp
Red Team Score	41.25/100	30.88/100	-10.37
Blue Team Score	67.86/100	63.76/100	-4.10
Detection Rate	100%	75%	-25pp
Total Attacks	10	17	+7
Confirmed Findings	7	8	+1
Defense Patches	9	5	14 cumulative
Blind Spots	0	2	+2
OWASP Categories Tested	7/10	9/10	+2

The headline numbers tell the story: patches work against known vectors (regression wave ASR dropped to 20%), but new attack techniques expose structural gaps (escalation wave ASR hit 85.7%). Per-vulnerability patching has a ceiling. Architecture-level remediation — authentication, input normalization, method-agnostic middleware — is the next step.

The Target#

The target application — obsidian-youtube-agent — is a personal productivity tool built by the authors. It automates YouTube transcript processing into Obsidian Notes, generates note backlinks, and provides a RAG chatbot over the vault. 30+ API endpoints, a FastAPI backend, React 19 frontend, PostgreSQL, Qdrant vector database, and integrations with Anthropic Claude and OpenAI.

It was never designed as a commercial or multi-user application. Authentication was disabled by default for LAN-only convenience. Input normalization and rate limiting were minimal. This makes it a realistic test target — most internal tools and personal projects share these same gaps. The point is that even well-built personal tools have structural security assumptions that break when you point adversarial agents at them.

Platform Architecture#

5 specialized agents across 3 teams:

Agent	Team	Model	Role
Recon Agent	Red	Sonnet	Maps attack surfaces, prioritizes targets
Exploit Agent	Red	Opus	Executes attacks, writes PoC exploits
Monitor Agent	Blue	Sonnet	Analyzes audit logs, detects anomalies
Hardener Agent	Blue	Opus	Patches vulnerabilities, hardens defenses
Scorer Agent	Referee	Opus	Evaluates both teams, calculates ASR

7-phase exercise lifecycle:

1
Setup → Recon → Attack → Detect → Defend → Score → Report

Each phase is gated. Red Team attacks first. Blue Team detects and patches without seeing Red’s findings (information asymmetry). The Referee scores both sides after all phases complete. Turn-based execution makes results deterministic and reproducible.

Isolation via worktrees and hooks:

Claude Code worktrees (claude --worktree red-team / claude --worktree blue-team) provide filesystem separation. Seven arena hooks enforce boundaries: team isolation, target protection, rate limiting, and audit logging. The critical insight: information asymmetry is enforced by deterministic hooks, not by prompt instructions that agents can ignore under context pressure.

Scoring Methodology#

Attack Success Rate (ASR) is the primary metric: confirmed findings divided by total attempts, with severity weighting (Critical 4x, High 3x, Medium 2x, Low 1x). Partial findings count at 0.5 weight. Thresholds: >50% CRITICAL, 20-50% HIGH, 5-20% MEDIUM, <5% LOW.

Blue Team Score combines 5 components: detection rate (25%), inverse ASR (30%), OWASP coverage (10%), time efficiency (30%), and false positive rate (5%). When time data is unavailable, the 30% time weight is redistributed proportionally to the other four components.

The scoring calculator is a 900-line Python script (UV single-file, PEP 723) that takes both teams’ JSON outputs and produces a comprehensive scoring-result.json with per-finding evaluation, blind spot analysis, and OWASP category breakdown.

Security Findings Highlights#

Round 1 — The Baseline (ASR 65% CRITICAL):

Arbitrary file read via base64 path traversal — exfiltrated API keys, database credentials, and system files through a missing path validation that existed on one endpoint but not another
Complete hook evasion — all 10 attacks executed via HTTP, triggering zero hook warnings. The hook monitoring layer and HTTP API operate in separate domains
Prompt injection sanitizer covered 3 of 22 known patterns. 9 out of 9 test payloads passed through unmodified
100% detection rate by Blue Team via proactive code analysis, 9 patches applied and verified

Round 2 — The Two-Wave Strategy (ASR 47.06% HIGH):

Wave 1 regression: 8 of 10 original attacks blocked. Patches hold against known vectors (20% ASR)
Wave 2 escalation: 6 of 7 new attacks confirmed (85.7% ASR). Unicode zero-width characters bypassed all regex-based security. Event loop exhaustion crashed the server with 5 rapid requests. Pattern gaps between middleware (8 patterns) and sanitizer (19 patterns) allowed complete bypass of the HTTP defense layer
2 blind spots emerged: credential leakage via config endpoint and auth bypass not re-flagged across rounds

Technologies#

Layer	Technology
Agent Runtime	Claude Code (Opus for leads, Sonnet for implementation)
Hook Scripts	Python UV single-file scripts (PEP 723), stdlib only
Inter-Agent Validation	JSON Schema Draft 2020-12
Scoring	Python calculator with severity weighting and time redistribution
Isolation	Claude Code worktrees + 7 arena hooks
Classification	OWASP Top 10 for Agentic Applications
Target	FastAPI + React 19 + PostgreSQL + Qdrant + Anthropic Claude

The Article Series#

This project is documented in a 4-part series covering the platform architecture, both exercise rounds, and transferable lessons:

Part 1: When Your AI Agents Attack Each Other — The platform architecture. Five agents, three teams, seven phases. How worktree isolation and deterministic hooks enforce information asymmetry that prompts cannot.

Part 2: 65% Attack Success Rate Against an Unpatched Target — Round 1 results. Ten attacks, seven confirmed, one critical credential exfiltration. Blue Team achieves 100% detection and patches every finding — but the ASR tells the real story.

Part 3: The Escalation Wave — Round 2 results. Regression wave proves patches hold (20% ASR). Escalation wave proves architecture doesn’t (85.7% ASR). Unicode bypasses, event loop crashes, and the ceiling of per-vulnerability patching.

Part 4: Securing Agentic AI Systems — What we learned. Patching vs. architecture, agent-driven testing advantages, the two-wave methodology, and recommendations for teams building agentic AI applications.

Builds on: Building the Bootstrap Framework — The security hooks from Part 3 of that series were the starting point for this platform’s defense infrastructure.

The throughline: Patching known vulnerabilities is necessary but insufficient. Adversarial agents find attack vectors that human testers miss — zero-width Unicode, event loop exhaustion, pattern gaps between security layers. The question is not whether your patches work against yesterday’s attacks, but whether your architecture survives tomorrow’s.