2357 words
12 minutes
Adversarial Agent Testing: When Your AI Agents Attack Each Other
Part 1 of 4 Adversarial Agent Testing

Five Claude Code agents. Three teams. One target. The Red Team found 7 vulnerabilities in 5 minutes. The Blue Team patched every one. Then Round 2 proved that patches alone aren’t enough.

Figure 1 - Three-team adversarial architecture diagram showing Red Team, Blue Team, and Referee in isolated worktrees targeting a shared codebase

Figure 1 - The Adversarial Architecture: Red Team (recon + exploit agents) attacks the target from one worktree. Blue Team (monitor + hardener agents) defends from another. The Referee scores both sides from a read-only vantage point. Information asymmetry is enforced by hooks — not prompts.


Two rounds of adversarial exercises against a real application. 27 total attacks across 10 OWASP categories. 14 defense patches adding 550 lines of security hardening. An Attack Success Rate that dropped from 65% CRITICAL to 47% HIGH — but only because regression attacks hit patched defenses. New attack techniques succeeded at 85.7%.

This is the story of building a platform where AI agents test each other’s security, and what we learned when we actually ran it.

The Bootstrap Framework series built the infrastructure — hooks, skills, agents, and slash commands that generate Claude Code configurations for any project. Part 3 of that series, Securing Agentic AI, established the security patterns: defense-in-depth rings, OWASP coverage, per-call hooks. This series takes those patterns and proves they work by running adversarial exercises against a live target.

The Problem: Security Testing That Scales#

Most AI agent security follows a familiar pattern. A developer reads about prompt injection, adds a basic sanitizer, and moves on. Maybe someone runs a one-time audit. The sanitizer covers 3 of 22 known injection patterns, and nobody discovers this until an attacker does.

The gap is not knowledge — the OWASP Top 10 for Agentic Applications catalogs the threats clearly. The gap is execution. Systematic testing requires:

  1. An attacker who thinks creatively about bypass techniques
  2. A defender who patches without seeing the attack plan
  3. A scorer who evaluates both sides impartially
  4. A methodology that makes results comparable across rounds

Humans can do all of this, but it takes a full security team and weeks of calendar time. For the vast majority of projects — internal tools, personal applications, startup MVPs — that investment never happens.

We built a platform where Claude Code agents fill each role. The agents are not better than human security testers (they miss things humans wouldn’t, and find things humans wouldn’t think to look for). But they are available, fast, and consistent. A full exercise — from initial reconnaissance through scoring and reporting — completes in under an hour.

The Architecture: Three Teams, Total Isolation#

The platform runs three teams against a target codebase, each operating in its own Claude Code worktree:

Figure 2 - Team composition diagram showing 5 agents across Red Team, Blue Team, and Referee with their models and tool access

Figure 2 - Team Composition: Red Team uses Sonnet for fast reconnaissance and Opus for complex exploitation. Blue Team mirrors the pattern: Sonnet for monitoring, Opus for patching. The Referee uses Opus for the nuanced evaluation of both sides. Each agent has explicitly restricted tool access.

Red Team operates two agents:

  • Recon Agent (Sonnet, read-only) maps the attack surface. It reads target code, identifies injection points, catalogs API endpoints, and produces a prioritized attack plan. It cannot write files or execute commands beyond read-only operations.
  • Exploit Agent (Opus) executes the attack plan. It crafts PoC exploits, sends HTTP requests to the target, documents evidence, and writes structured findings in JSON Schema format. It operates with SECURITY_SCAN_MODE=permissive so it can write exploit code without triggering security hooks.

Blue Team operates two agents:

  • Monitor Agent (Sonnet, read-only) analyzes audit logs and codebase state for evidence of attacks. It does not see Red Team’s findings — it must discover vulnerabilities independently. This information asymmetry is critical to the exercise’s validity.
  • Hardener Agent (Opus) patches detected vulnerabilities. It writes code, adds middleware, extends sanitizers, and verifies each patch with tests. It operates with SECURITY_SCAN_MODE=strict to ensure patches don’t introduce new vulnerabilities.

Referee operates one agent:

  • Scorer Agent (Opus, read-only for target) reads both teams’ outputs and calculates metrics. It produces a structured scoring-result.json and a human-readable exercise report. Its only write access is to the scoring and reports directories.

Information Asymmetry: The Design That Makes It Work#

The single most important architectural decision: Red Team cannot see Blue Team’s defenses, and Blue Team cannot see Red Team’s attack plans. This is not enforced by prompt instructions telling agents to “please don’t look.” It is enforced by a PreToolUse hook that reads the ACTIVE_TEAM environment variable and blocks file access across team boundaries.

Figure 3 - Information asymmetry diagram showing hook enforcement of team boundaries with blocked access paths

Figure 3 - Hook-Enforced Information Asymmetry: The pretooluse_team_isolation.py hook reads ACTIVE_TEAM and EXERCISE_PHASE environment variables on every tool call. Red Team agents attempting to read Blue’s detection/ or patches/ directories get a deterministic block. No amount of prompt engineering can bypass this — the hook runs before the tool executes.

This matters because prompt instructions achieve roughly 90% compliance. An agent under context pressure — processing complex code, managing long conversations, juggling multiple findings — occasionally ignores soft constraints. A hook achieves 100% compliance. That 10% gap is the difference between a valid security exercise and theater.

KEY INSIGHT: Information asymmetry in adversarial testing must be enforced by deterministic mechanisms (hooks, file permissions, worktree isolation) — not by prompt instructions. Prompts achieve 90% compliance. Hooks achieve 100%. The 10% gap is where exercises lose their validity.

The Exercise Lifecycle: Seven Phases#

Every exercise follows a turn-based lifecycle. Turn-based is deliberate — it simplifies analysis, produces clean phase boundaries for reporting, and eliminates the complexity of concurrent Red/Blue operations.

Figure 4 - Seven-phase exercise lifecycle showing Setup through Report with team assignments and outputs

Figure 4 - The Seven-Phase Lifecycle: Each phase has a single team owner, defined inputs and outputs, and a clear completion gate. The lifecycle is orchestrated by the /run-exercise slash command, which sequences all seven phases and handles worktree setup and teardown.

Phase 1: Setup — Create exercise run directory, initialize Red Team and Blue Team worktrees, copy target configuration, set environment variables (ACTIVE_TEAM, EXERCISE_PHASE, SECURITY_SCAN_MODE).

Phase 2: Recon (Red Team only) — Recon Agent maps the target’s attack surface. It reads source code, identifies API endpoints, catalogs injection points, analyzes existing security hooks, and produces a prioritized attack plan. Output: attack-surface-report.json.

Phase 3: Attack (Red Team only) — Exploit Agent executes the attack plan. Each finding includes: OWASP category, severity, PoC code, curl commands for reproduction, evidence of exploitation, and remediation hints. Output: attack-findings.json.

Phase 4: Detect (Blue Team only) — Monitor Agent analyzes audit logs and codebase state. It does not see Red’s findings. It must discover vulnerabilities through the same evidence trail an actual defender would have: audit entries, code analysis, endpoint probing. Output: detection-report.json.

Phase 5: Defend (Blue Team only) — Hardener Agent patches each detected vulnerability. Patches are applied in the Blue Team’s worktree branch, verified with tests, and documented with diffs. Output: defense-actions.json.

Phase 6: Score (Referee only) — Scorer Agent reads both teams’ JSON outputs. A 900-line Python calculator computes ASR, per-finding evaluation, blind spots, and composite scores for both teams. Output: scoring-result.json.

Phase 7: Report — Generate human-readable exercise report from scoring data. Includes executive summary, per-OWASP-category breakdown, attack timeline, blind spot analysis, and recommendations. Output: {exercise-id}-report.md.

The OWASP Framework: 10 Categories of Agent Threats#

We use the OWASP Top 10 for Agentic Applications as the classification framework for all findings. Every attack and every defense maps to one of 10 categories:

Figure 5 - OWASP Top 10 for Agentic Applications coverage heatmap showing attack and defense activity across both rounds

Figure 5 - OWASP Coverage Across Two Rounds: Green cells indicate defended categories. Red cells indicate confirmed attacks without defense. Gray cells indicate no activity. Round 1 tested 7 of 10 categories. Round 2 expanded to 9 of 10, with only A06 remaining untested.

#CategoryWhat It Means for Agent Systems
A01Prompt InjectionAdversarial inputs that hijack LLM behavior
A02Credential LeakageExposed API keys, tokens, or secrets
A03Excessive AgencyAgents with more permissions than needed
A04YAML/Data InjectionMalicious data injected into structured formats
A05Information DisclosureLeaked internal paths, model names, infrastructure
A06Trajectory DriftAgents deviating from intended behavior
A07Cascading FailureErrors propagating through multi-agent pipelines
A08Hook EvasionBypassing monitoring and enforcement layers
A09Output ManipulationUntrusted LLM output written without validation
A10Denial of ServiceResource exhaustion or service disruption

The classification serves two purposes. First, it provides a common vocabulary for Red and Blue teams to describe findings. Second, it makes exercises comparable: you can track which OWASP categories improved between rounds and which still have gaps.

The Scoring Methodology: Making Security Measurable#

Raw vulnerability counts are not useful for comparing security posture across rounds. “We found 7 vulnerabilities” versus “we found 8 vulnerabilities” tells you almost nothing — severity, detection, and patch effectiveness all matter.

The platform uses a scoring methodology built around Attack Success Rate (ASR) as the primary metric:

Figure 6 - Scoring formula visualization showing ASR calculation with severity weights and threshold ratings

Figure 6 - The Scoring Methodology: ASR is severity-weighted: a Critical finding (4x) counts four times more than a Low finding (1x). Partial findings count at 0.5 weight. Thresholds map ASR to ratings: >50% is CRITICAL, 20-50% is HIGH, 5-20% is MEDIUM, <5% is LOW.

ASR = weighted confirmed findings / total attempts

Severity weights ensure that a single Critical finding (credential exfiltration) matters more than three Low findings (verbose error messages):

SeverityWeightExample
Critical4xRemote code execution, credential exfiltration
High3xPrompt injection affecting output, YAML injection to filesystem
Medium2xInformation disclosure, authentication bypass (known risk)
Low1xVerbose error messages, configuration exposure

Blue Team Score combines 5 weighted components:

Blue Score = (detection_rate × 25% + (1 - ASR) × 30% +
coverage × 10% + time_efficiency × 30% +
(1 - FPR) × 5%) × 100

When time data is unavailable (as in both our exercises, where HTTP attacks bypassed audit logging), the 30% time weight is redistributed proportionally to the other four components. This ensures the score reflects only measured metrics rather than a neutral default.

KEY INSIGHT: Security metrics must be severity-weighted and round-comparable. A single Critical finding that exfiltrates credentials matters more than five Low findings that leak error messages. And a scoring formula that handles missing data (redistributing weights instead of defaulting) produces honest comparisons across rounds with different conditions.

The Target: A Real Application with Real Gaps#

The target for both exercises is obsidian-youtube-agent — a personal productivity tool built by the authors. It automates YouTube transcript processing into Obsidian Notes, generates note backlinks, and provides a RAG chatbot over the Obsidian vault.

Figure 7 - Target application architecture showing 30+ API endpoints, FastAPI backend, React frontend, and LLM integrations

Figure 7 - The Target: A full-stack application with 30+ API endpoints, a React 19 frontend, PostgreSQL for state, Qdrant for vector search, and integrations with Anthropic Claude and OpenAI. Authentication was disabled by default (accepted risk KR-001). Input sanitization covered 3 of 22 known injection patterns. 6 Claude Code security hooks were active pre-exercise.

The application was never designed as a commercial or multi-user system. It is a personal tool that runs on a Synology NAS for a household of three. Authentication was disabled because the API only serves a local network. Rate limiting was minimal because usage is light. Input normalization was incomplete because the developers (us) prioritized features over hardening.

This is exactly the point. Most internal tools, personal projects, and startup MVPs share these same gaps. The target is realistic because it is honest about what “well-built for its purpose” looks like when you point adversarial agents at it.

Pre-exercise security posture:

  • 30+ API endpoints, all accessible without authentication
  • 11 Claude Code security hooks active (security scan, input sanitization, rate limiting, audit logging, heartbeat monitoring, and more)
  • 10 known findings documented in a security report, 2 formally accepted as known risks
  • _sanitize_chat_input() covering 3 prompt injection patterns (XML role tags, role prefixes, code fences)
  • _validate_file_path() applied to 1 of 2 file access code paths

The Hook Infrastructure: Seven Arena Hooks#

Seven hooks enforce the exercise rules. They are all UV single-file Python scripts using only the standard library, following the fail-open pattern (malformed input exits with code 0, allowing the operation to proceed).

Figure 8 - Arena hook architecture showing 7 hooks and their enforcement points across the exercise lifecycle

Figure 8 - Arena Hook Architecture: Four arena-wide hooks (team isolation, target protection, rate limiting, audit logging) apply to all agents. Three per-team hooks (attack scope, defense scope, read-only enforcement) constrain each team’s capabilities. Every tool call passes through at least 2 hooks before executing.

HookTypeWhat It Enforces
pretooluse_team_isolation.pyPreToolUseBlocks cross-team file access based on ACTIVE_TEAM
pretooluse_target_protection.pyPreToolUsePrevents modification of target’s .claude/ infrastructure
pretooluse_arena_rate_limit.pyPreToolUsePer-team, per-tool rate limits (Red Read: 2000, Write: 200)
posttooluse_arena_audit.pyPostToolUseExtended audit log with team, phase, and OWASP metadata
pretooluse_attack_scope.pyPreToolUseConstrains Red Team to target + own exercise directories
pretooluse_defense_scope.pyPreToolUseConstrains Blue Team to target + own exercise directories
pretooluse_readonly_enforce.pyPreToolUseBlocks Referee writes except to scoring/ and reports/

The team isolation hook is the linchpin. It reads two environment variables on every tool call: ACTIVE_TEAM (red/blue/referee) and EXERCISE_PHASE (recon/attack/detect/defend/score). Based on these values, it blocks or allows file access deterministically. Red Team during the attack phase cannot read Blue Team’s detection directory. Blue Team during the defend phase cannot read Red Team’s exploit directory. The Referee during scoring can read both.

What Comes Next#

The platform is built. The lifecycle works. The scoring methodology produces comparable results. Two exercises are complete.

Part 2 covers Round 1 in detail: 10 attacks against an unpatched target, a 65% CRITICAL ASR, a critical credential exfiltration, and the architectural gap where HTTP attacks bypass hook monitoring entirely. Blue Team achieves 100% detection and patches every finding — but the ASR tells the real story about the target’s security posture.


Next: 65% Attack Success Rate Against an Unpatched Target — Round 1 results: 10 attacks, 7 confirmed, one critical credential exfiltration. Blue Team achieves 100% detection, but the damage is already done.


The Series#

This is Part 1 of a 4-part series on Adversarial Agent Testing:

  1. When Your AI Agents Attack Each Other (this article) — The platform: five agents, three teams, seven phases, and the hook infrastructure that makes it work
  2. 65% Attack Success Rate Against an Unpatched Target — Round 1: 10 attacks, 7 confirmed, 100% detection, 9 defense patches
  3. The Escalation Wave — Round 2: patches hold at 20% ASR, new attacks succeed at 85.7%
  4. Securing Agentic AI Systems — Lessons: patching vs. architecture, agent-driven testing, and recommendations
Adversarial Agent Testing: When Your AI Agents Attack Each Other
https://dotzlaw.com/insights/adversarial-testing-01/
Author
Gary Dotzlaw, Katrina Dotzlaw, Ryan Dotzlaw
Published at
2026-03-01
License
CC BY-NC-SA 4.0
← Back to Insights