Securing Agentic AI Systems: What Two Rounds of Adversarial Testing Taught Us

27 attacks. 14 defense patches. 550 lines of security hardening. Two rounds proved the same thing from opposite directions: targeted patches drop the attack success rate from 65% to 20% against known vectors. Structural weaknesses keep it at 85.7% for new ones. Patching and architecture are complements, not substitutes.

Figure 1 - Summary visualization showing the complete two-round journey from 65% to 47% ASR with key metrics

Figure 1 - The Two-Round Journey: From 65% CRITICAL to 47% HIGH, but the headline obscures the real story. Regression ASR (20%) proves patches work. Escalation ASR (85.7%) proves architecture doesn’t. The gap between these two numbers is the gap between per-vulnerability patching and structural defense.

We built a platform where AI agents attack and defend a real application. We ran two rounds of exercises. We generated structured data: JSON findings, severity-weighted scores, OWASP category breakdowns, patch durability assessments. We have numbers.

But numbers are not lessons. This article extracts the transferable insights — the patterns that apply beyond our specific target, our specific platform, and our specific results. If you are building, deploying, or securing an agentic AI application, these are the things we wish we had known before we started.

Lesson 1: Patching and Architecture Are Different Strategies for Different Threats#

The central finding of both rounds, stated simply: patches fix specific vulnerabilities. Architecture fixes categories of vulnerabilities. You need both.

Figure 2 - Patching vs architecture comparison showing patch scope versus architectural scope

Figure 2 - Two Defense Strategies: Patches (left) close specific code paths. Each patch addresses one finding at one location. Architectural changes (right) close categories of attacks. Authentication blocks all unauthenticated access. Input normalization blocks all Unicode-based bypasses. The Round 2 results quantify the difference.

Round 1 patches addressed 7 specific findings:

_validate_vault_path() closed the base64 path traversal on one endpoint
19 compiled regex patterns closed the sanitizer gap for known injection phrases
HTTP middleware closed the hook evasion gap with 8 detection patterns

Each patch is correct, verified, and durable. The regression wave proved it: 8 of 10 original attacks were blocked. Patches work.

But the escalation wave found 6 new vulnerabilities in 7 attempts. Unicode zero-width characters bypassed every regex. A new endpoint exposed credentials. Five rapid requests crashed the server. The pattern gap between security layers let 11 injection techniques through.

Every escalation finding traces to a structural weakness:

No authentication means every endpoint is accessible to every attacker
No input normalization means Unicode tricks bypass any string-based defense
Inconsistent security layers means gaps between layers become the attack surface

Patches cannot close these. You can add Unicode normalization to the middleware (DEF-010 does this). But until normalize-before-validate is an application-wide invariant, the next encoding trick will find the next unnormalized entry point.

Strategy	What It Fixes	Durability	Round 2 Evidence
Patches	Specific code paths at specific locations	High for known vectors	Regression ASR 20%
Architecture	Categories of attack across all locations	High for current + future vectors	Escalation ASR 85.7%

KEY INSIGHT: When regression ASR is low and escalation ASR is high, your patches work but your architecture doesn’t. This is the signal to shift investment from per-vulnerability patching to structural remediation. The two-wave methodology makes this distinction measurable.

Lesson 2: Agent-Driven Testing Finds What Human Testers Miss (and Misses What They Find)#

The Red Team agents discovered attack techniques that were not in our threat model. The Unicode zero-width bypass (ATK-N03), the event loop exhaustion via empty batch requests (ATK-N02), and the startswith() prefix collision (ATK-N04) were not attacks we had anticipated. The agents found them through systematic exploration of the target’s behavior.

Figure 3 - Agent vs human testing comparison showing complementary strengths and weaknesses

Figure 3 - Agent Testing Complements Human Testing: Agents excel at systematic enumeration, encoding variations, and exhaustive pattern testing. Humans excel at business logic attacks, social engineering, and creative scenario construction. The ideal security testing program uses both.

What the agents found that humans likely wouldn’t:

Unicode zero-width bypass: The agent systematically tested encoding variations against the regex patterns. A human tester might try URL encoding or HTML entities, but zero-width Unicode insertion requires specific knowledge of how regex engines handle invisible codepoints.
Event loop exhaustion: The agent sent rapid requests without rate limiting concerns. A human tester would likely test one request at a time and never discover the concurrency failure.
Pattern enumeration: The agent tested all 22 known injection patterns against the middleware, identified the 11 missing ones, and documented the gap systematically. A human tester might test 3-5 patterns and call it done.

What the agents missed that humans wouldn’t:

Business logic attacks: The agents did not attempt to manipulate the YouTube processing pipeline’s business logic — for example, submitting a YouTube URL that redirects to an attacker-controlled page serving crafted transcripts.
Social engineering via LLM: The agents tested prompt injection as pattern bypass. A human red teamer might craft prompts that exploit the LLM’s helpfulness rather than its pattern matching — subtler attacks that don’t trigger regex-based defenses.
A06: The one OWASP category never tested across both rounds. Agent trajectory drift requires observing behavior over time — something the current platform’s turn-based model doesn’t naturally produce.

The practical recommendation: use agent-driven testing for systematic coverage and encoding variation testing. Use human testers for business logic, creative scenarios, and the attacks that require understanding context rather than exhausting patterns.

Lesson 3: The Two-Wave Methodology Makes Results Actionable#

Before Round 2, we considered simply running 17 new attacks against the patched codebase. The combined ASR would have told us something, but not what we needed to know. A 47% ASR could mean “patches are partially effective” or “patches are fully effective but new attacks are devastating.” Those are different problems requiring different responses.

Splitting into regression and escalation waves makes the results actionable:

Figure 4 - Two-wave methodology flow showing how regression and escalation results drive different action plans

Figure 4 - The Two-Wave Decision Tree: Regression results determine whether to keep patching (if patches failed) or shift strategy (if patches held). Escalation results determine whether to patch new findings (if few succeeded) or invest in architecture (if many succeeded). The combination drives a specific, evidence-based action plan.

If regression ASR is high and escalation ASR is high: Patches are failing. Go back and fix the patches before attempting new defenses.

If regression ASR is low and escalation ASR is low: Both patches and architecture are holding. Increase attack complexity for the next round.

If regression ASR is low and escalation ASR is high: This is our result. Patches work, architecture doesn’t. Shift investment from patching to structural remediation.

If regression ASR is high and escalation ASR is low: Unusual. Patches are incomplete but the attacker isn’t finding new vectors. Review patch quality and test methodology.

The methodology is transferable. Any team running repeated security assessments can separate regression testing (do old fixes still work?) from escalation testing (what new things break?). The regression/escalation split turns a single ASR number into an actionable diagnostic.

KEY INSIGHT: A single combined ASR hides the distinction between patch failure and architectural weakness. The two-wave methodology — regression to validate patches, escalation to test resilience — produces actionable intelligence that a combined metric cannot. Adopt this split for any repeated security assessment.

Lesson 4: Scoring Makes Security Comparable#

Before the scoring methodology, security assessments produced reports. After the methodology, they produce numbers that can be compared across rounds, across targets, and across teams.

Figure 5 - Score comparison across rounds showing how each component contributed to the change

Figure 5 - Score Components Across Rounds: The Blue Team score decreased from 67.86 to 63.76 despite ASR improving. The reason: detection rate dropped 25pp (missed 2 of 8 findings) and coverage dropped 20pp (only 5 of 10 OWASP categories defended). The formula rewards breadth, not just depth.

Key properties of the scoring methodology:

Severity weighting matters. A single Critical finding (credential exfiltration, 4x weight) impacts the score more than three Low findings (verbose error messages, 1x each). Without weighting, Round 1 looks like “7 out of 10 attacks succeeded.” With weighting, it’s “16.5 severity-weighted points out of a possible 40” — a more precise picture of the damage.

Partial findings are explicitly handled. The prompt injection sanitizer bypass (ATK-003) was confirmed at the code level but couldn’t complete end-to-end because the LLM model was unavailable. Counting it as fully confirmed would overstate the risk. Counting it as failed would understate it. The 0.5 partial weight captures the reality: the vulnerability exists, the full exploitation chain doesn’t (yet).

Missing data is redistributed, not defaulted. Both exercises lacked time-to-detect and time-to-patch data (HTTP attacks bypassed audit logging). Rather than defaulting to a neutral 0.5 for time efficiency, the calculator redistributes the 30% time weight to the other four components. This ensures the Blue Score reflects only measured metrics and remains comparable across exercises with different data availability.

The Blue Team paradox is real. Improving defense against known attacks while being blindsided by new categories produces a worse composite score than mediocre defense with broad coverage. This is not a flaw in the formula — it reflects a genuine security principle: breadth of defense matters as much as depth.

Lesson 5: Honest Targets Produce Honest Results#

The target application was built by the authors for personal use. Authentication was disabled for convenience. Input sanitization was minimal. Security hooks were present but designed for agent-level threats, not HTTP-level attacks. We knew about the gaps — several were documented in a known-risks register.

This honesty is the point.

Figure 6 - Target context diagram showing the gap between personal tool assumptions and adversarial conditions

Figure 6 - Assumptions vs. Reality: The target was built for a trust model where the user is local, known, and non-malicious. Every architectural decision — disabled auth, minimal rate limiting, incomplete sanitization — is reasonable within that model. Adversarial testing invalidates the model, and the 65% ASR quantifies the gap.

Most internal tools, personal projects, and startup MVPs share the same security profile. They are built for a trust model where the user is known, the network is local, and the threat is accidental. Authentication is disabled because it’s inconvenient. Rate limiting is absent because traffic is light. Input validation is minimal because users aren’t adversarial.

This is not negligence. It is a rational design choice for the intended deployment context. The 65% ASR does not mean the application was built badly — it means the application was built for a trust model that adversarial testing deliberately violates.

The lesson for teams building agentic AI: start with an honest assessment of your trust model. If your agents interact with untrusted input (user prompts, external data, third-party APIs), your trust model is adversarial whether you designed for it or not. The security posture that’s appropriate for a localhost tool is not appropriate for an internet-facing API, and the gap between them is measurable.

Lesson 6: The Bootstrap Framework Connection#

This platform did not emerge from nothing. The security hooks, the defense-in-depth architecture, and the OWASP classification framework all trace back to the Bootstrap Framework series — specifically Part 3: Securing Agentic AI.

Figure 7 - Evolution diagram showing Bootstrap Framework security patterns flowing into the adversarial testing platform

Figure 7 - From Framework to Validation: The Bootstrap Framework (left) established the security patterns: defense-in-depth rings, per-call hooks, trajectory monitoring, OWASP coverage. The adversarial testing platform (right) validates those patterns against a real target. The exercise results feed back into the framework as empirical evidence for which patterns matter most.

The Bootstrap Framework’s Part 3 established:

Four Defense Rings: per-call, trajectory, structural, and validation layers
Per-archetype security patterns: different project types need different defenses
OWASP Top 10 for Agentic Applications: the classification framework for all 10 threat categories
Attack Success Rate (ASR): the metric that makes security improvement measurable
Zero Trust for Agents: verify then trust, least privilege, assume breach

The adversarial testing platform takes these patterns and proves which ones work. The four defense rings? Validated — per-call hooks (Ring 1) catch specific attacks, but trajectory monitoring (Ring 2) and structural safeguards (Ring 3) are where architectural resilience lives. Per-archetype patterns? Validated — a FastAPI application needs HTTP-layer defenses that the hook layer cannot provide. OWASP coverage? Validated — 9 of 10 categories were tested, and the untested category (A06) is the one that requires the turn-based model to evolve.

The empirical loop is the key contribution. The Bootstrap Framework proposed patterns. The adversarial testing platform tests them. The results feed back into both the framework (updating which patterns are recommended) and the target (hardening the application). Theory proposes, practice validates, results improve both.

KEY INSIGHT: Security patterns are hypotheses until tested. The Bootstrap Framework proposed defense-in-depth rings, OWASP coverage, and per-archetype security. The adversarial exercises proved which patterns held (targeted patches, HTTP middleware), which didn’t (regex-only defenses without normalization), and which matter most (authentication as the root defense). Build the patterns, then test them adversarially.

Recommendations for Teams Building Agentic AI#

Based on two rounds of adversarial testing, here are the concrete actions that would have the highest impact for any team building agentic AI applications:

Figure 8 - Priority matrix showing recommended actions by impact and effort

Figure 8 - Recommendation Priority Matrix: High-impact, low-effort actions (top-left) should be implemented first. Authentication and input normalization are the highest-leverage structural changes. The two-wave testing methodology is high-impact, moderate-effort, and produces the data needed to prioritize everything else.

Structural (Implement First)#

1. Enable authentication and fail closed. This single change would have blocked or mitigated the majority of findings across both rounds. When authentication is disabled, every endpoint is an attack surface. When it fails closed (rejecting requests when the key is misconfigured rather than allowing them), misconfiguration is safe by default.

2. Normalize input before validation. Apply Unicode NFKC normalization and zero-width character stripping before any regex-based security check. This is a middleware, not a per-endpoint fix. The principle: normalize first, validate second, everywhere, every time.

3. Centralize security patterns. Define injection detection patterns in a single module. Import them into every security layer — middleware, sanitizer, hook. When the middleware and sanitizer use the same 19 patterns, gap-based bypasses become impossible.

Operational (Implement Next)#

4. Run the two-wave methodology. Separate regression testing from escalation testing. Regression validates patches. Escalation tests resilience. The combination tells you whether to keep patching or invest in architecture.

5. Track OWASP coverage across rounds. Maintain a persistent risk register that carries findings across exercises. The A02 blind spot in Round 2 existed because Round 1 had zero A02 activity and Blue Team’s playbook didn’t cover it.

6. Use severity-weighted scoring. Raw vulnerability counts are misleading. A single Critical finding that exfiltrates credentials matters more than five Low findings that expose error messages. Weight severity, handle partial findings explicitly, and redistribute missing data rather than defaulting.

Testing (Continuous)#

7. Combine agent and human testing. Agents excel at systematic enumeration, encoding variations, and exhaustive pattern testing. Humans excel at business logic, creative scenarios, and context-dependent attacks. Neither is sufficient alone.

8. Test at the HTTP layer, not just the agent layer. If your application serves an API, your security tests must include HTTP-level attacks. Hook-based security monitors agent tool calls, not HTTP requests. Test both layers independently.

The Complete Picture#

Figure 9 - Complete data summary showing all key metrics from both rounds in a single view

Figure 9 - The Complete Summary: Every number from both rounds in one view. 27 total attacks, 15 confirmed, 14 defense patches, 550 lines of security hardening. ASR improved from 65% CRITICAL to 47% HIGH. Regression wave: 20%. Escalation wave: 85.7%. The gap between those two numbers is the measure of architectural debt.

Metric	Round 1	Round 2	Cumulative
Total Attacks	10	17	27
Confirmed Findings	7	8	15
Defense Patches	9	5	14
Lines of Security Code	+321	+229	+550
Files Modified	7	5	12
OWASP Categories Tested	7/10	9/10	9/10
Blind Spots	0	2	2 cumulative
ASR	65% CRITICAL	47.06% HIGH	—
Regression ASR	—	20%	—
Escalation ASR	—	85.7%	—

The throughline across both rounds is clear. Round 1 established a baseline: an unpatched personal tool has a 65% attack success rate against adversarial agents. Round 2 proved that targeted patches work (20% regression ASR) but structural weaknesses persist (85.7% escalation ASR). The path forward is architectural: authentication, normalization, centralized security patterns.

The platform itself — 5 agents, 7 phases, severity-weighted scoring, JSON Schema validation — is the contribution. The specific findings are interesting but perishable. The methodology for producing those findings, comparing them across rounds, and extracting actionable intelligence from the comparison is what transfers to any team building agentic AI.

Figure 10 - The adversarial testing cycle showing continuous improvement from exercise to patches to architecture to next exercise

Figure 10 - The Adversarial Testing Cycle: Exercise produces findings. Findings drive patches (tactical) and architectural changes (strategic). The next exercise validates patches via regression and tests architecture via escalation. Each round builds on the last. Security is not a destination — it is a cycle.

Security is a cycle, not a destination. Each exercise finds vulnerabilities. Each defense phase patches them. Each subsequent exercise validates the patches and discovers new attack surfaces. The adversarial agents evolve their techniques. The defense agents evolve their responses. The scoring methodology makes progress measurable.

Two rounds complete. The platform works. The methodology transfers. The question is not whether your application has vulnerabilities — it does. The question is whether you can find them before an attacker does, fix them systematically, and verify that the fixes hold. That is what adversarial agent testing provides.

The Series#

This is Part 4 of a 4-part series on Adversarial Agent Testing:

When Your AI Agents Attack Each Other — The platform: five agents, three teams, seven phases, and the hook infrastructure that makes it work
65% Attack Success Rate Against an Unpatched Target — Round 1: 10 attacks, 7 confirmed, 100% detection, 9 defense patches
The Escalation Wave — Round 2: patches hold at 20% ASR, new attacks succeed at 85.7%
Securing Agentic AI Systems (this article) — Lessons: patching vs. architecture, agent-driven testing, and recommendations

Securing Agentic AI — The Bootstrap Framework security article that established the defense patterns validated in this series
An Agent Swarm That Builds Agent Swarms — The Bootstrap Framework origin story — the infrastructure this platform builds on
Hooks, Agents, and the Deterministic Control Layer — Why hooks enforce what prompts cannot