Advanced Testing Strategies for LangGraph and Pydantic AI Agent Systems

We shipped an AI agent that passed every unit test in the suite. Green across the board. Then it hit production and started routing customer complaints to the sales team, hallucinating product names that didn’t exist, and occasionally responding in French. The tests never caught any of it because we were testing the wrong things at the wrong level. After rebuilding our entire test strategy around LangGraph’s state machine model and Pydantic AI’s validation primitives, we cut production incidents by 90% and dropped our test execution time from 12 minutes to 40 seconds.

That experience taught us a hard lesson. Traditional testing assumes deterministic inputs and outputs. AI agents laugh at that assumption. The same prompt can produce different responses across runs. State mutates as it flows through graph nodes. External LLM calls fail in ways you can’t predict. You need a testing strategy built specifically for this kind of system, one that accounts for probabilistic behavior, validates state transitions, and catches the subtle orchestration bugs that only surface three nodes deep in a workflow.

The good news: LangGraph and Pydantic AI give you the tools to make this tractable. LangGraph’s explicit state graphs turn implicit agent behavior into something you can inspect and verify at every step. Pydantic AI’s TestModel and FunctionModel let you rip out the LLM entirely and test your logic in milliseconds. Together, they turn “hope it works” into “prove it works.”

Why AI Agent Testing Breaks Traditional Approaches#

Five Layers of Complexity You Didn’t Ask For#

When we first tried to test our LangGraph agents with standard pytest patterns, we kept running into the same wall. The tests passed, but the system still broke. Here is why.

Stateful complexity is the biggest culprit. LangGraph maintains state as information flows through nodes, and that state mutates at each step. A bug in node two might not manifest until node five, when the corrupted state triggers an unexpected conditional branch. You cannot catch this with isolated unit tests.

LLM non-determinism compounds the problem. Even with temperature set to 0, language models produce slightly different outputs for the same input. Your tests need to verify the intent and structure of responses, not exact string matches.

Validation boundaries in Pydantic AI double your test surface. You need to verify that valid data passes through and that invalid data gets properly rejected with the right error messages. Testing only the happy path is testing half the system.

Tool integration introduces external dependencies that break test isolation. When your agent calls APIs, queries databases, or hits other services, you need strategies that test the integration logic without creating brittle or expensive external calls.

Orchestration logic demands testing at multiple levels. Individual nodes might work perfectly in isolation but fail when composed into a graph, because the routing conditions, state merges, or error propagation paths contain bugs that only appear during full execution.

The Testing Pyramid for AI Agents#

To catch bugs at every level, we need a layered strategy where each tier builds on the one below.

Figure 1: AI Agent Testing Pyramid. Unit tests form the base, verifying individual node functions and validation rules. Integration tests check component interactions. Workflow tests validate graph execution and state transitions. End-to-end tests at the top verify complete system behavior. Each layer catches different failure modes.

Unit tests verify individual components: node functions, validation rules, tool interfaces. They run in milliseconds and catch basic logic errors.

Integration tests validate interactions between components. Can a Pydantic AI agent call its tools correctly? Does a state update from one LangGraph node arrive intact at the next? These catch the subtle “components don’t quite fit together” bugs.

Workflow tests exercise complete LangGraph graphs. They validate state transitions, conditional routing, and error propagation across multiple nodes. These are where orchestration bugs surface.

End-to-end tests confirm the system works from the user’s perspective. They simulate full conversations and verify that all pieces deliver the expected result together.

You need all four layers. Unit tests alone miss orchestration bugs. End-to-end tests alone make it impossible to pinpoint which component failed.

KEY INSIGHT: Test AI agents at every layer of the pyramid. Unit tests that pass tell you components work in isolation. Only workflow and E2E tests tell you the system actually works.

Testing Pydantic AI Components#

Unit Testing with TestModel#

Pydantic AI’s TestModel changed how we write agent tests. Instead of making expensive, non-deterministic LLM calls during testing, TestModel provides predictable responses. Tests run in milliseconds, and they actually test your logic instead of the LLM’s mood that day.

1
from pydantic_ai import Agent
2
from pydantic_ai.models.test import TestModel
3
from pydantic import BaseModel, Field
4
from typing import List, Optional
5
import pytest
6

7
# First, let's define a structured output model
8
class CustomerAnalysis(BaseModel):
9
    """Structured analysis of customer sentiment and needs."""
10
    sentiment: float = Field(..., ge=-1, le=1, description="Sentiment score from -1 to 1")
11
    primary_issue: str = Field(..., description="Main customer concern")
12
    urgency_level: int = Field(..., ge=1, le=5, description="Urgency from 1-5")
13
    recommended_actions: List[str] = Field(..., min_items=1, max_items=3)
14

15
# Create our agent with structured output
16
customer_agent = Agent(
17
    'openai:gpt-4o',
18
    result_type=CustomerAnalysis,
19
    system_prompt="You are a customer service analyst. Analyze customer messages for sentiment, issues, and recommended actions."
20
)
21

22
# Now let's write comprehensive tests
23
class TestCustomerAgent:
24
    def test_basic_analysis(self):
25
        """Test that agent produces valid structured output."""
26
        # TestModel returns a simple response by default
27
        with customer_agent.override(model=TestModel()):
28
            result = customer_agent.run_sync("I'm frustrated with the slow shipping")
29

30
            # The result is guaranteed to be a CustomerAnalysis instance
31
            assert isinstance(result.data, CustomerAnalysis)
32
            assert -1 <= result.data.sentiment <= 1
33
            assert 1 <= result.data.urgency_level <= 5
34
            assert len(result.data.recommended_actions) >= 1
35

36
    def test_custom_responses(self):
37
        """Test agent with specific mock responses."""
38
        # Create a TestModel with a specific response
39
        mock_response = CustomerAnalysis(
40
            sentiment=-0.8,
41
            primary_issue="Shipping delays",
42
            urgency_level=4,
43
            recommended_actions=["Expedite shipping", "Offer compensation", "Follow up"]
44
        )
45

46
        # TestModel can return structured data directly
47
        test_model = TestModel(response=mock_response.model_dump_json())
48

49
        with customer_agent.override(model=test_model):
50
            result = customer_agent.run_sync("My order is 2 weeks late!")
51

52
            # Verify we got our expected response
53
            assert result.data.sentiment == -0.8
54
            assert result.data.primary_issue == "Shipping delays"
55
            assert "Expedite shipping" in result.data.recommended_actions
56

57
    def test_error_handling(self):
58
        """Test that agent handles errors gracefully."""
59
        # TestModel can simulate errors too
60
        error_model = TestModel(response=Exception("API Error"))
61

62
        with customer_agent.override(model=error_model):
63
            with pytest.raises(Exception) as exc_info:
64
                customer_agent.run_sync("Test message")
65

66
            assert "API Error" in str(exc_info.value)

The pattern here is straightforward: override the model, run your agent, assert on the structured output. You are testing how your agent handles responses, validates data, and deals with errors, all without touching an external LLM.

Controlling Agent Behavior with FunctionModel#

When you need finer-grained control over how your mock agent responds, FunctionModel lets you write custom response logic. We used this to test multi-step tool workflows where the agent’s behavior depends on what happened in previous turns.

1
from pydantic_ai.models.function import FunctionModel, AgentInfo
2
from pydantic_ai.messages import ModelMessage, ModelResponse, TextPart, ToolCallPart
3
import asyncio
4

5
# Let's create a more complex agent with tools
6
agent_with_tools = Agent(
7
    'openai:gpt-4o',
8
    system_prompt="You are a helpful assistant with access to various tools."
9
)
10

11
@agent_with_tools.tool
12
async def check_inventory(product_id: str) -> int:
13
    """Check inventory levels for a product."""
14
    # In tests, this might return mock data
15
    return 42
16

17
@agent_with_tools.tool
18
async def calculate_shipping(weight: float, destination: str) -> float:
19
    """Calculate shipping cost."""
20
    return weight * 2.5  # Simplified calculation
21

22
# Now create sophisticated test scenarios
23
async def test_multi_tool_workflow():
24
    """Test complex workflows involving multiple tool calls."""
25
    call_sequence = []
26

27
    async def custom_model_behavior(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
28
        """Simulate specific model behavior based on conversation state."""
29
        # Track what's been called
30
        nonlocal call_sequence
31

32
        # Get the last user message
33
        last_message = messages[-1].content
34

35
        if "check stock" in last_message.lower():
36
            # First, indicate we'll check inventory
37
            call_sequence.append("intent_recognized")
38
            return ModelResponse(
39
                parts=[
40
                    TextPart("I'll check the inventory for you."),
41
                    ToolCallPart(
42
                        tool_name="check_inventory",
43
                        args={"product_id": "PROD-123"}
44
                    )
45
                ]
46
            )
47
        elif any(part.part_kind == "tool-return" for part in messages[-1].parts):
48
            # We got tool results back, now calculate shipping
49
            call_sequence.append("tool_result_processed")
50
            return ModelResponse(
51
                parts=[
52
                    TextPart("The product is in stock. Let me calculate shipping."),
53
                    ToolCallPart(
54
                        tool_name="calculate_shipping",
55
                        args={"weight": 2.5, "destination": "New York"}
56
                    )
57
                ]
58
            )
59
        else:
60
            # Final response after all tools
61
            call_sequence.append("final_response")
62
            return ModelResponse(
63
                parts=[TextPart("Product PROD-123 is in stock (42 units) with shipping cost of $6.25 to New York.")]
64
            )
65

66
    # Run the test with our custom function
67
    with agent_with_tools.override(model=FunctionModel(custom_model_behavior)):
68
        result = await agent_with_tools.run("Check stock for PROD-123 and shipping to New York")
69

70
        # Verify the workflow executed correctly
71
        assert call_sequence == ["intent_recognized", "tool_result_processed", "final_response"]
72
        assert "42 units" in result.data
73
        assert "$6.25" in result.data
74

75
# Test error recovery in tool execution
76
async def test_tool_error_recovery():
77
    """Test how agent handles tool failures."""
78

79
    async def model_with_fallback(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
80
        # Check if we just got a tool error
81
        last_message = messages[-1]
82
        if hasattr(last_message, 'parts'):
83
            for part in last_message.parts:
84
                if part.part_kind == "tool-return" and "error" in str(part.content).lower():
85
                    # Provide helpful fallback response
86
                    return ModelResponse(
87
                        parts=[TextPart("I encountered an error checking inventory, but I can help you place a backorder instead.")]
88
                    )
89

90
        # Initial tool call that will fail
91
        return ModelResponse(
92
            parts=[ToolCallPart(tool_name="check_inventory", args={"product_id": "INVALID"})]
93
        )
94

95
    # Override the tool to simulate failure
96
    async def failing_inventory_check(product_id: str) -> int:
97
        raise ValueError("Product not found")
98

99
    agent_with_tools._tools["check_inventory"].func = failing_inventory_check
100

101
    with agent_with_tools.override(model=FunctionModel(model_with_fallback)):
102
        result = await agent_with_tools.run("Check stock for INVALID")
103
        assert "backorder" in result.data.lower()

With FunctionModel, you control exactly what the agent “thinks” at each step. You can simulate multi-step tool chains, test error recovery paths, and verify conversation flow, all deterministically and all in milliseconds.

Figure 2: Pydantic AI Testing Flow. TestModel provides predetermined responses for fast, simple tests. FunctionModel lets you write custom response logic based on conversation state, enabling sophisticated scenario testing. Both approaches avoid real LLM calls while thoroughly exercising your agent logic.

KEY INSIGHT: Use TestModel for “does the plumbing work” tests and FunctionModel for “does the agent behave correctly across multi-turn interactions” tests. Together they cover the full range of agent behavior without a single LLM call.

Validation Testing at the Boundaries#

One of Pydantic AI’s biggest strengths is its validation layer. But we learned the hard way that validation only protects you if you test both sides of every boundary. Our first agent accepted malformed meeting requests for weeks before anyone noticed, because we only tested the happy path.

Here is how we test validation comprehensively now:

1
from pydantic import BaseModel, Field, validator, model_validator
2
from typing import List, Optional, Dict
3
import pytest
4
from datetime import datetime, timedelta
5

6
class MeetingScheduleRequest(BaseModel):
7
    """Complex model with multiple validation rules."""
8
    title: str = Field(..., min_length=3, max_length=100)
9
    attendees: List[str] = Field(..., min_items=2, max_items=20)
10
    duration_minutes: int = Field(..., ge=15, le=480)  # 15 min to 8 hours
11
    proposed_time: datetime
12
    meeting_type: str = Field(..., pattern="^(video|audio|in-person)$")
13

14
    @validator('attendees')
15
    def validate_attendees(cls, v):
16
        """Ensure all attendees have valid email format."""
17
        import re
18
        email_pattern = re.compile(r'^[\w\.-]+@[\w\.-]+\.\w+$')
19

20
        invalid_emails = [email for email in v if not email_pattern.match(email)]
21
        if invalid_emails:
22
            raise ValueError(f"Invalid email addresses: {invalid_emails}")
23

24
        # Check for duplicates
25
        if len(set(v)) != len(v):
26
            raise ValueError("Duplicate attendees not allowed")
27

28
        return v
29

30
    @validator('proposed_time')
31
    def validate_future_time(cls, v):
32
        """Ensure meeting is scheduled in the future."""
33
        if v <= datetime.now():
34
            raise ValueError("Meeting must be scheduled in the future")
35

36
        # Not too far in the future
37
        if v > datetime.now() + timedelta(days=365):
38
            raise ValueError("Cannot schedule meetings more than a year in advance")
39

40
        return v
41

42
    @model_validator(mode='after')
43
    def validate_video_meeting_duration(self):
44
        """Video meetings shouldn't exceed 2 hours."""
45
        if self.meeting_type == 'video' and self.duration_minutes > 120:
46
            raise ValueError("Video meetings should not exceed 2 hours")
47
        return self
48

49
# Comprehensive validation tests
50
class TestMeetingValidation:
51
    def test_boundary_values(self):
52
        """Test validation at the edges of acceptable ranges."""
53
        base_data = {
54
            "title": "Test Meeting",
55
            "attendees": ["user1@example.com", "user2@example.com"],
56
            "proposed_time": datetime.now() + timedelta(hours=1),
57
            "meeting_type": "video"
58
        }
59

60
        # Test duration boundaries
61
        # Minimum valid duration
62
        valid_min = MeetingScheduleRequest(**{**base_data, "duration_minutes": 15})
63
        assert valid_min.duration_minutes == 15
64

65
        # Maximum valid duration
66
        valid_max = MeetingScheduleRequest(**{**base_data, "duration_minutes": 480})
67
        assert valid_max.duration_minutes == 480
68

69
        # Below minimum
70
        with pytest.raises(ValueError) as exc_info:
71
            MeetingScheduleRequest(**{**base_data, "duration_minutes": 14})
72
        assert "greater than or equal to 15" in str(exc_info.value)
73

74
        # Above maximum
75
        with pytest.raises(ValueError) as exc_info:
76
            MeetingScheduleRequest(**{**base_data, "duration_minutes": 481})
77
        assert "less than or equal to 480" in str(exc_info.value)
78

79
    def test_custom_validators(self):
80
        """Test complex custom validation logic."""
81
        base_data = {
82
            "title": "Team Standup",
83
            "duration_minutes": 30,
84
            "proposed_time": datetime.now() + timedelta(days=1),
85
            "meeting_type": "video"
86
        }
87

88
        # Test invalid email format
89
        with pytest.raises(ValueError) as exc_info:
90
            MeetingScheduleRequest(**{
91
                **base_data,
92
                "attendees": ["valid@example.com", "invalid-email"]
93
            })
94
        assert "Invalid email addresses" in str(exc_info.value)
95

96
        # Test duplicate attendees
97
        with pytest.raises(ValueError) as exc_info:
98
            MeetingScheduleRequest(**{
99
                **base_data,
100
                "attendees": ["user@example.com", "user@example.com"]
101
            })
102
        assert "Duplicate attendees" in str(exc_info.value)
103

104
        # Test past meeting time
105
        with pytest.raises(ValueError) as exc_info:
106
            MeetingScheduleRequest(**{
107
                **base_data,
108
                "attendees": ["user1@example.com", "user2@example.com"],
109
                "proposed_time": datetime.now() - timedelta(hours=1)
110
            })
111
        assert "must be scheduled in the future" in str(exc_info.value)
112

113
    def test_model_level_validation(self):
114
        """Test validation that depends on multiple fields."""
115
        base_data = {
116
            "title": "Long Video Call",
117
            "attendees": ["user1@example.com", "user2@example.com"],
118
            "proposed_time": datetime.now() + timedelta(hours=1)
119
        }
120

121
        # Video meeting within 2-hour limit - should pass
122
        valid_video = MeetingScheduleRequest(**{
123
            **base_data,
124
            "meeting_type": "video",
125
            "duration_minutes": 120
126
        })
127
        assert valid_video.meeting_type == "video"
128

129
        # Video meeting exceeding 2-hour limit - should fail
130
        with pytest.raises(ValueError) as exc_info:
131
            MeetingScheduleRequest(**{
132
                **base_data,
133
                "meeting_type": "video",
134
                "duration_minutes": 150
135
            })
136
        assert "should not exceed 2 hours" in str(exc_info.value)
137

138
        # In-person meeting can exceed 2 hours - should pass
139
        valid_in_person = MeetingScheduleRequest(**{
140
            **base_data,
141
            "meeting_type": "in-person",
142
            "duration_minutes": 240
143
        })
144
        assert valid_in_person.duration_minutes == 240
145

146
# Test validation in the context of an agent
147
async def test_agent_validation_handling():
148
    """Test how agents handle validation errors."""
149

150
    scheduling_agent = Agent(
151
        'openai:gpt-4o',
152
        result_type=MeetingScheduleRequest,
153
        system_prompt="You are a meeting scheduler. Extract meeting details from requests."
154
    )
155

156
    # Create a test scenario where the model returns invalid data
157
    invalid_response = {
158
        "title": "Quick Sync",
159
        "attendees": ["only-one@example.com"],  # Too few attendees
160
        "duration_minutes": 30,
161
        "proposed_time": "2024-01-01T10:00:00",  # Likely in the past
162
        "meeting_type": "video"
163
    }
164

165
    test_model = TestModel(response=json.dumps(invalid_response))
166

167
    with scheduling_agent.override(model=test_model):
168
        with pytest.raises(ValueError) as exc_info:
169
            await scheduling_agent.run("Schedule a quick sync with just me")
170

171
        # The validation error should bubble up
172
        assert "at least 2 items" in str(exc_info.value)

The key patterns here: test boundaries (15 and 480 minutes, not just 60), test custom validators with intentionally bad data, test cross-field validation (video meetings have different duration limits than in-person), and test that validation errors propagate correctly through the agent.

Testing LangGraph Workflows#

Verifying Nodes in Isolation#

LangGraph workflows are built from node functions, and those nodes are where you start testing. Each node takes a state dict, does some work, and returns a partial state update. Test them as pure functions first.

1
from langgraph.graph import StateGraph, State
2
from typing import TypedDict, List, Optional, Dict
3
import pytest
4

5
# Define a complex state structure
6
class DocumentProcessingState(TypedDict):
7
    document_id: str
8
    content: str
9
    entities: Optional[List[Dict[str, str]]]
10
    sentiment_analysis: Optional[Dict[str, float]]
11
    summary: Optional[str]
12
    processing_errors: List[str]
13
    metadata: Dict[str, any]
14

15
# Individual node functions to test
16
def extract_entities_node(state: DocumentProcessingState) -> Dict:
17
    """Extract named entities from document content."""
18
    try:
19
        # In production, this would use NLP models
20
        # For testing, we'll simulate the extraction
21
        content = state["content"]
22

23
        # Simulate entity extraction
24
        entities = []
25
        if "Apple" in content:
26
            entities.append({"text": "Apple", "type": "ORG", "confidence": 0.95})
27
        if "Tim Cook" in content:
28
            entities.append({"text": "Tim Cook", "type": "PERSON", "confidence": 0.98})
29
        if "Cupertino" in content:
30
            entities.append({"text": "Cupertino", "type": "LOC", "confidence": 0.92})
31

32
        return {
33
            "entities": entities,
34
            "metadata": {**state.get("metadata", {}), "entities_extracted": True}
35
        }
36
    except Exception as e:
37
        return {
38
            "processing_errors": state.get("processing_errors", []) + [f"Entity extraction failed: {str(e)}"]
39
        }
40

41
def analyze_sentiment_node(state: DocumentProcessingState) -> Dict:
42
    """Analyze document sentiment."""
43
    try:
44
        content = state["content"]
45

46
        # Simulate sentiment analysis
47
        # In reality, this would use a sentiment model
48
        sentiment_keywords = {
49
            "positive": ["excellent", "amazing", "innovative", "breakthrough"],
50
            "negative": ["disappointing", "failed", "problem", "issue"],
51
            "neutral": ["announced", "stated", "reported", "mentioned"]
52
        }
53

54
        scores = {"positive": 0.0, "negative": 0.0, "neutral": 0.0}
55

56
        for category, keywords in sentiment_keywords.items():
57
            for keyword in keywords:
58
                if keyword in content.lower():
59
                    scores[category] += 0.25
60

61
        # Normalize scores
62
        total = sum(scores.values()) or 1.0
63
        normalized_scores = {k: v / total for k, v in scores.items()}
64

65
        return {
66
            "sentiment_analysis": normalized_scores,
67
            "metadata": {**state.get("metadata", {}), "sentiment_analyzed": True}
68
        }
69
    except Exception as e:
70
        return {
71
            "processing_errors": state.get("processing_errors", []) + [f"Sentiment analysis failed: {str(e)}"]
72
        }
73

74
# Comprehensive node-level tests
75
class TestDocumentProcessingNodes:
76
    def test_entity_extraction_success(self):
77
        """Test successful entity extraction."""
78
        test_state = DocumentProcessingState(
79
            document_id="doc123",
80
            content="Apple CEO Tim Cook announced new products in Cupertino today.",
81
            entities=None,
82
            sentiment_analysis=None,
83
            summary=None,
84
            processing_errors=[],
85
            metadata={}
86
        )
87

88
        result = extract_entities_node(test_state)
89

90
        # Verify entities were extracted
91
        assert "entities" in result
92
        assert len(result["entities"]) == 3
93

94
        # Check specific entities
95
        entity_texts = {e["text"] for e in result["entities"]}
96
        assert "Apple" in entity_texts
97
        assert "Tim Cook" in entity_texts
98
        assert "Cupertino" in entity_texts
99

100
        # Verify metadata update
101
        assert result["metadata"]["entities_extracted"] is True
102

103
    def test_entity_extraction_empty_content(self):
104
        """Test entity extraction with no recognizable entities."""
105
        test_state = DocumentProcessingState(
106
            document_id="doc124",
107
            content="This is a generic statement with no specific entities.",
108
            entities=None,
109
            sentiment_analysis=None,
110
            summary=None,
111
            processing_errors=[],
112
            metadata={}
113
        )
114

115
        result = extract_entities_node(test_state)
116

117
        assert "entities" in result
118
        assert len(result["entities"]) == 0
119
        assert result["metadata"]["entities_extracted"] is True
120

121
    def test_sentiment_analysis_mixed(self):
122
        """Test sentiment analysis with mixed sentiment."""
123
        test_state = DocumentProcessingState(
124
            document_id="doc125",
125
            content="The product launch was amazing but faced some disappointing technical issues.",
126
            entities=None,
127
            sentiment_analysis=None,
128
            summary=None,
129
            processing_errors=[],
130
            metadata={}
131
        )
132

133
        result = analyze_sentiment_node(test_state)
134

135
        assert "sentiment_analysis" in result
136
        scores = result["sentiment_analysis"]
137

138
        # Should have both positive and negative sentiment
139
        assert scores["positive"] > 0
140
        assert scores["negative"] > 0
141

142
        # Verify scores sum to 1 (normalized)
143
        assert abs(sum(scores.values()) - 1.0) < 0.001
144

145
    def test_node_error_handling(self):
146
        """Test that nodes handle errors gracefully."""
147
        # Create state that will cause an error (missing required field)
148
        test_state = {
149
            "document_id": "doc126",
150
            # Missing 'content' field
151
            "processing_errors": [],
152
            "metadata": {}
153
        }
154

155
        result = extract_entities_node(test_state)
156

157
        # Should add error to processing_errors
158
        assert "processing_errors" in result
159
        assert len(result["processing_errors"]) > 0
160
        assert "Entity extraction failed" in result["processing_errors"][0]
161

162
# Test node composition and data flow
163
def test_node_composition():
164
    """Test that nodes can be composed and data flows correctly."""
165
    initial_state = DocumentProcessingState(
166
        document_id="doc127",
167
        content="Apple's innovative products continue to amaze customers worldwide.",
168
        entities=None,
169
        sentiment_analysis=None,
170
        summary=None,
171
        processing_errors=[],
172
        metadata={"source": "test"}
173
    )
174

175
    # Process through multiple nodes
176
    state = initial_state.copy()
177

178
    # First node: entity extraction
179
    entity_result = extract_entities_node(state)
180
    state.update(entity_result)
181

182
    # Second node: sentiment analysis
183
    sentiment_result = analyze_sentiment_node(state)
184
    state.update(sentiment_result)
185

186
    # Verify cumulative results
187
    assert state["entities"] is not None
188
    assert len(state["entities"]) > 0
189
    assert state["sentiment_analysis"] is not None
190
    assert state["sentiment_analysis"]["positive"] > state["sentiment_analysis"]["negative"]
191

192
    # Verify metadata accumulation
193
    assert state["metadata"]["entities_extracted"] is True
194
    assert state["metadata"]["sentiment_analyzed"] is True
195
    assert state["metadata"]["source"] == "test"  # Original metadata preserved

Notice the error handling test. We pass a state dict missing the content key and verify the node catches the exception and appends to processing_errors instead of crashing. Nodes in a LangGraph workflow must be resilient, because a crash in one node can take down the entire graph execution.

Figure 3: LangGraph Testing Architecture. Node-level tests verify individual functions with controlled inputs. Graph-level tests check full workflow execution from initial to final state. State transition tests validate conditional routing and verify the correct execution paths fire based on state conditions.

Testing Complete Graph Execution#

Once individual nodes pass their tests, you move up the pyramid to graph-level testing. Here we compile an actual LangGraph workflow and run it end-to-end with controlled inputs.

1
from langgraph.graph import StateGraph, START, END
2
from langgraph.checkpoint import MemorySaver
3
import asyncio
4

5
def create_document_processing_workflow():
6
    """Create a complete document processing workflow."""
7
    builder = StateGraph(DocumentProcessingState)
8

9
    # Add nodes
10
    builder.add_node("extract_entities", extract_entities_node)
11
    builder.add_node("analyze_sentiment", analyze_sentiment_node)
12
    builder.add_node("generate_summary", generate_summary_node)
13
    builder.add_node("quality_check", quality_check_node)
14

15
    # Define the flow
16
    builder.add_edge(START, "extract_entities")
17
    builder.add_edge("extract_entities", "analyze_sentiment")
18

19
    # Conditional edge based on sentiment
20
    def route_after_sentiment(state: DocumentProcessingState) -> str:
21
        if state.get("processing_errors"):
22
            return "quality_check"
23

24
        sentiment = state.get("sentiment_analysis", {})
25
        # Only generate summary for positive/neutral content
26
        if sentiment.get("negative", 0) > 0.7:
27
            return "quality_check"
28
        return "generate_summary"
29

30
    builder.add_conditional_edges(
31
        "analyze_sentiment",
32
        route_after_sentiment,
33
        {
34
            "generate_summary": "generate_summary",
35
            "quality_check": "quality_check"
36
        }
37
    )
38

39
    builder.add_edge("generate_summary", "quality_check")
40
    builder.add_edge("quality_check", END)
41

42
    return builder.compile()
43

44
# Additional nodes for complete workflow
45
def generate_summary_node(state: DocumentProcessingState) -> Dict:
46
    """Generate document summary."""
47
    try:
48
        content = state["content"]
49
        entities = state.get("entities", [])
50

51
        # Simple summary generation (in production, use LLM)
52
        entity_names = [e["text"] for e in entities]
53
        summary = f"Document discusses {', '.join(entity_names)}. " if entity_names else ""
54
        summary += f"Content length: {len(content)} characters."
55

56
        return {
57
            "summary": summary,
58
            "metadata": {**state.get("metadata", {}), "summary_generated": True}
59
        }
60
    except Exception as e:
61
        return {
62
            "processing_errors": state.get("processing_errors", []) + [f"Summary generation failed: {str(e)}"]
63
        }
64

65
def quality_check_node(state: DocumentProcessingState) -> Dict:
66
    """Perform final quality checks."""
67
    issues = []
68

69
    # Check for processing errors
70
    if state.get("processing_errors"):
71
        issues.append("Processing errors encountered")
72

73
    # Check completeness
74
    if not state.get("entities"):
75
        issues.append("No entities extracted")
76
    if not state.get("sentiment_analysis"):
77
        issues.append("No sentiment analysis performed")
78

79
    # For negative content, flag for review
80
    sentiment = state.get("sentiment_analysis", {})
81
    if sentiment.get("negative", 0) > 0.7:
82
        issues.append("High negative sentiment detected")
83

84
    return {
85
        "metadata": {
86
            **state.get("metadata", {}),
87
            "quality_check_completed": True,
88
            "quality_issues": issues
89
        }
90
    }
91

92
# Comprehensive graph-level tests
93
class TestDocumentWorkflow:
94
    async def test_complete_positive_flow(self):
95
        """Test the happy path through the workflow."""
96
        workflow = create_document_processing_workflow()
97

98
        initial_state = DocumentProcessingState(
99
            document_id="test001",
100
            content="Apple announced groundbreaking innovations at their Cupertino headquarters. CEO Tim Cook expressed excitement about the future.",
101
            entities=None,
102
            sentiment_analysis=None,
103
            summary=None,
104
            processing_errors=[],
105
            metadata={"test_case": "positive_flow"}
106
        )
107

108
        # Execute the workflow
109
        result = await workflow.ainvoke(initial_state)
110

111
        # Verify all steps completed successfully
112
        assert result["entities"] is not None
113
        assert len(result["entities"]) == 3  # Apple, Cupertino, Tim Cook
114

115
        assert result["sentiment_analysis"] is not None
116
        assert result["sentiment_analysis"]["positive"] > result["sentiment_analysis"]["negative"]
117

118
        assert result["summary"] is not None
119
        assert "Apple" in result["summary"]
120

121
        # Check metadata for execution tracking
122
        assert result["metadata"]["entities_extracted"] is True
123
        assert result["metadata"]["sentiment_analyzed"] is True
124
        assert result["metadata"]["summary_generated"] is True
125
        assert result["metadata"]["quality_check_completed"] is True
126
        assert len(result["metadata"]["quality_issues"]) == 0
127

128
    async def test_negative_sentiment_routing(self):
129
        """Test that negative content skips summary generation."""
130
        workflow = create_document_processing_workflow()
131

132
        initial_state = DocumentProcessingState(
133
            document_id="test002",
134
            content="The product launch was a complete disaster. Multiple critical failures and disappointed customers.",
135
            entities=None,
136
            sentiment_analysis=None,
137
            summary=None,
138
            processing_errors=[],
139
            metadata={"test_case": "negative_flow"}
140
        )
141

142
        result = await workflow.ainvoke(initial_state)
143

144
        # Should have completed sentiment analysis
145
        assert result["sentiment_analysis"] is not None
146
        assert result["sentiment_analysis"]["negative"] > 0.5
147

148
        # Should NOT have generated summary due to negative sentiment
149
        assert result["summary"] is None
150
        assert "summary_generated" not in result["metadata"]
151

152
        # Should have quality issues flagged
153
        assert "High negative sentiment detected" in result["metadata"]["quality_issues"]
154

155
    async def test_error_propagation(self):
156
        """Test that errors are properly propagated through the workflow."""
157
        workflow = create_document_processing_workflow()
158

159
        # Create state that will cause errors
160
        initial_state = DocumentProcessingState(
161
            document_id="test003",
162
            content="",  # Empty content should cause issues
163
            entities=None,
164
            sentiment_analysis=None,
165
            summary=None,
166
            processing_errors=["Pre-existing error"],
167
            metadata={"test_case": "error_flow"}
168
        )
169

170
        result = await workflow.ainvoke(initial_state)
171

172
        # Errors should be accumulated
173
        assert len(result["processing_errors"]) >= 1
174
        assert "Pre-existing error" in result["processing_errors"]
175

176
        # Quality check should flag the errors
177
        assert "Processing errors encountered" in result["metadata"]["quality_issues"]
178

179
# Test with checkpointing for complex workflows
180
async def test_workflow_checkpointing():
181
    """Test workflow execution with checkpointing."""
182
    checkpointer = MemorySaver()
183
    workflow = create_document_processing_workflow()
184

185
    # Compile with checkpointer
186
    app = workflow.compile(checkpointer=checkpointer)
187

188
    initial_state = DocumentProcessingState(
189
        document_id="test004",
190
        content="Test content for checkpointing workflow.",
191
        entities=None,
192
        sentiment_analysis=None,
193
        summary=None,
194
        processing_errors=[],
195
        metadata={}
196
    )
197

198
    # Execute with thread ID for checkpointing
199
    thread_config = {"configurable": {"thread_id": "test-thread-001"}}
200

201
    # First execution
202
    result1 = await app.ainvoke(initial_state, config=thread_config)
203

204
    # Verify checkpoint was saved
205
    saved_state = await checkpointer.aget(thread_config)
206
    assert saved_state is not None
207

208
    # Simulate resuming from checkpoint
209
    # In a real scenario, this might be after a failure
210
    result2 = await app.ainvoke(None, config=thread_config)
211

212
    # Results should be consistent
213
    assert result2["document_id"] == result1["document_id"]
214
    assert result2["entities"] == result1["entities"]

The negative sentiment routing test is the critical one. It verifies that when sentiment exceeds the 0.7 negative threshold, the workflow skips summary generation and routes directly to quality check. That conditional branch is exactly the kind of logic that unit tests cannot catch, because it only manifests during full graph execution.

Validating State Transitions and Routing#

State transitions are where LangGraph bugs hide. We built a dedicated test pattern that tracks which nodes execute and in what order, giving us full visibility into the routing logic.

1
class TestStateTransitions:
2
    def test_conditional_routing_logic(self):
3
        """Test that conditional edges route correctly based on state."""
4
        # Create a simple workflow with conditional routing
5
        builder = StateGraph(DocumentProcessingState)
6

7
        # Track which nodes were visited
8
        visited_nodes = []
9

10
        def track_node(name: str):
11
            def node_func(state):
12
                visited_nodes.append(name)
13
                return state
14
            return node_func
15

16
        builder.add_node("start", track_node("start"))
17
        builder.add_node("path_a", track_node("path_a"))
18
        builder.add_node("path_b", track_node("path_b"))
19
        builder.add_node("end", track_node("end"))
20

21
        # Conditional routing based on document length
22
        def route_by_length(state):
23
            if len(state.get("content", "")) > 100:
24
                return "path_a"
25
            return "path_b"
26

27
        builder.add_edge(START, "start")
28
        builder.add_conditional_edges(
29
            "start",
30
            route_by_length,
31
            {
32
                "path_a": "path_a",
33
                "path_b": "path_b"
34
            }
35
        )
36
        builder.add_edge("path_a", "end")
37
        builder.add_edge("path_b", "end")
38
        builder.add_edge("end", END)
39

40
        workflow = builder.compile()
41

42
        # Test path A (long content)
43
        visited_nodes.clear()
44
        long_content_state = DocumentProcessingState(
45
            document_id="test",
46
            content="x" * 200,  # Long content
47
            entities=None,
48
            sentiment_analysis=None,
49
            summary=None,
50
            processing_errors=[],
51
            metadata={}
52
        )
53

54
        workflow.invoke(long_content_state)
55
        assert visited_nodes == ["start", "path_a", "end"]
56

57
        # Test path B (short content)
58
        visited_nodes.clear()
59
        short_content_state = DocumentProcessingState(
60
            document_id="test",
61
            content="short",  # Short content
62
            entities=None,
63
            sentiment_analysis=None,
64
            summary=None,
65
            processing_errors=[],
66
            metadata={}
67
        )
68

69
        workflow.invoke(short_content_state)
70
        assert visited_nodes == ["start", "path_b", "end"]
71

72
    async def test_parallel_execution_paths(self):
73
        """Test workflows with parallel execution branches."""
74
        builder = StateGraph(DocumentProcessingState)
75

76
        execution_times = {}
77

78
        async def slow_node_a(state):
79
            start = asyncio.get_event_loop().time()
80
            await asyncio.sleep(0.1)  # Simulate work
81
            execution_times["node_a"] = asyncio.get_event_loop().time() - start
82
            return {"metadata": {**state.get("metadata", {}), "node_a_completed": True}}
83

84
        async def slow_node_b(state):
85
            start = asyncio.get_event_loop().time()
86
            await asyncio.sleep(0.1)  # Simulate work
87
            execution_times["node_b"] = asyncio.get_event_loop().time() - start
88
            return {"metadata": {**state.get("metadata", {}), "node_b_completed": True}}
89

90
        def merge_results(state):
91
            # This node runs after both parallel nodes complete
92
            return {
93
                "metadata": {
94
                    **state.get("metadata", {}),
95
                    "merge_completed": True,
96
                    "parallel_execution_verified": True
97
                }
98
            }
99

100
        # Build workflow with parallel execution
101
        builder.add_node("split", lambda x: x)  # Pass-through node
102
        builder.add_node("node_a", slow_node_a)
103
        builder.add_node("node_b", slow_node_b)
104
        builder.add_node("merge", merge_results)
105

106
        builder.add_edge(START, "split")
107
        builder.add_edge("split", "node_a")
108
        builder.add_edge("split", "node_b")
109
        builder.add_edge("node_a", "merge")
110
        builder.add_edge("node_b", "merge")
111
        builder.add_edge("merge", END)
112

113
        workflow = builder.compile()
114

115
        # Execute workflow
116
        start_time = asyncio.get_event_loop().time()
117
        result = await workflow.ainvoke(DocumentProcessingState(
118
            document_id="parallel_test",
119
            content="test",
120
            entities=None,
121
            sentiment_analysis=None,
122
            summary=None,
123
            processing_errors=[],
124
            metadata={}
125
        ))
126
        total_time = asyncio.get_event_loop().time() - start_time
127

128
        # Verify both nodes executed
129
        assert result["metadata"]["node_a_completed"] is True
130
        assert result["metadata"]["node_b_completed"] is True
131
        assert result["metadata"]["merge_completed"] is True
132

133
        # Verify parallel execution (total time should be ~0.1s, not ~0.2s)
134
        assert total_time < 0.15  # Allow some overhead
135
        assert execution_times["node_a"] >= 0.1
136
        assert execution_times["node_b"] >= 0.1

The track_node pattern is one of our most reused test utilities. Wrap each node in a tracker, run the workflow, then assert on the exact sequence. The parallel execution test goes further: it verifies that LangGraph actually runs the branches concurrently by checking that total wall-clock time is roughly equal to one branch, not both added together.

KEY INSIGHT: Build “node visitor” tracking into your LangGraph tests. Asserting on the exact sequence of visited nodes catches routing bugs that would be invisible if you only checked final state.

Bringing It All Together: End-to-End Testing#

Full Pipeline Verification#

End-to-end tests wire together everything: Pydantic AI agents, LangGraph workflows, mocked external services, and validation logic. These tests verify that the entire pipeline behaves correctly from input to output.

1
from pydantic_ai import Agent
2
from pydantic_ai.models.test import TestModel
3
from langgraph.graph import StateGraph
4
from pydantic import BaseModel, Field
5
from typing import List, Dict, Optional
6
from unittest.mock import MagicMock, AsyncMock
7
import json
8

9
# Define our domain models
10
class NewsArticle(BaseModel):
11
    """Structured news article data."""
12
    headline: str = Field(..., min_length=10, max_length=200)
13
    content: str = Field(..., min_length=50)
14
    category: str = Field(..., pattern="^(tech|business|health|sports)$")
15
    entities: List[str] = Field(default_factory=list)
16
    sentiment_score: float = Field(default=0.0, ge=-1, le=1)
17

18
class AnalysisResult(BaseModel):
19
    """Result of article analysis."""
20
    article_id: str
21
    key_insights: List[str] = Field(..., min_items=1, max_items=5)
22
    recommended_actions: List[str] = Field(default_factory=list)
23
    risk_level: str = Field(default="low", pattern="^(low|medium|high)$")
24

25
# Create specialized agents
26
content_analyzer = Agent(
27
    'openai:gpt-4o',
28
    result_type=AnalysisResult,
29
    system_prompt="Analyze news articles for key insights and actionable recommendations."
30
)
31

32
# Define workflow state
33
class NewsAnalysisState(TypedDict):
34
    article: NewsArticle
35
    raw_content: str
36
    analysis: Optional[AnalysisResult]
37
    external_data: Optional[Dict]
38
    notifications_sent: List[str]
39

40
# Mock external services
41
class ExternalServices:
42
    def __init__(self):
43
        self.database = AsyncMock()
44
        self.notification_service = AsyncMock()
45
        self.enrichment_api = AsyncMock()
46

47
# Comprehensive end-to-end test
48
class TestNewsAnalysisSystem:
49
    async def test_complete_article_processing_flow(self):
50
        """Test the entire article processing pipeline end-to-end."""
51
        # Set up mocks
52
        services = ExternalServices()
53

54
        # Mock database responses
55
        services.database.fetch_article.return_value = {
56
            "id": "article-123",
57
            "raw_content": "Apple announced record profits today. CEO Tim Cook stated that innovation continues to drive growth. The tech giant's stock rose 5% in after-hours trading.",
58
            "metadata": {"source": "Reuters", "timestamp": "2024-01-15T10:00:00Z"}
59
        }
60

61
        # Mock enrichment API
62
        services.enrichment_api.enrich.return_value = {
63
            "related_articles": ["article-120", "article-121"],
64
            "market_data": {"AAPL": {"change": "+5%", "volume": "high"}}
65
        }
66

67
        # Create workflow with mocked services
68
        def create_workflow(services):
69
            builder = StateGraph(NewsAnalysisState)
70

71
            async def fetch_article(state):
72
                # Fetch from database
73
                article_data = await services.database.fetch_article("article-123")
74
                return {"raw_content": article_data["raw_content"]}
75

76
            async def parse_article(state):
77
                # In real system, this would use an LLM
78
                # For testing, we create structured data
79
                article = NewsArticle(
80
                    headline="Apple Reports Record Profits",
81
                    content=state["raw_content"],
82
                    category="tech",
83
                    entities=["Apple", "Tim Cook"],
84
                    sentiment_score=0.8
85
                )
86
                return {"article": article}
87

88
            async def enrich_data(state):
89
                # Call external enrichment service
90
                enrichment = await services.enrichment_api.enrich(
91
                    entities=state["article"].entities
92
                )
93
                return {"external_data": enrichment}
94

95
            async def analyze_article(state):
96
                # Use our Pydantic AI agent
97
                with content_analyzer.override(model=TestModel()):
98
                    # Simulate agent response
99
                    analysis = AnalysisResult(
100
                        article_id="article-123",
101
                        key_insights=[
102
                            "Apple shows strong financial performance",
103
                            "Positive market reaction with 5% stock increase",
104
                            "Leadership emphasizes continued innovation"
105
                        ],
106
                        recommended_actions=[
107
                            "Monitor competitor responses",
108
                            "Track sustained stock performance"
109
                        ],
110
                        risk_level="low"
111
                    )
112
                    return {"analysis": analysis}
113

114
            async def send_notifications(state):
115
                # Send notifications based on analysis
116
                notifications = []
117

118
                if state["analysis"].risk_level == "high":
119
                    await services.notification_service.send_alert(
120
                        "High risk article detected",
121
                        state["article"].headline
122
                    )
123
                    notifications.append("risk_alert")
124

125
                if state["article"].sentiment_score < -0.5:
126
                    await services.notification_service.send_alert(
127
                        "Negative sentiment detected",
128
                        state["article"].headline
129
                    )
130
                    notifications.append("sentiment_alert")
131

132
                return {"notifications_sent": notifications}
133

134
            # Build the workflow
135
            builder.add_node("fetch", fetch_article)
136
            builder.add_node("parse", parse_article)
137
            builder.add_node("enrich", enrich_data)
138
            builder.add_node("analyze", analyze_article)
139
            builder.add_node("notify", send_notifications)
140

141
            # Define flow
142
            builder.add_edge(START, "fetch")
143
            builder.add_edge("fetch", "parse")
144
            builder.add_edge("parse", "enrich")
145
            builder.add_edge("enrich", "analyze")
146
            builder.add_edge("analyze", "notify")
147
            builder.add_edge("notify", END)
148

149
            return builder.compile()
150

151
        # Execute the workflow
152
        workflow = create_workflow(services)
153
        initial_state = NewsAnalysisState(
154
            article=None,
155
            raw_content="",
156
            analysis=None,
157
            external_data=None,
158
            notifications_sent=[]
159
        )
160

161
        result = await workflow.ainvoke(initial_state)
162

163
        # Comprehensive assertions
164
        # 1. Verify article was fetched and parsed
165
        assert result["article"] is not None
166
        assert result["article"].headline == "Apple Reports Record Profits"
167
        assert result["article"].category == "tech"
168
        assert len(result["article"].entities) == 2
169

170
        # 2. Verify enrichment occurred
171
        assert result["external_data"] is not None
172
        assert "related_articles" in result["external_data"]
173
        assert "market_data" in result["external_data"]
174

175
        # 3. Verify analysis was performed
176
        assert result["analysis"] is not None
177
        assert len(result["analysis"].key_insights) == 3
178
        assert result["analysis"].risk_level == "low"
179

180
        # 4. Verify service interactions
181
        services.database.fetch_article.assert_called_once_with("article-123")
182
        services.enrichment_api.enrich.assert_called_once()
183

184
        # 5. Verify notifications (none sent for positive low-risk article)
185
        assert len(result["notifications_sent"]) == 0
186
        services.notification_service.send_alert.assert_not_called()
187

188
    async def test_error_handling_and_recovery(self):
189
        """Test system behavior when components fail."""
190
        services = ExternalServices()
191

192
        # Configure enrichment API to fail
193
        services.enrichment_api.enrich.side_effect = Exception("API timeout")
194

195
        # Create workflow with error handling
196
        def create_resilient_workflow(services):
197
            builder = StateGraph(NewsAnalysisState)
198

199
            async def enrich_with_fallback(state):
200
                try:
201
                    enrichment = await services.enrichment_api.enrich(
202
                        entities=state["article"].entities
203
                    )
204
                    return {"external_data": enrichment}
205
                except Exception as e:
206
                    # Fallback to basic data
207
                    return {
208
                        "external_data": {
209
                            "error": str(e),
210
                            "fallback": True,
211
                            "related_articles": []
212
                        }
213
                    }
214

215
            # ... (other nodes remain the same)
216

217
            builder.add_node("enrich", enrich_with_fallback)
218
            # ... (build rest of workflow)
219

220
            return builder.compile()
221

222
        workflow = create_resilient_workflow(services)
223
        result = await workflow.ainvoke(initial_state)
224

225
        # Verify graceful degradation
226
        assert result["external_data"] is not None
227
        assert result["external_data"]["fallback"] is True
228
        assert "error" in result["external_data"]
229

230
        # Verify workflow continued despite enrichment failure
231
        assert result["analysis"] is not None

The error recovery test is particularly important. When the enrichment API times out, the workflow should degrade gracefully, filling in fallback data and continuing rather than crashing. We check both that the fallback was used and that downstream nodes still executed successfully.

CI/CD Pipeline for Agent Testing#

Running these tests consistently requires a CI pipeline that separates fast deterministic tests from slow LLM-dependent ones. Here is the GitHub Actions configuration we use:

1
name: Langgraph + Pydantic AI Test Suite
2

3
on:
4
  push:
5
    branches: [ main, develop ]
6
  pull_request:
7
    branches: [ main ]
8
  schedule:
9
    # Run nightly tests with live LLMs
10
    - cron: '0 2 * * *'
11

12
jobs:
13
  unit-tests:
14
    runs-on: ubuntu-latest
15
    strategy:
16
      matrix:
17
        python-version: ['3.9', '3.10', '3.11']
18

19
    steps:
20
    - uses: actions/checkout@v3
21

22
    - name: Set up Python
23
      uses: actions/setup-python@v4
24
      with:
25
        python-version: ${{ matrix.python-version }}
26

27
    - name: Cache dependencies
28
      uses: actions/cache@v3
29
      with:
30
        path: |
31
          ~/.cache/pip
32
          .venv
33
        key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
34

35
    - name: Install dependencies
36
      run: |
37
        python -m pip install --upgrade pip
38
        pip install -r requirements.txt
39
        pip install pytest pytest-asyncio pytest-cov pytest-timeout
40

41
    - name: Run unit tests with coverage
42
      run: |
43
        pytest tests/unit/ -v --cov=src --cov-report=xml --timeout=30
44
      env:
45
        TESTING: "true"
46

47
    - name: Upload coverage
48
      uses: codecov/codecov-action@v3
49
      with:
50
        file: ./coverage.xml
51

52
  integration-tests:
53
    runs-on: ubuntu-latest
54
    needs: unit-tests
55

56
    services:
57
      redis:
58
        image: redis:alpine
59
        ports:
60
          - 6379:6379
61
        options: >-
62
          --health-cmd "redis-cli ping"
63
          --health-interval 10s
64
          --health-timeout 5s
65
          --health-retries 5
66

67
    steps:
68
    - uses: actions/checkout@v3
69

70
    - name: Run integration tests
71
      run: |
72
        pytest tests/integration/ -v --timeout=60
73
      env:
74
        REDIS_URL: redis://localhost:6379
75
        USE_TEST_MODELS: "true"
76

77
  workflow-tests:
78
    runs-on: ubuntu-latest
79
    needs: integration-tests
80

81
    steps:
82
    - uses: actions/checkout@v3
83

84
    - name: Run workflow tests
85
      run: |
86
        pytest tests/workflows/ -v -m "not slow" --timeout=120
87

88
  nightly-llm-tests:
89
    if: github.event_name == 'schedule'
90
    runs-on: ubuntu-latest
91

92
    steps:
93
    - uses: actions/checkout@v3
94

95
    - name: Run tests with real LLMs
96
      run: |
97
        pytest tests/e2e/ -v -m "llm_required" --timeout=300
98
      env:
99
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
100
        ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
101
        RUN_EXPENSIVE_TESTS: "true"
102

103
  performance-benchmarks:
104
    runs-on: ubuntu-latest
105
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
106

107
    steps:
108
    - uses: actions/checkout@v3
109

110
    - name: Run performance benchmarks
111
      run: |
112
        python -m pytest tests/benchmarks/ -v --benchmark-only
113

114
    - name: Store benchmark results
115
      uses: benchmark-action/github-action-benchmark@v1
116
      with:
117
        tool: 'pytest'
118
        output-file-path: output.json
119
        github-token: ${{ secrets.GITHUB_TOKEN }}
120
        auto-push: true

The pipeline has five layers. Unit tests run on every push across 3 Python versions with a 30-second timeout. Integration tests spin up Redis and use TestModel. Workflow tests exercise full LangGraph graphs. Nightly tests hit real LLMs to catch model behavior changes. Performance benchmarks run on main branch pushes and track regressions over time.

Building an Evaluation Framework#

Beyond pass/fail testing, AI agents need quality evaluation. We built a lightweight evaluation framework that scores agent responses across multiple dimensions: required entities, forbidden phrases, and flexible output matching.

1
from typing import List, Dict, Tuple
2
from dataclasses import dataclass
3
from pydantic_ai import Agent
4
from pydantic_ai.models.test import TestModel
5
import numpy as np
6

7
@dataclass
8
class EvaluationCase:
9
    """Single test case for agent evaluation."""
10
    input_text: str
11
    expected_outputs: List[str]  # Multiple acceptable outputs
12
    required_entities: List[str]  # Entities that must be detected
13
    forbidden_phrases: List[str]  # Phrases that shouldn't appear
14
    min_quality_score: float = 0.8
15

16
@dataclass
17
class EvaluationResult:
18
    """Results from evaluating an agent."""
19
    total_cases: int
20
    passed_cases: int
21
    failed_cases: List[Tuple[str, str]]  # (input, reason)
22
    accuracy: float
23
    average_quality_score: float
24
    performance_metrics: Dict[str, float]
25

26
class AgentEvaluator:
27
    """Comprehensive evaluation framework for AI agents."""
28

29
    def __init__(self, agent: Agent, test_cases: List[EvaluationCase]):
30
        self.agent = agent
31
        self.test_cases = test_cases
32

33
    async def evaluate(self, use_test_model: bool = True) -> EvaluationResult:
34
        """Run comprehensive evaluation of the agent."""
35
        passed = 0
36
        failed_cases = []
37
        quality_scores = []
38

39
        # Override with TestModel for consistent evaluation
40
        if use_test_model:
41
            context_manager = self.agent.override(model=TestModel())
42
        else:
43
            context_manager = nullcontext()  # No override
44

45
        with context_manager:
46
            for case in self.test_cases:
47
                try:
48
                    # Run agent
49
                    result = await self.agent.run(case.input_text)
50

51
                    # Evaluate result
52
                    evaluation = self._evaluate_single_case(result.data, case)
53

54
                    if evaluation["passed"]:
55
                        passed += 1
56
                        quality_scores.append(evaluation["quality_score"])
57
                    else:
58
                        failed_cases.append((case.input_text, evaluation["reason"]))
59

60
                except Exception as e:
61
                    failed_cases.append((case.input_text, f"Exception: {str(e)}"))
62

63
        return EvaluationResult(
64
            total_cases=len(self.test_cases),
65
            passed_cases=passed,
66
            failed_cases=failed_cases,
67
            accuracy=passed / len(self.test_cases),
68
            average_quality_score=np.mean(quality_scores) if quality_scores else 0.0,
69
            performance_metrics=self._calculate_performance_metrics()
70
        )
71

72
    def _evaluate_single_case(self, output: str, case: EvaluationCase) -> Dict:
73
        """Evaluate a single test case result."""
74
        reasons = []
75
        quality_score = 1.0
76

77
        # Check for required entities
78
        missing_entities = []
79
        for entity in case.required_entities:
80
            if entity.lower() not in output.lower():
81
                missing_entities.append(entity)
82
                quality_score -= 0.1
83

84
        if missing_entities:
85
            reasons.append(f"Missing entities: {missing_entities}")
86

87
        # Check for forbidden phrases
88
        found_forbidden = []
89
        for phrase in case.forbidden_phrases:
90
            if phrase.lower() in output.lower():
91
                found_forbidden.append(phrase)
92
                quality_score -= 0.2
93

94
        if found_forbidden:
95
            reasons.append(f"Contains forbidden phrases: {found_forbidden}")
96

97
        # Check if output matches any expected outputs
98
        matched_expected = False
99
        for expected in case.expected_outputs:
100
            # Flexible matching - could be substring, semantic similarity, etc.
101
            if self._flexible_match(output, expected):
102
                matched_expected = True
103
                break
104

105
        if not matched_expected and case.expected_outputs:
106
            reasons.append("Output doesn't match expected patterns")
107
            quality_score -= 0.3
108

109
        # Ensure quality score is within bounds
110
        quality_score = max(0.0, min(1.0, quality_score))
111

112
        return {
113
            "passed": quality_score >= case.min_quality_score and not reasons,
114
            "quality_score": quality_score,
115
            "reason": "; ".join(reasons) if reasons else "Passed"
116
        }
117

118
    def _flexible_match(self, output: str, expected: str) -> bool:
119
        """Flexible matching that handles variations."""
120
        # Simple implementation - in practice, use semantic similarity
121
        output_lower = output.lower().strip()
122
        expected_lower = expected.lower().strip()
123

124
        # Exact match
125
        if output_lower == expected_lower:
126
            return True
127

128
        # Substring match
129
        if expected_lower in output_lower:
130
            return True
131

132
        # Key phrases match (80% of words present)
133
        expected_words = set(expected_lower.split())
134
        output_words = set(output_lower.split())
135
        overlap = len(expected_words.intersection(output_words))
136

137
        return overlap / len(expected_words) >= 0.8 if expected_words else False
138

139
    def _calculate_performance_metrics(self) -> Dict[str, float]:
140
        """Calculate additional performance metrics."""
141
        # In a real implementation, track timing, token usage, etc.
142
        return {
143
            "avg_response_time": 0.1,  # seconds
144
            "avg_tokens_used": 150,
145
            "error_rate": 0.02
146
        }
147

148
# Example usage
149
async def evaluate_customer_service_agent():
150
    """Evaluate a customer service agent comprehensively."""
151

152
    # Define evaluation cases
153
    test_cases = [
154
        EvaluationCase(
155
            input_text="My order #12345 hasn't arrived and it's been 2 weeks!",
156
            expected_outputs=[
157
                "I sincerely apologize for the delay with order #12345",
158
                "I'm sorry to hear about the delay with your order #12345"
159
            ],
160
            required_entities=["#12345", "apologize"],
161
            forbidden_phrases=["calm down", "not my problem"],
162
            min_quality_score=0.8
163
        ),
164
        EvaluationCase(
165
            input_text="How do I return a defective product?",
166
            expected_outputs=[
167
                "To return a defective product, please follow these steps",
168
                "I'll help you with the return process for your defective product"
169
            ],
170
            required_entities=["return", "defective"],
171
            forbidden_phrases=["figure it out yourself", "too bad"],
172
            min_quality_score=0.85
173
        ),
174
        # ... more test cases
175
    ]
176

177
    # Create agent
178
    agent = Agent(
179
        'openai:gpt-4o',
180
        system_prompt="You are a helpful customer service representative."
181
    )
182

183
    # Run evaluation
184
    evaluator = AgentEvaluator(agent, test_cases)
185
    results = await evaluator.evaluate(use_test_model=True)
186

187
    # Report results
188
    print(f"Evaluation Results:")
189
    print(f"- Accuracy: {results.accuracy:.1%}")
190
    print(f"- Average Quality: {results.average_quality_score:.2f}")
191
    print(f"- Failed Cases: {len(results.failed_cases)}")
192

193
    for input_text, reason in results.failed_cases[:3]:  # Show first 3 failures
194
        print(f"  - Input: '{input_text[:50]}...'")
195
        print(f"    Reason: {reason}")

Performance Testing Under Load#

Measuring What Matters in Production#

Performance testing for AI agent systems goes beyond simple response time. You need to track latency percentiles, throughput under concurrency, per-node execution time within LangGraph workflows, and resource consumption over sustained load.

Figure 4: Performance Testing Framework. Test queries of varying complexity feed into the test executor, which runs the agent system and collects raw data. The analysis covers four dimensions: latency metrics (avg, p50, p95, p99), throughput (requests per second), resource usage (memory and CPU), and per-node execution times. Bottleneck identification drives optimization priorities.

Here is the benchmarking framework we use:

1
import asyncio
2
import time
3
import psutil
4
import statistics
5
from dataclasses import dataclass, field
6
from typing import List, Dict, Optional, Callable
7
from concurrent.futures import ThreadPoolExecutor
8
import matplotlib.pyplot as plt
9
from datetime import datetime
10

11
@dataclass
12
class PerformanceMetrics:
13
    """Comprehensive performance metrics for agent execution."""
14
    request_id: str
15
    start_time: float
16
    end_time: float
17
    total_duration: float
18
    node_timings: Dict[str, float] = field(default_factory=dict)
19
    memory_usage_mb: float = 0.0
20
    cpu_usage_percent: float = 0.0
21
    tokens_used: int = 0
22
    error_occurred: bool = False
23
    error_message: Optional[str] = None
24

25
@dataclass
26
class BenchmarkResult:
27
    """Aggregated benchmark results."""
28
    total_requests: int
29
    successful_requests: int
30
    failed_requests: int
31
    avg_latency_ms: float
32
    p50_latency_ms: float
33
    p95_latency_ms: float
34
    p99_latency_ms: float
35
    throughput_rps: float
36
    avg_memory_mb: float
37
    peak_memory_mb: float
38
    avg_cpu_percent: float
39
    node_performance: Dict[str, Dict[str, float]]
40

41
class PerformanceBenchmark:
42
    """Comprehensive performance benchmarking for Langgraph + Pydantic AI systems."""
43

44
    def __init__(self, agent_system, test_data: List[Dict]):
45
        self.agent_system = agent_system
46
        self.test_data = test_data
47
        self.metrics: List[PerformanceMetrics] = []
48

49
    async def run_benchmark(
50
        self,
51
        duration_seconds: int = 60,
52
        concurrent_requests: int = 10,
53
        warmup_requests: int = 5
54
    ) -> BenchmarkResult:
55
        """Run a comprehensive performance benchmark."""
56

57
        # Warmup phase
58
        print(f"Running {warmup_requests} warmup requests...")
59
        for i in range(warmup_requests):
60
            await self._execute_single_request(f"warmup-{i}", self.test_data[0])
61

62
        # Clear warmup metrics
63
        self.metrics.clear()
64

65
        # Main benchmark
66
        print(f"Running benchmark for {duration_seconds} seconds with {concurrent_requests} concurrent requests...")
67

68
        start_time = time.time()
69
        end_time = start_time + duration_seconds
70
        request_count = 0
71

72
        # Create a pool of requests
73
        async def request_worker(worker_id: int):
74
            nonlocal request_count
75
            while time.time() < end_time:
76
                test_case = self.test_data[request_count % len(self.test_data)]
77
                request_id = f"req-{worker_id}-{request_count}"
78
                request_count += 1
79

80
                await self._execute_single_request(request_id, test_case)
81

82
        # Run concurrent workers
83
        workers = [request_worker(i) for i in range(concurrent_requests)]
84
        await asyncio.gather(*workers)
85

86
        # Calculate results
87
        return self._calculate_results(time.time() - start_time)
88

89
    async def _execute_single_request(self, request_id: str, test_input: Dict) -> PerformanceMetrics:
90
        """Execute a single request and collect metrics."""
91
        # Initialize metrics
92
        metrics = PerformanceMetrics(
93
            request_id=request_id,
94
            start_time=time.time(),
95
            end_time=0,
96
            total_duration=0
97
        )
98

99
        # Monitor system resources
100
        process = psutil.Process()
101
        initial_memory = process.memory_info().rss / 1024 / 1024  # MB
102

103
        try:
104
            # Track node execution times if using Langgraph
105
            if hasattr(self.agent_system, '_graph'):
106
                node_timings = {}
107

108
                # Monkey-patch nodes to track timing
109
                original_nodes = {}
110
                for node_name, node_func in self.agent_system._graph.nodes.items():
111
                    original_nodes[node_name] = node_func
112

113
                    async def timed_node(state, _node_name=node_name, _original=node_func):
114
                        node_start = time.time()
115
                        result = await _original(state) if asyncio.iscoroutinefunction(_original) else _original(state)
116
                        node_timings[_node_name] = time.time() - node_start
117
                        return result
118

119
                    self.agent_system._graph.nodes[node_name] = timed_node
120

121
                # Execute request
122
                result = await self.agent_system.ainvoke(test_input)
123

124
                # Restore original nodes
125
                for node_name, node_func in original_nodes.items():
126
                    self.agent_system._graph.nodes[node_name] = node_func
127

128
                metrics.node_timings = node_timings
129
            else:
130
                # Direct agent execution
131
                result = await self.agent_system.run(test_input['query'])
132

133
            # Collect resource usage
134
            metrics.memory_usage_mb = process.memory_info().rss / 1024 / 1024 - initial_memory
135
            metrics.cpu_usage_percent = process.cpu_percent(interval=0.1)
136

137
            # Extract token usage if available
138
            if hasattr(result, 'usage'):
139
                metrics.tokens_used = result.usage.get('total_tokens', 0)
140

141
        except Exception as e:
142
            metrics.error_occurred = True
143
            metrics.error_message = str(e)
144

145
        metrics.end_time = time.time()
146
        metrics.total_duration = metrics.end_time - metrics.start_time
147

148
        self.metrics.append(metrics)
149
        return metrics
150

151
    def _calculate_results(self, total_duration: float) -> BenchmarkResult:
152
        """Calculate aggregate benchmark results."""
153
        successful_metrics = [m for m in self.metrics if not m.error_occurred]
154
        failed_count = len([m for m in self.metrics if m.error_occurred])
155

156
        if not successful_metrics:
157
            raise ValueError("No successful requests to analyze")
158

159
        # Calculate latency percentiles
160
        latencies = [m.total_duration * 1000 for m in successful_metrics]  # Convert to ms
161
        latencies.sort()
162

163
        # Calculate node performance
164
        node_performance = {}
165
        all_nodes = set()
166
        for m in successful_metrics:
167
            all_nodes.update(m.node_timings.keys())
168

169
        for node in all_nodes:
170
            node_times = [m.node_timings.get(node, 0) * 1000 for m in successful_metrics if node in m.node_timings]
171
            if node_times:
172
                node_performance[node] = {
173
                    'avg_ms': statistics.mean(node_times),
174
                    'p95_ms': node_times[int(len(node_times) * 0.95)],
175
                    'percentage': statistics.mean(node_times) / statistics.mean(latencies) * 100
176
                }
177

178
        return BenchmarkResult(
179
            total_requests=len(self.metrics),
180
            successful_requests=len(successful_metrics),
181
            failed_requests=failed_count,
182
            avg_latency_ms=statistics.mean(latencies),
183
            p50_latency_ms=latencies[int(len(latencies) * 0.50)],
184
            p95_latency_ms=latencies[int(len(latencies) * 0.95)],
185
            p99_latency_ms=latencies[int(len(latencies) * 0.99)],
186
            throughput_rps=len(successful_metrics) / total_duration,
187
            avg_memory_mb=statistics.mean([m.memory_usage_mb for m in successful_metrics]),
188
            peak_memory_mb=max([m.memory_usage_mb for m in successful_metrics]),
189
            avg_cpu_percent=statistics.mean([m.cpu_usage_percent for m in successful_metrics]),
190
            node_performance=node_performance
191
        )
192

193
    def generate_report(self, result: BenchmarkResult, output_file: str = "benchmark_report.png"):
194
        """Generate a visual performance report."""
195
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
196

197
        # Latency distribution
198
        latencies = [m.total_duration * 1000 for m in self.metrics if not m.error_occurred]
199
        ax1.hist(latencies, bins=50, alpha=0.7, color='blue', edgecolor='black')
200
        ax1.axvline(result.p50_latency_ms, color='red', linestyle='--', label=f'P50: {result.p50_latency_ms:.1f}ms')
201
        ax1.axvline(result.p95_latency_ms, color='orange', linestyle='--', label=f'P95: {result.p95_latency_ms:.1f}ms')
202
        ax1.set_xlabel('Latency (ms)')
203
        ax1.set_ylabel('Frequency')
204
        ax1.set_title('Latency Distribution')
205
        ax1.legend()
206

207
        # Throughput over time
208
        time_buckets = {}
209
        for m in self.metrics:
210
            bucket = int(m.start_time - self.metrics[0].start_time)
211
            time_buckets[bucket] = time_buckets.get(bucket, 0) + 1
212

213
        times = sorted(time_buckets.keys())
214
        throughputs = [time_buckets[t] for t in times]
215
        ax2.plot(times, throughputs, marker='o')
216
        ax2.set_xlabel('Time (seconds)')
217
        ax2.set_ylabel('Requests per second')
218
        ax2.set_title('Throughput Over Time')
219
        ax2.grid(True, alpha=0.3)
220

221
        # Node performance breakdown
222
        if result.node_performance:
223
            nodes = list(result.node_performance.keys())
224
            avg_times = [result.node_performance[n]['avg_ms'] for n in nodes]
225

226
            ax3.barh(nodes, avg_times, color='green', alpha=0.7)
227
            ax3.set_xlabel('Average Time (ms)')
228
            ax3.set_title('Node Performance Breakdown')
229
            ax3.grid(True, alpha=0.3)
230

231
        # Resource usage
232
        memory_usage = [m.memory_usage_mb for m in self.metrics if not m.error_occurred]
233
        cpu_usage = [m.cpu_usage_percent for m in self.metrics if not m.error_occurred]
234

235
        ax4_twin = ax4.twinx()
236
        ax4.plot(range(len(memory_usage)), memory_usage, 'b-', label='Memory (MB)')
237
        ax4_twin.plot(range(len(cpu_usage)), cpu_usage, 'r-', label='CPU (%)')
238
        ax4.set_xlabel('Request Number')
239
        ax4.set_ylabel('Memory (MB)', color='b')
240
        ax4_twin.set_ylabel('CPU (%)', color='r')
241
        ax4.set_title('Resource Usage')
242
        ax4.tick_params(axis='y', labelcolor='b')
243
        ax4_twin.tick_params(axis='y', labelcolor='r')
244

245
        plt.tight_layout()
246
        plt.savefig(output_file)
247
        plt.close()
248

249
        # Print summary
250
        print("\n" + "="*50)
251
        print("PERFORMANCE BENCHMARK RESULTS")
252
        print("="*50)
253
        print(f"Total Requests: {result.total_requests}")
254
        print(f"Successful: {result.successful_requests} ({result.successful_requests/result.total_requests*100:.1f}%)")
255
        print(f"Failed: {result.failed_requests}")
256
        print(f"\nLatency Metrics:")
257
        print(f"  Average: {result.avg_latency_ms:.1f}ms")
258
        print(f"  P50: {result.p50_latency_ms:.1f}ms")
259
        print(f"  P95: {result.p95_latency_ms:.1f}ms")
260
        print(f"  P99: {result.p99_latency_ms:.1f}ms")
261
        print(f"\nThroughput: {result.throughput_rps:.1f} requests/second")
262
        print(f"\nResource Usage:")
263
        print(f"  Avg Memory: {result.avg_memory_mb:.1f}MB")
264
        print(f"  Peak Memory: {result.peak_memory_mb:.1f}MB")
265
        print(f"  Avg CPU: {result.avg_cpu_percent:.1f}%")
266

267
        if result.node_performance:
268
            print(f"\nNode Performance:")
269
            for node, perf in sorted(result.node_performance.items(), key=lambda x: x[1]['avg_ms'], reverse=True):
270
                print(f"  {node}: {perf['avg_ms']:.1f}ms avg, {perf['p95_ms']:.1f}ms p95 ({perf['percentage']:.1f}% of total)")
271

272
# Example usage
273
async def benchmark_document_processing_system():
274
    """Benchmark a complete document processing system."""
275

276
    # Create test data
277
    test_documents = [
278
        {
279
            "document_id": f"doc-{i}",
280
            "content": f"Sample document {i} with various content..." * 50,
281
            "processing_options": {
282
                "extract_entities": True,
283
                "analyze_sentiment": True,
284
                "generate_summary": i % 2 == 0  # Only half generate summaries
285
            }
286
        }
287
        for i in range(10)
288
    ]
289

290
    # Create your agent system (workflow or agent)
291
    document_processor = create_document_processing_workflow()
292

293
    # Run benchmark
294
    benchmark = PerformanceBenchmark(document_processor, test_documents)
295
    result = await benchmark.run_benchmark(
296
        duration_seconds=60,
297
        concurrent_requests=5,
298
        warmup_requests=10
299
    )
300

301
    # Generate report
302
    benchmark.generate_report(result)

The node-level timing breakdown is the most actionable part of this framework. When your p95 latency spikes, you can immediately see which node is the bottleneck. In our system, we discovered that the entity extraction node was taking 4x longer than expected because of an unoptimized regex, something we never would have found with aggregate timing alone.

Strategies That Hold Up in Production#

Eight Practices We Learned the Hard Way#

After building and testing multiple LangGraph + Pydantic AI systems, here are the practices that made the biggest difference.

Isolate LLM dependencies. Always use TestModel or FunctionModel in your test suite. We save real LLM calls for nightly evaluation runs. Our unit tests went from 12 minutes and $3 per run to 40 seconds and $0.

Test at every pyramid level. Unit tests catch logic errors in seconds. Integration tests verify component handoffs. Workflow tests expose routing bugs. E2E tests confirm the user experience. Skip any layer and bugs slip through.

Validate state transitions explicitly. In LangGraph, the conditional edges are where the logic lives and where the bugs hide. Use the node-visitor tracking pattern to assert on exact execution paths, not just final outputs.

Test both sides of every validation boundary. With Pydantic AI, test that valid data passes and that invalid data fails with the correct error message. Our meeting scheduler accepted solo meetings for three weeks because we never tested the minimum attendees constraint.

Mock external services with dependency injection. Pass services as parameters to your workflow factory function. In tests, inject AsyncMock instances. In production, inject real clients. No monkeypatching, no test pollution.

Separate test tiers in CI. Fast tests on every push, integration tests on PR, workflow tests on merge, LLM tests on a nightly schedule. Developers get feedback in seconds, not minutes.

Establish performance baselines early. Run benchmarks from week one. A 50ms regression in a hot path compounds into seconds at scale. Track p95 and p99, not just averages, because tail latency is where users feel pain.

Test error handling as thoroughly as happy paths. AI systems fail in unique ways: API timeouts, rate limits, malformed model outputs, hallucinated tool calls. Every failure mode needs a test that verifies graceful degradation, not just crash-free execution.

KEY INSIGHT: The highest-value tests for AI agent systems are workflow-level tests that verify conditional routing and state transitions. Unit tests prove components work. Workflow tests prove the system works.

What Comes Next#

The AI agent testing landscape is moving fast. Automated test generation using LLMs to write test cases for other LLMs is already practical. Adversarial testing frameworks that probe agents for prompt injection vulnerabilities and edge-case failures are maturing. Semantic verification, where test assertions check meaning rather than string equality, is replacing brittle exact-match patterns.

The biggest shift ahead is distributed testing for multi-agent systems. As agent architectures scale across services, we will need chaos engineering approaches adapted for AI, injecting failures, latency, and malformed responses to verify that agent orchestration degrades gracefully under real-world conditions.

The fundamentals, though, stay the same. Test at every level. Isolate what you can control. Verify the paths your system actually takes. Measure what matters in production. If you build your testing strategy on those principles, you will have agent systems that are reliable enough to trust with real users and real stakes.

References#

[1] Pydantic AI Documentation, “Testing and Evaluation Guide”, https://ai.pydantic.dev/testing-evals/ (2025)

[2] Langgraph Documentation, “Testing Workflows and State Management”, https://langchain-ai.github.io/langgraph/tutorials/testing/ (2025)

[3] Harrison Chase, “Best Practices for Testing LangChain Applications”, LangChain Blog (2024)

[4] Samuel Colvin, “Pydantic V2 Validation Strategies”, https://docs.pydantic.dev/latest/concepts/validators/ (2024)

[5] OpenAI, “Best Practices for Testing LLM Applications”, https://platform.openai.com/docs/guides/testing (2024)

[6] Mitchell Hashimoto, “Testing Strategies for Non-Deterministic Systems”, HashiCorp Blog (2023)

[7] Martin Fowler, “Testing Strategies in a Microservice Architecture”, https://martinfowler.com/articles/microservice-testing/ (2024)