The Development Workflow: How Seven Agents Turn a Ticket into Reviewed Code

We typed /workflow in VS Code, picked “Full Development Workflow,” and watched seven agents hand off structured artifacts through five stages (research, architecture, code, review, documentation) without us managing context once. The ticket went from blank to code-reviewed implementation in a single session. That is what a structured workflow does that a single AI agent cannot.

The application was a Servoy enterprise system backed by PostgreSQL, with 10,000+ functions across 1,000+ files and 22 modules spanning over a decade of accumulated knowledge. When we started working with GitHub Copilot on it, a single agent session would collapse under the competing demands: research the codebase, understand the architecture, write code, enforce data isolation rules, catch anti-patterns, and produce documentation. Each of those tasks requires different depth, different context, and a different cognitive posture. Asking one agent to do all of them produces mediocre output on all of them.

The solution is structural separation. Seven specialized agents. One development workflow.

Figure 1 - Workflow overview showing the five-stage development workflow (teal) with researcher, architect, developer, reviewer, documenter, plus two supporting skill agents (gold) for skill-builder and skill-auditor

Figure 1 - Workflow Overview: Seven agents with distinct roles. The development workflow (teal) handles ticket implementation across five stages. The skill agents (gold) manage the knowledge base that every development stage draws from.

Why One Agent Is Not Enough#

A single agent asked to “implement this business module change” faces a fundamental problem: the tasks inside that request are not the same kind of work.

Researching which functions are in scope requires breadth: scanning the Neo4j code graph, identifying module boundaries, finding what gets called and by what. Architectural planning requires depth: understanding the platform’s transaction model, the data isolation patterns enforced at the application layer, and the historical gotchas for this domain. Code generation requires precision: legacy syntax, platform API conventions, and the exact parameter signatures for existing functions. Code review requires skepticism: scanning for data isolation violations, security gaps, and the class of mistakes that burned developers before.

A single agent trying to do all of this switches modes constantly. It accumulates context that is useful for research but polluting for code generation. It makes architectural assumptions during research that should not be locked in until planning is complete. It cannot hold the skeptical reviewer posture while it is in code-generation mode.

Figure 2 - Single agent problem showing one overwhelmed agent node with competing task arrows versus seven specialized agents each focused on one task

Figure 2 - The Single Agent Problem: One agent receiving competing demands for research breadth, architectural depth, code precision, review skepticism, and documentation clarity cannot optimize for any of them. Context from early research pollutes later code generation. The reviewer cannot be skeptical about code the same agent just wrote.

The specialization fix is not just about capability. It is about context hygiene. Each agent starts with a clean slate shaped for its role. The researcher gets structural query tools and a broad mandate. The architect gets the researcher’s output plus domain skills. The developer gets the architect’s plan plus coding conventions. The reviewer gets the developer’s code plus a checklist mindset. None of them inherit context they do not need.

KEY INSIGHT: Specialization is not about what each agent can do. It is about what context each agent starts with. A researcher loaded with review checklists is worse at research. A reviewer that did the coding is worse at reviewing. Separation creates better outputs at every stage.

The Development Workflow in Detail#

The workflow runs five stages for a full ticket. A short version runs three stages for targeted work that does not need structural research upfront.

Figure 3 - Development workflow detail showing five agent cards in sequence: @researcher (GPT-4.1, teal), @architect (Claude Opus 4.6 / GPT-5.4, gold), @developer (Claude Sonnet 4.6, blue), @reviewer (Claude Sonnet 4.6, purple), @documenter (GPT-4.1, teal) with artifact file names flowing between each stage

Figure 3 - Development Workflow Detail: Five stages with assigned models and artifact outputs. The @researcher produces research/[ticket].md. The @architect produces plans/[ticket].md. The @developer writes implementation code. The @reviewer produces reviews/[ticket].md. The @documenter produces docs/[ticket].md. Each stage consumes the prior stage’s artifact as structured input.

Stage 1: @researcher (GPT-4.1)

The researcher does not read every file in the application’s 22 modules. It queries the Neo4j code graph. Given a ticket description, it identifies the affected functions, maps their callers and callees, finds module boundary crossings, and flags complexity hotspots. For tickets with deep domain complexity, it can query the graph beyond the immediate call graph to surface structural patterns across the codebase. The output is a structured research document that answers: what is in scope, what depends on it, and where does the risk concentrate.

GPT-4.1 runs here because speed matters more than depth at this stage. The researcher is gathering structural facts, not making judgments. Fast, accurate, broad.

Stage 2: @architect (Claude Opus 4.6 / GPT-5.4)

The architect receives the research document and the relevant domain skills. From that foundation it designs the implementation approach: which functions to modify, what new functions to create, what the call flow looks like, which validation functions to invoke, and which data isolation patterns apply. The output is a step-by-step plan with enough specificity that the developer does not make architectural guesses.

This is the most demanding cognitive stage in the workflow. Claude Opus 4.6 handles complex multi-constraint reasoning across Servoy’s architectural requirements, language limitations, and module-specific rules. GPT-5.4 provides a cross-model option for architectures with unusually high risk. The choice between them is made at session start based on ticket complexity.

Stage 3: @developer (Claude Sonnet 4.6)

The developer receives the architect’s plan and writes Servoy’s server-side JavaScript following platform conventions. It has access to the relevant domain skills for the module it is modifying, including business rules, data isolation patterns, and the specific API signatures the platform uses for database operations. It writes code, not explanations. If the reviewer finds issues, this agent runs again with the review findings as additional context.

Stage 4: @reviewer (Claude Sonnet 4.6)

The reviewer examines the developer’s output against a multi-layered checklist: legacy compatibility, platform API correctness, data isolation enforcement, security (SQL injection patterns, input sanitization), business rule compliance for the module, and format conventions. It produces a structured review document. It also runs a second task simultaneously: harvesting any new business rules, gotchas, or patterns discovered during review that are not yet in the existing skills.

KEY INSIGHT: The reviewer does two jobs in one pass: validation and knowledge harvesting. Every review is simultaneously a quality gate and a skill improvement opportunity. New findings go to @skill-builder for incorporation into the knowledge base. This is how the system gets smarter over time without additional developer effort.

Stage 5: @documenter (GPT-4.1)

The documenter assembles a structured record from all prior artifacts, organized into five categories: Problem (the research summary and what triggered the work), Solution (the architectural approach and implementation decisions), Code Review (the reviewer’s findings and resolutions), Test Results (verification outcomes), and Files Changed (the full list of modifications). It runs on GPT-4.1 because this is document assembly, not reasoning. Speed and consistency matter, not depth.

Handoff Buttons and Context Transfer#

The workflow would collapse without mechanical handoffs. Manually deciding what context to carry forward between stages introduces cognitive overhead and context loss. Handoff buttons eliminate both.

Figure 4 - Handoff mechanism showing a VS Code chat panel with a formatted handoff block and a highlighted button to open the next stage agent with context pre-loaded

Figure 4 - Handoff Mechanism: When a stage completes, it outputs a formatted handoff block with structured context fields and a stage-transition button. Clicking the button opens the next agent with the handoff block pre-loaded. No copy-paste, no context reconstruction, no cognitive overhead.

Each handoff block has a standard structure: what was done, what was found, the path to the artifact file, and the specific context the next agent needs. The researcher’s handoff tells the architect which functions are in scope and where risk concentrates. The architect’s handoff tells the developer which functions to write and what decisions were made. The developer’s handoff tells the reviewer exactly what changed and why.

The button does not just open a new agent. It opens the correct agent with the handoff block already in context. The next agent reads the handoff, loads the referenced artifact, and begins its task without any manual setup from us.

This is what makes the workflow feel mechanical rather than artisanal. The handoff does not depend on us remembering what to tell the next agent. The system carries the context forward.

Short Workflow vs Full Workflow#

Not every ticket needs a researcher and an architect. Feature additions to well-understood modules, bug fixes with a known root cause, and changes where the scope is already clear do not benefit from a full research pass.

Figure 5 - Short vs full workflow showing side-by-side comparison: Full Workflow (5 stages) on the left, Short Workflow (3 stages) on the right, with a decision tree in the center

Figure 5 - Short vs Full Workflow: The decision tree routes tickets based on scope certainty and structural risk. Known scope plus known approach plus low structural risk uses the Short Workflow. Unknown scope or multi-module dependencies or high complexity uses the Full Workflow. The short workflow skips research and architecture. The full workflow runs all five stages.

The Short Workflow runs @developer, then @reviewer, then @documenter. The developer starts with the ticket description plus the relevant domain skills. No research document, no architectural plan. Just the instructions and the accumulated domain knowledge already loaded in the skills.

The failure mode we found early: using the short workflow on tickets with structural risk we did not recognize upfront. A “simple” change to a validation function turned out to have 47 callers across 6 modules. The developer wrote correct code for the function it was asked to modify. The reviewer caught three callers with incompatible assumptions. What should have been a 30-minute task became a 3-hour structural analysis.

The fix is conservative routing. When in doubt, run the full workflow. The researcher stage is fast because GPT-4.1 querying the Neo4j code graph takes minutes, not hours. The cost of unnecessary research is low. The cost of discovering structural risk at the review stage is high.

Skill Management#

The two skill agents operate outside the development workflow. They maintain the knowledge base that every development session draws from.

Figure 6 - Skill management showing skill-builder receiving inputs from the reviewer's knowledge harvest and from direct curation, feeding into skill files, with skill-auditor validating the result

Figure 6 - Skill Management: @skill-builder creates and updates skill files from two sources: knowledge harvested by the reviewer and directly curated context. @skill-auditor validates updated skills for accuracy and consistency. The 18 domain skills covering the application’s modules are the product of this loop.

@skill-builder (Claude Sonnet 4.6)

The skill-builder creates new skills and updates existing ones. It works from two inputs: knowledge harvest findings queued by the reviewer after development sessions, and curated context prepared directly by developers for domains not yet covered. When building skills for domains with deep structural complexity, it can query the Neo4j graph for additional context rather than relying solely on what has been manually provided.

@skill-auditor (Claude Sonnet 4.6)

The skill-auditor validates skills before they go into rotation. It checks for internal consistency, coverage gaps, contradictions with adjacent skills, and accuracy against the codebase. A skill that passes audit goes live and is auto-loaded for every relevant development session. A skill that fails audit goes back to @skill-builder with specific findings.

The 18 domain skills covering the application’s 22 modules are the cumulative product of this loop, built over time through review harvesting and direct curation.

Model Selection Strategy#

Four models run across seven agents. The assignment is not arbitrary. Each model’s characteristics match what its assigned agents actually need.

Figure 7 - Model selection strategy showing a clean two-column grid with model name, assigned agents, and the cognitive characteristic that drives the assignment

Figure 7 - Model Selection Strategy: GPT-4.1 handles high-speed, broad tasks. Claude Opus 4.6 and GPT-5.4 handle deep architectural reasoning. Claude Sonnet 4.6 handles high-volume, high-precision code tasks. Model assignment follows cognitive load, not status.

GPT-4.1 powers @researcher and @documenter. Fast, broad, consistent. Research is graph queries and fact gathering. Document assembly is structured formatting. Speed matters, depth does not.

Claude Opus 4.6 powers @architect (primary). Deep reasoning across Servoy’s architectural requirements, language limitations, application-layer data isolation patterns, and module-specific constraints. This is where deliberate, constrained reasoning earns its cost.

GPT-5.4 powers @architect (alternate). Cross-model validation for high-risk architectural decisions. Running the same architectural problem through a different reasoning lineage catches assumptions Claude Opus 4.6 treated as settled.

Claude Sonnet 4.6 powers four agents: @developer, @reviewer, @skill-builder, @skill-auditor. Code generation and review need consistent instruction-following and high-quality output at volume. Sonnet delivers that at inference speeds that keep the workflow from stalling.

KEY INSIGHT: Model selection is an economic decision, not a prestige decision. Using a top-tier reasoning model for document assembly wastes money and adds latency. Using a fast research model for architectural design risks missed constraints. Matching model capability to task complexity is where workflow efficiency lives.

File-Based Artifacts#

Every workflow stage writes to disk. This is not just for auditability. It is how context transfers reliably between stages and how the knowledge base accumulates.

1
copilot/
2
├── research/
3
│   └── [ticket-id].md
4
├── plans/
5
│   └── [ticket-id].md
6
├── reviews/
7
│   └── [ticket-id].md
8
├── docs/
9
│   └── [ticket-id].md
10
└── knowledge-harvest/
11
    └── [date]-[domain].md

File-based artifacts solve three problems:

Context window limits: A full research document plus architectural plan can exceed what fits cleanly in a single context transfer. Referencing a file path and having the next agent load it avoids that ceiling.

Reproducibility: If a workflow stage produces bad output, we can rerun that stage with the same input. The artifact from the prior stage is still on disk. We do not have to reconstruct context from memory.

Audit trail: Six months from now, when a change introduced a bug, the research document shows what functions were in scope, the review document shows what was checked, and the documentation artifact shows what was communicated. The workflow produces documentation as a side effect of doing the work.

What a Typical Ticket Produces#

A medium-complexity business module ticket runs through the full workflow and produces five artifacts.

The researcher maps 23 functions in scope across 3 modules. The architect specifies 4 new functions and 2 modifications. The developer writes the implementation. The reviewer finds 1 medium finding, corrected in the same session. The knowledge harvest captures 2 new domain-specific rules for the skill-builder queue. The documenter assembles the full record.

The review finding in that session was a missed data isolation filter in a sub-query. The reviewer caught it. The developer fixed it. The knowledge harvest captured the rule that sub-queries must include the isolation filter even when the outer query already has it. That rule went into the domain skill via @skill-builder and now fires for every developer in every session touching that module.

One caught bug became a permanent guardrail.

The Failure That Shaped the Design#

Early in the system’s development, we tried to keep the workflow in a single long context rather than using file-based artifacts and handoffs. The reasoning seemed efficient: less overhead, no file I/O, everything in one place.

The actual behavior: by the time the reviewer received context, it was 40,000 tokens deep. The model was tracking research notes, architectural debates, code generation attempts, and revision history simultaneously. Review quality degraded. The reviewer started treating early research notes as constraints and missing code-level issues that were obvious in isolation.

The fix was mandatory context isolation. Each stage reads only what it needs from the prior stage: the structured handoff and the artifact file. The noise from earlier stages does not accumulate. The reviewer reads the developer’s code and the handoff summary. It does not read the research debates from Stage 1.

This is the counterintuitive lesson: more context is not better context. A reviewer with 2,000 tokens of clean input outperforms a reviewer with 40,000 tokens of accumulated noise.

Before and After: Workflow Comparison#

Figure 8 - Before/after comparison showing two-column layout: ad-hoc Copilot versus the seven-agent workflow across six dimensions

Figure 8 - Before/After Workflow Comparison: The left column shows ad-hoc Copilot usage: manual context setup each session, inconsistent review depth, knowledge lost between sessions, no structural risk detection, and documentation as an afterthought. The right column shows the structured workflow: one-click session start, consistent review checklist, self-improving knowledge base, and structural risk caught before code is written.

Dimension	Ad-Hoc Copilot	Seven-Agent Workflow
Session setup	Manual context reconstruction every time	`/workflow` + skill auto-load
Structural risk detection	Discovered during review or post-merge	Researcher maps Neo4j graph before any code
Review consistency	Depends on what we remember to check	Structured checklist, same every time
Knowledge retention	Lost between sessions	Harvested, structured, auto-loaded
Ticket documentation	Manual, often skipped	Documenter output as workflow artifact
New developer context	Months of codebase learning	18 domain skills cover critical modules

The numbers that matter: the structural risk catch rate went from “when we remembered to check” to “every ticket, before code is written.” The review consistency went from “varies by session energy” to “same checklist, every time.” The knowledge retention went from “in our heads” to “in skill files that outlast any individual developer.”

What Is Coming Next#

The workflow handles development and knowledge capture. The Neo4j code graph makes structural intelligence possible. But neither of those fully answers the question of why the graph changes what agents can do at scale.

Article 3 goes into the Neo4j code graph in depth: how 10,000+ functions get indexed, what kinds of structural questions become queryable, and what it costs to maintain a live graph against a codebase that changes daily. The graph is not a one-time snapshot. It is an operational dependency for every researcher session, and keeping it current is its own engineering problem.

The Series#

This is Part 2 of a 5-part series on building an AI development methodology with GitHub Copilot:

Beyond Code Completion. The enterprise AI gap and why agent mode changes everything
The Development Workflow (this article). How seven agents turn a ticket into reviewed code
Neo4j Code Graph. How a code graph database makes AI agents understand your codebase
The Knowledge Flywheel. How code reviews feed a self-improving knowledge loop
Enterprise AI Lessons. What building an AI methodology taught us about enterprise software