Skills, Slash Commands, and Harnesses: A Discipline Hierarchy

This is a Claude Code article on where each layer of the Claude Code discipline stack belongs. Skills, slash commands, agents, hooks, harnesses. They all exist. The question that keeps coming up is: which one to reach for, and when?

This article closes the Harness Fundamentals sub-series by answering that question directly. We’ll walk through the four rules Anthropic engineers use to design skills, show you how Boris Cherny (Head of Claude Code at Anthropic) has applied them in his own stack, and then lay out the full discipline hierarchy from the smallest unit to the largest. By the end, you will have a mental model that tells you exactly where to reach and what discipline each level demands.

If you have not read the prior articles in this sub-series, a quick orientation is useful. We covered what a harness is and why it matters more than the model in What Is an Agent Harness?, the three eras of AI engineering and why 2026 is the harness era in Three Eras of AI Engineering, how harnesses evolve as models improve in The Harness Evolution Principle, and how to build a specialized Python harness from the nine components up in Building Your First Specialized Harness in Python. This one wraps the series by zooming back in to the bottom of that stack: the skills layer, and the discipline it demands.

Figure 1: vertical stack diagram of the five-level Claude Code discipline hierarchy from Skills at the bottom to Harness at the top.

Figure 1 - The five-level discipline hierarchy: Skills are not just a convenient feature. They are the first layer of a five-level system, and each level up the stack adds a layer of coordinating logic. Knowing where you are in this hierarchy, and what discipline each level demands, is what separates a Claude Code power user from someone who re-prompts from scratch every Monday morning.

The problem: writing a new prompt for every task#

Most Claude Code users start the same way. A task comes up. They open Claude Code, type a descriptive prompt, and Claude does the work. Good result, close the session.

Next week, the same task comes up. They type another descriptive prompt. Maybe they remember what they wrote last time, maybe they don’t. Claude does the work again. If the prompt was a little different, the output will be a little different. The second session is not better than the first. There is no compounding loop, so the work on day 30 looks much like the work on day 1.

This is the prompting ceiling. It is not a flaw in Claude Code, it is a flaw in how you are using it. You are treating Claude Code as a one-shot answering machine instead of as a system that compounds over time.

The fix is a discipline, not a feature. Anthropic’s engineering team put that discipline into four rules at an AI engineering summit, and those rules are the clearest statement we have found of what working Claude Code usage actually looks like. Let’s walk through them.

KEY INSIGHT: The compounding loop only works if you build skills. Prompts vanish when the session ends. Skills persist and improve. Everything else in this article is a consequence of that distinction.

Rule 1: Prompt skills, not Claude#

The first rule sounds simple but it repositions where your effort goes.

Anthropic’s framing of the Claude Code stack has three layers. Layer 1 is the AI model, Anthropic’s domain. Layer 2 is the agents and prompts that orchestrate the model. Layer 3 is skills, the application layer, and that is your domain.

The cleanest way to frame this is by analogy with the phone in your pocket. Anthropic builds the phone. You build the apps. Just as you would not expect Apple to preload every app you will ever need, you should not expect Anthropic to pre-program every workflow you will ever run. Skills are how you install your apps.

The practical shift: instead of typing a new email-drafting prompt, you build a /draft-email skill. Instead of describing your PR process from scratch each time, you build a /commit-push-PR skill. The description in that skill tells Claude when to auto-invoke it. A specific, well-written description means Claude finds the skill automatically when the task comes up, without you typing the slash command at all.

Figure 2: three-layer diagram with AI Model at bottom, Agents and Prompts in middle, and Skills at top labeled as the user's application layer.

Figure 2 - The application layer: The three-layer Claude Code stack positions skills as the application layer, your domain and your responsibility. Layer 1 (the model) and Layer 2 (agents and prompts) are Anthropic’s domain. Skills in Layer 3 are where your compounding leverage lives. If you are still prompting directly every session, you are working in Layer 2 and handing Layer 3 to no one.

Most developers understand this in principle and still end up prompting directly in practice. The reason is usually that building a skill feels like overhead for a one-off task. Rule 1 is not about one-off tasks. It is about recognizing which of your “one-off” tasks are actually recurring ones in disguise.

Rule 2: Skills are more than prompts#

The second rule is where the mechanical understanding comes in, and where most practitioners stop short.

Every skill has three internal layers. The description is Layer 1: it governs whether Claude auto-invokes the skill at all. If the description is vague, Claude will not find the skill unless you explicitly type the slash command. A specific description means Claude reaches for the skill automatically. It functions like a folder label. Vague label, invisible folder.

Layer 2 is the instructions: the step-by-step playbook Claude follows when the skill activates. This is what most people build. They write detailed instructions, they test them, they feel done.

Layer 3 is the tools: code scripts, API calls, and reference files. This is where most of the leverage lives. Most practitioners never get here.

Eric, an Anthropic engineer, described what he observes from the outside in a way that stuck with us: the beautiful detailed prompt paired with bare-bones tools whose parameters are named A and B. His verbatim words: “I think maybe the funniest things I see is that people will put a lot of effort into creating these really beautiful, detailed prompts. And then the tools that they make to give the model are sort of these incredibly bare-bones, like, you know, no documentation, the parameters are named A and B, and it’s kind of like, oh, like an engineer wouldn’t be able to work with this as a function they had to use” [3].

The counterintuitive implication: Anthropic engineers focus on Layer 3, not Layer 2. The instructions matter, but the tools are where the skill actually becomes reliable.

Barry, another Anthropic engineer, gave the clearest practical example. His team kept watching Claude regenerate the same Python script to apply styling to slides, session after session. The fix was obvious once they thought of it. They asked Claude to save the script inside the skill folder as a tool for its future self. Barry, verbatim: “We kept seeing Claude write the same Python script over and over again to apply styling to slides. So we just asked Claude to save it inside of the skill as a tool for its future self. Now we can just run the script and that makes everything a lot more consistent, a lot more efficient” [3].

This is the scripts-inside-skills pattern. Once you have saved a deterministic script inside the skill, future runs execute the saved code directly instead of regenerating it. You trade AI inference tokens for deterministic code compute. Cheaper, faster, and the output stops varying between sessions.

Figure 3: anatomy of a single skill showing three internal layers (Description, Instructions, Tools) with a marker noting most practitioners stop at the Instructions layer.

Figure 3 - Three layers, one common stopping point: Most skill authors stop at the instructions layer. The tools layer (code scripts, API calls, reference files) is where Claude Code becomes deterministic, cost-efficient, and reliably consistent. Barry’s slide-styling story is the canonical example: one saved script eliminates every future re-inference of the same code.

KEY INSIGHT: When Claude writes the same code more than once across sessions, that code belongs inside the skill as a saved tool. Re-inference is waste. Deterministic code is free.

Rule 3: Build composable skills, not monolithic ones#

The third rule is about design discipline once you are building skills.

The temptation when you start building skills is to make them comprehensive. One content-creation skill that handles idea generation, script writing, and social post drafting. One research skill that does topic research, competitive analysis, and report formatting. The appeal is obvious: one skill, one command, one thing to remember.

The problem is that when it breaks, you cannot tell where. And when one capability needs updating, you have to touch the same file as all the other capabilities. Improvements do not propagate because the concerns are entangled.

Anthropic’s framing of the correct approach uses the word “composable” explicitly. A composable skill is small, focused, reusable, and chainable. The example from the engineering summit:

Bad pattern: one skill that generates YouTube ideas, writes the script, and drafts the LinkedIn companion post.

Good pattern: three skills that chain.

/youtube-idea-research researches and ranks video ideas.
/youtube-script-writer takes a chosen idea and writes the script.
/linkedin-post drafts the companion post.

When you improve /youtube-script-writer, better hooks and clearer structure, every workflow that uses scripts improves automatically. When you update the LinkedIn format guidelines, only that skill changes. You know exactly where to look when something breaks.

Figure 4: side-by-side comparison of one cracked monolithic skill against a chain of three small composable skills linked by arrows.

Figure 4 - Composable versus monolithic: A monolithic skill that does three things fails opaquely. A composable chain of three focused skills fails specifically. More importantly, improving one composable skill propagates to every workflow that calls it, while improving part of a monolithic skill requires careful surgery around the parts you did not intend to change.

Composable design also exposes a decision you will need to make once you have more than a few skills: who can invoke what.

Claude Code’s Skills feature includes two invocation-control flags in the skill’s frontmatter. The official field names use hyphens:

user-invocable: false hides the skill from the user’s slash command menu. The model can still invoke it automatically based on the description. Use this for internal utility sub-skills that are not meaningful user commands. A skill that loads background context on a legacy system, for example. The user does not need to type /legacy-system-context. Claude will load it when it’s needed. Note carefully: this flag controls menu visibility only. It does not block the model from invoking the skill.

disable-model-invocation: true blocks the model from invoking the skill automatically. Only the user can trigger it. Use this for skills with side effects: /commit, /deploy, /send-slack-message. You want a human decision gate before those run. Claude should not decide to deploy because the code looks ready.

The two flags are not symmetric. One controls the menu; one controls the model. In a composable chain, early steps (research, drafting, analysis) typically have neither flag set. The final step (deploy, send, publish) typically has disable-model-invocation: true so the human gates the irreversible action.

Figure 5: 2x2 truth table showing the four combinations of the user-invocable and disable-model-invocation skill frontmatter flags.

Figure 5 - Two flags, four behaviors: The invocation-control flags in Claude Code’s skill frontmatter are not symmetric. user-invocable: false hides the skill from the slash menu but leaves model invocation intact. disable-model-invocation: true blocks model invocation but leaves user invocation intact. The intended use cases are different: the first is for utility sub-skills; the second is for side-effect actions that need a human gate.

Rule 4: Update skills every session#

The fourth rule is the one most often skipped, and skipping it is how skill debt accumulates.

Skills persist between sessions. Prompts vanish. That persistence is the source of the compounding loop, but it only compounds if you maintain the skills.

After every skill run, ask one question: “Is this a one-time fix, or should it live in the skill forever?” If the answer is forever, update the skill right now, before closing the session. If you wait until “later,” later does not come.

As Anthropic engineers put it at the summit, the goal is that working with Claude on day 30 should feel noticeably better than working with Claude on day 1 [3]. That only happens if the skills that make up the day-30 system have been updated with 30 days of edge cases, corrections, and learned patterns.

The maintenance math is simple but ruthless. Every skill is a maintenance obligation. Not a passive one that stays stable, but an active one that drifts. Workflows change. The model improves and your previous guardrails become noise. New skills you added contradict old ones you forgot to update. Boris Cherny (Head of Claude Code at Anthropic) makes the point about the full harness layer but it applies equally to the skills inside it: “All of Claude Code has just been written and rewritten and rewritten and rewritten over and over and over. There is no part of Claude Code that was around 6 months ago” [3].

The implication for skills: the question is not how many skills you should build. It is how many you can actively maintain. Boris’s answer for most practitioners is 5 to 10 [3]. He had 40-plus skills at one point, deleted and rewrote, and kept 7 [3].

Figure 6: line chart of skills built versus skills actively maintained, with the widening gap between them shaded as skill debt.

Figure 6 - The skill debt gap: As the number of skills grows, the gap between skills in active use and skills actively maintained widens. Outdated skills contradict each other, confuse the model, and cost maintenance time without delivering value. Boris Cherny’s answer is to prune aggressively and keep the maintained set to 5 to 10.

KEY INSIGHT: The 5-to-10 rule is not a ceiling. It is a maintenance budget. Don’t ask how many skills you can build but how many you can keep current. The answer determines the ceiling.

How the head of Claude Code actually does it#

That is the four-rule framework from Anthropic’s engineers. Now let’s look at what it looks like when someone has applied it seriously.

Boris Cherny is Head of Claude Code at Anthropic, and he has been running Claude Code on his own development work for long enough that his habits have settled into something systematic. His documented stack has seven skills, of which he has named four publicly:

/commit-push-PR, his most-used skill: commits code, pushes to GitHub, and opens a pull request, all in one command.
Running tests.
Reviewing code for bugs.
End-of-session cleanup.

The other three skills are not enumerated in the available record, and we will not invent them.

Before building any skill, Boris applies what amounts to a three-question inner-loop filter [3]:

Do you do it two to three times per week?
Does it follow the same pattern each time?
Would preloaded context help?

Answering all three with yes means build the skill, while any no’s means do not build it. The filter is intentionally restrictive. Most practitioners fall into one of two failure modes: never building skills (no compounding, Claude on day 30 is identical to Claude on day 1), or building skills for everything (skill clutter, maintenance cost explodes, skills contradict each other).

Figure 7: decision-tree flowchart of three sequential yes-or-no questions; any No routes to do not build a skill, three Yeses route to build a skill.

Figure 7 - The inner-loop test: Boris Cherny’s three-question filter determines whether a task has earned a skill. All three answers must be yes. The filter prevents both under-investment (never building skills) and over-investment (skill clutter that accumulates faster than maintenance can keep up).

Boris’s productivity output on his own development: “Three days I should put 10, 20, 30 pull requests, something like that. 100% written by Claude Code. I have not edited it a single line by hand” [3]. That is the result of seven tightly maintained skills and a consistent update discipline, not 50 skills left to drift.

Beyond skills, Boris’s full automation stack has four components [3]. Skills handle the inner-loop workflows. Agents handle delegated work: he has six agents, of which he names one publicly, a “staff reviewer” that acts as a skeptical senior engineer, reviews plans, and returns a verdict of approve, request changes, or needs to rethink. It reviews only; it does not build anything. Hooks handle automatic enforcement: a post-tool-use hook runs the auto-formatter immediately after Claude writes to files, a stop hook returns “keep going” when work remains, and a session-start hook loads context into new sessions. Loops and schedules handle persistent attention: /loop 5 minutes /babysit checks for anything needing attention every five minutes; schedules run specific tasks at specific times in the cloud outside the local project.

Figure 8: four-tier diagram of Boris Cherny's automation stack: skills, agents, hooks, and loops or schedules, with hedge cards for the unnamed entries.

Figure 8 - Boris’s automation stack: Seven skills handle the inner-loop workflows that Boris does repeatedly. Six agents handle delegated work, each with one job. Three hooks enforce automatic behavior. Loops and schedules extend the system past the session boundary. The whole structure is a real-world instance of the five-level discipline hierarchy, built and maintained by the person who built the harness it runs on.

The dual-use detail is worth pulling out because it changes how you think about slash commands. Boris on his /commit skill: “Everything is dual use. The model can also call slash commands. I have a slash command for /commit where I run through diffing and generating a reasonable commit message. I run it manually, but Claude can run this for me. We get to share this logic” [3].

A skill is not just a user shortcut. It is a tool you equip Claude with. Boris can invoke /commit, or Claude can invoke /commit autonomously as part of a larger workflow. The same logic serves both invocation paths. That is the design insight behind slash commands: they are the interface contract between humans and skills, and between agents and skills.

The discipline hierarchy: where each layer belongs#

Now we can map the full stack.

The five levels are not independent. They feed each other. Understanding what each level is for, and what discipline it demands, is the practical takeaway from this series.

Level 1: Skills are the atomic composable units. Each skill encodes one repeatable workflow. Apply the inner-loop filter before building. Write composable rather than monolithic skills. Build Layer 3 (scripts, tools, API calls) rather than stopping at Layer 2 (instructions). Update after every session.

Level 2: Slash commands are the invocation surface [4]. A slash command is how a skill becomes a first-class action: it is the skill’s directory name, the frontmatter-driven menu entry, and the interface Boris shares between himself and his agents. The discipline at this level is naming and access control. Name commands specifically and descriptively. Apply the invocation flags (user-invocable and disable-model-invocation) to route each skill to the right caller.

Level 3: Agents are where skills get applied to delegated work [6]. An agent is a Claude instance with a focused role, a restricted tool set, and a system prompt scoped to one job. Boris frames the relationship this way: skills tell Claude how to do something, agents decide when to do it. The discipline at this level is the same as for skills: one agent, one job. No general-purpose agents. Boris’s staff reviewer reviews and returns a verdict. It does not also write code, run tests, or format slides.

Level 4: Hooks are automatic enforcement [5]. They fire on harness events without requiring invocation. Post-tool-use, stop, session-start. The discipline: hooks are not skills. Write hooks for cross-cutting automatic behavior that should apply to every tool call or every session boundary. Do not put business logic in hooks. A hook that runs the formatter on every file write is correct use. A hook that contains your PR review criteria is wrong use: that belongs in a skill.

Level 5: The harness is the fixed architecture that coordinates everything below it. The while loop, context management, tool registry, sub-agent management, session persistence, system prompt assembly, hook dispatcher, permissions. It is what the prior articles in this series covered in detail. The discipline: harnesses evolve with the model. Components that compensate for model limitations have a half-life of months. Pruning is as important as building. The Boris rewriting quote applies here at full force.

Figure 9: closing diagram of the five-level discipline hierarchy with the specific discipline each level demands annotated on the right.

Figure 9 - The full discipline hierarchy: Five levels, each with a specific purpose and a specific discipline requirement. Most practitioners operate at Levels 1 and 2. The ones who reach Levels 3, 4, and 5 and apply the discipline at each level are the ones whose Claude Code stacks compound meaningfully over time.

The non-developer proof point#

It would be easy to walk away from this article thinking the discipline hierarchy is a developer concern. It is not.

The COO of a tech startup worked with an engineering firm client that generates NYC Local Law 97 carbon-emissions compliance reports for building owners. Local Law 97 is a real NYC Buildings Department law requiring large buildings to reduce greenhouse gas emissions [9]. The compliance reports were a 2-week batch job: roughly 5 hours of work per individual report.

The COO spent 2 hours with the firm’s engineer building a /LL97-report skill using the three-folder Claude Code structure: skills/ for the execution playbooks, knowledge/ for the LL97 reference material and edge-case learnings, and projects/ for the report outputs. The same reports now generate in 5 minutes. After every batch, they run /improve-skill to fold any new edge cases back into the skill so the next run does not repeat them. Rule 4 in production, for a compliance use case in an engineering firm, run by a COO who does not write code.

The discipline is the same at every level of the stack, for every type of user. Apply the inner-loop filter. Build the Layer 3 tools, not just the instructions. Keep skills composable. Update after every run.

Conclusion#

This series started by asking what a harness actually is. We walked through nine components, three eras, the evolution principle, a Python implementation, and now, finally, the skills layer that sits at the foundation of everything the harness coordinates.

The mental model we want you to leave with:

Skills encode the knowledge. Slash commands expose it as invocable actions. Agents decide when to apply it. Hooks enforce behavior automatically around it. The harness orchestrates all of it.

None of these levels replaces the others. They are layers in a stack that compounds when you maintain the discipline at each level and collapses when you skip it. Most practitioners operate at Level 1 or 2 and wonder why their stack does not improve over time. The answer is usually not the model. The answer is the maintenance gap between what they built and what they are keeping current.

The next sub-series on this site, Compiled Knowledge, starts on June 11 with an article on why Karpathy’s wiki idea is reshaping how teams think about retrieval and the knowledge layer that sits behind their skills. The three-folder pattern the COO used for LL97 is the entry point to that conversation.

The Series#

This is Part 5 of the five-part Harness Fundamentals sub-series on Claude Code engineering:

What Is an Agent Harness, Really? Nine Components Most Builders Miss — a working definition and the nine components every modern harness needs
Three Eras of AI Engineering: Prompt to Context to Harness — how the discipline moved and what each era absorbed from the one before
The Harness Evolution Principle: Why Mature Harnesses Look Like Pruning — the V1/V2 case study, the Boris anchor, and a practitioner’s pruning playbook
Building Your First Specialized Harness in Python: 9 Components, 12 Design Decisions — hands-on construction of a minimal harness with all nine components mapped to working code
Skills, Slash Commands, and Harnesses: A Discipline Hierarchy (this article) — where individual skills fit inside the broader harness and how the three layers interact

References#

[1] Anthropic, “Extend Claude with skills,” Claude Code Documentation, 2025. https://code.claude.com/docs/en/skills

[2] Anthropic, “Effective harnesses for long-running agents,” Anthropic Engineering Blog, Nov 2025. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

[3] B. Cherny, E. Catto, et al., “How Anthropic engineers use Claude Code,” AI Engineering Summit, 2026. https://www.youtube.com/@aiengineer

[4] Anthropic, “Slash commands,” Claude Code Documentation, 2025. https://code.claude.com/docs/en/slash-commands

[5] Anthropic, “Hooks reference,” Claude Code Documentation, 2025. https://code.claude.com/docs/en/hooks

[6] Anthropic, “Subagents,” Claude Code Documentation, 2025. https://code.claude.com/docs/en/subagents

[7] B. Cherny, “Head of Claude Code: What happens after coding is solved,” Lenny’s Newsletter, 2025. https://www.lennysnewsletter.com/p/head-of-claude-code-what-happens

[8] B. Cherny, “Building Claude Code with Boris Cherny,” The Pragmatic Engineer, 2025. https://newsletter.pragmaticengineer.com/p/building-claude-code-with-boris-cherny

[9] NYC Department of Buildings, “Local Law 97: Greenhouse Gas Emissions Reductions,” 2024. https://www.nyc.gov/site/buildings/codes/ll97-greenhouse-gas-emissions-reductions.page