Agent skills look like markdown. Treat them like software.

#softwareengineering #agentskills #ai #agents

Quality, testing, and security for a format that is brand new and already everywhere

I have spent the last couple of weeks exploring agent skills, in the context of making it easier for our customers to build complex infrastructure-as-code solutions on top of our curated Terraform modules. The domain is not directly relevant here, but the fact that we want to eventually ship skills to customers is what shaped how I think about what follows.

Skills look simple: markdown instructions, optional scripts, maybe some reference files. In practice that usually means front matter, a core SKILL.md, and supporting docs or code. The Agent Skills specification turned this into a shared format with progressive disclosure, so assistants can read metadata first and only load the full body when needed. Anthropic introduced it in December 2025, and major coding assistants adopted it quickly.

That speed makes one thing urgent: quality and governance. The first serious, customer-facing skills are only now starting to show up.

Creating a skill is easy. Creating a good skill is hard

A skill can be just markdown, and that is the appeal. A compact way to package procedural knowledge. The buzz around Anthropic's contract-review skill for Claude Cowork showed this well: it rattled legal tech stocks, and it is a couple hundred lines of markdown. Not a product. A SKILL.md file.

But low creation cost creates noise. Catalogs and registries have multiplied. skills.sh alone lists tens of thousands of skills. That proves demand, not quality. Maybe 90% of published skills are not really useful, not well designed, or just some guy building their own workflow opinions and packaging them as reusable assets. Discovery and filtering can end up costing more than just doing the task directly.

The hard part is not writing instructions. The hard part is designing a reliable flow with clear boundaries and good triggers, with gates that reduce failure rates. Good skills checkpoint risky branches, force validation steps, avoid ambiguous handoffs.

A slightly more technical aside, but worth mentioning: skills are not the only way to augment an assistant. Slash commands (user-triggered instruction sets) and MCP servers (external tools and data sources the assistant can call) are converging with skills into a similar pattern: reusable instruction bundles, often with optional scripts. Frameworks like spec-kit lean into that convergence. Skills also compose well with MCP servers, which can provide information in a more token-efficient format than raw API calls (compact representations like TOON instead of full JSON payloads). And MCP gateways like MCP Context Forge can add governance and security between the assistant and the data source, which is harder when a skill instructs the assistant to make direct calls.

Are skills documentation, or are they software?

I have started to see them as software. At least for skills we want to push to customers and be able to support. That is the main takeaway from these last couple of weeks: if it needs to be supportable, treat it like a software artifact.

Start with static analysis in CI. Tools like skills-ref validate check structure and naming against the spec, but that is just one check among many: markdown linting, context budget enforcement (startup metadata around the ~100-token footprint, recommended thresholds for SKILL.md size), and whatever project-specific rules make sense.

Then handle lifecycle. Versioning helps, but semver alone is not enough. Skills should include ownership and last-updated metadata. This gets harder in monorepos that hold unrelated skills, and many installers still do not provide a clean update path for already-installed skills.

Modularity matters just as much. Small, single-purpose skills are easier to compose and less likely to conflict. Large catch-all skills drift into ambiguity. If a skill ships Bash or Python, those scripts follow the normal lifecycle too: review, tests, security checks, clear ownership.

Testing skills needs a TDD-like loop

This is where I see the biggest gap. Most teams still test skills informally: try a couple of prompts, get one good answer, ship it. That is not enough for probabilistic systems.

There is also a real variance problem across assistants. Some trigger skills naturally from intent, others need explicit prompting. Codex often triggers on loosely related tasks. Claude Code and GitHub Copilot can need more prompt shaping depending on context. Once a skill is loaded, instruction-following quality still differs by assistant and model. Testing a skill on one assistant and calling it done is not enough.

I have been toying with a TDD-like approach to this, adapted for non-determinism.

Run a baseline without skill or MCP augmentation and capture failures (red).
Add the skill and MCP context until behavior is acceptable (green).
Refactor for stability and token efficiency.

Gates should be statistical, not binary. Run each scenario multiple times, track aggregate scores and standard deviation, inspect tail behavior. A good average can hide an ugly long tail. Ideally you run this across a matrix of assistants and models, so teams can publish minimum support expectations for each skill instead of claiming universal compatibility.

Scoring should mix deterministic and non-deterministic checks. Deterministic checks run assertions on outputs: files generated, content matching, expected tool calls, etc, and can also verify conversation structure. Non-deterministic checks can use semantic similarity score, or LLM-judge models, with targeted human review on high-risk workflows. Cost matters too: token usage, latency, retry count. If a skill improves quality but doubles cost and variance, that may be a bad trade.

And there is a more basic question that many teams skip: is the skill useful at all? In many cases, the base model already has the knowledge. If a skill does not measurably improve quality, consistency, safety, or cost, it should not exist. Offline evals help, but you still need a feedback loop from real users in live usage.

Security is still the wild west

Security is the least mature part of this space, and it shows.

The integration guidance says script execution is risky and should be sandboxed. That warning is justified. Skills can be pulled from many catalogs through different installers, with little shared trust model.

The most installed skill on skills.sh is one that automatically discovers and installs more skills from the internet, with a -y flag that skips user confirmation. Not an edge case. That is the default user pathway. No realistic central gate exists today. Many catalogs and installer implementations are out there, each with different review standards. Trust does not transfer across them.

Script execution is only one part of the risk though. Markdown instructions alone can encode harmful behavior: a skill could tell the assistant to ask users for sensitive information and route it to a third party. No scripts needed. Security review has to cover instruction content, not just executable assets.

I see two practical paths:

Curated, security-vetted catalogs or some kind of certification scheme.
Stronger assistant-side guardrails: sandboxing, tool allowlists, confirmation prompts for high-risk actions.

The second probably matters more in the short term because it protects users even when catalog governance is inconsistent. Many of these protections still sit outside the core skill format and depend on the assistant implementers.

Monetization comes after trust

I do think there is a monetization path for high-quality skills, especially in dense procedural domains: compliance, legal, medical, security, specialized infrastructure. Areas where there is a lot of procedural knowledge that people would pay for, if they trust the quality.

You are not selling markdown. You are selling well-maintained, tested knowledge. Closer to buying a technical playbook than a prompt file.

Monetization will probably look less like selling one static skill and more like subscriptions to curated pattern libraries, or MCP-backed knowledge systems where the skill is a thin layer over a gated database of patterns. Companies could also sell skill packs that integrate with their own products. Services like uupm.cc already point in that direction.

Where this leaves me

The format is good. Adoption is fast. Serious skills are just starting to emerge. What is missing is treating them like actual software: clear ownership, static checks, versioning, modular design, statistical evaluation, security controls, real user feedback loops. The basics we already know from software engineering, applied to a new kind of artifact.

I am still figuring out parts of this, especially around the eval tooling and how far cross-assistant testing is worth pushing for early-stage skills. But the direction feels clear. If we want skills to hold up in enterprise delivery, the bar has to go up. Otherwise they are just fancy prompts.