Infra & Security 30 April 2026 · 5 min

Agent Skills Are Build Artifacts, Not Prompt Snippets

By wGrow Project Team · 30 April 2026

The Procurement API Pagination Incident

A developer updated a data-extraction skill for our internal procurement agent crew directly via a web UI. The agent immediately started hammering a vendor API in a pagination loop. Three hours later — after digging through logs, reconstructing the prompt string from memory, and verifying the vendor wasn’t about to block our IP — we had a working system again. We had no rollback path because the previous instruction string was gone.

The failure was specific. The system treated agent skills as mutable configuration strings rather than executable logic. The skill lived in a text field — anyone with editor access could change it, no review required, no hash recorded, no version history.

If a piece of text dictates how a system interacts with external infrastructure, that text requires version control.

Natural Language as Executable Code

Technical illustration of a linear version control commit history.

Agent Execution Flow

step 01

Receive Goal

step 02

Fetch Versioned Skill

step 03

Generate Payload

step 04

Execute API

The industry keeps misclassifying agentic instructions. A prompt attached to a tool-calling capability is not configuration data. It is code.

The mechanics are simple: an agent receives a goal, retrieves a skill — an instruction block paired with a tool signature — and executes it. The skill determines which API endpoint gets called, what parameters get constructed, and how the response gets handled. Change “summarise” to “extract exactly” and you’ve changed both the execution path and the API payload shape. That’s a code change, not a settings tweak.

Standard CI/CD discipline applies. Python modules have ownership, changelogs, semver, and deprecation notices. Go packages go through code review before touching production. A natural-language instruction that calls a live API deserves the same treatment — the fact that it’s written in English rather than a typed language does not reduce its blast radius.

Skills need the same lifecycle as build artifacts: ownership, versioning, review gates, and a retirement process. The procurement incident would have been a two-minute rollback if we’d treated the skill that way.

OWASP Classifies Skills as Supply-Chain Risks

Coverage Gap

Standard SAST/DAST

Agentic NL Skills

Syntax & CVEs

✓

N/A

Instruction Drift

✗

blind spot

Blast Radius Limits

blind spot

OWASP Top 10 for LLM Applications v1.1, item LLM05, covers Supply Chain Vulnerabilities. The category explicitly calls out compromised third-party models, plugins, and tools as attack surfaces. Sourcing an unversioned, externally hosted agent skill introduces a direct supply-chain exposure by that definition. So does an internal skill with no audit trail and no approved-version record.

Here’s where security teams tend to underestimate the risk: the scanner gap. Traditional SAST tools analyze Python and JavaScript source by parsing syntax trees and intermediate representations; DAST tools exercise running services from the outside. Neither class is built to detect semantic drift inside a natural-language skill string. An attacker who modifies a skill string, or a developer who does so carelessly, is unlikely to trigger existing automated security controls. The manipulation is invisible to the toolchain.

If a security team cannot reproduce the exact instruction payload an agent executed at 14:00 on a Tuesday, the system fails a basic compliance audit. You cannot investigate an incident you cannot reconstruct.

Building the WaterDoctor Skill Registry

Agent Manifest

1	agent:
2	name: water-ops
3	skills:
4	- skill_urn:wgrow:waterdoctor:query_timeseries:v1.2.0	← ①
5	- skill_urn:wgrow:core:format_report:v2.1.1
6

① Pinned version guarantees deterministic rollbacks

That compliance constraint drove the approach we took while building WaterDoctor, a platform for parsing time-series IoT data from industrial sensors. Agent crews needed reusable skills for querying telemetry, detecting anomalies, and formatting reports for plant operators. The skills were moderately complex and had to work correctly against live sensor data — getting a query window wrong by a single unit could mean a crew reporting a clean sensor that was, in fact, failing.

We stopped allowing inline prompt strings. Skills are now stored as versioned JSON payloads in S3, managed through Git. An agent’s configuration file requests a skill by URN: skill_urn:wgrow:waterdoctor:query_timeseries:v1.2.0. The agent runtime fetches the payload at that exact version and executes it. The previous version, v1.1.9, remains in the registry.

If v1.2.0 causes the agent to misinterpret a sensor’s telemetry window, we revert the agent manifest to v1.1.9 and redeploy. The rollback is deterministic. The three-hour procurement incident becomes a two-minute operation because we know exactly what changed, when it changed, and what state is known-good.

Review Gates for Database Credentials

Cybersecurity professional reviewing complex diagrams on a dual-monitor workstation.

Review Gate

Commit

Review

Deploy

Developer

Update NL instruction

Security

Audit blast radius

CI/CD

Attach IAM & Hash

Natural language dictates how credentials get used — and that link between instruction text and credential scope is precisely what makes security leads pay attention.

The WaterDoctor IoT parsing skill required read-only database credentials scoped to the sensor data schema. We refused to attach those credentials to an unversioned text string. The rule: no automated credential injection without a cryptographic hash of the approved skill payload. If the hash of the skill fetched at runtime does not match the hash stored in the approval record held by our credential broker, the credential is not released.

Changing a skill requires a pull request. The security lead reviews the prompt diff — not just the metadata, but the actual instruction language — to verify the agent is not being directed to widen its query scope, bypass tenant isolation, or exceed the credential boundary. Only after that review does the new version hash get written into the credential broker’s allowlist and the deployment manifest.

This is overhead. It is also exactly what you already do before deploying code that touches a production database. The cost of not doing it showed up as three hours of incident response.

Force the Commit

Stop editing agent behaviors in web consoles. Disconnect the UI from live agent logic.

Agentic systems are software systems. The instructions running inside them have blast radii. A poorly worded extraction skill can hammer a vendor API. An unapproved skill version can violate a tenant boundary. A missing rollback path turns a one-line fix into a three-hour recovery.

The DevSecOps baseline for agentic infrastructure isn’t complicated. If it executes against live systems, force the commit. Require a pull request. Keep a changelog. Give every skill an owner and a review date.

Words are executable now. Version control them accordingly.

← All field notes Brief a crew →