Playbooks, Not Prompts: Why Agent Work Runs on Contracts

The default approach to AI agents is: write a long prompt, add a few examples, and hope the model does what you want. This works for demos. It fails for operations. The failure modes are predictable and consistent, and they all trace back to the same root cause: prompts are unstructured text with no formal contract.

A 2,000-word prompt that works 90% of the time is not a workflow. It is a coin flip that lands on heads slightly more often. Running a company needs better odds than that. When a prompt fails, you cannot diff it, you cannot roll it back, and you cannot test it in isolation. You rewrite it and hope again.

This is why every workflow in Capx Casa is a playbook, not a prompt. A playbook is a declarative specification for a unit of company work. It describes what needs to happen, step by step. It defines what a good output looks like. And it carries the governance rules that decide what the agents may do on their own and what waits for the founder. Playbooks are to agent-run companies what Dockerfiles are to containers: a portable, versionable, composable format that the platform knows how to run.

Prompts vs playbooks

Dimension	Raw prompt	Playbook
Version control	Blob of text, diffs are unreadable	Structured fields, clean diffs, meaningful history
Testability	Run it and see what happens	Each step testable in isolation, every output graded
Composability	Copy-paste between prompts	Playbooks chain into pipelines, outputs feed inputs
Governance	No structure to attach policies to	Approval mode and spend cap declared per playbook
Debugging	Re-read the whole prompt	Trace which step failed, inspect its inputs and outputs
Rollback	Ctrl-Z in your editor	Revert to the last known-good version, instantly

Every row in that table is a real failure mode of prompt-driven systems. The structure is not added complexity. It is the minimum structure needed to run agent work reliably when nobody is watching.

What a playbook looks like

A playbook declares what the agent does (steps), how its output is judged (rubric), and what governance rules apply (execution). Here is the shape of one, simplified for illustration.

competitor-analysis (illustrative)YAML

name: competitor-analysis
role: strategist
schedule: every monday 9am

steps:
  - id: gather
    action: research competitor pages and recent changes
  - id: analyze
    action: compare pricing, positioning, and product moves
  - id: draft
    action: write a strategy memo for the founder

rubric:
  - actionability: does the memo contain specific next steps?
  - evidence: is every claim backed by the research?
  - brevity: under 800 words, no filler

execution:
  mode: review       # the founder approves before it ships
  spend_cap: 18      # credits this run may not exceed

Every field serves a purpose. The schedule controls when it runs. The steps define what happens, in order. The rubric defines what good looks like. The execution block defines the governance rules. No ambiguity, no interpretation.

The rubric: quality as a parameter

The rubric is the most underappreciated part of the playbook. It is an automated quality gate. After the agent produces its output, the output is evaluated against the declared criteria and scored. If the score falls below the bar, the output is flagged for review or retried instead of being delivered.

This is what makes playbooks self-correcting. A prompt-based system produces output and you hope it is good. A playbook-based system produces output, grades it, and only delivers it if it passes. Quality stops being a subjective judgment and becomes a measurable, tunable parameter of the company.

Versioning and rollback

Playbooks are files. They are versioned. When something breaks, you do not debug a 2,000-word prompt from memory. You look at the diff between the current version and the last version that worked.

playbook diffDIFF

--- competitor-analysis (v2.3)
+++ competitor-analysis (v2.4)
@@ steps
   - id: analyze
     action: compare pricing, positioning, and product moves
+      including hiring signals and content themes
-      including social mentions

@@ rubric
+  - brevity: under 800 words, no filler

@@ execution
-  spend_cap: 15
+  spend_cap: 18

The diff tells the whole story: two analysis angles swapped, a brevity criterion added, the spend cap raised by 3 credits. If the new version performs worse, the rollback is instant. Try doing that with a prompt.

Composition

Playbooks compose. The output of one becomes the input of the next. This lets a company build complex operations from simple, tested building blocks.

competitor-analysis

Runs Monday morning. Produces a strategy memo with competitive insights.

content-calendar

Runs after the analysis completes. Uses the memo to identify content gaps and plans the week.

blog-post-draft

Triggered for each item in the calendar. Drafts a post targeting the identified gap.

social-distribution

Triggered when a draft passes its rubric. Prepares platform-specific posts to promote it.

Four playbooks, each independently testable, each with its own rubric, each with its own governance rules. They compose into a full content pipeline that runs every week, and every output is graded before it moves to the next stage. A bad analysis never becomes a bad blog post, because it fails before the next playbook ever sees it.

Why this matters

The choice between prompts and playbooks is not a matter of taste. It is the difference between a system that works when you are watching and a system that works when you are not. Prompts are conversations. Playbooks are contracts. Conversations drift. Contracts hold.

You do not write playbooks from scratch. Capx Casa ships with a library of templates for common company operations. Pick one, customize the fields that matter for your business, and let your AI cofounder run it.