The default approach to AI agents is: write a long prompt, add a few examples, and hope the model does what you want. This works for demos. It fails for operations. The failure modes are predictable and consistent, and they all trace back to the same root cause: prompts are unstructured text with no formal contract.
This is why every workflow in Capx Casa is a playbook, not a prompt. A playbook is a declarative specification for a unit of company work. It describes what needs to happen, step by step. It defines what a good output looks like. And it carries the governance rules that decide what the agents may do on their own and what waits for the founder. Playbooks are to agent-run companies what Dockerfiles are to containers: a portable, versionable, composable format that the platform knows how to run.
Prompts vs playbooks
| Dimension | Raw prompt | Playbook |
|---|---|---|
| Version control | Blob of text, diffs are unreadable | Structured fields, clean diffs, meaningful history |
| Testability | Run it and see what happens | Each step testable in isolation, every output graded |
| Composability | Copy-paste between prompts | Playbooks chain into pipelines, outputs feed inputs |
| Governance | No structure to attach policies to | Approval mode and spend cap declared per playbook |
| Debugging | Re-read the whole prompt | Trace which step failed, inspect its inputs and outputs |
| Rollback | Ctrl-Z in your editor | Revert to the last known-good version, instantly |
Every row in that table is a real failure mode of prompt-driven systems. The structure is not added complexity. It is the minimum structure needed to run agent work reliably when nobody is watching.
What a playbook looks like
A playbook declares what the agent does (steps), how its output is judged (rubric), and what governance rules apply (execution). Here is the shape of one, simplified for illustration.
name: competitor-analysis
role: strategist
schedule: every monday 9am
steps:
- id: gather
action: research competitor pages and recent changes
- id: analyze
action: compare pricing, positioning, and product moves
- id: draft
action: write a strategy memo for the founder
rubric:
- actionability: does the memo contain specific next steps?
- evidence: is every claim backed by the research?
- brevity: under 800 words, no filler
execution:
mode: review # the founder approves before it ships
spend_cap: 18 # credits this run may not exceedEvery field serves a purpose. The schedule controls when it runs. The steps define what happens, in order. The rubric defines what good looks like. The execution block defines the governance rules. No ambiguity, no interpretation.
The rubric: quality as a parameter
The rubric is the most underappreciated part of the playbook. It is an automated quality gate. After the agent produces its output, the output is evaluated against the declared criteria and scored. If the score falls below the bar, the output is flagged for review or retried instead of being delivered.
This is what makes playbooks self-correcting. A prompt-based system produces output and you hope it is good. A playbook-based system produces output, grades it, and only delivers it if it passes. Quality stops being a subjective judgment and becomes a measurable, tunable parameter of the company.
Versioning and rollback
Playbooks are files. They are versioned. When something breaks, you do not debug a 2,000-word prompt from memory. You look at the diff between the current version and the last version that worked.
--- competitor-analysis (v2.3)
+++ competitor-analysis (v2.4)
@@ steps
- id: analyze
action: compare pricing, positioning, and product moves
+ including hiring signals and content themes
- including social mentions
@@ rubric
+ - brevity: under 800 words, no filler
@@ execution
- spend_cap: 15
+ spend_cap: 18The diff tells the whole story: two analysis angles swapped, a brevity criterion added, the spend cap raised by 3 credits. If the new version performs worse, the rollback is instant. Try doing that with a prompt.
Composition
Playbooks compose. The output of one becomes the input of the next. This lets a company build complex operations from simple, tested building blocks.
Runs Monday morning. Produces a strategy memo with competitive insights.
Runs after the analysis completes. Uses the memo to identify content gaps and plans the week.
Triggered for each item in the calendar. Drafts a post targeting the identified gap.
Triggered when a draft passes its rubric. Prepares platform-specific posts to promote it.
Four playbooks, each independently testable, each with its own rubric, each with its own governance rules. They compose into a full content pipeline that runs every week, and every output is graded before it moves to the next stage. A bad analysis never becomes a bad blog post, because it fails before the next playbook ever sees it.
Why this matters
The choice between prompts and playbooks is not a matter of taste. It is the difference between a system that works when you are watching and a system that works when you are not. Prompts are conversations. Playbooks are contracts. Conversations drift. Contracts hold.
