Here's an idea I've been advocating for the last year: write the runbook before you ship the feature.
Sounds backwards. It's transformative. Let me explain.
The usual way
Team ships feature → feature breaks at 3 AM → on-call engineer tries to debug without context → writes runbook after the post-mortem.
The problem: the runbook gets written under duress, incomplete, and often never. Worse, the first on-call engineer suffers needlessly.
The runbook-first way
Before shipping, the team writes:
- What alerts this feature will introduce
- What each alert means
- The first 3 things to check for each alert
- The most likely causes
- Escalation path
The runbook becomes a design review artifact. If you can't write the runbook, your design isn't clear.
What this exposes
Writing the runbook forces the team to answer questions most feature specs don't:
- How will this fail?
- What will on-call see when it does?
- Who owns the fix?
- What metrics should we add to make this debuggable?
I've seen feature designs get reworked just because writing the runbook exposed observability gaps. That's the signal you're doing it right.
The format
Keep the runbook short — one page. Structure:
- New alerts (what they mean)
- New dashboards (where to find them)
- Common failure modes
- First actions for on-call
- Owner and escalation path
The hard part
Engineers resist this because it feels like extra work. Frame it as 'this is part of the design, not an add-on.' Block merges until the runbook exists.
After 3 months, the team will thank you. Their 3 AM pages start having actual context. Their post-mortems don't start from zero.
The bigger insight
You cannot ship reliable software without thinking about how it fails. Runbook-first makes that thinking explicit and early. It's the cheapest reliability investment you can make.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (1)
Runbook-driven development is a sharp idea because it inverts the usual order: instead of building first and documenting the operational steps later (if ever), you write the executable runbook as the spec and the system grows to satisfy it. That front-loads the questions people defer until an incident - how do I deploy this, roll it back, recover it, verify it's healthy - and makes them first-class instead of tribal knowledge in one person's head. It's TDD's cousin but for operations: define "how this is operated and verified" before you've built the thing that needs operating.
The reason this clicks for me is it's the same instinct I build around - the operational/verification layer isn't an afterthought, it's the spec. In Moonshift, the thing I work on (a multi-agent pipeline that takes a prompt to a deployed SaaS), the deploy/verify steps are defined and automated as part of producing the app, not bolted on - same "the runbook is the source of truth" philosophy, executed by the pipeline. Multi-model routing keeps a build ~$3 flat, first run free no card. Genuinely like this framing. Are your runbooks executable (the system runs them) or human-followed checklists? Executable is where this goes from a nice discipline to an actual safety net.