Runbook-Driven Development: A New Way to Ship

#sre #devops #runbook #process

Here's an idea I've been advocating for the last year: write the runbook before you ship the feature.

Sounds backwards. It's transformative. Let me explain.

The usual way

Team ships feature → feature breaks at 3 AM → on-call engineer tries to debug without context → writes runbook after the post-mortem.

The problem: the runbook gets written under duress, incomplete, and often never. Worse, the first on-call engineer suffers needlessly.

The runbook-first way

Before shipping, the team writes:

What alerts this feature will introduce
What each alert means
The first 3 things to check for each alert
The most likely causes
Escalation path

The runbook becomes a design review artifact. If you can't write the runbook, your design isn't clear.

What this exposes

Writing the runbook forces the team to answer questions most feature specs don't:

How will this fail?
What will on-call see when it does?
Who owns the fix?
What metrics should we add to make this debuggable?

I've seen feature designs get reworked just because writing the runbook exposed observability gaps. That's the signal you're doing it right.

The format

Keep the runbook short — one page. Structure:

New alerts (what they mean)
New dashboards (where to find them)
Common failure modes
First actions for on-call
Owner and escalation path

The hard part

Engineers resist this because it feels like extra work. Frame it as 'this is part of the design, not an add-on.' Block merges until the runbook exists.

After 3 months, the team will thank you. Their 3 AM pages start having actual context. Their post-mortems don't start from zero.

The bigger insight

You cannot ship reliable software without thinking about how it fails. Runbook-first makes that thinking explicit and early. It's the cheapest reliability investment you can make.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (1)

Harjot Singh • May 31

Runbook-driven development is a sharp idea because it inverts the usual order: instead of building first and documenting the operational steps later (if ever), you write the executable runbook as the spec and the system grows to satisfy it. That front-loads the questions people defer until an incident - how do I deploy this, roll it back, recover it, verify it's healthy - and makes them first-class instead of tribal knowledge in one person's head. It's TDD's cousin but for operations: define "how this is operated and verified" before you've built the thing that needs operating.

The reason this clicks for me is it's the same instinct I build around - the operational/verification layer isn't an afterthought, it's the spec. In Moonshift, the thing I work on (a multi-agent pipeline that takes a prompt to a deployed SaaS), the deploy/verify steps are defined and automated as part of producing the app, not bolted on - same "the runbook is the source of truth" philosophy, executed by the pipeline. Multi-model routing keeps a build ~$3 flat, first run free no card. Genuinely like this framing. Are your runbooks executable (the system runs them) or human-followed checklists? Executable is where this goes from a nice discipline to an actual safety net.