DEV Community

Samson Tanimawo profile picture

Samson Tanimawo

Building the first Agentic SRE Platform. 100 AI agents that detect, investigate, and resolve incidents autonomously.

Location Houston Joined Joined on  Personal website https://novaaiops.com

Pronouns

He/Him/His

Error Budget Policies That Hold Leadership Accountable

Error Budget Policies That Hold Leadership Accountable

Comments
2 min read
Dependency Injection for Observability

Dependency Injection for Observability

Comments
2 min read
Load Balancer Tuning: Lessons from Production

Load Balancer Tuning: Lessons from Production

Comments
2 min read
Capacity Planning for Startups

Capacity Planning for Startups

Comments
2 min read
How We Handled Our First Major Outage (And Survived)

How We Handled Our First Major Outage (And Survived)

Comments
2 min read
The Economics of Reliability: When to Invest, When to Accept Risk

The Economics of Reliability: When to Invest, When to Accept Risk

Comments
2 min read
Why Your Status Page Should Be Boring

Why Your Status Page Should Be Boring

Comments
2 min read
Building Trust with Product Teams as an SRE

Building Trust with Product Teams as an SRE

Comments
2 min read
Incident Command: The Skills They Don't Teach You

Incident Command: The Skills They Don't Teach You

Comments
2 min read
How AI Is Changing SRE Workflows (Without Replacing SREs)

How AI Is Changing SRE Workflows (Without Replacing SREs)

Comments
2 min read
Security Monitoring for SRE Teams

Security Monitoring for SRE Teams

Comments
2 min read
Instrumenting Legacy Code Without Rewriting It

Instrumenting Legacy Code Without Rewriting It

Comments
2 min read
The Case for a Dedicated Reliability Engineer

The Case for a Dedicated Reliability Engineer

Comments
2 min read
Runbook-Driven Development: A New Way to Ship

Runbook-Driven Development: A New Way to Ship

Comments 1
2 min read
Zero-Downtime Database Migrations

Zero-Downtime Database Migrations

Comments
2 min read
API Rate Limiting: Patterns That Scale

API Rate Limiting: Patterns That Scale

Comments
2 min read
Kubernetes Upgrades Without Downtime

Kubernetes Upgrades Without Downtime

Comments
2 min read
The Dashboard Audit: Finding and Killing Dead Metrics

The Dashboard Audit: Finding and Killing Dead Metrics

Comments
2 min read
Cost Attribution in Shared Infrastructure

Cost Attribution in Shared Infrastructure

Comments 2
2 min read
How We Killed Our Worst Alert (And What We Learned)

How We Killed Our Worst Alert (And What We Learned)

Comments
2 min read
The Reliability Roadmap: A 90-Day Plan for New SRE Teams

The Reliability Roadmap: A 90-Day Plan for New SRE Teams

Comments
2 min read
Scaling On-Call When You Only Have 5 Engineers

Scaling On-Call When You Only Have 5 Engineers

Comments
2 min read
TLS Certificate Management Without Tears

TLS Certificate Management Without Tears

Comments
2 min read
DNS: The SRE's Most Underrated Skill

DNS: The SRE's Most Underrated Skill

Comments
2 min read
The Silent Outage: Monitoring What You Can't See

The Silent Outage: Monitoring What You Can't See

Comments
2 min read
Why Every SRE Should Learn a Little Rust

Why Every SRE Should Learn a Little Rust

Comments
2 min read
How We Built Our Own Incident Management System

How We Built Our Own Incident Management System

Comments
2 min read
The Role of Platform Engineering in a Startup

The Role of Platform Engineering in a Startup

Comments
2 min read
Building Dashboards People Actually Use

Building Dashboards People Actually Use

Comments
2 min read
SRE Maturity Models: Where Is Your Team?

SRE Maturity Models: Where Is Your Team?

Comments
2 min read
The Art of Writing a Good Post-Mortem

The Art of Writing a Good Post-Mortem

Comments
1 min read
Why We Stopped Using Log Aggregation for Everything

Why We Stopped Using Log Aggregation for Everything

Comments
1 min read
Running Postgres at Scale: Lessons Learned

Running Postgres at Scale: Lessons Learned

Comments
2 min read
How We Reduced Our Deployment Failure Rate to Under 2%

How We Reduced Our Deployment Failure Rate to Under 2%

Comments
1 min read
The Hidden Cost of Flaky Tests

The Hidden Cost of Flaky Tests

Comments
1 min read
Observability for Serverless: What's Different

Observability for Serverless: What's Different

Comments
2 min read
From DevOps to SRE: Making the Transition

From DevOps to SRE: Making the Transition

Comments
2 min read
The SRE Interview: Questions I Actually Ask

The SRE Interview: Questions I Actually Ask

1
Comments
1 min read
Incident Retrospectives Without Blame

Incident Retrospectives Without Blame

Comments
1 min read
Alert Fatigue: The Silent Productivity Killer

Alert Fatigue: The Silent Productivity Killer

Comments
1 min read
Why SLIs Matter More Than SLOs

Why SLIs Matter More Than SLOs

Comments
1 min read
The PagerDuty Migration Playbook

The PagerDuty Migration Playbook

Comments
1 min read
How We Cut Datadog Bills by 60% Without Losing Observability

How We Cut Datadog Bills by 60% Without Losing Observability

Comments
1 min read
Building Your First Runbook: A Template That Actually Works

Building Your First Runbook: A Template That Actually Works

Comments
1 min read
AIOps vs Traditional Monitoring: What Actually Changed

AIOps vs Traditional Monitoring: What Actually Changed

Comments
1 min read
Eventual Consistency: Debugging the Hardest Class of Bugs

Eventual Consistency: Debugging the Hardest Class of Bugs

Comments
4 min read
The Economics of Self-Hosting vs. Managed Monitoring

The Economics of Self-Hosting vs. Managed Monitoring

Comments
4 min read
Building an Incident Response Playbook Library

Building an Incident Response Playbook Library

Comments
4 min read
Kubernetes Network Policies: Lessons from Production Incidents

Kubernetes Network Policies: Lessons from Production Incidents

Comments
4 min read
Reducing Toil: The Google SRE Book Applied to Startups

Reducing Toil: The Google SRE Book Applied to Startups

Comments
4 min read
Incident Severity Levels: SEV-1 to SEV-5 Calibration

Incident Severity Levels: SEV-1 to SEV-5 Calibration

Comments
4 min read
Memory Leak Detection in Long-Running Services

Memory Leak Detection in Long-Running Services

Comments
3 min read
CI/CD Reliability: When Your Deploy Pipeline is Your SPOF

CI/CD Reliability: When Your Deploy Pipeline is Your SPOF

Comments
3 min read
Multi-Region Failover: Lessons from Running It Hot

Multi-Region Failover: Lessons from Running It Hot

Comments
3 min read
Multi-Region Failover: Lessons from Running It Hot

Multi-Region Failover: Lessons from Running It Hot

Comments
3 min read
Disaster Recovery Drills That Actually Work

Disaster Recovery Drills That Actually Work

Comments
3 min read
Disaster Recovery Drills That Actually Work

Disaster Recovery Drills That Actually Work

Comments
3 min read
Feature Flags as a Reliability Tool, Not Just an A/B Platform

Feature Flags as a Reliability Tool, Not Just an A/B Platform

Comments
3 min read
eBPF for SREs: Observability Without Agents

eBPF for SREs: Observability Without Agents

1
Comments
3 min read
Observability as Code: Managing Dashboards and Alerts with Terraform

Observability as Code: Managing Dashboards and Alerts with Terraform

Comments
2 min read
loading...