DEV Community

# reliability

General discussions on building and maintaining reliable software systems.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Building a Chaos Testing Harness for Multi-Region Video API Endpoints

Building a Chaos Testing Harness for Multi-Region Video API Endpoints

Comments
10 min read
Error budgets when downtime costs money: reliability engineering for payment-critical systems

Error budgets when downtime costs money: reliability engineering for payment-critical systems

Comments
10 min read
Distributed Tracing 101: The Mental Model, the Standards, and Your First Pipeline

Distributed Tracing 101: The Mental Model, the Standards, and Your First Pipeline

Comments
5 min read
Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Comments
17 min read
AI SRE: What an Autonomous Agent Doing On-Call Actually Looks Like

AI SRE: What an Autonomous Agent Doing On-Call Actually Looks Like

Comments
6 min read
Monitoring and Logging: How They Work Together and When You Need Both

Monitoring and Logging: How They Work Together and When You Need Both

Comments
8 min read
MCP Server Monitoring: How to Keep AI Agent Infrastructure Reliable

MCP Server Monitoring: How to Keep AI Agent Infrastructure Reliable

Comments
6 min read
Deploying Production Systems on Raspberry Pi: Lessons from the Field

Deploying Production Systems on Raspberry Pi: Lessons from the Field

Comments
7 min read
maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures

maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures

Comments
5 min read
Model Selection for Weibull Series Systems: When Simpler Models Suffice

Model Selection for Weibull Series Systems: When Simpler Models Suffice

Comments
3 min read
The Economics of Reliability: When to Invest, When to Accept Risk

The Economics of Reliability: When to Invest, When to Accept Risk

Comments
2 min read
Your Scraper Died at Row 12,000. The Rerun Pattern.

Your Scraper Died at Row 12,000. The Rerun Pattern.

Comments
13 min read
How we route around a 20-minute Anthropic outage

How we route around a 20-minute Anthropic outage

Comments
5 min read
AWS Daily Digest — June 04, 2026

AWS Daily Digest — June 04, 2026

1
Comments
3 min read
The Eighth Server: How One Missed Deploy Ended Knight Capital, 2012

The Eighth Server: How One Missed Deploy Ended Knight Capital, 2012

1
Comments
9 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.