BRO SRE — Engineering Blog

2026-05-22

Runbook Automation: From Markdown to Executable Recovery

Most runbooks rot because they are documentation, and documentation has no consumer until it is needed. We layered intent, executable procedure, and synthetic tests so that recovery steps fail loudly in CI when reality changes, not at 3 AM during an incident.

SREAutomationIncident Response

2026-05-15

Redis Cluster in Production: Five Pitfalls We Hit So You Don't Have To

Two years operating a 24-node Redis Cluster at 480k ops/sec taught us that the gap between "running" and "production-ready" is enormous. Resharding latency, CROSSSLOT pain, AOF stalls, stale client topology, and the gossip-quorum trap.

RedisDatabasesOperations

2026-05-08

Boring Kubernetes Upgrades: A Quarterly Cadence That Actually Works

Two years ago our upgrades were week-long events. Today they are two afternoons for one engineer. The shift: a fixed quarterly rhythm at N-1, pre-flight scanning of every manifest, and CNI always one minor ahead of the control plane.

KubernetesOperationsUpgrades

2026-04-30

Disaster Recovery Tabletops That Find Real Problems

Most DR exercises are theater: a scenario is read, everyone nods, the plan is filed. We changed the rule to "show me" — every claim must be demonstrated against the real system. The first quarter produced 14 actionable findings, including a runbook that would have failed at step 3.

DRResilienceProcess

2026-04-12

The Real Cost of CI Flakiness and How We Got It Under 0.5%

At 7.3% flake rate, the wasted compute was the cheapest cost — the real damage was signal degradation: real production bugs ignored as "probably a flake." Four quarters of measurement, classification, and discipline took us to 0.4%.

CI/CDDeveloper ProductivityTesting

2026-05-04

Terraform Modules That Don't Suck: A Design Guide

Every organization has a graveyard of bloated Terraform modules. We catalog principles that separate modules teams want to use from modules that become technical debt — parameter discipline, output design, state boundaries, and when not to write a module at all.

TerraformIaCArchitecture

2026-04-22

Service Mesh Without the Hype: When You Actually Need Istio

Three years of running Istio and a brief Linkerd experiment taught us that the default answer to "should we adopt a mesh" is no. Here are the signals that mean you actually need one, and the operational costs nobody mentions.

Service MeshIstioKubernetes

2026-04-15

Kafka in Production: Operational Lessons From a 200-Broker Fleet

From the KRaft migration to partition count discipline, ISR tradeoffs, and the compaction trap. Four years of operating Kafka at growing scale, distilled into the configurations and habits that prevent the incidents we have already debugged.

KafkaStreamingOperations

2026-04-08

FinOps for SRE: Cutting Cloud Spend Without Breaking Reliability

We cut annual cloud spend by 31% while improving p99 latency. No single optimization moved the needle by more than 6% — but the combination of attribution, right-sizing, spot tiers, and non-prod discipline compounds.

FinOpsCloudCost Optimization

2026-03-26

DNS at Scale: The Failure Mode Nobody Talks About

DNS is the protocol everyone assumes works. It also causes more subtle Kubernetes incidents than any other component in our experience. CoreDNS tuning, ndots:5, the conntrack race, and the observability gap that hides it all.

DNSKubernetesNetworking

2026-03-18

SLIs and SLOs for Platform Teams: Stop Guessing, Start Measuring

Platform teams often find themselves trapped: they build infrastructure but cannot answer "how reliable is it?" We break down how to define SLIs for internal platforms and negotiate SLOs with product teams.

SRESLOPlatform Engineering

2026-03-10

GitOps in 2026: Flux vs ArgoCD — An Honest Comparison After Two Years in Production

We ran both tools on clusters ranging from 50 to 400 nodes. Here is where each one excels, where it struggles, and why the choice is about team culture, not technology.

GitOpsKubernetesArgoCD

2026-02-27

Blameless Postmortems: Building an Incident Management Culture

Blameless postmortems are not just a document template. They require systematic changes to processes, metrics, and even the language teams use to describe failures. Here is a framework that works.

Incident ManagementSRECulture

2026-02-14

eBPF for Observability: From Theory to Practice Without a PhD

eBPF is no longer exotic — tools like Cilium, Pixie, and Tetragon make it accessible. We show how to get traces without code instrumentation and profile CPU with zero overhead in production.

eBPFObservabilityLinux

2026-01-30

Toil Budgets: How We Cut Operational Toil by 60% in One Quarter

Toil is work that scales linearly with service growth. We introduced a toil budget for every team and automated everything that exceeded 30%. Here is what happened.

SREAutomationToil

2026-01-15

OpenTelemetry Collector Pipelines: A Production-Ready Architecture

The OTel Collector is the backbone of modern observability stacks. We share our battle-tested pipeline configuration for processing 2M spans/sec with filtering, sampling, and multi-backend export.

OpenTelemetryObservabilityArchitecture

2025-12-22

Capacity Planning for Kubernetes Clusters: Beyond Request Limits

Resource requests and limits are only the beginning. Real capacity planning involves understanding burst patterns, node failure domains, and scheduling headroom. A practical guide to right-sizing clusters.

KubernetesCapacity PlanningInfrastructure

2025-12-08

The On-Call Handbook: Sustainable Practices for Small Teams

On-call rotations burn people out when done wrong. We document our approach to on-call that keeps engineers sane: escalation policies, runbooks, alert fatigue reduction, and compensation models.

On-CallSRETeam Health

2025-11-20

PostgreSQL High Availability: Patroni, PgBouncer, and the Lessons We Learned

Running PostgreSQL at scale requires more than replication. We cover our HA stack with Patroni, PgBouncer connection pooling, WAL archiving, and the failover scenarios that caught us off guard.

PostgreSQLHigh AvailabilityDatabases

2025-11-05

Chaos Engineering on a Budget: Practical Fault Injection Without Breaking the Bank

You do not need a Netflix-scale platform to benefit from chaos engineering. We walk through lightweight fault injection using LitmusChaos and simple scripts that exposed real production blind spots.

Chaos EngineeringResilienceTesting