BRO SRE

Reliability practices, infrastructure, automation

← Back to articles

The On-Call Handbook: Sustainable Practices for Small Teams

2025-12-08 · On-Call, SRE, Team Health

On-call is a necessary part of running production systems. Done poorly, it burns people out, drives attrition, and paradoxically makes systems less reliable because exhausted engineers make worse decisions. Done well, it distributes knowledge, builds confidence, and catches problems before customers notice. Here is how we make it work with a team of six.

Rotation Design

With six engineers, we run a weekly rotation with one primary and one secondary. Each engineer is on-call every six weeks as primary and every six weeks as secondary (offset by three weeks). This gives everyone a minimum four-week gap between primary shifts.

Rules we enforce:

Escalation Policy

Our escalation policy has three tiers with strict timeouts:

  1. Primary on-call: Paged immediately. Must acknowledge within 5 minutes. If no ack, escalate to secondary.
  2. Secondary on-call: Paged after 5 minutes without ack, or when primary requests help. Must acknowledge within 10 minutes.
  3. Engineering manager: Paged after 15 minutes without any ack, or for any Sev1 incident. Manager coordinates response and decides whether to pull in additional engineers.
# PagerDuty escalation policy (Terraform)
resource "pagerduty_escalation_policy" "production" {
  name      = "Production Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 5
    target { type = "schedule_reference" id = pagerduty_schedule.primary.id }
  }

  rule {
    escalation_delay_in_minutes = 10
    target { type = "schedule_reference" id = pagerduty_schedule.secondary.id }
  }

  rule {
    escalation_delay_in_minutes = 15
    target { type = "user_reference" id = pagerduty_user.eng_manager.id }
  }
}

Alert Quality: The Only Rule That Matters

Every alert must be actionable. If an engineer is paged, they must be able to do something about it. Alerts that require no action — informational alerts, transient blips, known issues waiting for a fix — must be removed or downgraded to notifications.

We audit alert quality monthly. Any alert that fired more than twice without resulting in human action gets reviewed. Common outcomes:

Runbook Structure

Every alert links to a runbook. Every runbook follows the same structure:

  1. What is this alert? One sentence explaining what triggered and what it means for customers.
  2. Immediate actions. Step-by-step commands to diagnose and mitigate. Copy-pasteable. No judgment required at 3 AM.
  3. Escalation criteria. When to wake up someone else. Specific conditions, not vibes.
  4. Background. Architecture context, related services, historical incidents. For when you have time to understand, not just react.

On-Call Handoff

Every Monday at 10 AM, the outgoing and incoming on-call engineers meet for 15 minutes. The agenda is fixed:

This meeting is recorded as a brief note in our on-call channel. It creates a written record and helps engineers who missed the meeting catch up.

Burnout Prevention

On-call burnout is real and cumulative. We watch for it with two metrics:

Compensation matters too. Our on-call compensation model: flat weekly stipend for being on-call, plus per-incident bonus for off-hours pages. The stipend compensates for the constraint on your life (you carry a laptop, you stay sober, you stay in cell range). The per-incident bonus compensates for actual disruption. Both components are necessary.

Measuring On-Call Health

We review on-call health quarterly using four metrics: pages per shift (trending down), MTTA (stable under 3 minutes), alert noise ratio (percentage of non-actionable alerts, target under 10%), and on-call satisfaction survey (anonymous, quarterly). The survey asks one question: "On a scale of 1-5, how sustainable is the current on-call rotation?" Anything below 3.5 triggers a retrospective focused on on-call improvements.