The On-Call Handbook: Sustainable Practices for Small Teams

2025-12-08 · On-Call, SRE, Team Health

On-call is a necessary part of running production systems. Done poorly, it burns people out, drives attrition, and paradoxically makes systems less reliable because exhausted engineers make worse decisions. Done well, it distributes knowledge, builds confidence, and catches problems before customers notice. Here is how we make it work with a team of six.

Rotation Design

With six engineers, we run a weekly rotation with one primary and one secondary. Each engineer is on-call every six weeks as primary and every six weeks as secondary (offset by three weeks). This gives everyone a minimum four-week gap between primary shifts.

Rules we enforce:

No on-call during the first month after joining the team (shadow period)
No back-to-back primary-then-secondary weeks
On-call swaps are allowed and encouraged — post in the team channel, no manager approval needed
If someone is paged more than 3 times during off-hours in one shift, the next person takes over early

Escalation Policy

Our escalation policy has three tiers with strict timeouts:

Primary on-call: Paged immediately. Must acknowledge within 5 minutes. If no ack, escalate to secondary.
Secondary on-call: Paged after 5 minutes without ack, or when primary requests help. Must acknowledge within 10 minutes.
Engineering manager: Paged after 15 minutes without any ack, or for any Sev1 incident. Manager coordinates response and decides whether to pull in additional engineers.

# PagerDuty escalation policy (Terraform)
resource "pagerduty_escalation_policy" "production" {
  name      = "Production Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 5
    target { type = "schedule_reference" id = pagerduty_schedule.primary.id }
  }

  rule {
    escalation_delay_in_minutes = 10
    target { type = "schedule_reference" id = pagerduty_schedule.secondary.id }
  }

  rule {
    escalation_delay_in_minutes = 15
    target { type = "user_reference" id = pagerduty_user.eng_manager.id }
  }
}

Alert Quality: The Only Rule That Matters

Every alert must be actionable. If an engineer is paged, they must be able to do something about it. Alerts that require no action — informational alerts, transient blips, known issues waiting for a fix — must be removed or downgraded to notifications.

We audit alert quality monthly. Any alert that fired more than twice without resulting in human action gets reviewed. Common outcomes:

Threshold adjusted (too sensitive)
Alert converted to a dashboard metric (informational, not actionable)
Root cause fixed (the alert was valid but the underlying issue should not exist)
Runbook created (the alert was actionable but the responder did not know what to do)

Runbook Structure

Every alert links to a runbook. Every runbook follows the same structure:

What is this alert? One sentence explaining what triggered and what it means for customers.
Immediate actions. Step-by-step commands to diagnose and mitigate. Copy-pasteable. No judgment required at 3 AM.
Escalation criteria. When to wake up someone else. Specific conditions, not vibes.
Background. Architecture context, related services, historical incidents. For when you have time to understand, not just react.

On-Call Handoff

Every Monday at 10 AM, the outgoing and incoming on-call engineers meet for 15 minutes. The agenda is fixed:

Incidents during the past week (1-2 sentences each, links to postmortems)
Ongoing issues that might page (known degradations, pending deployments)
Anything unusual about the coming week (planned maintenance, traffic events, team absences)

This meeting is recorded as a brief note in our on-call channel. It creates a written record and helps engineers who missed the meeting catch up.

Burnout Prevention

On-call burnout is real and cumulative. We watch for it with two metrics:

Pages per shift: Target is under 5 per week during business hours, under 2 per week outside business hours. If a shift exceeds this, we treat it as a signal that alert quality or system reliability needs investment.
MTTA (Mean Time to Acknowledge): Rising MTTA suggests fatigue. If the team average crosses 4 minutes (our target is under 3), we discuss whether alert volume, shift length, or rotation frequency needs adjustment.

Compensation matters too. Our on-call compensation model: flat weekly stipend for being on-call, plus per-incident bonus for off-hours pages. The stipend compensates for the constraint on your life (you carry a laptop, you stay sober, you stay in cell range). The per-incident bonus compensates for actual disruption. Both components are necessary.

Measuring On-Call Health

We review on-call health quarterly using four metrics: pages per shift (trending down), MTTA (stable under 3 minutes), alert noise ratio (percentage of non-actionable alerts, target under 10%), and on-call satisfaction survey (anonymous, quarterly). The survey asks one question: "On a scale of 1-5, how sustainable is the current on-call rotation?" Anything below 3.5 triggers a retrospective focused on on-call improvements.