Toil Budgets: How We Cut Operational Toil by 60% in One Quarter
Google's SRE book defines toil as work that is manual, repetitive, automatable, tactical, and devoid of enduring value. That definition is correct but abstract. In practice, toil is the thing that makes your best engineers update their LinkedIn profiles. We decided to measure it, budget it, and systematically eliminate it.
Measuring Toil
You cannot reduce what you do not measure. We started by asking each team to categorize their work for two weeks using a simple time-tracking spreadsheet with four categories:
- Project work — Building new capabilities, improving existing systems
- Toil — Repetitive operational tasks that could be automated
- Overhead — Meetings, planning, reviews
- Incident response — Firefighting, debugging production issues
The results were sobering. Across six teams, toil consumed an average of 45% of engineering time. The worst team was at 62%. The most common toil categories:
- Certificate rotation and renewal (manual cert-manager was not yet deployed)
- Access permission requests (manual kubectl role bindings)
- Capacity scaling (manually adding nodes before known traffic events)
- Log investigation for support tickets (grepping through CloudWatch)
- Database schema migrations (coordinating with DBA team)
Setting Budgets
Google recommends keeping toil below 50% of SRE time. We set a more aggressive target: 30% maximum toil per team per quarter. Any team exceeding 30% gets priority access to platform engineering resources for automation projects.
The budget is not punitive — it is a signal. When a team exceeds its toil budget, it means the system is generating more operational burden than the team can sustainably handle. The response is to invest in automation, not to demand the team work harder.
Prioritizing Automation ROI
Not all toil is equally worth automating. We rank automation projects by a simple formula:
ROI = (time_per_occurrence * frequency_per_month * 12) / estimated_automation_effort
Any project with ROI greater than 3 (meaning it pays for itself within 4 months) gets approved automatically. Projects between 1 and 3 are evaluated case by case. Below 1, we accept the toil unless it has reliability implications.
What We Automated
Certificate rotation: Deployed cert-manager with Let's Encrypt ClusterIssuers. Eliminated 8 hours/month of manual certificate work across all teams. Implementation time: 3 days.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
Permission requests: Built a Slack bot that accepts access requests, validates them against policy, creates time-bound RBAC bindings, and auto-revokes after expiry. Eliminated 12 hours/month of back-and-forth. Implementation time: 2 weeks.
Capacity scaling: Implemented Karpenter with provisioner profiles for different workload types. Predictive scaling based on historical patterns handles known events (Black Friday, product launches). Eliminated 6 hours/month of manual node management. Implementation time: 1 week.
Log investigation: Created a self-service Grafana dashboard with pre-built queries for common support ticket patterns. Added correlation IDs to all services. Support team now resolves 70% of log-related tickets without engineering involvement. Implementation time: 2 weeks.
Cultural Resistance
The hardest part was not the technology. Two patterns of resistance emerged:
"It is faster to just do it manually." This is true for any individual occurrence. It is false when you multiply by frequency and team size. We made this visible by showing the cumulative time cost on a dashboard. When engineers saw "Certificate rotation: 96 hours/year" next to "Automation cost: 24 hours," the argument resolved itself.
"Automation will eliminate my job." This fear is real and must be addressed directly. We reframed automation not as replacing people but as freeing them to work on more interesting problems. The team that automated the most toil also shipped the most features that quarter — because they had the time.
Results
After one quarter of focused toil reduction:
- Average toil dropped from 45% to 18% across all teams
- Feature velocity (measured by story points delivered) increased by 35%
- On-call page volume decreased by 28% (many pages were toil-related)
- Engineer satisfaction scores improved from 3.2 to 4.1 (out of 5)
The toil budget is now a permanent part of our quarterly planning. Each team reports their toil percentage in sprint retrospectives, and automation projects compete for resources alongside feature work through the same prioritization framework.