SLIs and SLOs for Platform Teams: Stop Guessing, Start Measuring
Platform teams occupy a peculiar position in engineering organizations. They do not ship features to end users. Their customers are other engineers, and the "product" is infrastructure: CI/CD pipelines, Kubernetes clusters, internal APIs, and developer tooling. Despite this, most platform teams either borrow product SLOs verbatim or operate with no measurable reliability targets at all. Both approaches lead to the same outcome: misaligned priorities, frustrated internal consumers, and reactive firefighting instead of strategic investment.
This article outlines a practical framework for defining Service Level Indicators and Objectives specifically tailored to platform engineering, drawing from two years of implementing this at an organization running 1,200 microservices across six product teams.
Why Platform SLIs Are Different
Product SLIs typically measure user-facing behavior: HTTP request latency, error rates, throughput. Platform SLIs measure developer-facing behavior. The distinction matters because the failure modes, measurement points, and consumer expectations are fundamentally different.
When a product API returns a 500 error, a user sees a broken page. When a deploy pipeline fails, a developer loses thirty minutes diagnosing why their merge did not reach staging. When cluster autoscaling lags, a team's load test produces meaningless results. These failures compound differently and demand different indicators.
A useful mental model: platform SLIs should answer the question, "Can our engineers ship and operate their software without being blocked by infrastructure?"
Concrete Platform SLIs Worth Tracking
Deploy Pipeline Latency (p99)
Measure the wall-clock time from merge-to-main to deployment completion in staging. This is arguably the single most important platform SLI because it directly governs developer iteration speed. We define it as:
SLI: deploy_pipeline_duration_seconds (p99)
Measurement point: CI webhook received -> last deployment probe healthy
Excludes: manual approval gates, optional integration test suites
A common mistake is measuring only the CI build step. The SLI must capture the full path, including image push, manifest update, ArgoCD sync, and readiness probe success. Our target is p99 under 12 minutes. Anything above 15 minutes triggers an error budget burn investigation.
Cluster Provisioning Success Rate
For teams running multi-tenant or ephemeral clusters (preview environments, load-test clusters), the provisioning success rate matters enormously. We track:
SLI: cluster_provision_success_ratio
Good events: cluster reaches Ready state within 10 minutes
Total events: all cluster provisioning requests
Target SLO: 99.5% over 30-day rolling window
The "within 10 minutes" qualifier is critical. A cluster that takes 45 minutes to provision is functionally a failure even if it eventually succeeds.
API Gateway Availability
If your platform exposes an internal API gateway (for service mesh routing, authentication, rate limiting), its availability is a platform SLI:
SLI: gateway_request_success_ratio
Good events: non-5xx responses (excluding upstream 5xx)
Total events: all routed requests
Target SLO: 99.95% over 30-day rolling window
The exclusion of upstream 5xx errors is essential. The gateway SLO measures the gateway itself, not the services behind it. Conflating the two makes the SLO unactionable.
Secret Delivery Latency
Often overlooked, but critical in environments using Vault or external secret operators:
SLI: secret_sync_latency_seconds (p95)
Measurement: time from secret update in Vault to pod receiving rotated secret
Target: p95 under 60 seconds
Negotiating SLOs with Internal Consumers
The most common failure mode is the platform team unilaterally setting SLOs. This produces targets that are either too aggressive (burning out the team) or too lax (not reflecting actual needs).
We run quarterly SLO negotiation sessions with each product team. The format is straightforward:
- Present current SLI data for the previous quarter, including error budget consumption.
- Each product team states what platform reliability level they actually need, with justification tied to their own SLOs.
- Jointly agree on targets for the next quarter, factoring in planned platform migrations or upgrades.
- Document the agreed SLOs in a shared repository, versioned alongside infrastructure-as-code.
This process surfaces surprising mismatches. One product team needed deploy latency under 8 minutes because they shipped hotfixes multiple times per day. Another was perfectly fine with 20 minutes because they deployed weekly. A single SLO would have been wrong for both.
Error Budget Policies That Actually Work
An SLO without an error budget policy is just a dashboard. The policy must define what happens when the budget runs low. Our tiered approach:
- Budget > 50% remaining: Normal operations. Feature work and reliability work proceed per roadmap.
- Budget 20-50% remaining: Freeze non-critical platform changes. Prioritize reliability improvements. Weekly review of burn rate.
- Budget < 20% remaining: All hands on reliability. No new feature work. Daily burn-rate review. Escalation to engineering leadership if product teams are contributing to budget consumption (e.g., deploying broken manifests at high frequency).
- Budget exhausted: Product teams are informed that the platform is operating below target. Joint incident review. Roadmap reprioritization meeting with stakeholders.
The key insight is that exhausting the error budget is not a punishment -- it is a signal for resource reallocation. If the platform team consistently exhausts its budget, the organization is under-investing in infrastructure reliability.
Tooling: Sloth and Pyrra
We evaluated several SLO-as-code tools and settled on a combination of Sloth for SLO definition and Pyrra for visualization.
Sloth generates Prometheus recording rules and alerting rules from a declarative YAML spec:
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: deploy-pipeline
spec:
service: "platform-cicd"
labels:
team: "platform"
slos:
- name: "deploy-latency-p99"
objective: 95
description: "p99 deploy pipeline latency under 12 minutes"
sli:
events:
error_query: |
sum(rate(deploy_duration_seconds_bucket{le="720"}[5m]))
total_query: |
sum(rate(deploy_duration_seconds_count[5m]))
alerting:
name: DeployLatencyBudgetBurn
labels:
severity: critical
annotations:
summary: "Deploy pipeline error budget burning fast"
page_alert:
labels:
severity: page
ticket_alert:
labels:
severity: ticket
Pyrra then reads the generated recording rules and provides a dashboard showing current SLO compliance, error budget remaining, and burn rate trends. The combination gives us code-reviewed SLO definitions stored in Git and real-time visibility without building custom dashboards.
Lessons Learned
After two years of running this framework, several hard-won lessons stand out:
- Start with three SLIs, not thirty. Every SLI you track requires instrumentation, alerting, and review cadence. Overloading on indicators leads to alert fatigue and abandoned dashboards.
- Measure from the consumer's perspective. Internal Prometheus metrics from your CI system are useful for debugging, but the SLI should reflect what the developer experiences. Synthetic probes that mimic real deploy workflows are invaluable.
- Version your SLOs. Requirements change. A new service mesh migration will temporarily degrade gateway latency. Versioned SLOs let you communicate planned regressions without losing accountability.
- Tie SLO reviews to sprint planning. If the error budget is burning, that should directly influence what the platform team works on next sprint. Otherwise, SLOs become a reporting exercise disconnected from actual work.
Platform SLOs are not about achieving perfection. They are about making infrastructure reliability visible, negotiable, and actionable. When product teams can see exactly how much reliability budget remains, and platform teams can justify reliability investments with data rather than intuition, the entire organization makes better engineering decisions.