BRO SRE

Reliability practices, infrastructure, automation

← Back to articles

SLIs and SLOs for Platform Teams: Stop Guessing, Start Measuring

2026-03-18 · SRE, SLO, Platform Engineering

Platform teams occupy a peculiar position in engineering organizations. They do not ship features to end users. Their customers are other engineers, and the "product" is infrastructure: CI/CD pipelines, Kubernetes clusters, internal APIs, and developer tooling. Despite this, most platform teams either borrow product SLOs verbatim or operate with no measurable reliability targets at all. Both approaches lead to the same outcome: misaligned priorities, frustrated internal consumers, and reactive firefighting instead of strategic investment.

This article outlines a practical framework for defining Service Level Indicators and Objectives specifically tailored to platform engineering, drawing from two years of implementing this at an organization running 1,200 microservices across six product teams.

Why Platform SLIs Are Different

Product SLIs typically measure user-facing behavior: HTTP request latency, error rates, throughput. Platform SLIs measure developer-facing behavior. The distinction matters because the failure modes, measurement points, and consumer expectations are fundamentally different.

When a product API returns a 500 error, a user sees a broken page. When a deploy pipeline fails, a developer loses thirty minutes diagnosing why their merge did not reach staging. When cluster autoscaling lags, a team's load test produces meaningless results. These failures compound differently and demand different indicators.

A useful mental model: platform SLIs should answer the question, "Can our engineers ship and operate their software without being blocked by infrastructure?"

Concrete Platform SLIs Worth Tracking

Deploy Pipeline Latency (p99)

Measure the wall-clock time from merge-to-main to deployment completion in staging. This is arguably the single most important platform SLI because it directly governs developer iteration speed. We define it as:

SLI: deploy_pipeline_duration_seconds (p99)
Measurement point: CI webhook received -> last deployment probe healthy
Excludes: manual approval gates, optional integration test suites

A common mistake is measuring only the CI build step. The SLI must capture the full path, including image push, manifest update, ArgoCD sync, and readiness probe success. Our target is p99 under 12 minutes. Anything above 15 minutes triggers an error budget burn investigation.

Cluster Provisioning Success Rate

For teams running multi-tenant or ephemeral clusters (preview environments, load-test clusters), the provisioning success rate matters enormously. We track:

SLI: cluster_provision_success_ratio
Good events: cluster reaches Ready state within 10 minutes
Total events: all cluster provisioning requests
Target SLO: 99.5% over 30-day rolling window

The "within 10 minutes" qualifier is critical. A cluster that takes 45 minutes to provision is functionally a failure even if it eventually succeeds.

API Gateway Availability

If your platform exposes an internal API gateway (for service mesh routing, authentication, rate limiting), its availability is a platform SLI:

SLI: gateway_request_success_ratio
Good events: non-5xx responses (excluding upstream 5xx)
Total events: all routed requests
Target SLO: 99.95% over 30-day rolling window

The exclusion of upstream 5xx errors is essential. The gateway SLO measures the gateway itself, not the services behind it. Conflating the two makes the SLO unactionable.

Secret Delivery Latency

Often overlooked, but critical in environments using Vault or external secret operators:

SLI: secret_sync_latency_seconds (p95)
Measurement: time from secret update in Vault to pod receiving rotated secret
Target: p95 under 60 seconds

Negotiating SLOs with Internal Consumers

The most common failure mode is the platform team unilaterally setting SLOs. This produces targets that are either too aggressive (burning out the team) or too lax (not reflecting actual needs).

We run quarterly SLO negotiation sessions with each product team. The format is straightforward:

  1. Present current SLI data for the previous quarter, including error budget consumption.
  2. Each product team states what platform reliability level they actually need, with justification tied to their own SLOs.
  3. Jointly agree on targets for the next quarter, factoring in planned platform migrations or upgrades.
  4. Document the agreed SLOs in a shared repository, versioned alongside infrastructure-as-code.

This process surfaces surprising mismatches. One product team needed deploy latency under 8 minutes because they shipped hotfixes multiple times per day. Another was perfectly fine with 20 minutes because they deployed weekly. A single SLO would have been wrong for both.

Error Budget Policies That Actually Work

An SLO without an error budget policy is just a dashboard. The policy must define what happens when the budget runs low. Our tiered approach:

The key insight is that exhausting the error budget is not a punishment -- it is a signal for resource reallocation. If the platform team consistently exhausts its budget, the organization is under-investing in infrastructure reliability.

Tooling: Sloth and Pyrra

We evaluated several SLO-as-code tools and settled on a combination of Sloth for SLO definition and Pyrra for visualization.

Sloth generates Prometheus recording rules and alerting rules from a declarative YAML spec:

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: deploy-pipeline
spec:
  service: "platform-cicd"
  labels:
    team: "platform"
  slos:
    - name: "deploy-latency-p99"
      objective: 95
      description: "p99 deploy pipeline latency under 12 minutes"
      sli:
        events:
          error_query: |
            sum(rate(deploy_duration_seconds_bucket{le="720"}[5m]))
          total_query: |
            sum(rate(deploy_duration_seconds_count[5m]))
      alerting:
        name: DeployLatencyBudgetBurn
        labels:
          severity: critical
        annotations:
          summary: "Deploy pipeline error budget burning fast"
        page_alert:
          labels:
            severity: page
        ticket_alert:
          labels:
            severity: ticket

Pyrra then reads the generated recording rules and provides a dashboard showing current SLO compliance, error budget remaining, and burn rate trends. The combination gives us code-reviewed SLO definitions stored in Git and real-time visibility without building custom dashboards.

Lessons Learned

After two years of running this framework, several hard-won lessons stand out:

Platform SLOs are not about achieving perfection. They are about making infrastructure reliability visible, negotiable, and actionable. When product teams can see exactly how much reliability budget remains, and platform teams can justify reliability investments with data rather than intuition, the entire organization makes better engineering decisions.