BRO SRE

Reliability practices, infrastructure, automation

← Back to articles

FinOps for SRE: Cutting Cloud Spend Without Breaking Reliability

2026-04-08 · FinOps, Cloud, Cost Optimization

Cost optimization has historically lived in a different orbit from reliability work. Finance teams negotiated commitments with cloud vendors, engineering managers approved instance sizes during sprint planning, and SREs occasionally received quarterly reports asking why one service consumed disproportionate resources. This separation is breaking down. As cloud bills cross meaningful thresholds, organizations are discovering that the engineers who understand reliability tradeoffs are also best positioned to identify which costs preserve reliability and which are pure waste.

Our team reduced annual cloud spend by 31% over twelve months while simultaneously improving p99 latency and reducing critical incidents. The work was not glamorous, and no single optimization accounted for more than 6% of savings. Here is what produced the bulk of the improvement.

Start With Cost Visibility, Not Cost Cutting

Before any optimization work, we built cost attribution down to the service level. The default cloud billing dashboard groups spend by service type (compute, storage, network) and account. This is useless for engineering decisions. What you need is spend attributed to the application teams who can actually change consumption.

We used Kubecost for Kubernetes workload attribution and built a small reconciliation pipeline that joined cloud billing data with Kubernetes namespace labels. Within a month, every product team had a weekly report showing their actual cloud cost, broken down by service, environment, and resource type. Three observations emerged immediately:

None of these were surprising in retrospect, but they were invisible without proper attribution.

Right-Sizing Without Risking Reliability

The temptation with right-sizing is to set requests and limits to observed p99 usage. This is a mistake. Resource consumption is bursty, and the gap between p99 and worst-case is often substantial. We use a different rule: requests at p95 of seven-day max, limits at 1.5x requests, except for workloads that cannot tolerate eviction (databases, stateful services).

The Vertical Pod Autoscaler's recommendation mode is invaluable here. We run VPA in recommendationOnly mode for all services, then have teams adjust manifests during their normal sprint cycle. Automated VPA in auto mode caused too many unexpected restarts for our taste, but the recommendations themselves are reliable.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-recommender
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 100m
          memory: 128Mi

One nuance: VPA's memory recommendations do not account for memory consumed by JVM heap reservations or Go's GOGC behavior. For runtime-managed memory, we cap recommendations at 1.3x current request and let teams tune the runtime separately.

Spot Instances for the Right Workloads

Spot instances offer 60-80% discount but can be reclaimed within two minutes. The blanket advice "use spot for stateless workloads" is too simple. We classify workloads into four tiers:

For Tier 2, we use Karpenter with a node pool configuration that maintains a minimum on-demand capacity but provisions additional nodes from spot. The nodepool.weight field controls preference, and the limits field caps spot usage per AZ to limit blast radius during spot reclaim storms.

Network Costs: The Hidden Iceberg

Cross-AZ and egress traffic costs are the most underestimated component of cloud bills. A service that gossips at 50 MB/s across three AZs can cost more in network than in compute. The optimization opportunities here are substantial:

Topology-aware routing. Kubernetes' topology-aware hints (formerly topology-aware service routing) keep traffic within the same AZ when possible. Enabling this for high-throughput internal services reduced our cross-AZ traffic by 60% with no application changes. The catch: it requires sufficient endpoints per AZ to avoid imbalanced load.

Egress consolidation. Multiple services calling the same external API independently miss caching opportunities. We introduced an internal HTTP cache (Squid, surprisingly) for high-volume external API calls. This reduced egress to specific vendors by 70% and had the side benefit of improving p99 latency for cached responses.

VPC endpoints for cloud services. Calls to S3, DynamoDB, and other cloud services traverse the public internet by default, incurring NAT gateway costs. VPC endpoints eliminate this for marginal monthly cost. Verify your service is actually using the endpoint by checking the response IP -- misconfiguration is silent.

The Non-Production Environment Trap

Non-production environments are the most consistent source of waste because nobody owns their cost. We implemented three policies that together reduced non-prod spend by 55%:

Aggressive scale-down schedules. Dev and staging environments scale to zero or near-zero from 8 PM to 7 AM and on weekends. We use KEDA for time-based scaling. The implementation is straightforward, but the harder problem is identifying which services actually need to run overnight (some batch jobs do).

Preview environment expiration. Preview environments per pull request are valuable but rarely cleaned up. We enforce a 7-day max lifetime, with automated extension only on PR activity. Stale environments are deleted automatically with a 24-hour warning.

Quotas per team. Each team gets a non-production budget. Exceeding it requires explicit approval. This sounds bureaucratic, but in practice it shifts the question from "do we want more capacity" to "do we want this capacity more than the other thing we asked for."

Reserved Capacity and Savings Plans

Commitment-based discounts (Reserved Instances, Savings Plans, Committed Use Discounts) are the largest single lever for cost reduction, often delivering 30-60% savings. The risk is over-committing and paying for unused capacity.

Our approach: commit to baseline capacity at three-year terms, leave 20-30% of capacity at on-demand or one-year terms for flexibility. The baseline is calculated from the past six months of minimum daily consumption, smoothed for outages and known business changes.

Critically, commitments should be at the compute family level (Savings Plans, not Reserved Instances) where possible. Family-level commitments cover whichever instance type fits, protecting you from instance family transitions during the commitment period.

The Reliability-Cost Tradeoff Frame

Every cost optimization is implicitly a reliability tradeoff. Reducing replica counts increases blast radius during failures. Using spot instances increases reclamation events. Right-sizing reduces headroom for traffic spikes. The frame we use during cost reviews:

For each optimization, explicitly state: what is the worst-case scenario this introduces, what is the probability, and what is the recovery cost if it occurs. Most optimizations survive this analysis. Some do not -- and those are the ones to push back on regardless of the savings.

Cost optimization done well makes reliability better, not worse. Eliminating waste frees budget for redundancy where it matters, right-sized workloads schedule more efficiently and fail more predictably, and visibility into spending forces conversations about what the organization actually values.