Chaos Engineering on a Budget: Practical Fault Injection Without Breaking the Bank

2025-11-05 · Chaos Engineering, Resilience, Testing

Chaos engineering does not require a dedicated team, a custom platform, or a Netflix-scale budget. It requires a hypothesis, a controlled experiment, and the discipline to act on what you learn. We started with shell scripts and graduated to LitmusChaos. Here is the practical path.

The Principles

Before injecting any faults, internalize these rules:

Start with a steady state hypothesis. Define what "normal" looks like in measurable terms: request latency under 200ms, error rate below 0.1%, queue depth under 1000. You need a baseline to compare against.
Minimize blast radius. Start with one pod, not the whole deployment. Start in staging, not production. Start during business hours when everyone is awake.
Have a kill switch. Every experiment must be immediately reversible. If you cannot stop the experiment in under 30 seconds, redesign it.
Automate the observation. Set up dashboards before the experiment. If you are watching terminals manually, you will miss the interesting data.

Simple Fault Injection Scripts

You do not need a platform for your first chaos experiments. These tools are already on your Linux nodes:

Network Latency

# Add 100ms latency with 25ms jitter to eth0
tc qdisc add dev eth0 root netem delay 100ms 25ms

# Remove when done
tc qdisc del dev eth0 root netem

Packet Loss

# Drop 5% of packets
tc qdisc add dev eth0 root netem loss 5%

# Targeted: only drop packets to a specific service
iptables -A OUTPUT -d 10.96.42.15 -m statistic \
  --mode random --probability 0.05 -j DROP

CPU Stress

# Consume 4 CPU cores for 60 seconds
stress-ng --cpu 4 --timeout 60s

# Memory pressure: allocate 2GB
stress-ng --vm 1 --vm-bytes 2G --timeout 60s

DNS Failure

# Block DNS resolution for a specific service
iptables -A OUTPUT -p udp --dport 53 -m string \
  --string "payments-service" --algo bm -j DROP

Kubernetes Pod Failures

The simplest Kubernetes chaos experiment is killing pods. If your services are stateless and properly configured with readiness probes, replica counts, and pod disruption budgets, a single pod deletion should cause zero customer impact.

# Kill a random pod from a deployment
kubectl delete pod $(kubectl get pods -l app=api-gateway \
  -o jsonpath='{.items[0].metadata.name}') \
  --grace-period=0 --force

# Kill pods continuously every 30 seconds
while true; do
  kubectl delete pod $(kubectl get pods -l app=api-gateway \
    -o jsonpath='{.items[0].metadata.name}') \
    --grace-period=0 --force
  sleep 30
done

Our first pod-kill experiment revealed that our API gateway took 45 seconds to recover because the readiness probe had a 30-second initialDelaySeconds. We reduced it to 5 seconds with a proper health check endpoint and recovery dropped to 8 seconds.

LitmusChaos: When Scripts Are Not Enough

LitmusChaos adds structure to chaos experiments through CRDs. Experiments become declarative, repeatable, and version-controlled. The key concepts:

ChaosEngine: Binds an experiment to a target application
ChaosExperiment: Defines the fault type and parameters
ChaosResult: Records the outcome (pass/fail based on your probe definitions)

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-gateway-pod-kill
spec:
  appinfo:
    appns: production
    applabel: app=api-gateway
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: "300"
        - name: CHAOS_INTERVAL
          value: "30"
        - name: FORCE
          value: "true"
      probe:
      - name: api-availability
        type: httpProbe
        httpProbe/inputs:
          url: http://api-gateway.production/health
          expectedResponseCode: "200"
        mode: Continuous
        runProperties:
          probeTimeout: 5s
          interval: 5s

Network Partition Testing

Network partitions are the most interesting chaos experiments because they expose assumptions about distributed system behavior. Can your service handle a database that is reachable but slow? What happens when the cache is partitioned from the database?

We test three partition scenarios quarterly:

Service-to-database partition: Verifies circuit breakers activate, retries are bounded, and error messages are meaningful.
Service-to-cache partition: Verifies the service degrades gracefully to database-only mode without cascading failures.
Cross-AZ partition: Verifies leader election completes, clients reconnect, and data remains consistent.

Game Days

Once you have confidence in individual experiments, combine them into game days — structured events where the team runs multiple experiments in sequence or parallel, simulating realistic failure scenarios.

Our game day format:

2 hours, scheduled monthly, during business hours
One engineer runs experiments, one monitors dashboards, one takes notes
Scenarios are pre-planned but outcomes are not shared with the response team
Debrief immediately after with findings and action items

The most valuable outcome of game days is not the bugs you find — it is the confidence you build. Teams that regularly practice failure response handle real incidents with less panic and more precision.