Chaos Engineering on a Budget: Practical Fault Injection Without Breaking the Bank
Chaos engineering does not require a dedicated team, a custom platform, or a Netflix-scale budget. It requires a hypothesis, a controlled experiment, and the discipline to act on what you learn. We started with shell scripts and graduated to LitmusChaos. Here is the practical path.
The Principles
Before injecting any faults, internalize these rules:
- Start with a steady state hypothesis. Define what "normal" looks like in measurable terms: request latency under 200ms, error rate below 0.1%, queue depth under 1000. You need a baseline to compare against.
- Minimize blast radius. Start with one pod, not the whole deployment. Start in staging, not production. Start during business hours when everyone is awake.
- Have a kill switch. Every experiment must be immediately reversible. If you cannot stop the experiment in under 30 seconds, redesign it.
- Automate the observation. Set up dashboards before the experiment. If you are watching terminals manually, you will miss the interesting data.
Simple Fault Injection Scripts
You do not need a platform for your first chaos experiments. These tools are already on your Linux nodes:
Network Latency
# Add 100ms latency with 25ms jitter to eth0
tc qdisc add dev eth0 root netem delay 100ms 25ms
# Remove when done
tc qdisc del dev eth0 root netem
Packet Loss
# Drop 5% of packets
tc qdisc add dev eth0 root netem loss 5%
# Targeted: only drop packets to a specific service
iptables -A OUTPUT -d 10.96.42.15 -m statistic \
--mode random --probability 0.05 -j DROP
CPU Stress
# Consume 4 CPU cores for 60 seconds
stress-ng --cpu 4 --timeout 60s
# Memory pressure: allocate 2GB
stress-ng --vm 1 --vm-bytes 2G --timeout 60s
DNS Failure
# Block DNS resolution for a specific service
iptables -A OUTPUT -p udp --dport 53 -m string \
--string "payments-service" --algo bm -j DROP
Kubernetes Pod Failures
The simplest Kubernetes chaos experiment is killing pods. If your services are stateless and properly configured with readiness probes, replica counts, and pod disruption budgets, a single pod deletion should cause zero customer impact.
# Kill a random pod from a deployment
kubectl delete pod $(kubectl get pods -l app=api-gateway \
-o jsonpath='{.items[0].metadata.name}') \
--grace-period=0 --force
# Kill pods continuously every 30 seconds
while true; do
kubectl delete pod $(kubectl get pods -l app=api-gateway \
-o jsonpath='{.items[0].metadata.name}') \
--grace-period=0 --force
sleep 30
done
Our first pod-kill experiment revealed that our API gateway took 45 seconds to recover because the readiness probe had a 30-second initialDelaySeconds. We reduced it to 5 seconds with a proper health check endpoint and recovery dropped to 8 seconds.
LitmusChaos: When Scripts Are Not Enough
LitmusChaos adds structure to chaos experiments through CRDs. Experiments become declarative, repeatable, and version-controlled. The key concepts:
- ChaosEngine: Binds an experiment to a target application
- ChaosExperiment: Defines the fault type and parameters
- ChaosResult: Records the outcome (pass/fail based on your probe definitions)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-gateway-pod-kill
spec:
appinfo:
appns: production
applabel: app=api-gateway
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "300"
- name: CHAOS_INTERVAL
value: "30"
- name: FORCE
value: "true"
probe:
- name: api-availability
type: httpProbe
httpProbe/inputs:
url: http://api-gateway.production/health
expectedResponseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5s
interval: 5s
Network Partition Testing
Network partitions are the most interesting chaos experiments because they expose assumptions about distributed system behavior. Can your service handle a database that is reachable but slow? What happens when the cache is partitioned from the database?
We test three partition scenarios quarterly:
- Service-to-database partition: Verifies circuit breakers activate, retries are bounded, and error messages are meaningful.
- Service-to-cache partition: Verifies the service degrades gracefully to database-only mode without cascading failures.
- Cross-AZ partition: Verifies leader election completes, clients reconnect, and data remains consistent.
Game Days
Once you have confidence in individual experiments, combine them into game days — structured events where the team runs multiple experiments in sequence or parallel, simulating realistic failure scenarios.
Our game day format:
- 2 hours, scheduled monthly, during business hours
- One engineer runs experiments, one monitors dashboards, one takes notes
- Scenarios are pre-planned but outcomes are not shared with the response team
- Debrief immediately after with findings and action items
The most valuable outcome of game days is not the bugs you find — it is the confidence you build. Teams that regularly practice failure response handle real incidents with less panic and more precision.