BRO SRE

Reliability practices, infrastructure, automation

← Back to articles

DNS at Scale: The Failure Mode Nobody Talks About

2026-03-26 · DNS, Kubernetes, Networking

DNS is the protocol everyone assumes works. It is also the protocol that, in our experience, causes more subtle production incidents than any other component in a Kubernetes cluster. Slow lookups manifest as elevated p99 latency, intermittent NXDOMAIN responses look like application bugs, and conntrack table exhaustion presents as random connection failures that defy reproduction. This article catalogs the DNS failure modes that have actually caused us pain and the configurations that prevented recurrence.

The Default CoreDNS Configuration Is Wrong for Production

A fresh Kubernetes cluster ships with CoreDNS configured for general-purpose workloads. At any meaningful query rate, several defaults become liabilities.

The default cache TTL of 30 seconds is too short for stable internal services. We see hit ratios climb from 70% to 95% by raising the cache TTL to 300 seconds for cluster-internal records. The risk -- stale records pointing to terminated pods -- is mitigated by Kubernetes' endpoint reconciliation, which updates DNS records on pod changes well before the cache TTL would matter.

kubernetes cluster.local in-addr.arpa ip6.arpa {
    pods insecure
    fallthrough in-addr.arpa ip6.arpa
    ttl 300
}
cache 300 {
    success 9984 300
    denial 9984 30
    prefetch 10 60s 10%
}

The prefetch directive is particularly valuable. It causes CoreDNS to refresh popular records before they expire, eliminating the latency spike that would otherwise occur at cache miss.

The ndots:5 Problem

Kubernetes injects ndots:5 into every pod's /etc/resolv.conf. This means any DNS query with fewer than five dots is treated as relative and gets the search domains appended before the literal lookup is attempted. For external hostnames like api.example.com (two dots), this produces five wasted lookups before the correct one:

api.example.com.default.svc.cluster.local. NXDOMAIN
api.example.com.svc.cluster.local.         NXDOMAIN
api.example.com.cluster.local.             NXDOMAIN
api.example.com.eu-west-1.compute.internal. NXDOMAIN
api.example.com.                           A 93.184.216.34

This amplification multiplies your DNS query rate by 5x or more for external lookups. The fix is either to use fully qualified domain names (trailing dot) in application code or to set ndots:2 in the pod's dnsConfig. We recommend the latter for new clusters because changing application code across hundreds of services is impractical.

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"

NodeLocal DNSCache: Not Optional at Scale

The single most impactful DNS optimization we have implemented is NodeLocal DNSCache. It runs a DNS cache as a DaemonSet on every node, intercepting cluster-internal DNS traffic before it traverses the network to CoreDNS pods. The benefits compound:

The configuration requires a specific iptables setup that intercepts traffic to the cluster DNS service IP. Our deployment uses the upstream manifest with one modification: we increase the cache size to handle larger working sets.

The Conntrack Race Condition

Years before NodeLocal DNSCache, we lost weeks to a Linux kernel race condition involving conntrack and DNS. The symptom: occasional 1-second delays on DNS lookups, no clear pattern, no correlation with load. The cause: when two parallel DNS queries (A and AAAA records) from the same pod hit the kernel's connection tracking nearly simultaneously, conntrack could create duplicate entries that triggered a retransmit after the SYN-ACK timeout.

The fix path was rocky. We tried disabling IPv6 lookups, which helped some workloads but broke others. We tried single-request-reopen in resolv.conf, which helped but did not eliminate the issue. Ultimately, NodeLocal DNSCache fixed the problem by ensuring DNS queries hit a local UDP listener rather than traversing conntrack at all.

If you are seeing unexplained 1-second or 5-second DNS delays, this race condition is the prime suspect. The Weave Works blog post from 2017 documents it in detail; the kernel fix landed in 5.0 but does not cover all the conditions that trigger the race.

External DNS: The Failure That Cascades

Internal DNS failures are bad. External DNS failures cascade. When your applications cannot resolve external service endpoints (payment providers, third-party APIs, your own services in other regions), the resulting timeouts produce visible user impact within seconds.

We configure CoreDNS with multiple upstream resolvers and a strict timeout policy:

forward . 8.8.8.8 1.1.1.1 8.8.4.4 {
    max_fails 2
    expire 10s
    health_check 5s
    policy sequential
    prefer_udp
}

The policy sequential setting ensures we always prefer the first resolver, falling back only on failure. This is intentional: cache locality on the resolver side matters, and round-robin across multiple providers fragments the cache.

Beyond CoreDNS configuration, we run a synthetic DNS prober that checks resolution of approximately 50 critical external hostnames every 30 seconds. This catches upstream DNS provider issues before they show up in application metrics.

The Hidden Cost of Headless Services

Headless Services in Kubernetes return all pod IPs in a single DNS response. For services with hundreds of backends, this produces enormous response payloads that can exceed UDP packet size limits, forcing TCP fallback. We have seen DNS responses approach 8 KB for a 200-pod headless service.

For client-side load balancing scenarios where you genuinely need the full endpoint list (databases with replicas, certain gRPC patterns), this overhead is unavoidable. But for many use cases, EndpointSlice-aware clients or service mesh routing are better solutions than a headless service with massive DNS responses.

Observability for DNS

You cannot optimize what you cannot see. Our DNS observability stack includes:

The Mindset Shift

DNS is not infrastructure plumbing that you set up once and forget. At scale, it is an active system with its own performance characteristics, failure modes, and observability requirements. The teams that treat it with the same rigor as their application services have far fewer mysterious latency incidents. The teams that assume "DNS just works" eventually learn that it does not, usually during a postmortem.