DNS is weird inside k8s on AWS
ndots - why one hostname lookup can become many DNS queries
Pull /etc/resolv.conf from inside almost any Kubernetes pod and you'll see three interesting lines:
nameserver <cluster-dns>
search my-namespace.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5
- nameserver - where queries are sent.
- search - a list of suffixes the resolver may append to a name before giving up.
- options ndots:N - the rule that decides when those suffixes get appended.
ndots is the quiet one, and it's the one that surprises people. It says: "If the name being looked up contains fewer than N dots, treat it as a partial/relative name - try it with each search suffix appended first, and only try it as an absolute name if all of those fail."
Kubernetes defaults to ndots:5. That default exists for a good reason: service discovery. It lets your code say redis or payments.billing and have the resolver expand it to redis.my-namespace.svc.cluster.local. Very convenient inside the cluster.
The catch is what happens to real external hostnames, which usually have fewer than 5 dots. Take data.example.com - that's 2 dots. Since 2 < 5, the resolver assumes it's relative and walks the entire search list before trying the real thing:
data.example.com.my-namespace.svc.cluster.local→ NXDOMAINdata.example.com.svc.cluster.local→ NXDOMAINdata.example.com.cluster.local→ NXDOMAINdata.example.com.ec2.internal→ NXDOMAINdata.example.com→ the real answer
That's 5 queries to resolve one name, four of them guaranteed misses. And remember the resolver issues A (IPv4) and AAAA (IPv6) records in parallel, so realistically you're looking at up to ~10 DNS queries for a single hostname.
The fix is one line: set ndots:1. Then any name with at least one dot is tried as an absolute name first, and data.example.com resolves in a single shot (well, two - A and AAAA). You only lose the convenience of short, suffix-less service names - which most apps that talk to FQDNs and external hosts don't rely on anyway.
# pod spec
dnsConfig:
options:
- name: ndots
value: "1"
Takeaway: on Kubernetes, the number of DNS queries your app generates is not the number of lookups it makes. ndots is the multiplier, and the default of 5 is tuned for in-cluster discovery, not external traffic.
NodeLocal DNS: the (optional) cache hop you might not know is there
First, the baseline that is effectively universal: every cluster has a cluster DNS service - historically kube-dns, and CoreDNS by default since Kubernetes 1.13. It runs centrally as a Deployment (a handful of pods behind a ClusterIP Service, usually at .10 of the service CIDR, e.g. 10.96.0.10), and every pod's resolv.conf points its nameserver at that Service IP.
NodeLocal DNSCache is not mandatory - it's an optional add-on you opt into. When present, it inserts a per-node caching layer between the pod and CoreDNS, so your pod resolves against the local node first instead of reaching across the network to the central CoreDNS Service on every query:
┌─────────┐ link-local IP ┌──────────────────┐
│ Pod │ ────────────────► │ NodeLocal DNS │ (DaemonSet - one per node)
│ (glibc) │ UDP :53 │ on-node cache │
└─────────┘ └────────┬─────────┘
│ on cache miss, forward
┌────────────┴────────────┐
│ │
*.cluster.local everything else
│ │
▼ ▼
┌───────────────┐ ┌──────────────────┐
│ CoreDNS / │ │ Upstream / VPC │
│ kube-dns │ │ resolver │
│ (in-cluster) │ │(cloud-provided) │
└───────────────┘ └──────────────────┘
What it is and why it exists: NodeLocal DNS runs as a DaemonSet - one instance per node - and listens on a link-local address (an IP in the 169.254.0.0/16 range, which is node-local and never routed off the box). Every pod on that node sends its DNS queries to this local instance first. The query never leaves the node for a cache hit.
This solves two real problems: it cuts latency (no network hop for cached answers), and it avoids a known conntrack/UDP race that caused intermittent 5-second DNS hangs when pods talked to a cluster-wide DNS service directly.
On a cache miss, NodeLocal forwards the query upstream - cluster-internal names (*.svc.cluster.local) go to CoreDNS/kube-dns; everything else goes to the upstream resolver (on AWS, the VPC resolver / AmazonProvidedDNS).
The important mental model (when NodeLocal is present): caching only helps for repeated names, and only within a TTL. A positive answer is cached for its record's TTL; a negative answer (NXDOMAIN) is cached according to the zone's SOA. If you re-resolve the same host frequently, NodeLocal absorbs most of it. If your names or TTLs churn, more queries forward upstream.
The EC2 per-ENI DNS packet limit (~1024 pps)
This is the one almost nobody knows until it bites: EC2 caps DNS traffic at roughly 1024 packets per second per network interface (ENI).
Details that matter:
- It's specifically the path to the VPC resolver (the
.2address / AmazonProvidedDNS) - packets to the Route 53 Resolver are what's metered. - The limit is per-ENI, which in practice usually means per-node (or per-pod, if pods get their own ENIs).
- It is not a cluster-wide pool and it cannot be raised via a support ticket - it's a fixed allowance.
- When you exceed it, AWS doesn't return an error. It silently drops the excess packets. A dropped UDP query gets no response, so the client just waits and eventually reports
i/o timeout.
How you'd actually prove you're hitting it (rather than assuming): AWS exposes a per-ENI counter, linklocal_allowance_exceeded, that increments each time a packet is dropped for crossing this limit. Read it on the node:
ethtool -S <interface> | grep allowance_exceeded
A few honest caveats, because this is exactly where people jump to conclusions:
- The counter only proves anything if it's increasing during the timeout window. A reading of
0, or a value on an idle/freshly-provisioned node, tells you nothing. - Hitting this limit requires genuinely high DNS packet rates. A service that resolves a small set of hosts repeatedly - where NodeLocal caches the answers - typically won't get anywhere near 1024 pps. So before blaming this limit, confirm the packet rate is actually high and the counter is actually moving.
- The
ndotsamplification from concept #1 is what makes this limit easier to hit than you'd expect - because each logical lookup can be ~10 packets - but amplification of a low request rate is still a low request rate.
Takeaway: the ~1024 pps per-ENI DNS cap is real and unraiseable, and its failure mode (silent drops → timeouts) is genuinely confusing. But it's a ceiling you have to measure, not assume - linklocal_allowance_exceeded is the only thing that proves it.
How the three fit together
These aren't three separate facts - they're a chain:
- Your app makes a DNS lookup.
ndotsdecides how many actual queries that becomes (defaultndots:5→ potentially ~10×).- Those queries hit NodeLocal DNS first; cache hits stay on the node, misses forward to CoreDNS or the VPC resolver.
- The forwarded traffic to the VPC resolver is metered against the ~1024 pps per-ENI limit, and silently dropped past it.
Knowing the chain is what lets you reason about a DNS problem instead of guessing: is my query count inflated (ndots)? is my cache actually being used (NodeLocal + TTLs)? am I genuinely hitting the packet ceiling (linklocal_allowance_exceeded)? Three different questions, three different places to look.
That's the toolkit. Where it points in any specific incident is a separate investigation - and proving which link in the chain is actually responsible takes measurement, not assumption.
Comments
No comments yet. Start the discussion.