5 DNS Troubleshooting Tips for Network Teams
DNS is a critical but often ignored component of the networking stack. Monitoring DNS query anomalies can help you detect and correct underlying issues.
Join the DZone community and get the full member experience.
Join For Free“Set it and forget it” is the approach that most network teams follow with their authoritative Domain Name System (DNS). If the system is working and end-users find network connections to revenue-generating applications, services, and content, then administrators will generally say that you shouldn’t mess with success.
Unfortunately, the reliability of DNS often causes us to take it for granted. It’s easy to write DNS off as a background service precisely because it performs so well. Yet this very “set it and forget it” strategy often creates blind spots for network teams by leaving performance and reliability issues undiagnosed. When those undiagnosed issues pile up or go unaddressed for a while, they can easily metastasize into a more significant network performance problem.
The reality is that, like any machine or system, DNS requires the occasional tune-up. Even when it works well, specific DNS errors require attention so minor issues don’t flare up into something more consequential.
I want to share a few pointers for network teams on what to look for when they’re troubleshooting DNS issues.
Set Baseline DNS Metrics
No two networks are configured alike. No two networks have the same performance profile. Every network has quirks and peculiarities that make it unique. That’s why knowing what’s “normal” for your network is important before diagnosing any issues.
DNS data can give you a sense of average query volume over time. For most businesses, this is going to be a relatively stable number. There will probably be seasonal variations (especially in industries like retail), but these are usually predictable. Most businesses see gradual increases in query volume as their customer base or service volume grows, but this also generally follows a set pattern.
It’s also important to look at the mix of query volume. Is most of your DNS traffic to a particular domain? How steady (or volatile) is the mix of DNS queries among various back-end resources? The answers to these questions will be different for every enterprise and may change based on network team decisions on issues like load balancing, product resourcing, and delivery costs.
Monitor NXDOMAIN Responses
NXDOMAIN responses are a clear indication that something’s wrong. It’s normal to return at least some NXDOMAINs for “fat finger” queries, standard redirect errors, and user-side issues that are likely outside of a network team’s control.
NS1, an IBM Company’s recent Global DNS data report, shows that between 3-6% of DNS queries receive an NXDOMAIN response for one reason or another. Anything at or near that range is probably to be expected in a “normal” network setup.
When you go over double digits, something bigger is probably happening. The nature of the pattern matters, though. A slow but steady increase in NXDOMAIN responses is probably a long-standing misconfiguration issue that mimics overall traffic volume. A sudden spike in NXDOMAINs could be either a localized (but highly impactful) misconfiguration or a DDoS attack.
The key is to keep a steady eye on NXDOMAIN responses as a percentage of overall query volume. Deviation from the norm is usually a clear sign that something is not right — then it becomes a question of why it’s not right and how to fix it. In most cases, a deeper dive into the timing and characteristics of the abnormal uptick will provide clues about why it’s happening.
NXDOMAIN responses aren’t always a bad thing. In fact, they could represent a potential business opportunity. If someone’s trying to query a domain or subdomain of yours and coming up empty, that could indicate that it’s a domain you should buy or start using.
Watch Out for Exposure of Internal DNS Data
One particularly concerning type of NXDOMAIN response is caused by misconfigurations that expose internal DNS zone and record data to the internet. Not only does this kind of misconfiguration weigh on performance by creating unnecessary query volume, but it’s also a significant security issue.
Stale URL redirects are often the cause of exposed internal records. In the upheaval of a merger or acquisition, systems sometimes get pointed at properties that fade away or are repurposed for other uses. The systems are still publicly looking for the old connection but not finding the expected answer. The smaller the workload, the more likely it is to go unnoticed.
Pay Attention to Geography
If you set a standard baseline for where your traffic is coming from, it’s easier to discover anomalous DDoS attacks, misconfigurations, and even broader changes in usage patterns as they emerge. A sudden uptick in traffic to a specific regional server is a different kind of issue than a broader increase in overall query volume. Tracking your DNS data by geography helps identify the issue you’re facing and ultimately provides clues on how to deal with it.
Check SERVFAILs for Misconfigured Alias Records
Alias records are a frequent source of misconfigurations and deserve regular audits in their own right. I’ve found that an increase in SERVFAIL responses — whether a sudden spike or a gradual increase — can often be traced back to problems with alias records.
NOERROR NODATA? Consider IPv6
NXDOMAIN responses are pretty straightforward — the record wasn’t found. Things get a little more nuanced when you see the response come back as NOERROR, but you also see that no answer was returned. While there’s no official RFC code for this situation, it’s usually known as a NOERROR NODATA response when the answer counter returns “0”. NOERROR NODATA means that the record was found, but it wasn’t the record type that was supposed to be there.
If you’re seeing a lot of NOERROR NODATA responses, in our experience, the resolver is usually looking for an AAAA record. If you’ve got a lot of NOERROR NODATA responses, I’ve found that adding support for IPv6 usually fixes the problem.
DNS Cardinality and Security Implications
In the world of DNS, there are two types of cardinality to worry about. Resolver cardinality refers to the number of resolvers querying your DNS records. Query name cardinality refers to the number of different DNS names for which you receive queries each minute.
Measuring DNS cardinality is important because it may indicate malicious activity. Specifically, an increase in DNS query name cardinality can indicate a random label attack or probing of your infrastructure at a mass level. An increase in resolver cardinality may indicate that you are being targeted with a botnet. If you suddenly see an increase in resolver cardinality, it’s likely an indication of some sort of attack.
Conclusion
These pointers should help you better understand the impact of DNS query behavior and some steps you can take to get your DNS to a healthy state. Feel free to comment below on any other tips you’ve learned throughout your career.
Opinions expressed by DZone contributors are their own.
Comments