Designing your external DNS solution

by Ed Fisher on 2010-01-30

in Architecture

architecture

 

One of my favourite subjects is DNS. I really love resolvers and queries, discussing the relative merits of recursion versus iteration, the different types of records that are out there, and how to tune DNS just so. Primaries and secondaries, AXFR or IXFR, tuning the TTL of your individual records so that you can propagate a change to the world in seconds, while reducing traffic as much as possible; BIND versus Windows DNS services, TXT records and SOA records. I love it! I guess you could say that when it comes to DNS, folks like me are just

NERDS!!!

But hey, I don’t run out and buy DNS & BIND every time an updated version comes out, so it’s okay. I have it all under control. And speaking of control, I convinced the VP of IT that we need to look at bringing our external DNS in house, since NetSol (our current DNS provider) requires a minimum 3600 TTL, and he wanted to use DNS updates to failover things like our VPN in the event of a problem, and didn’t like the sound of clients waiting an hour before picking up on an A record change. So here we’re going to discuss the requirements for hosting your own external DNS, the service requirements, design decisions, and security considerations. We’re not going to discuss any particular operating system solution, nor are we going to discuss other hosted solutions at this time. This is all high level design.

 

This assumes you know what DNS is, how the various records work, that there is a value called the TTL, and the difference between primary and secondary servers. Before we begin, we need to get our heads in the game, and level-set the relative importance of DNS. Ready? Here we go.

Every single application opened, email sent, web site visited, file transferred, and VPN client connected, starts with DNS.

Sure, in the microcosm, you can use hard coded ip.addrs, or HOSTS file entries, or even local broadcasts, but when it’s time to roll production, DNS is at the start of EVERYTHING. A well designed and properly scaled DNS solution makes everything else run smoothly. A DNS fail causes an EVERYTHING fail, so let’s make sure we don’t have a DNS failure…m’kay?

Like most solutions, we want our DNS solution to be

  1. Reliable,
  2. Responsive,
  3. Flexible,
  4. Scalable, and
  5. Secure.

Like most solutions, management wants our DNS solution to be

Pop Quiz fail.   

You know there are some PHBs that are going to check all three. I have something for them at the end of this post. Since I am more a geek than a management drone, let’s cover what we want.

Reliable

A very wise man has this to say about reliability…

Redundancy means never getting that 2 AM phone call.

We want redundancy; we want a solution that precludes any single point of failure. We are required by the InterNIC to have two DNS servers, but they say nothing about having them on different networks. Some fairly high profile frack-ups have resulted from major players putting all their DNS servers on the same network. Don’t be that guy. Separate your DNS servers into multiple datacenters. You don’t have multiple datacenters? Really? Then to be blunt, hosting your own DNS is not for you. Consider staying with NetSol, or Enom, or look to an outsource provider like UltraDNS. Those of you with multiple datacenters should also consider shopping your pipes so that both datacenters do not use the same ISP, unless you are dealing with Tier 1 providers, or your slice of the subnet has redundant connectivity.
The servers that you will run your DNS service upon should be current spec, reliable, and well monitored. They do NOT have to be major powerhouses…you can even virtualise them. You just need to make sure that you do not constantly have to reboot them, re-cable them, or deal with spotty power or cheap drives. We’re shooting for five nines of reliability, and we won’t get there with cheap hardware.

Responsive

Remember what we said up top? DNS is at the start of EVERYTHING. If you are waiting on your DNS query to resolve, you are NOT moving data. When potential clients are trying to view your website, they may be waiting for their browser’s query to their DNS server to recurse its way across the entire Internet, querying root servers for .com servers, querying .com servers for YOUR servers, then finally sending the query to your servers, so that they can provide the ip.addr to the client. It can take 200 milliseconds or more for the client to get an answer, and the browser won’t start requesting the page until then. That’s 2/10ths of a second, which is noticeable by normal people. We want our DNS servers to be quick like bunnies when they receive a query, so we need to make sure that they are purpose built if they are physical, or they have reserved processor time if they are virtual. We do NOT want to introduce extra latency into our clients’ experience.

Flexible

When we make a change to our services, we want our clients to start using the new service, and stop using the old service, as soon as possible. We don’t want them to notice any interruption in service. The secret to a successful cut-over (and the occasional though inevitable fallback) is to make it as quick, painless, and invisible as possible. We do this by controlling our TTLs. The Time To Live value is usually expressed in seconds, and tells all DNS servers between our clients and our own authoritative servers, as well as most modern clients, just how long they can cache a record before they have to look it up again. A TTL of 3600 (NetSol’s minimum) means that any system that has resolved one of our records can use that answer for an hour before they need to check it again. That blank look from the guy in the third row tells me I have to explain that a little more.
Say you are planning on cutting over to the new web server Friday at lunch time. You have the new one ready to go, it’s past all of its tests. All we need to do is change the A record for www.example.com from 1.2.3.4 to 1.2.3.5. 12:00 hits, so you log on to your NetSol portal and change the ip.addr, hit apply, and tell everyone it’s done. You look at your new server to see that nobody is hitting it. You look at your old server to see the same volume of traffic as it always had. You see, DNS servers and clients out there are not going to ask your DNS server for the ip.addr to www.example.com until the TTL expires from the last time they asked. Sure, within an hour all your traffic should start hitting the new address…however, what do you do if it is at that point you realise the new server is borked and can’t handle the load. You quickly log back onto the NetSol portal, change the address back, and then sit there answering phone calls for the next hour from people hitting the new site because they got the new ip.addr, and it hasn’t aged out of their resolver cache yet.
If we control the horizontal, and the vertical, AND the TTL, then we can strike the right balance between quick changes, and low traffic. Start by listing all the services you want in your DNS that shouldn’t ever have to change without advance notice. Then get consensus on the longest period of time (in seconds) those services could be unavailable should an unplanned change be required. Divide that by 2, and use the result (rounded to the next round number) as your default TTL. If the collective mind comes up with ten minutes as an acceptable outage, then your TTL is 300.

10 minutes x 60 seconds in a minute = 600 seconds : 2 = 300 seconds

With a TTL of 300, any system, whether client or intermediate DNS server, that resolves one of your names will cache that record for five minutes. Remember that caching is intended to reduce traffic and increase performance by reducing the need to look up the same record again and again. On the day of any planned change, you can reduce the TTL for the specific record involved to 5. That way, when you make the change, essentially all clients will pick it up and start using it immediately. Once you are sure the new system is up and running, you can move the TTL back up to 300, but should the new system bork, you can change the ip.addr back to the original and just as quickly, everything is back to normal.
Sure, short TTLs will result in more DNS traffic. Really short TTLs will result in LOTS more DNS traffic. That is no big deal though. A DNS query, whether from a client or another DNS server, is a whopping big 64 bytes, plus the size of the name requested. The reply could be much larger, depending upon how many records are in the response, but will almost certainly be smaller than 512 bytes. Let’s say your query was for a relatively large name, and the response included several records, so we’ll call that whole transaction 600 bytes. Remember, DNS uses UDP, so there is no overhead, just a question and an answer. Further, let’s say you have a puny little T1, which weighs in at 1.5436 Mbps. A DNS query will use roughly 0.00039% of your available pipe. I think you can afford a little more DNS traffic for the day!

Scalable

As our pipes grow and our relative popularity grows, so to will the volume of DNS queries. As our servers begin to reach higher loads, we’ll need to scale out by adding more name servers. Whether we put more than one in each datacenter, or we obtain more datacenters to host one name server each depends on our circumstance, but we’re good either way. We can have up to 13 name servers without issue, or we can assign VIPs and use network load balancing if we ever really need more. I ran the DNS for over 250 domains, which received an aggregate average of over 3,000,000 emails each month, and 5,000 web site hits per day (plus VPN, FTP for EDI, SRV records for IM, and a myriad of other things I don’t have stats on) with only two DNS servers for almost two years before we upgraded to a more robust solutions, and those zones used a 300 second TTL for every record by default, and we made scheduled changes with 5 second TTLs almost every week. Incidentally, DNS traffic, as a percentage of all incoming traffic, never broke 2% in all the time I monitored it.

Secure

This is where the rubber hits the road. Since (say it all together now) Everything starts with DNS, an attacker who can take down our DNS has effectively launched a DoS attack on everything we own. An attacker who can compromise our DNS now owns us; or rather, our clients. So we want to make sure our DNS servers are always fully patched, our a/v is installed, running, and updated daily. We need to minimise the number of admins, make sure they use unique credentials, and that they can only manage our DNS from the inside (VPN is acceptable, but no direct connections over the Internet.) Only UDP port 53 should be open to our DNS servers. Our credentials to our registrar must be well protected too, lest an attacker simply hijack our domain by telling the world that his servers are now authoritative for our zones. Do we really want to lasso the fail whale the same way Twitter did?

Harden your servers

These servers should be dedicated hosts, so follow your operating system’s guidelines regarding system hardening. This should include setting all unnecessary services disabled, having no extra software installed that is not required, and they should be closely monitored to ensure that they are always up to date on patches and antivirus definitions.

Restrict zone transfers to your name servers only

Why make the bad guys’ job any easier? By restricting zone transfers so that only YOUR servers can pull a full zone file, an attacker is going to have to work a lot harder to enumerate all your names and network ranges. Sure, a determined attacker has other ways of getting this data…so make him work for it.

Disable recursion

DNS servers are nice, they try to be helpful. If you ask them a question, they will try their best to find you an answer. You should encourage this behaviour in your internal DNS servers. The ones hanging out on the Internet however, should ONLY answer queries for the domains they host. Anything else should be told to punt off. You do this by disabling recursion. Just make sure no one configures your internal servers to forward to your external ones.

Secure against cache pollution

A number of attacks against DNS have to do with providing answer information in a query. Say I send to a vulnerable server a request to resolve the name www.google.com, and in the query I populate the answer section with information that www.example.com has a TTL of 36000 and is at 1.2.3.4. A vulnerable server will resolve www.google.com, and at the same time, update their cache with the bogus A record for www.example.com. For the next ten hours, any client that asks this server for www.example.com will be given the answer 1.2.3.4. If that is my malware infested hack-in-a-box, bad things result. In the good old days, when the tubes were a kinder and gentler place, including answers with queries was just a helpful way to speed along the propagation of changes. Today, there is no legitimate reason for that. Configure your servers to disregard any data contained in the answer section of a query and you’ll be fine.

Use a hidden primary

Since we are NOT going to put domain controllers on the Internet, we’re dealing with a single writable primary server, supplemented by one or more read only secondaries. If an attacker wants to alter your zone data, s/he will have to modify it on the primary and let it XFR to the secondaries. While there is always the risk of an inside job, the easiest way to avoid this external threat is to implement a ‘hidden primary.’ Keep your primary DNS server on the inside, and don’t permit any traffic into it from the Internet. Have at least two secondaries configured to pull the zones from the primary, and let those servers handle all queries. That way, an attacker is only able to attack a read-only version of your zone.

Monitor your services

While DNS won’t require a great deal of care and feeding, it should be monitored very closely to ensure that it is up, responsive, and giving out the right answers. I like to setup an A record for canary (think canary in a coal mine) and then have a service like Pingdom monitor that name by querying my DNS servers at regular intervals. Pingdom can email you, send you an SMS message, or even tweet you if it detects a change to the record, or a failure to respond to a query by your DNS servers. They can also monitor other services, including websites. Sure, you can use your internal systems too, but for US$10 a month, getting that outside view is well worth it.

Remember, security is all about a layered approach. The more layers you implement, the better your chances of avoiding incidents. Putting the above into practice will not guarantee you will never suffer a DNS issue, but I’ve been managing DNS for over ten years, have done so for over 250 domains at a time, and I have <knocking wood> never had a DNS incident cause by anything other than me omitting a trailing dot </knocking wood.>  That, however, is a story for another time.

the trailing dot of shame, whose absence stopped incoming email for over an hour.

So, if you are considering hosting your own DNS, you should have a much better handle on things, and feel more comfortable about it. As critical as DNS is, you don’t have to be named Cricket to set it up well (though it helps.)

As promised, here’s a little something for the PHBs that checked all three boxes in their section. I’d like to send out a very special dedication to all of them.

I Want It All-Queen

This has been one of my longer posts, and was intentionally left as o/s agnostic as much as possible. I welcome any questions you might have in the comments. Please leave them, any suggestions you’d like to make, or you favourite Queen track there.

You might also enjoy:

  1. How to Find the Best Web Filtering Solution for Your Business
  2. Why does your organization need an email archiving solution?

Leave a Comment

Previous post:

Next post: