370 likes | 614 Views
CoDNS : Improving DNS Performance and Reliability via Cooperative Lookups. KyoungSoo Park Electrical Engineering KAIST. DNS Background. Domain Name System (DNS) Distributed database of resource records (RR) Typically, (name, IP) pair lookup (A-record) Hierarchical Name Resolution
E N D
CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups KyoungSoo Park Electrical Engineering KAIST EE513/IS535
DNS Background • Domain Name System (DNS) • Distributed database of resource records (RR) • Typically, (name, IP) pair lookup (A-record) • Hierarchical Name Resolution • www.kaist.ac.kr: . -> .kr -> .ac.kr -> kaist.ac.kr • Local resolver (local DNS server) handles the request • Caching and Redundancy • Each server aggressively caches RRs • More than two local resolvers EE513/IS535
Host at swan.kaist.ac.kr wants IP address for www.princeton.edu local DNS server ns.kaist.ac.kr DNS Name Resolution Example root DNS server 2 TLD DNS server (.edu) 3 4 • Iterated query: • contacted server replies with name of server to contact • “I don’t know this name, but ask this server” 5 6 7 1 8 authoritative DNS server dns.princeton.edu requesting host swan.kaist.ac.kr www.princeton.edu EE513/IS535
root DNS server 2 3 6 7 TLD DNS server (.edu) 4 local DNS server ns.kaist.ac.kr 5 1 8 authoritative DNS server dns.princeton.edu requesting host swan.kaist.ac.kr www.umass.edu DNS name resolution example • Recursive query: • puts burden of name resolution on contacted name server • heavy load? EE513/IS535
A.root-servers.net to M.root-servers.net Each server is a cluster of replicated servers Each IP is shared by many machines (e.g., IP Anycast) Responsible for top-level domain NS records How do we know the IP addresses of root servers? DNS: Root Name Servers a Verisign, Dulles, VA c Cogent, Herndon, VA (also LA) d U Maryland College Park, MD g US DoD Vienna, VA h ARL Aberdeen, MD j Verisign, ( 21 locations) k RIPE London (also 16 other locations) i Autonomica, Stockholm (plus 28 other locations) m WIDE Tokyo (also Seoul, Paris, SF) e NASA Mt View, CA f Internet Software C. Palo Alto, CA (and 36 other locations) 13 root name servers worldwide b USC-ISI Marina del Rey, CA l ICANN Los Angeles, CA EE513/IS535
TLD and Authoritative Servers • Top-level domain (TLD) servers: • Responsible for com, org, net, edu, etc, and all top-level country domains kr, uk, fr, ca, jp. • Network Solutions maintains servers for com TLD • Educause for edu TLD • Authoritative DNS servers: • Organization’s DNS servers, providing authoritative hostname to IP mappings for organization’s servers (e.g., Web, mail). • Can be maintained by organization or service provider EE513/IS535
Local Name Server • Does not strictly belong to hierarchy • Each ISP (residential ISP, company, university) has one set • Also called “default name server” • When host makes DNS query, query is sent to its local DNS server • Acts as proxy, forwards query into hierarchy EE513/IS535
Once (any) name server learns mapping, it caches mapping Why caching? Cache entries time out(disappear) after some time TLD servers typically cached in local name servers Thus root name servers not often visited Typical cache hit rate: 80-90% at local DNS server Negative caching of DNS queries (RFC 2308) Caches negative responses (e.g., non-existent names) DNS: Caching and Updating Records EE513/IS535
DNS: distributed DB storing resource records (RR) RR format: (name, value, type, ttl) DNS records • Type=A • name is hostname • value is IP address • Type=CNAME • name is alias name for some “canonical” (the real) name www.ibm.com is really www.ibm.com.cs186.net • value is canonical name • Type=NS • nameis domain(e.g., foo.com) • valueis hostname of authoritative name server for this domain • Type=MX • value is name of mailserver associated with name EE513/IS535
DNS protocol :queryand reply messages, both with same message format DNS protocol, messages • msg header • identification: 16 bit # for query, reply to query uses same # • flags: • query or reply • recursion desired • recursion available • reply is authoritative EE513/IS535
DNS protocol, messages Name, type fields for a query RRs in response to query records for authoritative servers additional “helpful” info that may be used EE513/IS535
CoDNS EE513/IS535
Two Kinds of DNS Problems • Server-side problems • Problems in server infrastructure • [Danzig92], [Jung01] • Nameserver, resolver bugs • Misconfigurations by operators • Client-side problems • Between LDNS and clients • LDNS cache hit rate : 80 ~ 90% • CoDeeN experiences problems EE513/IS535
Local DNS Lookup Problems • Local DNS lookup failures • 5+ seconds delay for cached records • Frequent & widely-distributed • Unpredictable service • Directly affecting user-perceived latency • Random delay in web browsing • Critical in HTTP proxies, web crawlers and busy mail servers EE513/IS535
Experiment For Local Problems • Local name lookup every 6 seconds • “yyy.domain” on xxx.domain at PlanetLab • “planetlab-2.cs.princeton.edu” on planetlab-1.cs.princeton.edu • Lookup should be handled locally • Failure criteria • 5+ seconds of latency • zero answer • Rolling average of the past 100 queries EE513/IS535
Expected DNS Behavior • planetlab3.flux.utah.edu • ricepl-1.cs.rice.pl EE513/IS535
DNS Failure on Various Nodes • planetlab1.cs.cornell.edu • planetlab2.tamu.edu • planetlab2.cs.uoregon.edu EE513/IS535
Possible Causes • Packet loss • LDNS overloading • Cron jobs • Maintenance problems EE513/IS535
Packet Loss • UDP inherently unreliable • No ACK / retransmission • Single loss triggers query retransmission • Less than 0.1% in LAN environment • Increases over # of hops(Princeton) • 0.00 % at 2 hops • 0.02 % at 3 hops • 0.09 % at 4 hops • Heavily dependent on local traffic • Losses last for ~1 min • Cable modem/DSL users may see more • Avg # hops between LDNS and clients : 7.6 [Shaikh00] EE513/IS535
6 pm 6 pm 8 am 8 am Nameserver Overloading • planetlab1.eecs.umich.edu • planetlab2.di.unito.it • miranda.tkn.tu-berlin.de EE513/IS535
Nameserver Overloading • 90%+ nameservers within 4 hops • 70%, within 2 hops • Many responses for 1 sec ~ 5 sec • No timeout but simply late • Pr (Overloading|Failure) = 90 % for some nodes • Socket buffer overflow under request bursts EE513/IS535
Not a client problem! Cron jobs/heavy processes • pl1.cs.utk.edu • pl2.cs.utk.edu • phys0bha-5a.chem.msu.ru EE513/IS535
Maintenance Problems • /etc/resolv.conf • Configured to dead nameservers • Blocking services • Outside the firewall • Complete outage • Berkeley Millennium nodes, 3/17/2004 • Blackout / natural disaster • Duke hit by hurricane Isabel, Fall/2003 EE513/IS535
LAN LAN LAN CoDNS CoDNS CoDNS LDNS LDNS LDNS LDNS remote answer remote query query answer query query query query answer answer answer answer Solution:Cooperative Lookups LAN Client CoDNS Machine EE513/IS535
CoDNS : Cooperative DNS • Cooperative name lookup scheme • If local server OK, use local server • When failures, ask a peer for the lookup • Insurance model • Share risk, share benefits • Spend resources only when needed • Aggregate name lookup service • Aggregate cache effect EE513/IS535
Design Issues • Proximity / liveness • Select nearby peers • Monitors nameserver’s health as well • Request locality • Pick same peer for same names • Highest Random Weight(HRW) • Remote request timeout • Dynamically adjusted to local server’s health • Exponentially backed off for each remote query EE513/IS535
Status Quo • CoDNS deployed on all PlanetLab nodes • Running 24/7 since August 2003 • CoDeeN uses CoDNS as primary DNS • Remote query configuration • Top 10 nodes as neighbors • 200ms as a starting timeout EE513/IS535
Evaluation • Live traffic for one week for CoDeeN (20k - 30k) EE513/IS535
5.5% -> 0.06% 76% -> 17.8% Lookup Distribution • Live traffic on a node for one week (20,333 queries) • 2,043,135 ms / 5,809,265 ms = 35.1% • 100 ms vs. 286 ms per query • Great improvement on W-CDF EE513/IS535
Finer-grained View • Live traffic for one day • Effectively flattens the spikes EE513/IS535
CoDNS LDNS 99.99% 99.9% Availability(%) 99% 90% 9% 1 11 21 31 41 51 61 71 81 91 Nodes Sorted By Availability Availability • Add one ‘9’, from 99% to 99.9% EE513/IS535
DNS-Based CDN Sites? • DNS-based CDN exploits DNS server to provide a near replica Latency Difference (ms) EE513/IS535
CoDNS Alternatives • Private Nameservers • Secondary Nameservers • TCP Queries EE513/IS535
TCP Queries • DNS support TCP • Failure rate is better • Not used exept for AFXR or when answer is big • Simple TCP • 2 packets vs. 9 packets (3+2+4 =9) • Persistent TCP • ACK overhead • Resource waste for Idle connections • Vulnerable to overloading/server down EE513/IS535
S-TCP,P-TCP,UDP, CoDNS • Replay test(10,792 names) on 107 nodes • CoDNS First EE513/IS535
CoDNS vs. Persistent TCP Average Response Time (ms) EE513/IS535
Conclusion • Local failures are ubiquitous and relatively frequent • Local failures lead to long latency • CoDNS is effective, low-cost “insurance” service • CoDNS effectively masks local failures • CoDNS reduces average response time by 27-82% • CoDNS improves DNS Lookup availability by adding additional ‘9’. EE513/IS535