1 / 37

CoDNS : Improving DNS Performance and Reliability via Cooperative Lookups

CoDNS : Improving DNS Performance and Reliability via Cooperative Lookups. KyoungSoo Park Electrical Engineering KAIST. DNS Background. Domain Name System (DNS) Distributed database of resource records (RR) Typically, (name, IP) pair lookup (A-record) Hierarchical Name Resolution

argyle
Download Presentation

CoDNS : Improving DNS Performance and Reliability via Cooperative Lookups

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups KyoungSoo Park Electrical Engineering KAIST EE513/IS535

  2. DNS Background • Domain Name System (DNS) • Distributed database of resource records (RR) • Typically, (name, IP) pair lookup (A-record) • Hierarchical Name Resolution • www.kaist.ac.kr: . -> .kr -> .ac.kr -> kaist.ac.kr • Local resolver (local DNS server) handles the request • Caching and Redundancy • Each server aggressively caches RRs • More than two local resolvers EE513/IS535

  3. Host at swan.kaist.ac.kr wants IP address for www.princeton.edu local DNS server ns.kaist.ac.kr DNS Name Resolution Example root DNS server 2 TLD DNS server (.edu) 3 4 • Iterated query: • contacted server replies with name of server to contact • “I don’t know this name, but ask this server” 5 6 7 1 8 authoritative DNS server dns.princeton.edu requesting host swan.kaist.ac.kr www.princeton.edu EE513/IS535

  4. root DNS server 2 3 6 7 TLD DNS server (.edu) 4 local DNS server ns.kaist.ac.kr 5 1 8 authoritative DNS server dns.princeton.edu requesting host swan.kaist.ac.kr www.umass.edu DNS name resolution example • Recursive query: • puts burden of name resolution on contacted name server • heavy load? EE513/IS535

  5. A.root-servers.net to M.root-servers.net Each server is a cluster of replicated servers Each IP is shared by many machines (e.g., IP Anycast) Responsible for top-level domain NS records How do we know the IP addresses of root servers? DNS: Root Name Servers a Verisign, Dulles, VA c Cogent, Herndon, VA (also LA) d U Maryland College Park, MD g US DoD Vienna, VA h ARL Aberdeen, MD j Verisign, ( 21 locations) k RIPE London (also 16 other locations) i Autonomica, Stockholm (plus 28 other locations) m WIDE Tokyo (also Seoul, Paris, SF) e NASA Mt View, CA f Internet Software C. Palo Alto, CA (and 36 other locations) 13 root name servers worldwide b USC-ISI Marina del Rey, CA l ICANN Los Angeles, CA EE513/IS535

  6. TLD and Authoritative Servers • Top-level domain (TLD) servers: • Responsible for com, org, net, edu, etc, and all top-level country domains kr, uk, fr, ca, jp. • Network Solutions maintains servers for com TLD • Educause for edu TLD • Authoritative DNS servers: • Organization’s DNS servers, providing authoritative hostname to IP mappings for organization’s servers (e.g., Web, mail). • Can be maintained by organization or service provider EE513/IS535

  7. Local Name Server • Does not strictly belong to hierarchy • Each ISP (residential ISP, company, university) has one set • Also called “default name server” • When host makes DNS query, query is sent to its local DNS server • Acts as proxy, forwards query into hierarchy EE513/IS535

  8. Once (any) name server learns mapping, it caches mapping Why caching? Cache entries time out(disappear) after some time TLD servers typically cached in local name servers Thus root name servers not often visited Typical cache hit rate: 80-90% at local DNS server Negative caching of DNS queries (RFC 2308) Caches negative responses (e.g., non-existent names) DNS: Caching and Updating Records EE513/IS535

  9. DNS: distributed DB storing resource records (RR) RR format: (name, value, type, ttl) DNS records • Type=A • name is hostname • value is IP address • Type=CNAME • name is alias name for some “canonical” (the real) name www.ibm.com is really www.ibm.com.cs186.net • value is canonical name • Type=NS • nameis domain(e.g., foo.com) • valueis hostname of authoritative name server for this domain • Type=MX • value is name of mailserver associated with name EE513/IS535

  10. DNS protocol :queryand reply messages, both with same message format DNS protocol, messages • msg header • identification: 16 bit # for query, reply to query uses same # • flags: • query or reply • recursion desired • recursion available • reply is authoritative EE513/IS535

  11. DNS protocol, messages Name, type fields for a query RRs in response to query records for authoritative servers additional “helpful” info that may be used EE513/IS535

  12. CoDNS EE513/IS535

  13. Two Kinds of DNS Problems • Server-side problems • Problems in server infrastructure • [Danzig92], [Jung01] • Nameserver, resolver bugs • Misconfigurations by operators • Client-side problems • Between LDNS and clients • LDNS cache hit rate : 80 ~ 90% • CoDeeN experiences problems EE513/IS535

  14. Local DNS Lookup Problems • Local DNS lookup failures • 5+ seconds delay for cached records • Frequent & widely-distributed • Unpredictable service • Directly affecting user-perceived latency • Random delay in web browsing • Critical in HTTP proxies, web crawlers and busy mail servers EE513/IS535

  15. Experiment For Local Problems • Local name lookup every 6 seconds • “yyy.domain” on xxx.domain at PlanetLab • “planetlab-2.cs.princeton.edu” on planetlab-1.cs.princeton.edu • Lookup should be handled locally • Failure criteria • 5+ seconds of latency • zero answer • Rolling average of the past 100 queries EE513/IS535

  16. Expected DNS Behavior • planetlab3.flux.utah.edu • ricepl-1.cs.rice.pl EE513/IS535

  17. DNS Failure on Various Nodes • planetlab1.cs.cornell.edu • planetlab2.tamu.edu • planetlab2.cs.uoregon.edu EE513/IS535

  18. Possible Causes • Packet loss • LDNS overloading • Cron jobs • Maintenance problems EE513/IS535

  19. Packet Loss • UDP inherently unreliable • No ACK / retransmission • Single loss triggers query retransmission • Less than 0.1% in LAN environment • Increases over # of hops(Princeton) • 0.00 % at 2 hops • 0.02 % at 3 hops • 0.09 % at 4 hops • Heavily dependent on local traffic • Losses last for ~1 min • Cable modem/DSL users may see more • Avg # hops between LDNS and clients : 7.6 [Shaikh00] EE513/IS535

  20. 6 pm 6 pm 8 am 8 am Nameserver Overloading • planetlab1.eecs.umich.edu • planetlab2.di.unito.it • miranda.tkn.tu-berlin.de EE513/IS535

  21. Nameserver Overloading • 90%+ nameservers within 4 hops • 70%, within 2 hops • Many responses for 1 sec ~ 5 sec • No timeout but simply late • Pr (Overloading|Failure) = 90 % for some nodes • Socket buffer overflow under request bursts EE513/IS535

  22. Not a client problem! Cron jobs/heavy processes • pl1.cs.utk.edu • pl2.cs.utk.edu • phys0bha-5a.chem.msu.ru EE513/IS535

  23. Maintenance Problems • /etc/resolv.conf • Configured to dead nameservers • Blocking services • Outside the firewall • Complete outage • Berkeley Millennium nodes, 3/17/2004 • Blackout / natural disaster • Duke hit by hurricane Isabel, Fall/2003 EE513/IS535

  24. LAN LAN LAN CoDNS CoDNS CoDNS LDNS LDNS LDNS LDNS remote answer remote query query answer query query query query answer answer answer answer Solution:Cooperative Lookups LAN Client CoDNS Machine EE513/IS535

  25. CoDNS : Cooperative DNS • Cooperative name lookup scheme • If local server OK, use local server • When failures, ask a peer for the lookup • Insurance model • Share risk, share benefits • Spend resources only when needed • Aggregate name lookup service • Aggregate cache effect EE513/IS535

  26. Design Issues • Proximity / liveness • Select nearby peers • Monitors nameserver’s health as well • Request locality • Pick same peer for same names • Highest Random Weight(HRW) • Remote request timeout • Dynamically adjusted to local server’s health • Exponentially backed off for each remote query EE513/IS535

  27. Status Quo • CoDNS deployed on all PlanetLab nodes • Running 24/7 since August 2003 • CoDeeN uses CoDNS as primary DNS • Remote query configuration • Top 10 nodes as neighbors • 200ms as a starting timeout EE513/IS535

  28. Evaluation • Live traffic for one week for CoDeeN (20k - 30k) EE513/IS535

  29. 5.5% -> 0.06% 76% -> 17.8% Lookup Distribution • Live traffic on a node for one week (20,333 queries) • 2,043,135 ms / 5,809,265 ms = 35.1% • 100 ms vs. 286 ms per query • Great improvement on W-CDF EE513/IS535

  30. Finer-grained View • Live traffic for one day • Effectively flattens the spikes EE513/IS535

  31. CoDNS LDNS 99.99% 99.9% Availability(%) 99% 90% 9% 1 11 21 31 41 51 61 71 81 91 Nodes Sorted By Availability Availability • Add one ‘9’, from 99% to 99.9% EE513/IS535

  32. DNS-Based CDN Sites? • DNS-based CDN exploits DNS server to provide a near replica Latency Difference (ms) EE513/IS535

  33. CoDNS Alternatives • Private Nameservers • Secondary Nameservers • TCP Queries EE513/IS535

  34. TCP Queries • DNS support TCP • Failure rate is better • Not used exept for AFXR or when answer is big • Simple TCP • 2 packets vs. 9 packets (3+2+4 =9) • Persistent TCP • ACK overhead • Resource waste for Idle connections • Vulnerable to overloading/server down EE513/IS535

  35. S-TCP,P-TCP,UDP, CoDNS • Replay test(10,792 names) on 107 nodes • CoDNS First EE513/IS535

  36. CoDNS vs. Persistent TCP Average Response Time (ms) EE513/IS535

  37. Conclusion • Local failures are ubiquitous and relatively frequent • Local failures lead to long latency • CoDNS is effective, low-cost “insurance” service • CoDNS effectively masks local failures • CoDNS reduces average response time by 27-82% • CoDNS improves DNS Lookup availability by adding additional ‘9’. EE513/IS535

More Related