1 / 23

Experience with some Principles for Building an Internet-Scale Reliable System

Experience with some Principles for Building an Internet-Scale Reliable System. Overview. Background Our Development Philosophy Guiding Principles Metrics and Benefits of the Approach. Customer Web server. WWW.XYZ.COM. DNS. 1. 2. 7. 15,000+ Servers 1,100+ Networks 2,500+ Locations.

deepak
Download Presentation

Experience with some Principles for Building an Internet-Scale Reliable System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experience with some Principles for Building an Internet-Scale Reliable System

  2. Overview • Background • Our Development Philosophy • Guiding Principles • Metrics and Benefits of the Approach

  3. Customer Web server WWW.XYZ.COM DNS 1 2 7 15,000+ Servers 1,100+ Networks 2,500+ Locations 5 4 3 6 Downloading www.xyz.com with Akamai’s EdgeSuite • User enters www.xyz.com • Akamai server assembles page, contacting customer Web server if necessary • Browser requests IP address for www.xyz.com which is CNAMEd to Akamai • Optimal Akamai server returns HTML • DNS returns IP address of optimal Akamai server • Browser obtains objects from optimal Akamai servers, contacting the customer Web server if necessary • Browser requests HTML

  4. What is this Paper About? • Internal effort to assess and further formalize internal processes for reliability. • Produced a long list of principles, some quite basic • e.g. Input checking • A smaller set of principles capturing our basic approach to building distributed systems emerged. • Some we realized only in retrospect • Many are not unique or new to us

  5. Sharing our Principles • “Not always easy in practice” • Similarities with academic literature • Enables useful operational approach • This talk is not: • Detailed exposition or justification of entire system or architecture • Scientific reliability study • Adequate comparison with previous literature

  6. Overview • Background • Our Development Philosophy • Guiding Principles • Metrics and Benefits of the Approach

  7. Challenges • Failures all the time at different levels: • Machines, racks, datacenters, networks, multiple networks • Connectivity Statistics: “Health” Time

  8. Our Philosophy Assumption: We assume that a significant and constantly changing number of component or other failures occur at all times in the network. Our software is designed to seamlessly work despite numerous failures as part of the operational network.

  9. Consequence of Philosophy

  10. Overview • Background • Our Development Philosophy • Guiding Principles • Metrics and Benefits of the Approach

  11. Our Principles Principle #6: Notice and Quarantine Faults Principle #5: Zoning for Releases Principle #4: Fail-Stop & Restart Principle #3: Distributed Control Principle #2: Logic and Software for Message Reliability Principle #1: Ensure Significant Redundancy Philosophy: Work with numerous failures Assumption: Significant and constantly changing failures

  12. Redundancy Principle #1: Ensure Significant Redundancy • Base Approach: Redundancy at every layer • Example Problem: • gTLDs return only 13 entries • The set is relatively static • Solution: IP Anycast to overload the IP addresses • Other Practical Constraints • DNS TTLs constrain flexibility • Third-party protocols • Cost Simple in theory, often challenging in practice.

  13. Redundancy Logic and Software Principle #2: Use Logic and Software to Provide Message Reliability • Many message types: • Monitoring information • Customer content • We use an overlay transport (UDP and HTTP) • We do not: • Have dedicated pipes • Own datacenters

  14. Redundancy Logic and Software Distributed Control Principle #3: Distributed Control • Different Layers: • Leader-Election • Failover X X Suspending region ensures reliability Region contains the only reliable content!

  15. Our ability to tolerate failures facilitates our approach to software development and operations.

  16. Redundancy Logic and Software Distributed Control Fail-Stop & Restart Principle #4: Fail Stop and Restart Why? • Significant downside to a mistake • Strong mechanism for recovery Akamai could be viewed as a seven-year experiment in running Recovery Oriented Computing.

  17. Redundancy Logic and Software Distributed Control Fail-Stop & Restart A Cautious Approach 1.2.3.4 X X X X X X • Problems: • Continual Rolls • System-wide Issues

  18. Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation Principle #6: Notice and Quarantine Faults • Challenging Problem • Many classes of solution • Open problem with vital importance

  19. Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation Principle #5: Zoning Phase 3: Entire Network Phase 2: Subset (< 1/8th) the network Phase 1: One Machine The prior principles – unexpectedly– have enabled a much more reliable and aggressive release process.

  20. Overview • Background • Our Development Philosophy • Guiding Principles • Metrics and Benefits of the Approach

  21. Benefits to Software Development High Rate of Change per Month: • ~22 network/software releases • ~1000 customer configuration releases Our confidence in our network’s ability to handle faults enables an aggressive rate of change. Data from 25 months = 556 software releases

  22. Benefit to Operations • Normal NOCC Staffing: • 7-8 people during the day • 3 people at night • Per person: • 1800 servers • 300 datacenters Our ability to treat faults as normal occurrences – not as crises – helps us scale

  23. Principles • Ensure Significant Redundancy • Use Logic and Software for Messaging • Employ Distributed Control • Fail Stop and Restart • Employ Zoning • Notice and Quarantine Faults • Key Points • These principles: • Build upon each other • Enable Akamai’s highly reliable service and ability to scale

More Related