240 likes | 353 Views
Experience with some Principles for Building an Internet-Scale Reliable System. Overview. Background Our Development Philosophy Guiding Principles Metrics and Benefits of the Approach. Customer Web server. WWW.XYZ.COM. DNS. 1. 2. 7. 15,000+ Servers 1,100+ Networks 2,500+ Locations.
E N D
Experience with some Principles for Building an Internet-Scale Reliable System
Overview • Background • Our Development Philosophy • Guiding Principles • Metrics and Benefits of the Approach
Customer Web server WWW.XYZ.COM DNS 1 2 7 15,000+ Servers 1,100+ Networks 2,500+ Locations 5 4 3 6 Downloading www.xyz.com with Akamai’s EdgeSuite • User enters www.xyz.com • Akamai server assembles page, contacting customer Web server if necessary • Browser requests IP address for www.xyz.com which is CNAMEd to Akamai • Optimal Akamai server returns HTML • DNS returns IP address of optimal Akamai server • Browser obtains objects from optimal Akamai servers, contacting the customer Web server if necessary • Browser requests HTML
What is this Paper About? • Internal effort to assess and further formalize internal processes for reliability. • Produced a long list of principles, some quite basic • e.g. Input checking • A smaller set of principles capturing our basic approach to building distributed systems emerged. • Some we realized only in retrospect • Many are not unique or new to us
Sharing our Principles • “Not always easy in practice” • Similarities with academic literature • Enables useful operational approach • This talk is not: • Detailed exposition or justification of entire system or architecture • Scientific reliability study • Adequate comparison with previous literature
Overview • Background • Our Development Philosophy • Guiding Principles • Metrics and Benefits of the Approach
Challenges • Failures all the time at different levels: • Machines, racks, datacenters, networks, multiple networks • Connectivity Statistics: “Health” Time
Our Philosophy Assumption: We assume that a significant and constantly changing number of component or other failures occur at all times in the network. Our software is designed to seamlessly work despite numerous failures as part of the operational network.
Overview • Background • Our Development Philosophy • Guiding Principles • Metrics and Benefits of the Approach
Our Principles Principle #6: Notice and Quarantine Faults Principle #5: Zoning for Releases Principle #4: Fail-Stop & Restart Principle #3: Distributed Control Principle #2: Logic and Software for Message Reliability Principle #1: Ensure Significant Redundancy Philosophy: Work with numerous failures Assumption: Significant and constantly changing failures
Redundancy Principle #1: Ensure Significant Redundancy • Base Approach: Redundancy at every layer • Example Problem: • gTLDs return only 13 entries • The set is relatively static • Solution: IP Anycast to overload the IP addresses • Other Practical Constraints • DNS TTLs constrain flexibility • Third-party protocols • Cost Simple in theory, often challenging in practice.
Redundancy Logic and Software Principle #2: Use Logic and Software to Provide Message Reliability • Many message types: • Monitoring information • Customer content • We use an overlay transport (UDP and HTTP) • We do not: • Have dedicated pipes • Own datacenters
Redundancy Logic and Software Distributed Control Principle #3: Distributed Control • Different Layers: • Leader-Election • Failover X X Suspending region ensures reliability Region contains the only reliable content!
Our ability to tolerate failures facilitates our approach to software development and operations.
Redundancy Logic and Software Distributed Control Fail-Stop & Restart Principle #4: Fail Stop and Restart Why? • Significant downside to a mistake • Strong mechanism for recovery Akamai could be viewed as a seven-year experiment in running Recovery Oriented Computing.
Redundancy Logic and Software Distributed Control Fail-Stop & Restart A Cautious Approach 1.2.3.4 X X X X X X • Problems: • Continual Rolls • System-wide Issues
Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation Principle #6: Notice and Quarantine Faults • Challenging Problem • Many classes of solution • Open problem with vital importance
Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation Principle #5: Zoning Phase 3: Entire Network Phase 2: Subset (< 1/8th) the network Phase 1: One Machine The prior principles – unexpectedly– have enabled a much more reliable and aggressive release process.
Overview • Background • Our Development Philosophy • Guiding Principles • Metrics and Benefits of the Approach
Benefits to Software Development High Rate of Change per Month: • ~22 network/software releases • ~1000 customer configuration releases Our confidence in our network’s ability to handle faults enables an aggressive rate of change. Data from 25 months = 556 software releases
Benefit to Operations • Normal NOCC Staffing: • 7-8 people during the day • 3 people at night • Per person: • 1800 servers • 300 datacenters Our ability to treat faults as normal occurrences – not as crises – helps us scale
Principles • Ensure Significant Redundancy • Use Logic and Software for Messaging • Employ Distributed Control • Fail Stop and Restart • Employ Zoning • Notice and Quarantine Faults • Key Points • These principles: • Build upon each other • Enable Akamai’s highly reliable service and ability to scale