Experience with some Principles for Building an Internet-Scale Reliable System

Experience with some Principles for Building an Internet-Scale Reliable System

Overview • Background • Our Development Philosophy • Guiding Principles • Metrics and Benefits of the Approach

Customer Web server WWW.XYZ.COM DNS 1 2 7 15,000+ Servers 1,100+ Networks 2,500+ Locations 5 4 3 6 Downloading www.xyz.com with Akamai’s EdgeSuite • User enters www.xyz.com • Akamai server assembles page, contacting customer Web server if necessary • Browser requests IP address for www.xyz.com which is CNAMEd to Akamai • Optimal Akamai server returns HTML • DNS returns IP address of optimal Akamai server • Browser obtains objects from optimal Akamai servers, contacting the customer Web server if necessary • Browser requests HTML

What is this Paper About? • Internal effort to assess and further formalize internal processes for reliability. • Produced a long list of principles, some quite basic • e.g. Input checking • A smaller set of principles capturing our basic approach to building distributed systems emerged. • Some we realized only in retrospect • Many are not unique or new to us

Sharing our Principles • “Not always easy in practice” • Similarities with academic literature • Enables useful operational approach • This talk is not: • Detailed exposition or justification of entire system or architecture • Scientific reliability study • Adequate comparison with previous literature

Challenges • Failures all the time at different levels: • Machines, racks, datacenters, networks, multiple networks • Connectivity Statistics: “Health” Time

Our Philosophy Assumption: We assume that a significant and constantly changing number of component or other failures occur at all times in the network. Our software is designed to seamlessly work despite numerous failures as part of the operational network.

Consequence of Philosophy

Our Principles Principle #6: Notice and Quarantine Faults Principle #5: Zoning for Releases Principle #4: Fail-Stop & Restart Principle #3: Distributed Control Principle #2: Logic and Software for Message Reliability Principle #1: Ensure Significant Redundancy Philosophy: Work with numerous failures Assumption: Significant and constantly changing failures

Redundancy Principle #1: Ensure Significant Redundancy • Base Approach: Redundancy at every layer • Example Problem: • gTLDs return only 13 entries • The set is relatively static • Solution: IP Anycast to overload the IP addresses • Other Practical Constraints • DNS TTLs constrain flexibility • Third-party protocols • Cost Simple in theory, often challenging in practice.

Redundancy Logic and Software Principle #2: Use Logic and Software to Provide Message Reliability • Many message types: • Monitoring information • Customer content • We use an overlay transport (UDP and HTTP) • We do not: • Have dedicated pipes • Own datacenters

Redundancy Logic and Software Distributed Control Principle #3: Distributed Control • Different Layers: • Leader-Election • Failover X X Suspending region ensures reliability Region contains the only reliable content!

Our ability to tolerate failures facilitates our approach to software development and operations.

Redundancy Logic and Software Distributed Control Fail-Stop & Restart Principle #4: Fail Stop and Restart Why? • Significant downside to a mistake • Strong mechanism for recovery Akamai could be viewed as a seven-year experiment in running Recovery Oriented Computing.

Redundancy Logic and Software Distributed Control Fail-Stop & Restart A Cautious Approach 1.2.3.4 X X X X X X • Problems: • Continual Rolls • System-wide Issues

Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation Principle #6: Notice and Quarantine Faults • Challenging Problem • Many classes of solution • Open problem with vital importance

Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation Principle #5: Zoning Phase 3: Entire Network Phase 2: Subset (< 1/8th) the network Phase 1: One Machine The prior principles – unexpectedly– have enabled a much more reliable and aggressive release process.

Benefits to Software Development High Rate of Change per Month: • ~22 network/software releases • ~1000 customer configuration releases Our confidence in our network’s ability to handle faults enables an aggressive rate of change. Data from 25 months = 556 software releases

Benefit to Operations • Normal NOCC Staffing: • 7-8 people during the day • 3 people at night • Per person: • 1800 servers • 300 datacenters Our ability to treat faults as normal occurrences – not as crises – helps us scale

Principles • Ensure Significant Redundancy • Use Logic and Software for Messaging • Employ Distributed Control • Fail Stop and Restart • Employ Zoning • Notice and Quarantine Faults • Key Points • These principles: • Build upon each other • Enable Akamai’s highly reliable service and ability to scale

Experience with some Principles for Building an Internet-Scale Reliable System

Experience with some Principles for Building an Internet-Scale Reliable System

Presentation Transcript

Building an Internet Router (P33)

Building an Internet Gateway

Building Reliable Software

Building Web Scale for Libraries

Building an Integrated System for Personalizing Care

Reliable Internet Routing

Building and maintaining internet scale applications with Windows Azure Web Sites

Scale Building

Building an improved resource discovery experience for users:

Some Principles For Training

An Experience with Nature.

A Reliable Internet

Building a Reliable IP Multicast Distributed System

GridVine: Building Internet-Scale Semantic Overlay Networks

Rating Scale Experience:

An interactive capacity building experience – an approach with serious games

Building an Internet Router (P33)

Building an Internet Marketing Strategy

Towards an Internet-Scale XML Dissemination Service

Some Principles