1 / 50

Some Claims about Giant-Scale Systems

Some Claims about Giant-Scale Systems. Prof. Eric A. Brewer UC Berkeley Vanguard Conference, May 5, 2003. Five Claims. Scalability is easy (availability is hard) How to build a giant-scale system Services are King (infrastructure centric) P2P is cool but overrated XML doesn’t help much

dustin
Download Presentation

Some Claims about Giant-Scale Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Some Claims aboutGiant-Scale Systems Prof. Eric A. BrewerUC Berkeley Vanguard Conference, May 5, 2003

  2. Five Claims • Scalability is easy (availability is hard) • How to build a giant-scale system • Services are King (infrastructure centric) • P2P is cool but overrated • XML doesn’t help much • Need new IT for the Third World Vanguard Conference

  3. Claim 1: Availability >> Scalability

  4. Key New Problems • Unknown but large growth • Incremental & Absolute scalability • 1000’s of components • Must be truly highly available • Hot swap everything (no recovery time) • $6M/hour for stocks, $300k/hour for retailers • Graceful degradation under faults & saturation • Constant evolution (internet time) • Software will be buggy • Hardware will fail • These can’t be emergencies... Vanguard Conference

  5. Typical Cluster Vanguard Conference

  6. Scalability is EASY • Just add more nodes…. • Unless you want HA • Or you want to change the system… • Hard parts: • Availability • Overload • Evolution Vanguard Conference

  7. Step 1: Basic Architecture Vanguard Conference

  8. Step 2: Divide Persistent Data • Transactional (expensive, must be correct) • Supporting Data • HTML, images, applets, etc. • Read mostly • Publish in snapshots (normally stale) • Session state • Persistent across connections • Limited lifetime • Can be lost (rarely) Vanguard Conference

  9. Step 3: Basic Availability • a) depend on layer-7 switches • Isolate external names (IP address, port) from specific machines • b) automatic detection of problems • Node-level checks (e.g. memory footprint) • (Remote) App-level checks • c) Focus on MTTR • Easier to fix & test than MTBF Vanguard Conference

  10. Step 4: Overload Strategy • Goal: degrade service (some) to allow more capacity during overload • Examples: • Simpler pages (less dynamic, smaller size) • Less options (just the basics) • Must kick in automatically • “overload mode” • Relatively easy to detect Vanguard Conference

  11. Step 5: Disaster Tolerance • a) pick a few locations • Independent failure • b) Dynamic redirection of load • Best: client-side control • Next: switch all traffic (long routes) • Worst: DNS remapping (takes a while) • c) Target site will get overloaded • -- but you have overload handling Vanguard Conference

  12. Step 6: Online Evolution • Goal: rapid evolution without downtime • a) Publishing model • Decouple development from live system • Atomic push of content • Automatic revert if trouble arises • b) Three methods Vanguard Conference

  13. Evolution: Three Approaches • Flash Upgrade • Fast reboot into new version • Focus on MTTR (< 10 sec) • Reduces yield (and uptime) • Rolling Upgrade • Upgrade nodes one at time in a “wave” • Temporary 1/n harvest reduction, 100% yield • Requires co-existing versions • “Big Flip” Vanguard Conference

  14. The Big Flip • Steps: 1) take down 1/2 the nodes 2) upgrade that half 3) flip the “active half” (site upgraded) 4) upgrade second half 5) return to 100% • Avoids mixed versions (!) • can replace schema, protocols, ... • Twice used to change physical location Vanguard Conference

  15. The Hat Trick • Merge the handling of: • Disaster tolerance • Online evolution • Overload handling • The first two reduce capacity, which then kicks in overload handling (!) Vanguard Conference

  16. Claim 2:Services are King

  17. Coming of The Infrastructure Vanguard Conference

  18. Infrastructure Services • Much simpler devices • lower cost & more functionality • longer battery life • Data is in the infrastructure • can lose the device • enables groupware • can update/access from home or work • phone book on the web, not in the phone • can use a real PC & keyboard • Much faster access • Surfing is 3-7 times faster • Graphics look good Vanguard Conference

  19. Transformation Examples • Tailor content for each user & device 6.8x 65x 10x 1.2 The Remote Queue Model We introduce Remote Queues (RQ), ….

  20. Infrastructure Services (2) • Much cheaper overall cost (20x?) • Device utilization = 4%, infrastructure = 80% • Admin & supports costs also decrease • “Super Convergence” (all -> IP) • View powerpoint slides with teleconference • Integrated cell phone, pager, web/email access • Map, driving directions, location-based services • Can upgrade/add services in place! • Devices last longer and grow in usefulness • Easy to deploy new services => new revenue Vanguard Conference

  21. Internet Phases (prediction) • Internet as New Media • HTML, basic search • Consumer Services (today) • Shopping, travel, Yahoo!, eBay, tickets, … • Industrial Services • XML, micropayments, spot markets • Everything in the Infrastructure • store your data in the infrastructure • access anytime/anywhere Vanguard Conference

  22. Claim 3:P2P is cool but overrated

  23. P2P Services? • Not soon… • Challenges • Untrusted nodes !! • Network Partitions • Much harder to understand behavior • Harder to upgrade • Relatively few advantages… Vanguard Conference

  24. Better: Smart Clients • Mostly helps with: • Load balancing • Disaster tolerance • Overload • Can also offload work from servers • Can also personalize results • E.g. mix search results locally • Can include private data Vanguard Conference

  25. Claim 4:XML doesn’t help much…

  26. The Problem • Need services for computers to use • HTML only works for people • Sites depend on human interpretation of ambiguous content • “Scaping” content is BAD • Very error prone • No strategy for evolution • XML doesn’t solve any of these issues! • At best: RPC with an extensible schema Vanguard Conference

  27. Why it is hard… • The real problem is *social* • What do the fields mean? • Who gets to decide? • Doesn’t make evolution better… • Two sides still need to agree on schema • Can ignore stuff you don’t understand? • When can a field change? Consequences? • At least need versioning system… • XML can mislead us to ignore/postpone the real issues! Vanguard Conference

  28. Claim 5:New IT for the Third World

  29. Plug for new area… • Bridging the IT gap is the only long-term path to global stability • Convergence makes it possible: • 802.11 wireless ($5/chipset) • Systems on a chip (cost, power) • Infrastructure services (cost, power) • Goal: 10-100x reduction in overall cost and power Vanguard Conference

  30. Five Claims • Scalability is easy (availability is hard) • How to build a giant-scale system • Services are King (infrastructure centric) • P2P is cool but overrated • XML doesn’t help much • Need new IT for the Third World Vanguard Conference

  31. Backup

  32. Refinement • Retrieve part of distilled object at higher quality Zoom in to original resolution Distilled image (by 60X)

  33. Consistency Availability Tolerance to networkPartitions The CAP Theorem Theorem: You can have at most two of these properties for any shared-data system Vanguard Conference

  34. Consistency Availability Tolerance to networkPartitions Forfeit Partitions Examples • Single-site databases • Cluster databases • LDAP • xFS file system Traits • 2-phase commit • cache validation protocols Vanguard Conference

  35. Consistency Availability Tolerance to networkPartitions Forfeit Availability Examples • Distributed databases • Distributed locking • Majority protocols Traits • Pessimistic locking • Make minority partitions unavailable Vanguard Conference

  36. Consistency Availability Tolerance to networkPartitions Forfeit Consistency Examples • Coda • Web cachinge • DNS Traits • expirations/leases • conflict resolution • optimistic Vanguard Conference

  37. These Tradeoffs are Real • The whole space is useful • Real internet systems are a careful mixture of ACID and BASE subsystems • We use ACID for user profiles and logging (for revenue) • But there is almost no work in this area • Symptom of a deeper problem: systems and database communities are separate but overlapping (with distinct vocabulary) Vanguard Conference

  38. CAP Take Homes • Can have consistency & availability within a cluster (foundation of Ninja), but it is still hard in practice • OS/Networking good at BASE/Availability, but terrible at consistency • Databases better at C than Availability • Wide-area databases can’t have both • Disconnected clients can’t have both • All systems are probabilistic… Vanguard Conference

  39. The DQ Principle Data/query * Queries/sec = constant = DQ • for a given node • for a given app/OS release • A fault can reduce the capacity (Q), completeness (D) or both • Faults reduce this constant linearly (at best) Vanguard Conference

  40. Harvest & Yield • Yield: Fraction of Answered Queries • Related to uptime but measured by queries, not by time • Drop 1 out of 10 connections => 90% yield • At full utilization: yield ~ capacity ~ Q • Harvest: Fraction of the Complete Result • Reflects that some of the data may be missing due to faults • Replication: maintain D under faults • DQ corollary: harvest * yield ~ constant • ACID => choose 100% harvest (reduce Q but 100% D) • Internet => choose 100% yield (available but reduced D) Vanguard Conference

  41. RAID RAID Harvest Options 1) Ignore lost nodes • RPC gives up • forfeit small part of the database • reduce D, keep Q 2) Pair up nodes • RPC tries alternate • survives one fault per pair • reduce Q, keep D 3) n-member replica groups Decide when you care...

  42. Replica Groups With n members: • Each fault reduces Q by 1/n • D stable until nth fault • Added load is 1/(n-1) per fault • n=2 => double load or 50% capacity • n=4 => 133% load or 75% capacity • “load redirection problem” • Disaster tolerance: better have >3 mirrors Vanguard Conference

  43. Graceful Degradation • Goal: smooth decrease in harvest/yield proportional to faults • we know DQ drops linearly • Saturation will occur • high peak/average ratios... • must reduce harvest or yield (or both) • must do admission control!!! • One answer: reduce D dynamically • disaster => redirect load, then reduce D to compensate for extra load Vanguard Conference

  44. Thinking Probabilistically • Maximize symmetry • SPMD + simple replication schemes • Make faults independent • requires thought • avoid cascading errors/faults • understand redirected load • KISS • Use randomness • makes worst-case and average case the same • ex: Inktomi spreads data & queries randomly • Node loss implies a random 1% harvest reduction Vanguard Conference

  45. Server Pollution • Can’t fix all memory leaks • Third-party software leaks memory and sockets • so does the OS sometimes • Some failures tie up local resources Solution: planned periodic “bounce” • Not worth the stress to do any better • Bounce time is less than 10 seconds • Nice to remove load first… Vanguard Conference

  46. Key New Problems • Unknown but large growth • Incremental & Absolute scalability • 1000’s of components • Must be truly highly available • Hot swap everything (no recovery time allowed) • No “night” • Graceful degradation under faults & saturation • Constant evolution (internet time) • Software will be buggy • Hardware will fail • These can’t be emergencies... Vanguard Conference

  47. Conclusions • Parallel Programming is very relevant, except… • historically avoids availability • no notion of online evolution • limited notions of graceful degradation (checkpointing) • best for CPU-bound tasks • Must think probabilistically about everything • no such thing as a 100% working system • no such thing as 100% fault tolerance • partial results are often OK (and better than none) • Capacity * Completeness == Constant Vanguard Conference

  48. Partial checklist • What is shared? (namespace, schema?) • What kind of state in each boundary? • How would you evolve an API? • Lifetime of references? Expiration impact? • Graceful degradation as modules go down? • External persistent names? • Consistency semantics and boundary? Vanguard Conference

  49. The Move to Clusters • No single machine can handle the load • Only solution is clusters • Other cluster advantages: • Cost: about 50% cheaper per CPU • Availability: possible to build HA systems • Incremental growth: add nodes as needed • Replace whole nodes (easier) Vanguard Conference

  50. Goals • Sheer scale • Handle 100M users, going toward 1B • Largest: AOL Web Cache: 12B hits/day • High Availability • Large cost for downtime • $250K per hour for online retailers • $6M per hour for stock brokers • Disaster Tolerance? • Overload Handling • System Evolution • Decentralization? Vanguard Conference

More Related