600 likes | 743 Views
Introduction to cloud computing. Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net. Yahoo ! Cloud computing . Yahoo! Cloud Stack. EDGE. Horizontal Cloud Services. YCS. YCPI . Brooklyn. …. WEB. Horizontal Cloud Services. VM/OS. yApache. PHP.
E N D
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net
Yahoo! Cloud Stack EDGE Horizontal Cloud Services YCS YCPI Brooklyn … WEB Horizontal Cloud Services VM/OS yApache PHP App Engine APP Provisioning (Self-serve) Monitoring/Metering/Security Horizontal Cloud Services VM/OS Serving Grid … Data Highway STORAGE Horizontal Cloud Services Sherpa MOBStor … BATCH Horizontal Cloud Services Hadoop …
Web Data Management • CRUD • Point lookups and short scans • Index organized table and random I/Os • $ per latency • Scan oriented workloads • Focus on sequential disk I/O • $ per cpu cycle Structured record storage (PNUTS/Sherpa) Large data analysis (Hadoop) • Object retrieval and streaming • Scalable file storage • $ per GB Blob storage (SAN/NAS)
The World Has Changed • Web serving applications need: • Scalability! • Preferably elastic • Flexible schemas • Geographic distribution • High availability • Reliable storage • Web serving applications can do without: • Complicated queries • Strong transactions
PNUTS / SHERPA To Help You Scale Your Mountains of Data
Yahoo! Serving Storage Problem • Small records – 100KB or less • Structured records – lots of fields, evolving • Extreme data scale - Tens of TB • Extreme request scale - Tens of thousands of requests/sec • Low latency globally - 20+ datacenters worldwide • High Availability - outages cost $millions • Variable usage patterns - as applications and users change 9
What is PNUTS/Sherpa? A 42342 E A 42342 E B 42521 W B 42521 W C 66354 W D 12352 E F 15677 E A 42342 E E 75656 C B 42521 W C 66354 W C 66354 W D 12352 E D 12352 E E 75656 C E 75656 C F 15677 E F 15677 E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Structured, flexible schema Geographic replication Parallel database Hosted, managed infrastructure 11
A 42342 E A 42342 E A 42342 E B 42521 W B 42521 W B 42521 W C 66354 W C 66354 W C 66354 W D 12352 E D 12352 E D 12352 E E 75656 C E 75656 C E 75656 C F 15677 E F 15677 E F 15677 E What Will It Become? Indexes and views
Design Goals Consistency Per-record guarantees Timeline model Option to relax if needed Multiple access paths Hash table, ordered table Primary, secondary access Hosted service Applications plug and play Share operational cost Scalability Thousands of machines Easy to add capacity Restrict query language to avoid costly queries Geographic replication Asynchronous replication around the globe Low-latency local access High availability and fault tolerance Automatically recover from failures Serve reads and writes despite failures 14
Technology Elements Applications Tabular API PNUTS API • PNUTS • Query planning and execution • Index maintenance • Distributed infrastructure for tabular data • Data partitioning • Update consistency • Replication YCA: Authorization • YDOT FS • Ordered tables • YDHT FS • Hash tables • Tribble • Pub/sub messaging • Zookeeper • Consistency service 15
Data Manipulation Per-record operations Get Set Delete Multi-record operations Multiget Scan Getrange 16
Tablets—Hash Table Name Description Price 0x0000 $12 Grape Grapes are good to eat $9 Limes are green Lime $1 Apple Apple is wisdom $900 Strawberry Strawberry shortcake 0x2AF3 $2 Orange Arrgh! Don’t get scurvy! $3 Avocado But at what price? Lemon How much did you pay for this lemon? $1 $14 Is this a vegetable? Tomato 0x911F $2 The perfect fruit Banana $8 Kiwi New Zealand 0xFFFF 17
Tablets—Ordered Table Name Description Price A $1 Apple Apple is wisdom $3 Avocado But at what price? $2 Banana The perfect fruit $12 Grape Grapes are good to eat H $8 Kiwi New Zealand Lemon $1 How much did you pay for this lemon? Limes are green Lime $9 $2 Orange Arrgh! Don’t get scurvy! Q $900 Strawberry Strawberry shortcake $14 Is this a vegetable? Tomato Z 18
Detailed Architecture Remote regions Local region Clients REST API Routers Tribble Tablet Controller Storage units 20
Tablet Splitting and Balancing Storage unit Tablet Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Tablets may grow over time Overfull tablets split Shed load by moving tablets to other servers 21
Accessing Data Record for key k Get key k Record for key k 1 2 3 4 Get key k SU SU SU 23
Bulk Read {k1, k2, … kn} Get k1 Get k2 Get k3 Scatter/ gather server 1 2 SU SU SU 24
Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1 Grapefruit…Pear? Grapefruit…Lime? Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1 Lime…Pear? Router Storage unit 1 Storage unit 2 Storage unit 3 Range Queries in YDOT • Clustered, ordered retrieval of records Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Apple Avocado Banana Blueberry Strawberry Tomato Watermelon Lime Mango Orange Canteloupe Grape Kiwi Lemon
Updates Write key k SU SU SU 6 5 2 4 1 8 7 3 Write key k Sequence # for key k Routers Message brokers Write key k Sequence # for key k SUCCESS Write key k 26
Goal: Make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Alice”? Consistency Model Record inserted Delete Update Update Update Update Update Update Update v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Time Generation 1 As the record is updated, copies may get out of sync. 29
Example: Social Alice East Record Timeline West ___ Busy Free Free
Consistency Model Read Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 In general, reads are served using a local copy 31
Consistency Model Read up-to-date Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 But application can request and get current version 32
Consistency Model Read ≥ v.6 Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 Or variations such as “read forward”—while copies may lag the master record, every copy goes through the same sequence of changes 33
Consistency Model Write Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 Achieved via per-record primary copy protocol (To maximize availability, record masterships automaticlly transferred if site fails) Can be selectively weakened to eventual consistency (local writes that are reconciled using version vectors) 34
Consistency Model Write if = v.7 ERROR Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 Test-and-set writes facilitate per-record transactions 35
Consistency Techniques • Per-record mastering • Each record is assigned a “master region” • May differ between records • Updates to the record forwarded to the master region • Ensures consistent ordering of updates • Tablet-level mastering • Each tablet is assigned a “master region” • Inserts and deletes of records forwarded to the master region • Master region decides tablet splits • These details are hidden from the application • Except for the latency impact!
Mastering A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W Tablet master C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E 37
Bulk Insert/Update/Replace • Client feeds records to bulk manager • Bulk loader transfers records to SU’s in batches • Bypass routers and message brokers • Efficient import into storage unit Client Bulk manager Source Data
Bulk Load in YDOT • YDOT bulk inserts can cause performance hotspots • Solution: preallocate tablets
Index Maintenance • How to have lots of interesting indexes and views, without killing performance? • Solution: Asynchrony! • Indexes/views updated asynchronously when base table updated
Types of Record Stores • Query expressiveness S3 PNUTS Oracle Simple Feature rich Object retrieval Retrieval from single table of objects/records SQL
Types of Record Stores • Consistency model S3 PNUTS Oracle Best effort Strong guarantees Eventual consistency Timeline consistency ACID Program centric consistency Object-centric consistency
Types of Record Stores • Data model PNUTS CouchDB Oracle Flexibility, Schema evolution Optimized for Fixed schemas Object-centric consistency Consistency spans objects
Types of Record Stores • Elasticity (ability to add resources on demand) PNUTS S3 Oracle Inelastic Elastic Limited (via data distribution) VLSD (Very Large Scale Distribution /Replication)
User-partitioned SQL stores Microsoft Azure SDS Amazon SimpleDB Multi-tenant application databases Salesforce.com Oracle on Demand Mutable object stores Amazon S3 Versus PNUTS More expressive queries Users must control partitioning Limited elasticity Highly optimized for complex workloads Limited flexibility to evolving applications Inherit limitations of underlying data management system Object storage versus record management Data Stores Comparison
Application Design Space Get a few things Sherpa MObStor YMDB MySQL Oracle Filer BigTable Scan everything Hadoop Everest Files Records 47
Alternatives Matrix Consistency model Structured access Global low latency SQL/ACID Availability Operability Updates Elastic Sherpa Y! UDB MySQL Oracle HDFS BigTable Dynamo Cassandra 48
Further Reading Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008) Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008) Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni Asynchronous View Maintenance for VLSD Databases, Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and Raghu Ramakrishnan SIGMOD 2009 (to appear) Cloud Storage Design in a PNUTShell Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava Beautiful Data, O’Reilly Media, 2009 (to appear)
QUESTIONS? 50
Problem How do you scale up applications? Run jobs processing 100’s of terabytes of data Takes 11 days to read on 1 computer Need lots of cheap computers Fixes speed problem (15 minutes on 1000 computers), but… Reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure Must be efficient and reliable
Solution Open Source Apache Project Hadoop Core includes: Distributed File System - distributes data Map/Reduce - distributes application Written in Java Runs on Linux, Mac OS/X, Windows, and Solaris Commodity hardware
Hardware Cluster of Hadoop Typically in 2 level architecture Nodes are commodity PCs 40 nodes/rack Uplink from rack is 8 gigabit Rack-internal is 1 gigabit