650 likes | 787 Views
Introduction to cloud computing. Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net. Advanced MapReduce Application Reference: Jimmy Lin http://www.umiacs.umd.edu/~jimmylin/cloud-2008-Fall/schedule.html. Managing Dependencies.
E N D
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net
Advanced MapReduce Application • Reference: Jimmy Lin • http://www.umiacs.umd.edu/~jimmylin/cloud-2008-Fall/schedule.html
Managing Dependencies • Remember: Mappers run in isolation • You have no idea in what order the mappers run • You have no idea on what node the mappers run • You have no idea when each mapper finishes • Tools for synchronization: • Ability to hold state in reducer across multiple key-value pairs • Sorting function for keys • Partitioner • Cleverly-constructed data structures
Motivating Example • Term co-occurrence matrix for a text collection • M = N x N matrix (N = vocabulary size) • Mij: number of times i and j co-occur in some context (for concreteness, let’s say context = sentence) • Why? • Distributional profiles as a way of measuring semantic distance • Semantic distance useful for many language processing tasks e.g., Mohammad and Hirst (EMNLP, 2006)
MapReduce: Large Counting Problems • Term co-occurrence matrix for a text collection= specific instance of a large counting problem • A large event space (number of terms) • A large number of events (the collection itself) • Goal: keep track of interesting statistics about the events • Basic approach • Mappers generate partial counts • Reducers aggregate partial counts
First Try: “Pairs” • Each mapper takes a sentence: • Generate all co-occurring term pairs • For all pairs, emit (a, b) → count • Reducers sums up counts associated with these pairs • Use combiners!
“Pairs” Analysis • Advantages • Easy to implement, easy to understand • Disadvantages • Lots of pairs to sort and shuffle around (upper bound?)
Another Try: “Stripes” a → { b: 1, c: 2, d: 5, e: 3, f: 2 } (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } • Idea: group together pairs into an associative array • Each mapper takes a sentence: • Generate all co-occurring term pairs
Another Try: “Stripes” a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } + • Reducers perform element-wise sum of associative arrays
“Stripes” Analysis • Advantages • Far less sorting and shuffling of key-value pairs • Can make better use of combiners • Disadvantages • More difficult to implement • Underlying object is more heavyweight • Fundamental limitation in terms of size of event space
Cluster size: 38 cores Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
Conditional Probabilities • How do we compute conditional probabilities from counts? • Why do we want to do this? • How do we do this with MapReduce?
P(B|A): “Pairs” Reducer holds this value in memory (a, *) → 32 (a, b1) → 3 (a, b2) → 12 (a, b3) → 7 (a, b4) → 1 … (a, b1) → 3 / 32 (a, b2) → 12 / 32 (a, b3) → 7 / 32 (a, b4) → 1 / 32 … • For this to work: • Must emit extra (a, *) for every bn in mapper • Must make sure all a’s get sent to same reducer (use Partitioner) • Must make sure (a, *) comes first (define sort order)
P(B|A): “Stripes” a → {b1:3, b2 :12, b3 :7, b4 :1, … } • Easy! • One pass to compute (a, *) • Another pass to directly compute P(B|A)
Synchronization in Hadoop • Approach 1: turn synchronization into an ordering problem • Sort keys into correct order of computation • Partition key space so that each reducer gets the appropriate set of partial results • Hold state in reducer across multiple key-value pairs to perform computation • Approach 2: construct data structures that “bring the pieces together” • Each reducer receives all the data it needs to complete the computation
Issues and Tradeoffs • Number of key-value pairs • Object creation overhead • Time for sorting and shuffling pairs across the network • Size of each key-value pair • De/serialization overhead • Combiners make a big difference! • RAM vs. disk and network • Arrange data to maximize opportunities to aggregate partial results
Data Types in Hadoop Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable. WritableComprable Defines a sort order. All keys must be of this type (but not values). Concrete classes for different data types. IntWritableLongWritable Text …
Complex Data Types in Hadoop • How do you implement complex data types? • The easiest way: • Encoded it as Text, e.g., (a, b) = “a:b” • Use regular expressions to parse and extract data • The hard way: • Define a custom implementation of WritableComprable • Must implement: readFields, write, compareTo • Computationally efficient, but slow for rapid prototyping
Yahoo! Cloud Stack EDGE Horizontal Cloud Services YCS YCPI Brooklyn … WEB Horizontal Cloud Services VM/OS yApache PHP App Engine APP Provisioning (Self-serve) Monitoring/Metering/Security Horizontal Cloud Services VM/OS Serving Grid … Data Highway STORAGE Horizontal Cloud Services Sherpa MOBStor … BATCH Horizontal Cloud Services Hadoop …
Yahoo! CCDI Thrust Areas • Fast Provisioning and Machine Virtualization: On demand, deliver a set of hosts imaged with desired software and configured against standard services • Multiple hosts may be multiplexed onto the same physical machine. • Batch Storage and Processing: Scalable data storage optimized for batch processing, together with computational capabilities • Operational Storage: Persistent storage that supports low-latency updates and flexible retrieval • Edge Content Services: Support for dealing with network topology, communication protocols, caching, and BCP Rest of today’s talk
Web Data Management • CRUD • Point lookups and short scans • Index organized table and random I/Os • $ per latency • Scan oriented workloads • Focus on sequential disk I/O • $ per cpu cycle Structured record storage (PNUTS/Sherpa) Large data analysis (Hadoop) • Object retrieval and streaming • Scalable file storage • $ per GB Blob storage (SAN/NAS)
The World Has Changed • Web serving applications need: • Scalability! • Preferably elastic • Flexible schemas • Geographic distribution • High availability • Reliable storage • Web serving applications can do without: • Complicated queries • Strong transactions
PNUTS / SHERPA To Help You Scale Your Mountains of Data
Yahoo! Serving Storage Problem • Small records – 100KB or less • Structured records – lots of fields, evolving • Extreme data scale - Tens of TB • Extreme request scale - Tens of thousands of requests/sec • Low latency globally - 20+ datacenters worldwide • High Availability - outages cost $millions • Variable usage patterns - as applications and users change 27
What is PNUTS/Sherpa? A 42342 E A 42342 E B 42521 W B 42521 W C 66354 W D 12352 E F 15677 E A 42342 E E 75656 C B 42521 W C 66354 W C 66354 W D 12352 E D 12352 E E 75656 C E 75656 C F 15677 E F 15677 E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Structured, flexible schema Geographic replication Parallel database Hosted, managed infrastructure 29
A 42342 E A 42342 E A 42342 E B 42521 W B 42521 W B 42521 W C 66354 W C 66354 W C 66354 W D 12352 E D 12352 E D 12352 E E 75656 C E 75656 C E 75656 C F 15677 E F 15677 E F 15677 E What Will It Become? Indexes and views
Design Goals Consistency Per-record guarantees Timeline model Option to relax if needed Multiple access paths Hash table, ordered table Primary, secondary access Hosted service Applications plug and play Share operational cost Scalability Thousands of machines Easy to add capacity Restrict query language to avoid costly queries Geographic replication Asynchronous replication around the globe Low-latency local access High availability and fault tolerance Automatically recover from failures Serve reads and writes despite failures 32
Technology Elements Applications Tabular API PNUTS API • PNUTS • Query planning and execution • Index maintenance • Distributed infrastructure for tabular data • Data partitioning • Update consistency • Replication YCA: Authorization • YDOT FS • Ordered tables • YDHT FS • Hash tables • Tribble • Pub/sub messaging • Zookeeper • Consistency service 33
Data Manipulation Per-record operations Get Set Delete Multi-record operations Multiget Scan Getrange Web service (RESTful) API 34
Tablets—Hash Table Name Description Price 0x0000 $12 Grape Grapes are good to eat $9 Limes are green Lime $1 Apple Apple is wisdom $900 Strawberry Strawberry shortcake 0x2AF3 $2 Orange Arrgh! Don’t get scurvy! $3 Avocado But at what price? Lemon How much did you pay for this lemon? $1 $14 Is this a vegetable? Tomato 0x911F $2 The perfect fruit Banana $8 Kiwi New Zealand 0xFFFF 35
Tablets—Ordered Table Name Description Price A $1 Apple Apple is wisdom $3 Avocado But at what price? $2 Banana The perfect fruit $12 Grape Grapes are good to eat H $8 Kiwi New Zealand Lemon $1 How much did you pay for this lemon? Limes are green Lime $9 $2 Orange Arrgh! Don’t get scurvy! Q $900 Strawberry Strawberry shortcake $14 Is this a vegetable? Tomato Z 36
Detailed Architecture Remote regions Local region Clients REST API Routers Tribble Tablet Controller Storage units 38
Tablet Splitting and Balancing Storage unit Tablet Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Tablets may grow over time Overfull tablets split Shed load by moving tablets to other servers 39
Accessing Data Record for key k Get key k Record for key k 1 2 3 4 Get key k SU SU SU 41
Bulk Read {k1, k2, … kn} Get k1 Get k2 Get k3 Scatter/ gather server 1 2 SU SU SU 42
Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1 Grapefruit…Pear? Grapefruit…Lime? Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1 Lime…Pear? Router Storage unit 1 Storage unit 2 Storage unit 3 Range Queries in YDOT • Clustered, ordered retrieval of records Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Apple Avocado Banana Blueberry Strawberry Tomato Watermelon Lime Mango Orange Canteloupe Grape Kiwi Lemon
Updates Write key k SU SU SU 6 5 2 4 1 8 7 3 Write key k Sequence # for key k Routers Message brokers Write key k Sequence # for key k SUCCESS Write key k 44
Goal: Make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Alice”? Consistency Model Record inserted Delete Update Update Update Update Update Update Update v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Time Generation 1 As the record is updated, copies may get out of sync. 47
Example: Social Alice East Record Timeline West ___ Busy Free Free
Consistency Model Read Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 In general, reads are served using a local copy 49
Consistency Model Read up-to-date Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 But application can request and get current version 50
Consistency Model Read ≥ v.6 Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 Or variations such as “read forward”—while copies may lag the master record, every copy goes through the same sequence of changes 51
Consistency Model Write Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 Achieved via per-record primary copy protocol (To maximize availability, record masterships automaticlly transferred if site fails) Can be selectively weakened to eventual consistency (local writes that are reconciled using version vectors) 52
Consistency Model Write if = v.7 ERROR Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 Test-and-set writes facilitate per-record transactions 53
Consistency Techniques • Per-record mastering • Each record is assigned a “master region” • May differ between records • Updates to the record forwarded to the master region • Ensures consistent ordering of updates • Tablet-level mastering • Each tablet is assigned a “master region” • Inserts and deletes of records forwarded to the master region • Master region decides tablet splits • These details are hidden from the application • Except for the latency impact!