580 likes | 1.04k Views
Mail, Messenger, Finance, Shopping, Sports, OMG! ... 20 properties are #1 or #2. Yahoo! is ... curl http://pnuts.yahoo.com/PNUTSWebService/V1/get/userTable/yahoo ...
E N D
Slide 2:Overview
Building a cloud service How PNUTS works “Advanced” features Lessons learned
Slide 3:Yahoo!
Yahoo! has almost 100 properties Mail, Messenger, Finance, Shopping, Sports, OMG! … 20 properties are #1 or #2 Yahoo! is #1 in time spent online in U.S. (10.5%) 164 million unique U.S. visitors in January 79 percent of U.S. online audience 598 million unique worldwide visitors in January 48 percent of global online audience This is where we make our money! Users coming to Yahoo! sites and spending time We are focusing on the “audience” side of Yahoo! Not the search engine (Jan. 2010, source: ComScore)
Slide 4:CLOUD COMPUTING
4
Slide 5:Why?
Two competing needs Accelerating innovation Focus on building your application, not the infrastructure Increasing availability Without infinite hardware and system operators How will cloud services help? Cloud services will perform the heavy lifting of scaling & high-availability Focus on horizontal cloud services Platforms to support multiple vertical applications
Slide 6:Requirements for Cloud Services
Multi-tenancy Support for multiple, organizationally distant customers Horizontal scaling Add cloud capacity incrementally and transparently as needed by tenants Elasticity Tenants can request and receive resources on-demand, paying for usage Security & Account management Accounts/IDs, authentication, access control; isolate tenants; data security Availability & Operability High availability and reliability over commodity hardware Easy to operate, with few operators; automated monitoring & metering
Slide 7:Cloud Data Management Systems
Large data analysis (Hadoop) Structured record storage (PNUTS) Blob storage (MObStor) Scan oriented workloads Focus on sequential disk I/O $ per cpu cycle CRUD Point lookups and short scans Index organized table and random I/Os $ per latency Object retrieval and streaming Scalable file storage $ per GB
Slide 8:What Makes a Cloud Data Service?
DBA to the world! Many apps Each with hundreds or thousands of client processes Must automanage – cannot manually tweak knobs Must autobalance – load will constantly shift Massive scalability Scaling up via shared or specialized hardware is infeasible Scale out with commodity hardware – 10,000 or 100,000 servers Failures are the common case Must continue to operate in the face of servers down Must autoscale – plug in new servers and let them go These capabilities must be baked in from the start
Slide 9:WHAT IS PNUTS?
9
Brian Sonja Jimi Brandon KurtSlide 10: Example: social network updates
Slide 11: Example: social network updates
16 Mike <ph.. 6 Jimi <ph.. 8 Mary <re.. 12 Sonja <ph.. 15 Brandon <po.. 17 Bob <re.. <photo> <title>Flower</title> <url>www.flickr.com</url> </photo> (caveat: not necessarily how our Y! Updates product actually works)
Slide 12: The world has changed
Can trade away “standard” DBMS features: Complicated queries Strong transactions But I must have my scalability, flexibility and availability!
Slide 13:The PNUTS Solution
Record-orientation: Optimized for low-latency record access Scale out: Add machines to scale throughput Asynchrony: Avoid expensive synchronous operations Consistency model: Hide complexity of asynchronous replication Flexible access: Hashed or ordered, indexes, views; flexible schemas Cloud deployment model: Hosted, managed service [VLDB 08]
Slide 14:PNots
Not a SQL database Simple queries, simple transaction model Not a parallel processing engine Though it can play well with MapReduce Not a filesystem Record storage, not blob storage Not peer-to-peer We own the servers and can save some complexity Servers organized into natural groups (datacenters)
Slide 15:Data Model
Slide 16:Query Model
Simple call API Get Set Delete Scan Getrange Scan and Getrange with predicate Web service (RESTful) API Encode data as JSON 16
Slide 17: Representing sparse data
$ curl http://pnuts.yahoo.com/PNUTSWebService/V1/get/userTable/yahoo {"record":{ "status":{"code":200,"message":"OK"}, "metadata":{ "seq_id":"5", "modtime":1234231551, "disk_size":89}, "fields": { "addr":{"value":"700 First Ave"}, "city":{"value":"Sunnyvale"}, "state":{"value":"CA"} } } } (some details changed to protect the innocent)
Slide 18:DISTRIBUTION
18
Storage unitsSlide 19:Architecture
19
Slide 20:Tablet Splitting and Balancing
20 Each storage unit has many tablets (horizontal partitions of the table) Tablets may grow over time Overfull tablets split Storage unit may become a hotspot Shed load by moving tablets to other servers
Slide 21:Tablets—Hash Table
Apple Lemon Grape Orange Lime Strawberry Kiwi Avocado Tomato Banana Grapes are good to eat Limes are green Apple is wisdom Strawberry shortcake Arrgh! Don’t get scurvy! But at what price? How much did you pay for this lemon? Is this a vegetable? New Zealand The perfect fruit Name Description Price $12 $9 $1 $900 $2 $3 $1 $14 $2 $8 0x0000 0xFFFF 0x911F 0x2AF3 21
Slide 22:Tablets—Ordered Table
22 Apple Banana Grape Orange Lime Strawberry Kiwi Avocado Tomato Lemon Grapes are good to eat Limes are green Apple is wisdom Strawberry shortcake Arrgh! Don’t get scurvy! But at what price? The perfect fruit Is this a vegetable? How much did you pay for this lemon? New Zealand $1 $3 $2 $12 $8 $1 $9 $2 $900 $14 Name Description Price A Z Q H
Slide 23:Accessing Data
23 Get key k
Slide 24:Updates
Write key k Sequence # for key k Sequence # for key k SU SU SU Write key k SUCCESS Write key k Routers Log servers 24
Slide 25:ASYNCHRONOUS REPLICATION AND CONSISTENCY
25
Slide 26:Asynchronous Replication
26
Slide 27:Global Replication
(not necessarily actual Yahoo! datacenters)
Slide 28:Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? We also support an eventual consistency model Applications can choose which kind of table to create
Consistency Model 28 Time Record inserted Update Update Update Update Update Delete Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update Update
Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Stale version ReadSlide 29:Timeline Model
29
Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version Stale versionSlide 30:Timeline Model
30
Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read = v.6 Current version Stale version Stale versionSlide 31:Timeline Model
31
Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version Stale versionSlide 32:Timeline Model
32
Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Stale versionSlide 33:Timeline Model
33
Slide 34:Consistency levels
Eventual consistency Transactions: Alice changes status from “Sleeping” to “Awake” Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) Region 1 (Alice, Home, Sleeping) Region 2 Final state consistent
Slide 35:Consistency levels
Timeline consistency Transactions: Alice changes status from “Sleeping” to “Awake” Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) Region 1 (Alice, Home, Sleeping) Region 2 (Alice, Work, Awake) Work (Alice, Work, Awake)
36Slide 36:Mastering
A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E
37Slide 37:Coping With Failures
A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E X
Slide 38:“ADVANCED” FEATURES
38
Slide 39: Ordered tables
Time ranges Relationship graphs Hierarchical data Indexes and views Ordered tables provide efficient scanning of clustered subranges
Slide 40:Ordered tables are tricky
Hotspots! Solution: Proactive load balancing Move tablets from hot servers to cold servers If necessary, split hot tablets
Slide 41:Parallel scans
Scan engine Client
Slide 42: Adaptive server allocation
Scan engine Client
Slide 43:Server scheduling
Scan engine Client 1 Client 2
Slide 44:Indexes and views
How to have lots of interesting indexes, without killing performance? Solution: Asynchrony! Indexes updated asynchronously when base table updated Some interesting views can be represented as indexes
Slide 45:View types
Index – Remote view table Base table ByAuthor view table
Slide 46:View types
Equijoin – Co-clustered remote view tables Each sub-table managed like an index PostComments view table Posts view table Comments table
Slide 47:Remote view tables
A regular table, but updated by the view maintainer instead of a client Update Log server Log server SU
Slide 48:SOME NUMBERS
48
Slide 49:Performance comparison
Setup Six server-class machines 8 cores (2 x quadcore) 2.5 GHz CPUs 8 GB RAM 6 x 146GB 15K RPM SAS drives in RAID 1+0 Gigabit ethernet RHEL 4 Plus extra machines for clients, routers, controllers, etc. Workloads 120 million 1 KB records = 20 GB per server Write heavy workload: 50/50 read/update Updates write the whole record 50 client processes usually; up to 300 needed to generate higher throughputs Obviously many variations are possible; these are just two points in the space Metrics Latency versus throughput curves Caveats Write performance would be improved for Sherpa, Sharded and Cassandra with a dedicated log disk We tuned each system as well as we knew how
Slide 50:Results
Slide 51:YCSB
Developing a common benchmark for serving systems Yahoo! Cloud Serving Benchmark Details coming soon…
Slide 52:CONCLUSIONS
52
Slide 53:Lessons learned (1)
Simpler is better than clever Clever approaches are hard to implement, test, debug and maintain Incremental is better than big-bang How many new things do you want to test at once? Why throw away years of hardening?
Slide 54:Lessons learned (2)
Non-algorithmic challenges can be hard Dealing with network config, legacy software and requirements, the “corporate way,” multiple stakeholders… Researchers should get dirty hands Being a part of shipping a real system can radically readjust your worldview Write some test cases to understand system complexity
Slide 55:New in 2010!
SIGMOD and SIGOPS are starting a new conference, to be co-located alternately with SIGMOD and SOSP: ACM Symposium on Cloud Computing (SoCC) Steering committee: Phil Bernstein, Ken Birman, Joe Hellerstein, John Ousterhout, Raghu Ramakrishnan, Doug Terry, John Wilkes PC Chairs: Surajit Chaudhuri & Mendel Rosenblum http://research.microsoft.com/socc2010
Slide 56:Research + product collaboration
Yahoo! Research Raghu Ramakrishnan Brian Cooper Utkarsh Srivastava Adam Silberstein Erwin Tam Excellent interns – Parag Agrawal, Robert Ikeda, Ymir Vigfusson, Arvind Thiagarajan, Jeffrey Terrace, Yang Zhang, Mert Akdere, Prasang Upadhyaya Excellent visitors – Arno Jacobsen, Rodrigo Fonseca Cloud Computing Chuck Neerdaels P.P.S. Narayan Toby Negrin Plus Dev/QA/Ops teams
Slide 57:Thanks!
cooperb@yahoo-inc.com research.yahoo.com