580 likes | 786 Views
PNUTS Building and running a cloud database system Brian Cooper. Overview. Building a cloud service How PNUTS works “Advanced” features Lessons learned. Yahoo!. Yahoo! has almost 100 properties Mail, Messenger, Finance, Shopping, Sports, OMG! … 20 properties are #1 or #2
E N D
PNUTS Building and running a cloud database system Brian Cooper
Overview • Building a cloud service • How PNUTS works • “Advanced” features • Lessons learned
Yahoo! • Yahoo! has almost 100 properties • Mail, Messenger, Finance, Shopping, Sports, OMG! … • 20 properties are #1 or #2 • Yahoo! is #1 in time spent online in U.S. (10.5%) • 164 million unique U.S. visitors in January • 79 percent of U.S. online audience • 598 million unique worldwide visitors in January • 48 percent of global online audience • This is where we make our money! • Users coming to Yahoo! sites and spending time • We are focusing on the “audience” side of Yahoo! • Not the search engine (Jan. 2010, source: ComScore)
Why? Two competing needs • Accelerating innovation • Focus on building your application, not the infrastructure • Increasing availability • Without infinite hardware and system operators How will cloud services help? • Cloud services will perform the heavy lifting of scaling & high-availability • Focus on horizontal cloud services • Platforms to support multiple vertical applications
Requirements for Cloud Services • Multi-tenancy • Support for multiple, organizationally distant customers • Horizontal scaling • Add cloud capacity incrementally and transparently as needed by tenants • Elasticity • Tenants can request and receive resources on-demand, paying for usage • Security & Account management • Accounts/IDs, authentication, access control; isolate tenants; data security • Availability & Operability • High availability and reliability over commodity hardware • Easy to operate, with few operators; automated monitoring & metering
Cloud Data Management Systems • CRUD • Point lookups and short scans • Index organized table and random I/Os • $ per latency • Scan oriented workloads • Focus on sequential disk I/O • $ per cpu cycle Structured record storage (PNUTS) Large data analysis (Hadoop) • Object retrieval and streaming • Scalable file storage • $ per GB Blob storage (MObStor)
What Makes a Cloud Data Service? • DBA to the world! • Many apps • Each with hundreds or thousands of client processes • Must automanage – cannot manually tweak knobs • Must autobalance – load will constantly shift • Massive scalability • Scaling up via shared or specialized hardware is infeasible • Scale out with commodity hardware – 10,000 or 100,000 servers • Failures are the common case • Must continue to operate in the face of servers down • Must autoscale – plug in new servers and let them go These capabilities must be baked in from the start
What are my friends up to? Sonja: Brandon: Example: social network updates Brian Sonja Jimi Brandon Kurt
Example: social network updates 6 Jimi <ph.. 8 Mary <re.. 12 Sonja <ph.. 15 Brandon <po.. 16 Mike <ph.. <photo> <title>Flower</title> <url>www.flickr.com</url> </photo> 17 Bob <re.. (caveat: not necessarily how our Y! Updates product actually works)
The world has changed • Can trade away “standard” DBMS features: • Complicated queries • Strong transactions • But I must have my scalability, flexibility and availability!
The PNUTS Solution Record-orientation: Optimized for low-latency record access Scale out: Add machines to scale throughput Asynchrony: Avoid expensive synchronous operations Consistency model: Hide complexity of asynchronous replication Flexible access: Hashed or ordered, indexes, views; flexible schemas Cloud deployment model: Hosted, managed service [VLDB 08]
PNots • Not a SQL database • Simple queries, simple transaction model • Not a parallel processing engine • Though it can play well with MapReduce • Not a filesystem • Record storage, not blob storage • Not peer-to-peer • We own the servers and can save some complexity • Servers organized into natural groups (datacenters)
Query Model Simple call API Get Set Delete Scan Getrange Scan and Getrange with predicate Web service (RESTful) API Encode data as JSON 16
Representing sparse data $ curl http://pnuts.yahoo.com/PNUTSWebService/V1/get/userTable/yahoo {"record":{ "status":{"code":200,"message":"OK"}, "metadata":{ "seq_id":"5", "modtime":1234231551, "disk_size":89}, "fields": { "addr":{"value":"700 First Ave"}, "city":{"value":"Sunnyvale"}, "state":{"value":"CA"} } } } (some details changed to protect the innocent)
DISTRIBUTION 18
Architecture Clients REST API Routers Tablet controller Log servers Storage units 19
Tablet Splitting and Balancing Storage unit Tablet Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Tablets may grow over time Overfull tablets split Shed load by moving tablets to other servers 20
Tablets—Hash Table Name Description Price 0x0000 $12 Grape Grapes are good to eat $9 Limes are green Lime $1 Apple Apple is wisdom $900 Strawberry Strawberry shortcake 0x2AF3 $2 Orange Arrgh! Don’t get scurvy! $3 Avocado But at what price? Lemon How much did you pay for this lemon? $1 $14 Is this a vegetable? Tomato 0x911F $2 The perfect fruit Banana $8 Kiwi New Zealand 0xFFFF 21
Tablets—Ordered Table Name Description Price A $1 Apple Apple is wisdom $3 Avocado But at what price? $2 Banana The perfect fruit $12 Grape Grapes are good to eat H $8 Kiwi New Zealand Lemon $1 How much did you pay for this lemon? Limes are green Lime $9 $2 Orange Arrgh! Don’t get scurvy! Q $900 Strawberry Strawberry shortcake $14 Is this a vegetable? Tomato Z 22
Accessing Data Record for key k Get key k Record for key k 1 2 3 4 Get key k 23
Updates Log servers make storage units disposable Write key k 3 2 7 6 8 5 4 1 Sequence # for key k Write key k Routers Log servers Write key k Sequence # for key k SUCCESS SU SU SU Write key k 24
Global Replication (not necessarily actual Yahoo! datacenters)
Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? We also support an eventual consistency model Applications can choose which kind of table to create Consistency Model Record inserted Delete Update Update Update Update Update Update Update v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Time Generation 1 28
Timeline Model Read Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 29
Timeline Model Read up-to-date Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 30
Timeline Model Read ≥ v.6 Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 31
Timeline Model Write Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 32
Timeline Model Write if = v.7 ERROR Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 33
(Alice, Home, Awake) (Alice, Work, Awake) Awake Work Awake Work (Alice, Work, Sleeping) (Alice, Work, Awake) “Invalid” state visible Consistency levels • Eventual consistency • Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) Region 1 Final state consistent (Alice, Home, Sleeping) Region 2
(Alice, Home, Awake) Work Awake (Alice, Work, Awake) Consistency levels • Timeline consistency • Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) (Alice, Work, Awake) Region 1 (Alice, Work, Awake) Work (Alice, Home, Sleeping) Region 2
Mastering A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E A 42342 E B 42521 W B 42521 E C 66354 W C 66354 W D 12352 E D 12352 E E 75656 C E 75656 C F 15677 E F 15677 E A 42342 E B 42521 E C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E 36
Coping With Failures X OVERRIDE W → E X A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E 37
Time ranges Relationship graphs Hierarchical data Indexes and views Ordered tables Ordered tables provide efficient scanning of clustered subranges
Ordered tables are tricky • Hotspots! • Solution: Proactive load balancing • Move tablets from hot servers to cold servers • If necessary, split hot tablets
Parallel scans Client Scan engine
Adaptive server allocation Client Scan engine
Server scheduling Client 2 Client 1 Scan engine
Indexes and views • How to have lots of interesting indexes, without killing performance? • Solution: Asynchrony! • Indexes updated asynchronously when base table updated • Some interesting views can be represented as indexes
View types • Index – Remote view table ByAuthor view table Base table
View types • Equijoin – Co-clustered remote view tables • Each sub-table managed like an index PostComments view table Comments table Posts view table
VM Remote view tables • A regular table, but updated by the view maintainer instead of a client Update SU Log server Log server
SOME NUMBERS 48
Performance comparison • Setup • Six server-class machines • 8 cores (2 x quadcore) 2.5 GHz CPUs • 8 GB RAM • 6 x 146GB 15K RPM SAS drives in RAID 1+0 • Gigabit ethernet • RHEL 4 • Plus extra machines for clients, routers, controllers, etc. • Workloads • 120 million 1 KB records = 20 GB per server • Write heavy workload: 50/50 read/update • Updates write the whole record • 50 client processes usually; up to 300 needed to generate higher throughputs • Obviously many variations are possible; these are just two points in the space • Metrics • Latency versus throughput curves • Caveats • Write performance would be improved for Sherpa, Sharded and Cassandra with a dedicated log disk • We tuned each system as well as we knew how