360 likes | 555 Views
Yahoo ! Research present by Liyan & Fang. PNUTS: Yahoo!’s Hosted Data Serving Platform. What are my friends up to?. Sonja:. Brandon:. social network websites. Brian. Sonja. Jimi. Brandon. Kurt. What does a web application need?. Scalability
E N D
Yahoo! Research present by Liyan & Fang PNUTS: Yahoo!’s Hosted Data Serving Platform
What are my friends up to? Sonja: Brandon: social network websites Brian Sonja Jimi Brandon Kurt
What does a web application need? • Scalability • architectural scalability • scale during periods of rapid growth with minimal operational effort • Response Time and Geographic Scope • Fast response time to geographically distributed users • High Availability and Fault Tolerance • Read and even write data in failures • Relaxed Consistency Guarantees • Eventually consistency: update one replica first and then update others
What do we need from our DBMS? • Web applications need: • Scalability • And the ability to scale linearly • Geographic scope • High availability • Web applications typically have: • Simplified query needs • No joins, aggregations • Relaxed consistency needs • Applications can tolerate stale or reordered data
A 42342 E A 42342 E B 42521 W B 42521 W F 15677 E D 12352 E E 75656 C C 66354 W B 42521 W A 42342 E C 66354 W C 66354 W D 12352 E D 12352 E E 75656 C E 75656 C F 15677 E F 15677 E What is PNUTS? Indexes and views CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Geographic replication Parallel database Structured, flexible schema Hosted, managed infrastructure
Query model • Per-record operations • Get • Set • Delete • Multi-record operations • Multiget • Scan • Getrange
Data-path components Detailed architecture Clients REST API tablet controller : determines when it is time to move a tablet between storage units for load balancing or recovery when a large tablet must be split. update the copy of the interval mapping. If we want to commit the update result, need to write them to Message Broker firstly. Routers Message Broker Tablet controller Router: determine which storage unit is responsible for a given record to be read or written by the client, we must first determine which tablet contains the record, and then determine which storage unit has that tablet Storage units Data tables are horizontally partitioned into groups of records called tablets. Storage units: store tablets respond to get() and scan() requests by retrieving and returning matching records respond to set() requests by processing the update.
Detailed architecture messages published to one YMB cluster will be relayed to other YMB clusters for delivery to local subscribers YMB takes multiple steps to ensure messages are not lost before they are applied to the database. Local region Remote regions Clients REST API Routers YMB Tablet controller Storage units record-level mastering: mastership is assigned on a record-by-record basis, and different records in the same table can be mastered in different clusters. In one week, 85 percent of the writes to a given record originated in the same datacenter. A master publishes its updates to a single broker, and thus updates are delivered to replicas in commit order.
MIN-Canteloupe SU4 Canteloupe-Lime SU3 Lime-Strawberry SU2 Grapefruit…Pear? Strawberry-MAX SU1 Grapefruit…Lime? Lime…Pear? Storage unit 1 Storage unit 2 Storage unit 3 Range queries Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Router Lime Mango Orange Pear Strawberry Tomato Watermelon
Write key k SU SU SU 3 8 5 4 2 6 1 7 Updates Sequence # for key k Write key k Routers Message brokers Write key k Sequence # for key k SUCCESS Write key k
Consistency model • Goal: make it easier for applications to reason about updates and cope with asynchrony • What happens to a record with primary key “Brian”? Record inserted Delete Update Update Update Update Update Update Update v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Time Generation 1
Consistency model Read Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 Read-any: Returns a possibly stale version of the record. e.g., in a social networking application, for displaying a user’s friend’s status, it is not absolutely essential to get the most up-to-date value, and hence read-any can be used.
Consistency model Read up-to-date Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1
Consistency model Read-critical(required version): Read ≥ v.6 Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 Read-critical: Returns a version of the record that is strictly newer than, or the same as the required version. For example, when a user writes a record, and then wants to read a version of the record that definitely reflects his changes.
Consistency model Write Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1
Consistency model Write if = v.7 Test-and-set-write(required version) ERROR Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 Test-and-set-write(required version): This call performs the requested write to the record if and only if the present version of the record is the same as required version. This call can be used to implement transactions that first read a record, and then do a write to the record based on the read, e.g., incrementing the value of a counter. .
Record and Tablet Mastership • Data in PNUTS is replicated across sites • Hidden field in each record stores which copy is the master copy • updates can be submitted to any copy • forwarded to master, applied in order received by master • Record also contains origin of last few updates • Mastership can be changed by current master, based on this information • Mastership change is simply a record update • Tablets mastership • Required to ensure primary key consistency • Can be different from record mastership
Other Features • Per record transactions • Copying a tablet (failure recovery, for e.g.) • Request copy • Publish checkpoint message • Get copy of tablet as of when checkpoint is received • Apply later updates • Tablet split • Has to be coordinated across all copies
Query Processing • Range scan can span tablets • Only one tablet scanned at a time • Client may not need all results at once • Continuation object returned to client to indicate where range scan should continue • Notification • One pub-sub topic per tablet • Client knows about tables, does not know about tablets • Automatically subscribed to all tablets, even as tablets are added/removed. • Usual problem with pub-sub: undelivered notifications, handled in usual way
Experimental setup • Production version supported by • Hash tables • ordered tables • Database • 3 regions: 2 west coast, 1 east coast • 1 KB records, 128 tablets per region • Each process had 100 client threads, • Totally 300 clients across the system. • Workload • 1200-3600 requests/second • 0-50% writes • 80% locality
Inserts • Inserts (hash tables) • required 75.6 ms per insert in West 1 (tablet master) • 131.5 ms per insert into the non-master West 2, and • 315.5 ms per insert into the non-master East. • Inserts (ordered tables) • 33 ms per insert in West 1 • 105.8 ms per insert in the non-master West2 • 324.5 ms per insert in the non-master East.
latency decreases, and then increases, with increasing load The high latency at low request rate resulted from an anomaly in the HTTP client library we used, which closed TCP connections in between requests at low request rates, requiring expensive TCP setup for each call. 10% writes by default As the proportion of reads increases, the average latency decreases.