PNUTS Building and running a cloud database system Brian Cooper

PNUTS Building and running a cloud database system Brian Cooper

Overview • Building a cloud service • How PNUTS works • “Advanced” features • Lessons learned

Yahoo! • Yahoo! has almost 100 properties • Mail, Messenger, Finance, Shopping, Sports, OMG! … • 20 properties are #1 or #2 • Yahoo! is #1 in time spent online in U.S. (10.5%) • 164 million unique U.S. visitors in January • 79 percent of U.S. online audience • 598 million unique worldwide visitors in January • 48 percent of global online audience • This is where we make our money! • Users coming to Yahoo! sites and spending time • We are focusing on the “audience” side of Yahoo! • Not the search engine (Jan. 2010, source: ComScore)

CLOUD COMPUTING 4

Why? Two competing needs • Accelerating innovation • Focus on building your application, not the infrastructure • Increasing availability • Without infinite hardware and system operators How will cloud services help? • Cloud services will perform the heavy lifting of scaling & high-availability • Focus on horizontal cloud services • Platforms to support multiple vertical applications

Requirements for Cloud Services • Multi-tenancy • Support for multiple, organizationally distant customers • Horizontal scaling • Add cloud capacity incrementally and transparently as needed by tenants • Elasticity • Tenants can request and receive resources on-demand, paying for usage • Security & Account management • Accounts/IDs, authentication, access control; isolate tenants; data security • Availability & Operability • High availability and reliability over commodity hardware • Easy to operate, with few operators; automated monitoring & metering

Cloud Data Management Systems • CRUD • Point lookups and short scans • Index organized table and random I/Os • $ per latency • Scan oriented workloads • Focus on sequential disk I/O • $ per cpu cycle Structured record storage (PNUTS) Large data analysis (Hadoop) • Object retrieval and streaming • Scalable file storage • $ per GB Blob storage (MObStor)

What Makes a Cloud Data Service? • DBA to the world! • Many apps • Each with hundreds or thousands of client processes • Must automanage – cannot manually tweak knobs • Must autobalance – load will constantly shift • Massive scalability • Scaling up via shared or specialized hardware is infeasible • Scale out with commodity hardware – 10,000 or 100,000 servers • Failures are the common case • Must continue to operate in the face of servers down • Must autoscale – plug in new servers and let them go These capabilities must be baked in from the start

WHAT IS PNUTS? 9

What are my friends up to? Sonja: Brandon: Example: social network updates Brian Sonja Jimi Brandon Kurt

Example: social network updates 6 Jimi <ph.. 8 Mary <re.. 12 Sonja <ph.. 15 Brandon <po.. 16 Mike <ph.. <photo> <title>Flower</title> <url>www.flickr.com</url> </photo> 17 Bob <re.. (caveat: not necessarily how our Y! Updates product actually works)

The world has changed • Can trade away “standard” DBMS features: • Complicated queries • Strong transactions • But I must have my scalability, flexibility and availability!

The PNUTS Solution Record-orientation: Optimized for low-latency record access Scale out: Add machines to scale throughput Asynchrony: Avoid expensive synchronous operations Consistency model: Hide complexity of asynchronous replication Flexible access: Hashed or ordered, indexes, views; flexible schemas Cloud deployment model: Hosted, managed service [VLDB 08]

PNots • Not a SQL database • Simple queries, simple transaction model • Not a parallel processing engine • Though it can play well with MapReduce • Not a filesystem • Record storage, not blob storage • Not peer-to-peer • We own the servers and can save some complexity • Servers organized into natural groups (datacenters)

Data Model

Query Model Simple call API Get Set Delete Scan Getrange Scan and Getrange with predicate Web service (RESTful) API Encode data as JSON 16

Representing sparse data $ curl http://pnuts.yahoo.com/PNUTSWebService/V1/get/userTable/yahoo {"record":{ "status":{"code":200,"message":"OK"}, "metadata":{ "seq_id":"5", "modtime":1234231551, "disk_size":89}, "fields": { "addr":{"value":"700 First Ave"}, "city":{"value":"Sunnyvale"}, "state":{"value":"CA"} } } } (some details changed to protect the innocent)

DISTRIBUTION 18

Architecture Clients REST API Routers Tablet controller Log servers Storage units 19

Tablet Splitting and Balancing Storage unit Tablet Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Tablets may grow over time Overfull tablets split Shed load by moving tablets to other servers 20

Tablets—Hash Table Name Description Price 0x0000 $12 Grape Grapes are good to eat $9 Limes are green Lime $1 Apple Apple is wisdom $900 Strawberry Strawberry shortcake 0x2AF3 $2 Orange Arrgh! Don’t get scurvy! $3 Avocado But at what price? Lemon How much did you pay for this lemon? $1 $14 Is this a vegetable? Tomato 0x911F $2 The perfect fruit Banana $8 Kiwi New Zealand 0xFFFF 21

Tablets—Ordered Table Name Description Price A $1 Apple Apple is wisdom $3 Avocado But at what price? $2 Banana The perfect fruit $12 Grape Grapes are good to eat H $8 Kiwi New Zealand Lemon $1 How much did you pay for this lemon? Limes are green Lime $9 $2 Orange Arrgh! Don’t get scurvy! Q $900 Strawberry Strawberry shortcake $14 Is this a vegetable? Tomato Z 22

Accessing Data Record for key k Get key k Record for key k 1 2 3 4 Get key k 23

Updates Log servers make storage units disposable Write key k 3 2 7 6 8 5 4 1 Sequence # for key k Write key k Routers Log servers Write key k Sequence # for key k SUCCESS SU SU SU Write key k 24

ASYNCHRONOUS REPLICATION AND CONSISTENCY 25

Asynchronous Replication 26

Global Replication (not necessarily actual Yahoo! datacenters)

Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? We also support an eventual consistency model Applications can choose which kind of table to create Consistency Model Record inserted Delete Update Update Update Update Update Update Update v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Time Generation 1 28

Timeline Model Read Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 29

Timeline Model Read up-to-date Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 30

Timeline Model Read ≥ v.6 Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 31

Timeline Model Write Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 32

Timeline Model Write if = v.7 ERROR Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 33

(Alice, Home, Awake) (Alice, Work, Awake) Awake Work Awake Work (Alice, Work, Sleeping) (Alice, Work, Awake) “Invalid” state visible Consistency levels • Eventual consistency • Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) Region 1 Final state consistent (Alice, Home, Sleeping) Region 2

(Alice, Home, Awake) Work Awake (Alice, Work, Awake) Consistency levels • Timeline consistency • Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) (Alice, Work, Awake) Region 1 (Alice, Work, Awake) Work (Alice, Home, Sleeping) Region 2

Mastering A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E A 42342 E B 42521 W B 42521 E C 66354 W C 66354 W D 12352 E D 12352 E E 75656 C E 75656 C F 15677 E F 15677 E A 42342 E B 42521 E C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E 36

Coping With Failures X OVERRIDE W → E X A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E 37

“ADVANCED” FEATURES 38

Time ranges Relationship graphs Hierarchical data Indexes and views Ordered tables Ordered tables provide efficient scanning of clustered subranges

Ordered tables are tricky • Hotspots! • Solution: Proactive load balancing • Move tablets from hot servers to cold servers • If necessary, split hot tablets

Parallel scans Client Scan engine

Adaptive server allocation Client Scan engine

Server scheduling Client 2 Client 1 Scan engine

Indexes and views • How to have lots of interesting indexes, without killing performance? • Solution: Asynchrony! • Indexes updated asynchronously when base table updated • Some interesting views can be represented as indexes

View types • Index – Remote view table ByAuthor view table Base table

View types • Equijoin – Co-clustered remote view tables • Each sub-table managed like an index PostComments view table Comments table Posts view table

VM Remote view tables • A regular table, but updated by the view maintainer instead of a client Update SU Log server Log server

SOME NUMBERS 48

Performance comparison • Setup • Six server-class machines • 8 cores (2 x quadcore) 2.5 GHz CPUs • 8 GB RAM • 6 x 146GB 15K RPM SAS drives in RAID 1+0 • Gigabit ethernet • RHEL 4 • Plus extra machines for clients, routers, controllers, etc. • Workloads • 120 million 1 KB records = 20 GB per server • Write heavy workload: 50/50 read/update • Updates write the whole record • 50 client processes usually; up to 300 needed to generate higher throughputs • Obviously many variations are possible; these are just two points in the space • Metrics • Latency versus throughput curves • Caveats • Write performance would be improved for Sherpa, Sharded and Cassandra with a dedicated log disk • We tuned each system as well as we knew how

Results

PNUTS Building and running a cloud database system Brian Cooper

PNUTS Building and running a cloud database system Brian Cooper

Presentation Transcript

Database, Technology and Cloud

Building a Database Application

Database as a (Cloud) Service

Salesforce Cloud Database

Building a “System”

Building and Running Modules

Building and Running Parallel Simulations

Building a Database

Running Your Database in the Cloud

GMOD: Building Blocks for a Model Organism System Database

Lesson 29: Building a Database

DeDu : Building a Deduplication Storage system over Cloud computing

A Distributed Database System

Building and Running a Parallel Application… …Continued

Building and Running Modules

BUILDING A DATABASE SYSTEM FOR ORDER

MATLAB Building and Running Faster Matlab

Running on a Cloud

Building and Running Modules

Cloud database systems

Building a Database Application

Building and Running a Parallel Application… …Continued