Distributed Systems

Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2013-2014

Yahoo! PNUTS • A massively parallel and geographically distributed database system for Yahoo!’s web applications • provides data storage organized as hashed or ordered tables • low latency for large numbers of concurrent requests including updates and queries • per-record consistency guarantees

Consistency • Serializability of general transaction is inefficient and often unnecessary • If a user changes an avatar, posts new pictures, or invites several friends to connect, little harm is done if the new avatar is not initially visible to one friend • Many distributed applications go to the extreme of providing only eventual consistency • Too weak and inadequate for web applications • PNUTS suggests a consistency model that falls between those two extremes

SYSTEM ARCHITECTURE • Data is organized into tables of records with attributes • In addition to typical data types, “blob” is a valid data type, allowing arbitrary structures inside a record • Data tables are horizontally partitioned into groups of records called tablets. • Tabletsare scattered across many servers • each server might have hundreds or thousands of tablets, but each tablet is stored on a single server within a region

Distributed Hash Table 0x0000 0x2AF3 Tablet 0x911F

Distributed Hash Table Tablet clustered by key range

Query model • PNUTS supports very simple queries sacrificing rich API in favor of response time and overall simplicity • No joins, group-by, etc. • This is stated as future work • The system is designed to work well with queries that read and write single records or small groups of records

PNUTS-Single Region • a single pair of active/standby servers • Maintains map from database.table.key to tablet to storage-unit • Routes client requests to correct storage unit • Caches the maps from the tablet controller • Stores records • Services get/set/delete requests

Tablet Splitting & Balancing Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Tablets may grow over time Overfull tablets split Shed load by moving tablets to other servers

Consistency Options • Eventual Consistency • Low latency updates and inserts done locally • Record Timeline Consistency • Each record is assigned a “master region” • Inserts succeed, but updates could fail during outages • Primary Key Constraint + Record Timeline • Each tablet and record is assigned a “master region” • Inserts and updates could fail during outages Availability Consistency

Record Timeline Consistency • One of the replicas is designated as the master • Per record • All updates to that record are forwarded to the master • If a replica is receiving the majority of write requests – it becomes the master • Each update advances the generation of the record

(Alice, Home, Awake) Work Awake (Alice, Work, Awake) Record Timeline Consistency Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) (Alice, Work, Awake) Region 1 (Alice, Work, Awake) Work (Alice, Home, Sleeping) Region 2 No replica should see record as (Alice, Work, Sleeping)

API calls • Read-any • Returns a possibly stale version of the record. • The returned record is always a valid one from the record’s history. • This call has lower latency than other read calls with stricter guarantees • Read-critical(required version) • Returns a version of the record that is strictly newer than, or the same as the required version. • Read-latest • Returns the latest copy of the record that reflects all writes that have succeeded. • Write • This call gives the same ACID guarantees as a transaction with a single write operation in it. This call is useful for blind writes, e.g., a user updating his status on his profile. • Test-and-set-write(required version) • This call performs the requested write to the record if and only if the present version of the record is the same as required version.

Eventual Consistency • Timeline consistencycomes at a price • Writes not originating in record master region forward to master and have longer latency • The mastership of a record can migrate between replicas • When master region down, record is unavailable for write • eventual consistencymode • On conflict, latest write per field wins • Target customers • Those that externally guarantee no conflicts • Those that understand/can cope

Yahoo! Message Broker (YMB) • A topic-based publish/subscribe system • Data updates are considered “committed” when they have been published to YMB. • At some point after being committed, the update will be asynchronously propagated to different regions and applied to their replicas • YMB guarantees that published messages will be delivered to all topic subscribers even in the presence of single broker machine failures • by logging the message to multiple disks on different servers. two copies are logged initially, and more copies are logged as the message propagates • The message is not purged from the YMB log until PNUTS has verified that the update is applied to all replicas of the database • YMB provides partial ordering of published messages. • Messages published to a particular YMB cluster will be delivered to all subscribers in the order they were published

Recovery • Recovering from a failure involves copying lost tablets from another replica. • A three step process • The tablet controller requests a copy from a particular remote replica (the “source tablet”). • A “checkpoint message” is published to YMB, to ensure that any in-flight updates at the time the copy is initiated are applied to the source tablet. • The source tablet is copied to the destination region. • To support this recovery protocol, tablet boundaries are kept synchronized across replicas, and tablet splits are conducted by having all regions split a tablet at the same point • coordinated by a two-phase commit between regions.

For more info http://www.mpi-sws.org/~druschel/courses/ds/papers/cooper-pnuts.pdf

Distributed Systems