490 likes | 698 Views
D 3 S: Debugging Deployed Distributed Systems. Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang, CS525@UIUC. Debugging distributed systems is difficult. Bugs are difficult to reproduce Many machines executing concurrently Machines/network may fail
E N D
D3S: Debugging Deployed Distributed Systems XuezhengLiu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang, CS525@UIUC
Debugging distributed systems is difficult • Bugs are difficult to reproduce • Many machines executing concurrently • Machines/network may fail • Consistent snapshots are not easy to get • Current approaches • Multi-threaded debugging • Model-checking • Runtime-checking
State of the Arts • Example • Distributed reader-writer locks • Log-based debugging • Step1: add logs • void ClientNode::OnLockAcquired(…) { • … • print_log( m_NodeID, lock, mode); • } • Step2: Collect logs • Step3: Write checking scripts
Problems • Too much manual effort • Difficult to anticipate what to log • Too much? • Too little? • Checking for large system is challenging • A central checker cannot keep up • Snapshots must be consistent
D3S Contribution • A simple language for writing distributed predicates • Programmers can change what is being checked on-the-fly • Failure tolerant consistent snapshot for predicate checking • Evaluation with five real-world applications
D3S Workflow state Conflict! state state state state Predicate: no conflict locks Violation! Checker Checker
Glance at D3S Predicate V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) } V1: V0 { ( conflict: LockID ) } as final after (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2) after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2) class MyChecker : vertex<V1> { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for partitioning };
D3S Parallel Predicate Checker Lock clients Expose states individually Key: LockID Exposed states (C1, L1, E), (C2, L3, S), (C5, L1, S),… L1 L1 Reconstruct: SN1, SN2, … (C1,L1,E),(C5,L1,S) (C2,L3,S) Checkers
Summary of Checking Language • Predicate • Any property calculated from a finite number of consecutive state snapshots • Highlights • Sequential programs (w/ mapping) • Reuse app types in the script and C++ code • Binary Instrumentation • Supports for reducing the overhead (in the paper) • Incremental checking • Sampling the time or snapshots
Snapshots • Use Lamport clock • Instrument network library • 1000 logic clocks per second • Problem: how does the checker know whether it receives all necessary states for a snapshot?
Consistent Snapshot { (A, L0, S) }, ts=2 { }, ts=10 { (A, L1, E) }, ts=16 A • Membership • What if a process does not have state to expose for a long time? • What if a checker fails? { (B, L1, E) }, ts=6 ts=12 B SA(2) SB(6) Detect failure SA(10) SA(16) Checker M(2)={A,B} SB(2)=?? M(6)={A,B} SA(6)=?? M(10)={A,B} SA(6)=SA(2) check(6) SB(10)=SB(6) check(10) M(16)={A} check(16)
Experimental Method • Debugging five real systems • Can D3S help developers find bugs? • Are predicates simple to write? • Is the checking overhead acceptable? • Case: Chord implementation – i3 • Using predecessors and successors list to stabilize • “holes” and overlap
Chord Overlay • Consistency vs. Availability: cannot get both • Global measure on the factors • See the tradeoff quantitatively for performance tuning • Capable of checking detailed key coverage • Perfect Ring: • No overlap, no hole • Aggregated key coverage is 100% ???
Summary of Results Data center apps Wide area apps
Overhead (PacificA) • Less than 8%, in most cases less than 4%. • I/O overhead < 0.5% • Overhead is negligible in other checked systems
Related Work • Log analysis • Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07] • Predicate checking at replay time • WiDS Checker[NSDI’07], Friday[NSDI’07] • P2-based online monitoring • P2-monitor[EuroSys’06] • Model checking • MaceMC[NSDI’07], CMC[OSDI’04]
Conclusion • Predicate checking is effective for debugging deployed & large-scale distributed systems • D3S enables: • Change of what is monitored on-the-fly • Checking with multiple checkers • Specify predicate in sequential & centralized manner
Thank You • Thank the authors for providing some of slides
PNUTSYahoo!’s Hosted Data Serving Platform Brian F. Cooper et al. @ Yahoo! Research Presented by Ying-Yi Liang * Some slides come from the authors’ version
What is the Problem • The web era: web applications • Users are picky – low latency; high availability • Enterprises are greedy – high scalability • Things go fast – new ideas expires very soon • Two ways of developing a cool web application • Making your own fire: quick, cool, but tiring, error prone • Using huge “powerful” building blocks: wonderful, but the market would have shifted away when you are done • Both ways do not scale very well… • Something is missing – an infrastructure specially tailored for web applications!
Web Application Model • Object sharing: Blogs, Flicker, Web Picasa, Youtube, … • Social: Facebook, Twitter, … • Listing: Yahoo! Shopping, del.icio.us, news • They require: • High scalability, availability and fault tolerance • Acceptable latency w.r.t. geographically distributed requests • Simplified query API • Some consistency (weaker than SC)
A 42342 E A 42342 E B 42521 W B 42521 W C 66354 W F 15677 E D 12352 E E 75656 C B 42521 W A 42342 E C 66354 W C 66354 W D 12352 E D 12352 E E 75656 C E 75656 C F 15677 E F 15677 E PNUTS – DB in the Cloud CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Indexes and views Structured, flexible schema Hosted, managed infrastructure
Basic Concepts Primary Key Record Tablet Field
PNUTS Storage Architecture Clients REST API Routers Message Broker Tablet controller Storage units
Geographic Replication Clients REST API Region 1 Routers Message Broker Tablet controller Region 2 Storage units Region 3
Storage unit Tablets In-region Load Balance
Data and Query Models • Simplified rational data model: tables of records • Typed columns • Typical data types plus the blob type • Does not enforce inter-table relationship • Operation: selection, projection (no join, aggregation, …) • Options: point access, range query, multiget
MIN-Canteloupe SU1 Canteloupe-Lime SU3 Lime-Strawberry SU2 Strawberry-MAX SU1 Storage unit 1 Storage unit 2 Storage unit 3 Record Assignment Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Router Lime Mango Orange Strawberry Tomato Watermelon
Write key k SU SU SU 2 4 1 8 3 5 7 6 Single Point Update Sequence # for key k Write key k Routers Message brokers Write key k Sequence # for key k SUCCESS Write key k
MIN-Canteloupe SU1 Canteloupe-Lime SU3 Lime-Strawberry SU2 Strawberry-MAX SU1 Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? Storage unit 1 Storage unit 2 Storage unit 3 Range Query Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon
Relaxed Consistency • ACID transactions • Sequential consistency: too strong • Non-trivial overhead for asynchronous settings • Users can tolerate stale data in many cases • Go hybrid: eventual consistency + mechanism for SC • Use versioning to cope with asynchrony Record inserted Delete Update Update Update Update Update Update Update v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Time Generation 1
Relaxed Consistency read_any() Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1
Relaxed Consistency read_latest() Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1
Relaxed Consistency read_critical(“v.6”) Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1
Relaxed Consistency write() Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1
Relaxed Consistency test_and_set_write(v.7) ERROR Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1
Membership Management • Record timelines should be coherent for each replica • Updates must be applied to the latest version • Use mastership • Per-record basis • Only one replica has mastership at anytime • All update requests are sent to master to get ordered • Routers & YMB maintain mastership information • Replica receiving frequent write req. gets the mastership • Leader election service provided by ZooKeeper
ZooKeeper • A distributed system is like a zoo, someone needs to be in charge of it. • ZooKeeper is a highly available, scalable coordination svc. • ZooKeeper plays two roles in PNUTS • Coordination service • Publish/subscribe service • Guarantees: • Sequential consistency; Single system image • Atomicity (as in ACID); Durability; Timeliness • A tiny kernel for upper level building blocks
ZooKeeper: High Availability • High availability via replication • A fault-tolerant persistent store • Providing sequential consistency
ZooKeeper: Services • Publish/Subscribe Service • Contents stored in ZooKeeper organized as directory trees • Publish: write to specific znode • Subscribe: read specific znode • Coordination via automatic name resolution • By appending sequence number to names • CREATE(“/…/x-”, host, EPHEMERAL | SEQUENCE) • “/…/x-1”, “/…/x-2”, … • Ephemeral nodes: znodes living as long as the session
ZooKeeper Example: Lock 1) id = create(“…/locks/x-”, SEQUENCE | EMPHEMERAL); 2) children = getChildren(“…/locks”, false); 3) if (children.head == id) exit(); 4) test = exists(name of last child before id, true); 5) if (test == false) goto 2); 6) wait for modification to “…/locks”; 7) goto 2);
ZooKeeper Is Powerful • Many core svc. in distributed sys. built on ZooKeeper • Consensus • Distributed locks (exclusive, shared) • Membership • Leader election • Job tracker binding • … • More information at http://hadoop.apache.org/zookeeper/
Experimental Setup • Production PNUTS code • Enhanced with ordered table type • Three PNUTS regions • 2 west coast, 1 east coast • 5 storage units, 2 message brokers, 1 router • West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array • East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk • Workload • 1200-3600 requests/second • 0-50% writes • 80% locality
Related Work • Google BigTable/GFS • Fault-tolerance and consistency via Chubby • Strong consistency – Chubby not scalable • Lack of geographic replication support • Targeting analytical workloads • Amazon Dynamo • Unstructured data • Peer-to-peer style solution • Eventual consistency • Facebook Cassandra (still kind of a secret) • Structured storage over peer-to-peer network • Eventual consistency • Always writable property – success even in the face of a failure
Discussion • Can all web applications tolerate stale data? • Is doing replication completely across WAN a good idea? • Single level router vs. B+ tree style router hierarchy • Tiny service kernel vs. stand alone services • Is relaxed consistency just right or too weak? • Is exposing record versions to applications a good idea? • Should security be integrated into PNUTS? • Using pub/sub service as undo logs