850 likes | 1.01k Views
&. Programming Model and Protocols for Reconfigurable Distributed Systems. C OSMIN I ONEL A RAD. https://www.kth.se/profile/icarad/page/doctoral-thesis/. Doctoral Thesis Defense , 5 th June 2013 KTH Royal Institute of Technology. Presentation Overview.
E N D
& Programming Model and Protocolsfor Reconfigurable Distributed Systems COSMIN IONEL ARAD https://www.kth.se/profile/icarad/page/doctoral-thesis/ Doctoral Thesis Defense, 5th June 2013 KTH Royal Institute of Technology
Presentation Overview • Context, Motivation, and Thesis Goals • introduction & Design philosophy • Distributed abstractions& P2P framework • Component execution & Scheduling • Distributed systems Experimentation • Development cycle: build, test, debug, deploy • scalable & consistent key-value store • System architecture and testing using • Scalability, Elasticity, and Performance Evaluation • Conclusions
Trend 1: Computer systems are increasingly distributed • For fault-tolerance • E.g.: replicated state machines • For scalability • E.g.: distributed databases • Due to inherent geographic distribution • E.g.: content distribution networks
Trend 2: Distributed systems areincreasingly complex connection management, location and routing, failuredetection, recovery, datapersistence, loadbalancing, scheduling, self-optimization, access-control, monitoring, garbagecollection, encryption, compression, concurrency control, topology maintenance, bootstrapping, ...
Trend 3: Modern Hardware isincreasingly parallel • Multi-core and many-core processors • Concurrent/parallel software is needed to leverage hardware parallelism • Major software concurrency models • Message-passing concurrency • Data-flow concurrency viewed as a special case • Shared-state concurrency
Distributed Systems are still Hard… • … to implement, test, and debug • Sequential sorting is easy • Even for a first-yearcomputer science student • Distributed consensus is hard • Even for an experienced practitionerhaving all the necessary expertise
Experience from building Chubby,Google’s lock service, using Paxos “The fault-tolerance computing community has not developed the tools to make it easyto implement their algorithms. The fault-tolerance computing community has not paid enough attention to testing, a key ingredient for building fault-tolerant systems.” [Paxos Made Live] Tushar Deepak ChandraEdsger W. Dijkstra Prize in Distributed Computing 2010
A call to action “It appears that the fault-tolerant distributed computing community has not developed the tools and know-how to close the gaps between theory and practice with the same vigor as for instance the compiler community. Our experience suggests that these gaps are non-trivial and that they merit attention by the research community.”[Paxos Made Live] Tushar Deepak Chandra EdsgerW. Dijkstra Prize in Distributed Computing 2010
Thesis Goals Raise the level of abstraction in programming distributed systems Make it easy to implement, test, debug, and evaluate distributed systems Attempt to bridge the gap between the theory and the practice of fault-tolerant distributed computing
with asynchronous communication and message-passingconcurrency Application Consensus Broadcast Failure detector Network Timer
Design principles • Tackle increasing system complexity through abstractionand hierarchicalcomposition • Decouple components from each other • publish-subscribe component interaction • dynamic reconfiguration for always-on systems • Decouplecomponent code from its executor • same code executed in different modes: production deployment, interactive stress testing, deterministic simulationfor replay debugging
Nested hierarchical composition • Model entire sub-systems as first-class composite components • Richer architectural patterns • Tackle system complexity • Hiding implementation details • Isolation • Natural fit for developing distributed systems • Virtual nodes • Model entire system: each node as a component
Message-passing concurrency • Compositional concurrency • Free from the idiosyncrasies of locks and threads • Easy to reason about • Many concurrency formalisms: the Actor model (1973), CSP (1978), CCS (1980), π-calculus (1992) • Easy to program • See the success of Erlang, and Go, Rust, Akka, ... • Scales well on multi-core hardware • Almost all modern hardware
Loose coupling • “Where ignorance is bliss, 'tis folly to be wise.” • Thomas Gray, Ode on a Distant Prospect of Eton College (1742) • Communication integrity • Law of Demeter • Publish-subscribe communication • Dynamic reconfiguration
Design Philosophy • Nested hierarchical composition • Message-passing concurrency • Loose coupling • Multiple execution modes
Component Model Event Port Component channel handler • Event • Port • Component • Channel • Handler • Subscription • Publication / Event trigger
A simple distributed system Process1 Process2 Ping Pong Pong Ping Application Application handler1 <Ping> handler2 <Pong> handler1 <Ping> handler2 <Pong> Network Network Network Network Network Comp Network Comp handler < > handler < > handler <Message> handler <Message>
A Failure Detector Abstraction usinga Network and a Timer Abstraction Eventually Perfect Failure Detector Ping Failure Detector Network Timer + Eventually Perfect Failure Detector Suspect Restore + Network Timer MyTimer MyNetwork StartMonitoring StopMonitoring
A Leader Election Abstraction usinga Failure Detector Abstraction + Leader Leader Election Ω Leader Elector Eventually Perfect Failure Detector + Leader Election Eventually Perfect Failure Detector Ping Failure Detector
A Reliable Broadcast Abstraction using a Best-Effort Broadcast Abstraction RbDeliver Deliver RbBroadcast Broadcast BebDeliver Deliver BebBroadcast Broadcast + Deliver Broadcast Broadcast Reliable Broadcast + Broadcast Broadcast Broadcast Best-Effort Broadcast Network
A Consensus Abstraction using aBroadcast, a Network, and a Leader Election Abstraction Consensus PaxosConsensus Broadcast Network Leader Election Network Broadcast Leader Election MyNetwork Best-Effort Broadcast Ω Leader Elector
A Shared Memory Abstraction Atomic Register ABD Broadcast Network Broadcast + Atomic Register Best-Effort Broadcast ReadResponse WriteResponse + Network ReadRequest WriteRequest Network MyNetwork
A Replicated State Machine usinga Total-Order Broadcast Abstraction + + + Decide TobDeliver Output Replicated State Machine State Machine Replication Propose Execute TobBroadcast Total-Order Broadcast + + + Replicated State Machine Total-Order Broadcast Consensus Total-Order Broadcast Uniform Total-Order Broadcast Consensus Consensus PaxosConsensus
Probabilistic Broadcast and Topology Maintenance Abstractions using aPeer Sampling Abstraction Topology Probabilistic Broadcast T-Man Epidemic Dissemination Network Network Peer Sampling Peer Sampling Peer Sampling Cyclon Random Overlay Network Timer
A Structured Overlay Network implements a Distributed Hash Table Distributed Hash Table Structured Overlay Network Consistent Hashing Ring Topology Overlay Router Chord Periodic Stabilization One-Hop Router Network Network Failure Detector Peer Sampling Failure Detector Peer Sampling Ping Failure Detector Cyclon Random Overlay Network Timer Network Timer Network Timer
A Video on Demand Service using a Content Distribution Networkand a Gradient Topology Overlay Video On-Demand Network Content Distribution Network Timer Gradient Topology Content Distribution Network Gradient Topology BitTorrent Gradient Overlay Network Tracker Network Peer Sampling Tracker Tracker Tracker Peer Exchange Distributed Tracker Centralized Tracker Client Network Timer Peer Sampling Distributed Hash Table
+ + + – – + – – – – + + + – – + – + Timer Timer Timer Network Network Web Web Network Web Timer Network Network Timer Timer Network Web Web Web Generic Bootstrap and Monitoring Services provided by the Kompics Peer-to-Peer Protocol Framework PeerMain BootstrapServerMain MonitorServerMain MyWebServer MyWebServer MyWebServer Peer BootstrapServer MonitorServer MyNetwork MyTimer MyNetwork MyTimer MyNetwork MyTimer
Whole-System Repeatable Simulation Deterministic SimulationScheduler Experiment Scenario Network Model
Experiment scenario DSL • Define parameterized scenario events • Node failures, joins, system requests, operations • Define “stochastic processes” • Finite sequence of scenario events • Specify distribution of event inter-arrival times • Specify type and number of events in sequence • Specify distribution of each event parameter value • Scenario: composition of “stochastic processes” • Sequential, parallel:
Local Interactive Stress Testing Work-Stealing Multi-CoreScheduler Experiment Scenario Network Model
execution profiles • Distributed Production Deployment • One distributed system node per OS process • Multi-core component scheduler (work stealing) • Local / Distributed Stress Testing • Entire distributed system in one OS process • Interactive stress testing, multi-core scheduler • Local Repeatable Whole-system Simulation • Deterministic simulation component scheduler • Correctness testing, stepped / replay debugging
Incremental Development&Testing • Define emulated network topologies • processes and their addresses: <id, IP, port> • properties of links between processes • latency (ms) • loss rate (%) • Define small-scale execution scenarios • the sequence of service requests initiated by each process in the distributed system • Experiment with various topologies / scenarios • Launch all processes locally on one machine
The script of service requests of the process is shown here… After the Application completes the script it can process further commands input here…
Programming in the Large • Events and ports are interfaces • service abstractions • packaged together as libraries • Components are implementations • provide or require interfaces • dependencies on provided / required interfaces • expressed as library dependencies [Apache Maven] • multiple implementations for an interface • separate libraries • deploy-time composition
Case study A Scalable, Self-Managing Key-Value Store withAtomic Consistency and Partition Tolerance
Key-Value Store? Put(”www.sics.se”,”193.10.64.51”) OK Get(”www.sics.se”) ”193.10.64.51” • Store.Put(key, value) OK [write] • Store.Get(key) value[read]
Consistent Hashing Dynamo Incrementalscalability Self-organization Simplicity Project Voldemort
Single client, Single server Put(X, 1) Ack(X) Get(X) Return(1) Client Server X = 1 X = 0
Multiple clients, Multiple servers Put(X, 1) Ack(X) Client 1 Get(X) Return(0) Get(X) Return(1) Client 2 Server 1 X = 1 X = 0 Server 2 X = 1 X = 0
Atomic Consistency Informally • put/get ops appear to occur instantaneously • Once a put(key, newValue) completes • new value immediately visible to all readers • each get returns the value of the last completed put • Once a get(key) returns a new value • no other get may return an older, stale value
Distributed Hash Table Web CATS Node Status Web Peer Status Status Monitor CATS Web Application Load Balancer Aggregation Network Distributed Hash Table Peer Status Status Status Network Timer Status Distributed Hash Table Overlay Router Status Status Garbage Collector Operation Coordinator One-Hop Router Overlay Router Network Timer Replication Network Peer Sampling Broadcast Aggregation Broadcast Status Replication Status Group Member Reconfiguration Coordinator Epidemic Dissemination Peer Sampling Ring Topology Network Network Timer Data Transfer Network Ring Topology Status Peer Sampling Status Data Transfer Status Consistent Hashing Ring Bulk Data Transfer Cyclon Random Overlay Network Bootstrap Timer Failure Detector Network Timer Local Store Network Failure Detector Status Bootstrap Local Store Status Persistent Storage Ping Failure Detector Bootstrap Client Network Timer Network Timer Network Timer
Simulation and Stress Testing CATS Simulation Main CATS Stress Testing Main SimulationScheduler Multi-core Scheduler Web Web CATS Simulator CATS Simulator Web Web DHT DHT Web Web DHT DHT CATS Node CATS Node Web Web DHT DHT CATS Node CATS Node Web Web DHT DHT CATS Node CATS Node Web Web DHT DHT CATS Node CATS Node Web Web DHT DHT CATS Node CATS Node Web Web DHT DHT CATS Node CATS Node Web Web DHT DHT CATS Node CATS Node Web Web DHT DHT Network Network Timer Timer CATS Node CATS Node Web Web DHT DHT Network Network Timer Timer CATS Node CATS Node Web Web DHT DHT Network Network Timer Timer CATS Node CATS Node Web Web DHT DHT Network Network Timer Timer CATS Node CATS Node Web Web DHT DHT Network Network Timer Timer CATS Node CATS Node Web Web DHT DHT Network Network Timer Timer CATS Node CATS Node Web Web DHT DHT Network Network Timer Timer CATS Node CATS Node Web Web DHT DHT Network Network Timer Timer CATS Node CATS Node Network Network Timer Timer CATS Node CATS Node Network Network Timer Timer Network Network Timer Timer Network Network Timer Timer Network Network Timer Timer Network Network Timer Timer Network Network Timer Timer Network Network Timer Timer Timer Timer Network CATS Experiment Network CATS Experiment Timer Timer Network CATS Experiment Network CATS Experiment Discrete-Event Simulator Generic Orchestrator Network Model Network Model Experiment Scenario Experiment Scenario