490 likes | 594 Views
Management of Distributed Systems and Applications. Indranil Gupta March 9, 2006 CS 598IG. SP06. PlanetLab – 645 nodes over 309 sites (Mar 06). TeraGrid – 10’s of clusters. Commodity Clusters – O(10,000 nodes).
E N D
Management of Distributed Systems and Applications ... Indranil Gupta March 9, 2006 CS 598IG. SP06.
PlanetLab – 645 nodes over 309 sites (Mar 06) TeraGrid – 10’s of clusters Commodity Clusters – O(10,000 nodes) ... ...
Management of Distributed Applications forms 24% - 33% of the TCO (Total Cost of Ownership) • Many tools available for management of clusters • Beowulf Clusters • NOW clusters • But these are too heavyweight! • Need lightweight tools that are reusable.
Jeffrey Dean and Sanjay Ghemawat OSDI 2004 MapReduce: Simplified Data Processing on Large Clusters
Motivation • Google cluster – several 10’s of 1000’s of nodes (used for indexing and other operations). • Parallel & distributed programming on this cluster is complex. • Process failures. • Job scheduling. • Communication. • Managing TB of INPUT/OUTPUT. • Application code is hard to maintain.
1 ... M Goal User App. (MapReduce coding style) MapReduce { (hide from user app) (Input Data Partitioning) task_1 (Task scheduling) task_M ... (Node failure) (Commodity Cluster) ... ...
Programming Model • Inspired by functional programming (Lisp). • map(k1, v1) -> list(k2, v2) • User implemented. • Input: Key/value pairs. • Output: Intermediate key/value pairs. • reduce(k2, v2) -> list(v2) • User implemented. • Input: Intermediate key/value pairs. • Output: usually a single value.
Example:Word Frequency in a Set of Docs. ... “the = 100,000” ... “stereo = 20,000” ... reduce Key: filename. Value: file data. map ... <“the”, “1”> <”the”, “1”> ... <”stereo”,”1”> ...
Impact on Google • 18 months, 900 applications. • Many domains: • Machine learning. • Clustering for Google News & Froogle. • Query reports (e.g., Google Zeitgeist). • Large-scale graph computations. • Property extraction from web pages. • Google's indexer rewritten. • Code is significantly smaller, and decoupled.
1 ... M Execution User Program Master Assign map (input data on GFS) • Exploit locality. read ... GFS= Google File System [N (<M) map workers]
1 ... M Execution User Program Master Assign map (input data on GFS) read Local write • Partitioning function creates R regions. • Output ordered by key. ... ... (Intermediate files, local fs) [N (<M) map workers]
1 ... M Execution User Program Master Assign reduce Assign map Remote read (input data on GFS) read Local write ... ... Reduce workers (Intermediate files, local fs) [N (<M) map workers]
1 ... M Execution User Program Master Assign reduce Assign map Remote read (input data on GFS) GFS write Output 0 read Local write ... ... Output 1 Reduce workers (Intermediate files, local fs) [N (<M) map workers]
Fault Tolerance Master ping ping • Completed map(s) get rescheduled. • In-progress map/reduce get rescheduled. ... Master User App. • Abort MapReduce:
Backup Tasks (Done near MapReduce completion) task_i (slow machine)
Grep Experiment • 1800 machines. • Scans 1010 100B records. • Pattern matches 92,337 records. • M=15000; R=1. • Peak with 1764 map workers. • Total time = 150s.
Sort Experiment • 1800 machines. • Sorts 1010 100B records. • Map function extracts a 10-byte sorting key from each record. • Reduce = Identity function. • M = 15000; R=4000. • Input: 1TB, Output: 2TB.
Conclusions • Map/Reduce: • Expressive • Efficient implementation • Task re-execution for fault tolerance. • Locality reduces BW.
Discussion • How general beyond Google’s application? • What about other parallel programming models? (Google’s answer: we don’t care; this one works, and that’s all we need!) • Innovative contributions not great, but this is the only one-of-its kind system. • GFS paper (the only other major paper from Google) also similar – it’s just a system that works.
MON: On-Demand Overlays for Distributed System Management Jin Liang, Steven Ko, Indranil Gupta and Klara Nahrstedt WORLDS 05
Motivation • Large dist. applications emerging on: • Infrastructure: PlanetLab, Grid, … • Applications/services: CDN, DNS, storage, … • Difficult to manage for • Scale (100s to 1000s of nodes) • Failures (1/4 nodes rebooted daily on PL!) • Cluster management tools are too heavyweight for managing distributed applications
What do we Want to Manage? • Query and Modify distributed (sets of) objects • Object could be system resources such as CPU util., RAM util., etc. • Object could be application –generated elements such as log/trace-files (imagine a Pastry simulation running on 100 nodes) • Object could be application-specific (e.g., Pastry’s internal routing tables, or software updates)
Management Operations • query current system status • push software updates n2 n5 n3 n6 n4 n1 ?
Existing Solutions • Centralized (CoMon, GIS, scripts…) • Efficient • Non-scalable, no in-network aggregation • Distributed but Persistent (Astrolabe, Ganglia, SWORD, …) • Scalable • Difficult to maintain, complex failure repair
MON: A New Approach • Management Overlay Networks (MON) • Distributed management framework • On-demandoverlay construction
On-Demand, Why? • Simple • Light-weight • Better overlay performance • Suited to management • Sporadic usage • Short/medium term command execution
On-Demand, How? • Layered Architecture • Membership: membership exchange • Overlay: on-demand overlay construction • Management: cmd exec and results aggregation
Membership Gossip • Partial membership list • Periodically membership exchange • Detect failure/recovery • Measure delay n2 n5 n3 n6 n1 n4
On-Demand Construction • The problem: • Directed graph G = (V, E) • Create spanning subgraph (tree/DAG) • Goal: • Efficient, quick construction • Good overlay performance • Traditional algo may not work • Ω(E) messages, failure timeouts
Randomized Algorithms • For constructing two kinds of overlays – tree and DAG • Simple algorithm • Each node randomly selects k children • Each child acts recursively • Improved Algorithm • Membership augmented with local list • Two stages • Random + local selection • DAG construction: similar to tree
Tree Building Example n5 n2 n3 n6 n1 n4
Management Operations • Status query • Software push
Instant Status Query • Aggregate Queries • Average, histogram, top k • Supported attr: load, mem, etc. (from CoMon) • Generic Filtering • Exec any operation on a node • Return data based on exec results • return list of nodes, where “Sendto: Operation not permitted” occurred in log file X
Software Push • Receiver-driven, multi-parent download • Parent notify availability • Child request blocks • DAG structure helps: bw, resilience
Conclusion • On-demand overlay • Alternative to long term maintainability • c.f. software rejuvenation • Suited to distributed system management • MON: reasonable performance
Discussion • Ongoing work • Better construction algo, usability, etc. • Distributed log querying (a big challenge for PlanetLab applications) • How do on-demand overlays compare to long-lived overlays. • For long-running commands, can the on-demand approach beat the long-lived approach? • What are the tradeoffs?
ACMS: Akamai Configuration Management System Sherman, Lisiecki, Berkheimer, WeinNSDI 2005
Background • 1000s of Akamai servers store configuration files • Individual Akamai clients (e.g., cnn.com) can submit a configuration update (100 MB large) that needs • Agreement on the Update (for fault-tolerance and ordering). • Fast propagation of updates. • Lighter weight than a software install. • Requirements • 24x7 Availability • Multiple entry points in multiple networks • Efficiency and Scalability • Persistent Fault-Tolerant Storage • Correctness • Need Unique Ordering of all Versions and that the system synchronize to latest version for each configuration file. • If submission accepted, must be propagated to all servers through the Akamai CDN (content distribution network).
Architectural Overview • Small group of special Storage Points = SPs, set aside for agreement (say 5 of them) • Application submitting update contacts an SP called “Accepting SP” • Accepting SP replicates message to a quorum of of SPs • SPs store message persistently on disk as a file • SPs run Vector-Exchange Algorithm to Agree on a submission • SPs offer data for download through Akamai CDN
Quorum-Based Replication • In order for an Accepting SP to accept an update submission we require that the update be both replicated and agreed upon by a quorum (majority) of the ACMS SPs. • Assumption: ACMS can maintain a majority of operational and connected SPs.
Vector Exchange • Accepting SP initializes a bit vector by setting own bit to 1, rest to 0, broadcast vector along with UID update to other SPs. • Any SP that sees the vector sets its corresponding bit to 1, stores the vectors persistently on disk, re-broadcasts modified vector to the rest of the SPs. • Persistent storage guarantees that the SP will not lose its vector state on process restart or machine reboot. • When you see a quorum of bits you know agreement has been reached.
Failures and Maintenance • If Accepting SP gets cut off from quorum after VE initiated, “Possible Accept.” • Apparently, has never occurred in real system. • Maintenance: • Upgrade individual servers so as not to kill the quorum. • Adding or removing machines requires temporary halt.
Recovery Via Index-Merging • Every few seconds SPs pick a random subset of other SPs and merge index files. • Helped cover up some early bugs. • This is nothing but a gossip mechanism!
Operational Issues • Connected Quorum Assumption • First 9 months of 2004, 36 instances where a Storage Point did not have connectivity to a quorum for > 10 minutes. • Never lost an operating quorum • 6 days of logs: 2 pairwise outages of 2 and 8 minutes. • 3 instances of file corruption in 6 months. • Agreement phase: < 50 ms on average.
Discussion • Again, the generality question – how general is ACMS and its techniques (not that it needs to be general beyond Akamai’s purposes)?
Summary • Management is an important problem • Need lightweight solutions • Need generically applicable solutions • Need solutions that can be used for applications, not for the clusters themselves • Lots of open (and tough) problems in this area…which means a good area of research yet to be done