Management of Distributed Systems and Applications

Management of Distributed Systems and Applications ... Indranil Gupta March 9, 2006 CS 598IG. SP06.

PlanetLab – 645 nodes over 309 sites (Mar 06) TeraGrid – 10’s of clusters Commodity Clusters – O(10,000 nodes) ... ...

Management of Distributed Applications forms 24% - 33% of the TCO (Total Cost of Ownership) • Many tools available for management of clusters • Beowulf Clusters • NOW clusters • But these are too heavyweight! • Need lightweight tools that are reusable.

Jeffrey Dean and Sanjay Ghemawat OSDI 2004 MapReduce: Simplified Data Processing on Large Clusters

Motivation • Google cluster – several 10’s of 1000’s of nodes (used for indexing and other operations). • Parallel & distributed programming on this cluster is complex. • Process failures. • Job scheduling. • Communication. • Managing TB of INPUT/OUTPUT. • Application code is hard to maintain.

1 ... M Goal User App. (MapReduce coding style) MapReduce { (hide from user app) (Input Data Partitioning) task_1 (Task scheduling) task_M ... (Node failure) (Commodity Cluster) ... ...

Programming Model • Inspired by functional programming (Lisp). • map(k1, v1) -> list(k2, v2) • User implemented. • Input: Key/value pairs. • Output: Intermediate key/value pairs. • reduce(k2, v2) -> list(v2) • User implemented. • Input: Intermediate key/value pairs. • Output: usually a single value.

Example:Word Frequency in a Set of Docs. ... “the = 100,000” ... “stereo = 20,000” ... reduce Key: filename. Value: file data. map ... <“the”, “1”> <”the”, “1”> ... <”stereo”,”1”> ...

Impact on Google • 18 months, 900 applications. • Many domains: • Machine learning. • Clustering for Google News & Froogle. • Query reports (e.g., Google Zeitgeist). • Large-scale graph computations. • Property extraction from web pages. • Google's indexer rewritten. • Code is significantly smaller, and decoupled.

1 ... M Execution User Program Master Assign map (input data on GFS) • Exploit locality. read ... GFS= Google File System [N (<M) map workers]

1 ... M Execution User Program Master Assign map (input data on GFS) read Local write • Partitioning function creates R regions. • Output ordered by key. ... ... (Intermediate files, local fs) [N (<M) map workers]

1 ... M Execution User Program Master Assign reduce Assign map Remote read (input data on GFS) read Local write ... ... Reduce workers (Intermediate files, local fs) [N (<M) map workers]

1 ... M Execution User Program Master Assign reduce Assign map Remote read (input data on GFS) GFS write Output 0 read Local write ... ... Output 1 Reduce workers (Intermediate files, local fs) [N (<M) map workers]

Fault Tolerance Master ping ping • Completed map(s) get rescheduled. • In-progress map/reduce get rescheduled. ... Master User App. • Abort MapReduce:

Backup Tasks (Done near MapReduce completion) task_i (slow machine)

Grep Experiment • 1800 machines. • Scans 1010 100B records. • Pattern matches 92,337 records. • M=15000; R=1. • Peak with 1764 map workers. • Total time = 150s.

Sort Experiment • 1800 machines. • Sorts 1010 100B records. • Map function extracts a 10-byte sorting key from each record. • Reduce = Identity function. • M = 15000; R=4000. • Input: 1TB, Output: 2TB.

Sort

Conclusions • Map/Reduce: • Expressive • Efficient implementation • Task re-execution for fault tolerance. • Locality reduces BW.

Discussion • How general beyond Google’s application? • What about other parallel programming models? (Google’s answer: we don’t care; this one works, and that’s all we need!) • Innovative contributions not great, but this is the only one-of-its kind system. • GFS paper (the only other major paper from Google) also similar – it’s just a system that works.

MON: On-Demand Overlays for Distributed System Management Jin Liang, Steven Ko, Indranil Gupta and Klara Nahrstedt WORLDS 05

Motivation • Large dist. applications emerging on: • Infrastructure: PlanetLab, Grid, … • Applications/services: CDN, DNS, storage, … • Difficult to manage for • Scale (100s to 1000s of nodes) • Failures (1/4 nodes rebooted daily on PL!) • Cluster management tools are too heavyweight for managing distributed applications

What do we Want to Manage? • Query and Modify distributed (sets of) objects • Object could be system resources such as CPU util., RAM util., etc. • Object could be application –generated elements such as log/trace-files (imagine a Pastry simulation running on 100 nodes) • Object could be application-specific (e.g., Pastry’s internal routing tables, or software updates)

Management Operations • query current system status • push software updates n2 n5 n3 n6 n4 n1 ?

Existing Solutions • Centralized (CoMon, GIS, scripts…) • Efficient • Non-scalable, no in-network aggregation • Distributed but Persistent (Astrolabe, Ganglia, SWORD, …) • Scalable • Difficult to maintain, complex failure repair

MON: A New Approach • Management Overlay Networks (MON) • Distributed management framework • On-demandoverlay construction

On-Demand, Why? • Simple • Light-weight • Better overlay performance • Suited to management • Sporadic usage • Short/medium term command execution

On-Demand, How? • Layered Architecture • Membership: membership exchange • Overlay: on-demand overlay construction • Management: cmd exec and results aggregation

Membership Gossip • Partial membership list • Periodically membership exchange • Detect failure/recovery • Measure delay n2 n5 n3 n6 n1 n4

On-Demand Construction • The problem: • Directed graph G = (V, E) • Create spanning subgraph (tree/DAG) • Goal: • Efficient, quick construction • Good overlay performance • Traditional algo may not work • Ω(E) messages, failure timeouts

Randomized Algorithms • For constructing two kinds of overlays – tree and DAG • Simple algorithm • Each node randomly selects k children • Each child acts recursively • Improved Algorithm • Membership augmented with local list • Two stages • Random + local selection • DAG construction: similar to tree

Tree Building Example n5 n2 n3 n6 n1 n4

Management Operations • Status query • Software push

Instant Status Query • Aggregate Queries • Average, histogram, top k • Supported attr: load, mem, etc. (from CoMon) • Generic Filtering • Exec any operation on a node • Return data based on exec results • return list of nodes, where “Sendto: Operation not permitted” occurred in log file X

Software Push • Receiver-driven, multi-parent download • Parent notify availability • Child request blocks • DAG structure helps: bw, resilience

Tree Construction Performance

Software Push Bandwidth

Conclusion • On-demand overlay • Alternative to long term maintainability • c.f. software rejuvenation • Suited to distributed system management • MON: reasonable performance

Discussion • Ongoing work • Better construction algo, usability, etc. • Distributed log querying (a big challenge for PlanetLab applications) • How do on-demand overlays compare to long-lived overlays. • For long-running commands, can the on-demand approach beat the long-lived approach? • What are the tradeoffs?

ACMS: Akamai Configuration Management System Sherman, Lisiecki, Berkheimer, WeinNSDI 2005

Background • 1000s of Akamai servers store configuration files • Individual Akamai clients (e.g., cnn.com) can submit a configuration update (100 MB large) that needs • Agreement on the Update (for fault-tolerance and ordering). • Fast propagation of updates. • Lighter weight than a software install. • Requirements • 24x7 Availability • Multiple entry points in multiple networks • Efficiency and Scalability • Persistent Fault-Tolerant Storage • Correctness • Need Unique Ordering of all Versions and that the system synchronize to latest version for each configuration file. • If submission accepted, must be propagated to all servers through the Akamai CDN (content distribution network).

Architectural Overview • Small group of special Storage Points = SPs, set aside for agreement (say 5 of them) • Application submitting update contacts an SP called “Accepting SP” • Accepting SP replicates message to a quorum of of SPs • SPs store message persistently on disk as a file • SPs run Vector-Exchange Algorithm to Agree on a submission • SPs offer data for download through Akamai CDN

Quorum-Based Replication • In order for an Accepting SP to accept an update submission we require that the update be both replicated and agreed upon by a quorum (majority) of the ACMS SPs. • Assumption: ACMS can maintain a majority of operational and connected SPs.

Vector Exchange • Accepting SP initializes a bit vector by setting own bit to 1, rest to 0, broadcast vector along with UID update to other SPs. • Any SP that sees the vector sets its corresponding bit to 1, stores the vectors persistently on disk, re-broadcasts modified vector to the rest of the SPs. • Persistent storage guarantees that the SP will not lose its vector state on process restart or machine reboot. • When you see a quorum of bits you know agreement has been reached.

Failures and Maintenance • If Accepting SP gets cut off from quorum after VE initiated, “Possible Accept.” • Apparently, has never occurred in real system. • Maintenance: • Upgrade individual servers so as not to kill the quorum. • Adding or removing machines requires temporary halt.

Recovery Via Index-Merging • Every few seconds SPs pick a random subset of other SPs and merge index files. • Helped cover up some early bugs.  • This is nothing but a gossip mechanism!

Operational Issues • Connected Quorum Assumption • First 9 months of 2004, 36 instances where a Storage Point did not have connectivity to a quorum for > 10 minutes. • Never lost an operating quorum • 6 days of logs: 2 pairwise outages of 2 and 8 minutes. • 3 instances of file corruption in 6 months. • Agreement phase: < 50 ms on average.

Discussion • Again, the generality question – how general is ACMS and its techniques (not that it needs to be general beyond Akamai’s purposes)?

Summary • Management is an important problem • Need lightweight solutions • Need generically applicable solutions • Need solutions that can be used for applications, not for the clusters themselves • Lots of open (and tough) problems in this area…which means a good area of research yet to be done

Management of Distributed Systems and Applications

Management of Distributed Systems and Applications

Presentation Transcript

Distributed Database Management Systems

Resource Management in Distributed Systems: Distributed File Systems

Distributed Systems Management

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Systems Management

Distributed Database Management Systems

Distributed database Management Systems

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Systems Management

Database management systems with distributed, XML and datawarehousing applications

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Database Management Systems

Distributed Database Management Systems