1 / 39

Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager

Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager. Ralph H. Castain, Ph.D. Cisco Systems, Inc. Outline. Overview Key pieces OpenRTE uPNP ORCM Architecture Fault behavior Future directions. System Software Requirements.

willa
Download Presentation

Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Open Resilient Cluster Manager:A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc.

  2. Outline • Overview • Key pieces • OpenRTE • uPNP • ORCM • Architecture • Fault behavior • Future directions

  3. System Software Requirements Turn on once with remote access thereafter Non-Stop == max 20 events/day lasting < 200ms each Hitless SW Upgrades and Downgrades Upgrade/downgrade SW components across delta versions Field Patchable Beta Test New Features in situ Extensive Trace Facilities: on Routes, Tunnels, Subscribers,… Configuration Clear APIs; minimize application awareness Extensive remote capabilities for fault management, software maintenance and software installations

  4. Our Approach • Distributed redundancy • NO master • Multiple copies of everything • Running in tracking mode • Parallel, seeing identical input • Multiple ways of selecting leader • Utilize component architecture • Multiple ways to do something => framework! • Create an initial working base • Encourage experimentation

  5. Methodology • Exploit open source software • Reduce development time • Encourage outside participation • Cross-fertilize with HPC community • Write new cluster manager (ORCM) • Exploit new capabilities • Potential dual-use for HPC clusters • Encourage outside contributions

  6. Open Source ≠ Free Pro Con Your timeline ≠ my timeline No penalty for late contributions Academic contributors have other priorities Compromise: a required art Code must be designed to support multiple approaches Nobody wins all the time Adds time to implementation • Widespread exposure • ORTE on thousands of systems around world • Surface & address problems • Community support • Others can help solve problems • Expanded access to tools (e.g., debuggers) • Energy • Other ideas, methods

  7. Outline • Overview • Key pieces • OpenRTE • uPNP • ORCM • Architecture • Fault behavior • Future directions 3-day workshop

  8. A Convergence of Ideas FT-MPI (U of TN) Open MPI LA-MPI (LANL) LAM/MPI (IU) PACX-MPI (HLRS) OpenRTE Fault Detection (LANL, Industry) FDDP (Semi. Mfg. Industry) Resilient Computing Systems Robustness (CSU) Autonomous Computing (many) Grid (many)

  9. Program Objective *Cell = one or more computers sharing a common launch environment/point

  10. Developers DOE/NNSA* Los Alamos Nat Lab Sandia Nat Lab Oak Ridge Nat Lab Universities Indiana University Univ of Tennessee Univ of Houston HLRS, Stuttgart Support Industry Cisco Oracle IBM Microsoft* Apple* Multiple interconnect vendors Open source teams OFED, autotools, Mercurial Participants *Providing funding

  11. Formalized interfaces Specifies “black box” implementation Different implementations available at run-time Can compose different systems on the fly Reliance on Components Caller Interface 1 Interface 2 Interface 3

  12. OpenRTE and Components • Components are shared libraries • Central set of components in installation tree • Users can also have components under $HOME • Can add / remove components after install • No need to recompile / re-link apps • Download / install new components • Develop new components safely • Update “on-the-fly” • Add, update components while running • Frameworks “pause” during update

  13. Component Benefits • Stable, production quality environment for 3rd party researchers • Can experiment inside the system without rebuilding everything else • Small learning curve (learn a few components, not the entire implementation) • Allow wide use, experience before exposing work • Vendors can quickly roll out support for new platforms • Write only the components you want/need to change • Protect intellectual property

  14. ORTE: Resiliency* • Fault • Events that hinder the correct operation of a process. • May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level • Effect may be immediate or some time in the future. • Usually are rare. May not have many data examples. • Fault prediction • Estimate probability of incipient fault within some time period in the future • Fault Tolerance ………………………………………reactive, static • Ability to recover from a fault • Robustness…………………………………………..metric • How much can the system absorb without catastrophic consequences • Resilience……………………………………………..proactive, dynamic • Dynamically configure system to minimize impact of potential faults *standalone presentation

  15. Key Frameworks Error Manager (Errmgr) Sensor Monitors software and hardware state-of-health Sentinel file size, mod & access times Memory footprint Temperature Heartbeat ECC errors Predicts incipient faults Trend, fingerprint AI-based algos coming • Receives all process state updates • Sensor, waitpid • Includes predictions • Determines response strategy • Restart locally, globally, abort • Executes recovery • Accounts for fault groups to avoid repeated failover

  16. Outline • Overview • Key pieces • OpenRTE • uPNP • ORCM • Architecture • Fault behavior • Future directions

  17. Universal PNP • Widely adopted standard • ORCM uses only a part • PNP discovery via announcement on std multicast channel • Includes application id, contact info • All applications respond • Wireup “storm” limits scalability • Various algorithms for storm reduction • Each application assigned own “channel” • All output from members of that application • Input sent to that application given to all members

  18. Outline • Overview • Key pieces • OpenRTE • uPNP • ORCM • Architecture • Fault behavior • Future directions

  19. ORCM DVM • One per node • Started at node boot or launched by tool • Locally spawns and monitors processes, system health sensors • Small footprint (≤1Mb) • Each daemon tracks existence of others • PNP wireup • Know where all processes are located Predefined “System” multicast channel orcmd orcmd orcmd

  20. Parallel DVMs • Allows • Concurrent development, testing in production environment • Sharing of development resources • Unique identifier (ORTE jobid) • Maintains separation between orcmd’s • Each application belongs to their respective DVM • No cross-DVM communication allowed

  21. Configuration Mgmt orcmd orcmd orcmd Lowest vpid Open framework cfgi confd tool file confd daemon connect? set recv config file? subscribe orcm-start recv config file

  22. Configuration Mgmt orcmd orcmd orcmd Open framework Update any missing config info Assume “leader” duties cfgi confd tool file confd daemon connect? set recv config file? subscribe orcm-start recv config file

  23. Application Launch Predefined “System” multicast channel Launch msg orcmd orcmd orcmd #procs location Config change cfgi confd tool file confd daemon connect? set recv config file? subscribe orcm-start recv config file

  24. Resilient Mapper Fault groups Nodes with common failure mode Node can belong to multiple fault groups Defined in system file Map instances across fault groups Minimize probability of cascading failures One instance/fault group Pick lightest loaded node in group Randomly map extras Next-generation algorithms Failure mode probability => fault group selection

  25. Multiple Replicas • Multiple copies of each executable • Run on separate fault groups • Async, independent • Shared pnp channel • Input: recvd by all • Output: broadcast to all, recvd by those who registered for input • Leader determined by recvr

  26. Leader Selection • Two forms of leader selection • Internal to ORCM DVM • External facing • Internal - framework • App-specific module • Configuration specified • Lowest rank • First contact • None

  27. External Connections orcm-connector • Input • Broadcast on respective PNP channel • Output • Determines “leader” to supply output to rest of world • Utilize any leader method in framework

  28. Testing in Production orcm-logger logger db file syslog console

  29. Software Maintenance • On-the-fly module activation • Configuration manager can select new modules to load, reload, activate • Change priorities of active modules • Full replacement • When more than a module needs updating • Start replacement version • Configuration manager switches “leader” • Stop old version

  30. Detecting Failures • Application failures - detected by local daemon • Monitors for self-induced problems • Memory and cpu usage • Orders termination if limits exceeded or are trending to exceed • Detects unexpected failures via waitpid • Hardware failures • Local hardware sensors continuously report status • Read by local daemon • Projects potential failure modes to pre-order relocation of processes, shutdown node • Detected by DVM when daemon misses heartbeats

  31. Application Failure • Local daemon • Detects (or predicts) failure • Locally restarts up to specified max #local-restarts • Utilizes resilient mapper to determine re-location • Sends launch message to all daemons • Replacement app • Announces itself on application public address channel • Receives responses - registers own inputs • Begins operation • Connected applications • Select new “leader” based on current module

  32. Node Failure orcmd orcmd orcmd Open framework Next higher orcmd becomes leader Open/init cfgi framework Update any missing config info Mark node as “down” Relocate application processes from failed node Connected apps failover leader per active leader module Attempt to restart cfgi confd tool file

  33. Node Replacement/Addition • Auto-boot of local daemon on power up • Daemon announces to DVM • All DVM members add node to available resources • Reboot/restart • Relocate original procs back up to some max number of times (need smarter algo here) • Leadership remains unaffected to avoid “bounce” • Processes will map to new resource as start/restart demands • Future: rebalance existing load upon node availability

  34. Outline • Overview • Key pieces • OpenRTE • uPNP • ORCM • Architecture • Fault behavior • Future directions

  35. System Software Requirements Turn on once with remote access thereafter Non-Stop == max 20 events/day lasting < 200ms each Hitless SW Upgrades and Downgrades Upgrade/downgrade SW components across delta versions Field Patchable Beta Test New Features in situ Extensive Trace Facilities: on Routes, Tunnels, Subscribers,… Configuration Clear APIs; minimize application awareness Extensive remote capabilities for fault management, software maintenance and software installations  Boot-level startup  ~5ms recovery  Start new app triplet, kill old one  Start/stop triplets, leader selection  New app triplet, register for production input   

  36. Still A Ways To Go • Security • Who can order ORCM to launch/stop apps? • Who can “log” output from which apps? • Network extent of communications? • Communications • Message size, fragmentation support • Speed of underlying transport • Truly reliable multicast • Asynchronous messaging

  37. Still A Ways To Go • Transfer of state • How does a restarted application replica regain the state of its prior existence? • How do we re-sync state across replicas so outputs track? • Deterministic outputs • Same output from replicas tracking same inputs • Assumes deterministic algorithms • Can we support non-deterministic algorithms? • Random channel selection to balance loads • Decisions based on instantaneous traffic sampling

  38. Still A Ways To Go • Enhanced algorithms • Mapping • Leader selection • Fault prediction • Implementation and algorithms • Expanded sensors • Replication vs rapid restart • If we can restart in few millisecs, do we really need replication?

  39. Concluding Remarks http://www.open-mpi.org http://www.open-mpi.org/projects/orcm

More Related