Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager

Open Resilient Cluster Manager:A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc.

Outline • Overview • Key pieces • OpenRTE • uPNP • ORCM • Architecture • Fault behavior • Future directions

System Software Requirements Turn on once with remote access thereafter Non-Stop == max 20 events/day lasting < 200ms each Hitless SW Upgrades and Downgrades Upgrade/downgrade SW components across delta versions Field Patchable Beta Test New Features in situ Extensive Trace Facilities: on Routes, Tunnels, Subscribers,… Configuration Clear APIs; minimize application awareness Extensive remote capabilities for fault management, software maintenance and software installations

Our Approach • Distributed redundancy • NO master • Multiple copies of everything • Running in tracking mode • Parallel, seeing identical input • Multiple ways of selecting leader • Utilize component architecture • Multiple ways to do something => framework! • Create an initial working base • Encourage experimentation

Methodology • Exploit open source software • Reduce development time • Encourage outside participation • Cross-fertilize with HPC community • Write new cluster manager (ORCM) • Exploit new capabilities • Potential dual-use for HPC clusters • Encourage outside contributions

Open Source ≠ Free Pro Con Your timeline ≠ my timeline No penalty for late contributions Academic contributors have other priorities Compromise: a required art Code must be designed to support multiple approaches Nobody wins all the time Adds time to implementation • Widespread exposure • ORTE on thousands of systems around world • Surface & address problems • Community support • Others can help solve problems • Expanded access to tools (e.g., debuggers) • Energy • Other ideas, methods

Outline • Overview • Key pieces • OpenRTE • uPNP • ORCM • Architecture • Fault behavior • Future directions 3-day workshop

A Convergence of Ideas FT-MPI (U of TN) Open MPI LA-MPI (LANL) LAM/MPI (IU) PACX-MPI (HLRS) OpenRTE Fault Detection (LANL, Industry) FDDP (Semi. Mfg. Industry) Resilient Computing Systems Robustness (CSU) Autonomous Computing (many) Grid (many)

Program Objective *Cell = one or more computers sharing a common launch environment/point

Developers DOE/NNSA* Los Alamos Nat Lab Sandia Nat Lab Oak Ridge Nat Lab Universities Indiana University Univ of Tennessee Univ of Houston HLRS, Stuttgart Support Industry Cisco Oracle IBM Microsoft* Apple* Multiple interconnect vendors Open source teams OFED, autotools, Mercurial Participants *Providing funding

Formalized interfaces Specifies “black box” implementation Different implementations available at run-time Can compose different systems on the fly Reliance on Components Caller Interface 1 Interface 2 Interface 3

OpenRTE and Components • Components are shared libraries • Central set of components in installation tree • Users can also have components under $HOME • Can add / remove components after install • No need to recompile / re-link apps • Download / install new components • Develop new components safely • Update “on-the-fly” • Add, update components while running • Frameworks “pause” during update

Component Benefits • Stable, production quality environment for 3rd party researchers • Can experiment inside the system without rebuilding everything else • Small learning curve (learn a few components, not the entire implementation) • Allow wide use, experience before exposing work • Vendors can quickly roll out support for new platforms • Write only the components you want/need to change • Protect intellectual property

ORTE: Resiliency* • Fault • Events that hinder the correct operation of a process. • May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level • Effect may be immediate or some time in the future. • Usually are rare. May not have many data examples. • Fault prediction • Estimate probability of incipient fault within some time period in the future • Fault Tolerance ………………………………………reactive, static • Ability to recover from a fault • Robustness…………………………………………..metric • How much can the system absorb without catastrophic consequences • Resilience……………………………………………..proactive, dynamic • Dynamically configure system to minimize impact of potential faults *standalone presentation

Key Frameworks Error Manager (Errmgr) Sensor Monitors software and hardware state-of-health Sentinel file size, mod & access times Memory footprint Temperature Heartbeat ECC errors Predicts incipient faults Trend, fingerprint AI-based algos coming • Receives all process state updates • Sensor, waitpid • Includes predictions • Determines response strategy • Restart locally, globally, abort • Executes recovery • Accounts for fault groups to avoid repeated failover

Universal PNP • Widely adopted standard • ORCM uses only a part • PNP discovery via announcement on std multicast channel • Includes application id, contact info • All applications respond • Wireup “storm” limits scalability • Various algorithms for storm reduction • Each application assigned own “channel” • All output from members of that application • Input sent to that application given to all members

ORCM DVM • One per node • Started at node boot or launched by tool • Locally spawns and monitors processes, system health sensors • Small footprint (≤1Mb) • Each daemon tracks existence of others • PNP wireup • Know where all processes are located Predefined “System” multicast channel orcmd orcmd orcmd

Parallel DVMs • Allows • Concurrent development, testing in production environment • Sharing of development resources • Unique identifier (ORTE jobid) • Maintains separation between orcmd’s • Each application belongs to their respective DVM • No cross-DVM communication allowed

Configuration Mgmt orcmd orcmd orcmd Lowest vpid Open framework cfgi confd tool file confd daemon connect? set recv config file? subscribe orcm-start recv config file

Configuration Mgmt orcmd orcmd orcmd Open framework Update any missing config info Assume “leader” duties cfgi confd tool file confd daemon connect? set recv config file? subscribe orcm-start recv config file

Application Launch Predefined “System” multicast channel Launch msg orcmd orcmd orcmd #procs location Config change cfgi confd tool file confd daemon connect? set recv config file? subscribe orcm-start recv config file

Resilient Mapper Fault groups Nodes with common failure mode Node can belong to multiple fault groups Defined in system file Map instances across fault groups Minimize probability of cascading failures One instance/fault group Pick lightest loaded node in group Randomly map extras Next-generation algorithms Failure mode probability => fault group selection

Multiple Replicas • Multiple copies of each executable • Run on separate fault groups • Async, independent • Shared pnp channel • Input: recvd by all • Output: broadcast to all, recvd by those who registered for input • Leader determined by recvr

Leader Selection • Two forms of leader selection • Internal to ORCM DVM • External facing • Internal - framework • App-specific module • Configuration specified • Lowest rank • First contact • None

External Connections orcm-connector • Input • Broadcast on respective PNP channel • Output • Determines “leader” to supply output to rest of world • Utilize any leader method in framework

Testing in Production orcm-logger logger db file syslog console

Software Maintenance • On-the-fly module activation • Configuration manager can select new modules to load, reload, activate • Change priorities of active modules • Full replacement • When more than a module needs updating • Start replacement version • Configuration manager switches “leader” • Stop old version

Detecting Failures • Application failures - detected by local daemon • Monitors for self-induced problems • Memory and cpu usage • Orders termination if limits exceeded or are trending to exceed • Detects unexpected failures via waitpid • Hardware failures • Local hardware sensors continuously report status • Read by local daemon • Projects potential failure modes to pre-order relocation of processes, shutdown node • Detected by DVM when daemon misses heartbeats

Application Failure • Local daemon • Detects (or predicts) failure • Locally restarts up to specified max #local-restarts • Utilizes resilient mapper to determine re-location • Sends launch message to all daemons • Replacement app • Announces itself on application public address channel • Receives responses - registers own inputs • Begins operation • Connected applications • Select new “leader” based on current module

Node Failure orcmd orcmd orcmd Open framework Next higher orcmd becomes leader Open/init cfgi framework Update any missing config info Mark node as “down” Relocate application processes from failed node Connected apps failover leader per active leader module Attempt to restart cfgi confd tool file

Node Replacement/Addition • Auto-boot of local daemon on power up • Daemon announces to DVM • All DVM members add node to available resources • Reboot/restart • Relocate original procs back up to some max number of times (need smarter algo here) • Leadership remains unaffected to avoid “bounce” • Processes will map to new resource as start/restart demands • Future: rebalance existing load upon node availability

System Software Requirements Turn on once with remote access thereafter Non-Stop == max 20 events/day lasting < 200ms each Hitless SW Upgrades and Downgrades Upgrade/downgrade SW components across delta versions Field Patchable Beta Test New Features in situ Extensive Trace Facilities: on Routes, Tunnels, Subscribers,… Configuration Clear APIs; minimize application awareness Extensive remote capabilities for fault management, software maintenance and software installations  Boot-level startup  ~5ms recovery  Start new app triplet, kill old one  Start/stop triplets, leader selection  New app triplet, register for production input   

Still A Ways To Go • Security • Who can order ORCM to launch/stop apps? • Who can “log” output from which apps? • Network extent of communications? • Communications • Message size, fragmentation support • Speed of underlying transport • Truly reliable multicast • Asynchronous messaging

Still A Ways To Go • Transfer of state • How does a restarted application replica regain the state of its prior existence? • How do we re-sync state across replicas so outputs track? • Deterministic outputs • Same output from replicas tracking same inputs • Assumes deterministic algorithms • Can we support non-deterministic algorithms? • Random channel selection to balance loads • Decisions based on instantaneous traffic sampling

Still A Ways To Go • Enhanced algorithms • Mapping • Leader selection • Fault prediction • Implementation and algorithms • Expanded sensors • Replication vs rapid restart • If we can restart in few millisecs, do we really need replication?

Concluding Remarks http://www.open-mpi.org http://www.open-mpi.org/projects/orcm

Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager

Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager

Presentation Transcript

Towards a more resilient nation

Resilient Distributed Datasets (NSDI 2012)

Resilient

A STRONGER, MORE RESILIENT COMMUNITY

Resilient Distributed Datasets

Dr. SES Distributed Resilient Secure EcmaScript

Spark Resilient Distributed Datasets:

A Poisoning-Resilient TCP Stack

CCRG Cluster Manager

Towards a naturally resilient Northamptonshire

Building a Resilient WAN

Resilient dCache

Christchurch – a Resilient City

Building a Resilient Workforce

Resilient Technologies

Supporting Resilient Families within Resilient Communities

Candytuft a stunning, resilient plant

How to Build a Resilient Organisation