Malleability, Migration and Replication for Adaptive Distributed Computing

Malleability, Migration and Replication for Adaptive Distributed Computing Over Dynamic Environments Boleslaw Szymanski Travis Dessel, Kaoutar ElMaghraoui and Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute September 11, 2007

Talk Outline • Middleware-driven Reconfiguration • Application-level Functional Issues • Reconfiguring MPI Applications • Performance Evaluation • Conclusion and Future Directions • Questions?

Infrastructure Characteristics Large Scale Environments High Fault Rates Dynamic Resource Availability Dynamic Resource Demand Heterogeneous Resources Today’s Computing Environments Applications’ View • What resources are available? Resource Discovery • Where to execute an application? Resource Selection • How the application is doing? Resource Monitoring • How to utilize resources better? Migration • What process granularity is the best? Dynamic Granularity

Challenges of Dynamic Reconfiguration • An application needs to support: • scaling up to accommodate new resources, • shrinking to accommodate leaving or slow resources. • The middleware needs to provide transparent and non-intrusive performance monitoring and application adaptability. • Currently handled by the programmer. • System needs to supervise allocation/reallocation of resources.

High-level APIs Library support Middleware: Applications: Gap-bridging Software • Defines when and where to reconfigure applications • Decides what types of reconfiguration to apply • Focus on problem solving • Support migration and/or dynamic granularity Our Approach Separation of concerns between the application and the middleware

Middleware - IOS: Internet Operating System • The Internet Operating System (IOS) is a decentralized middleware framework that provides: • Opportunistic load balancing capabilities • Resource-level profiling • Application-level profiling • Generic interfaces to interoperate with various programming models • Modular and customizable software architecture • Reference: Kaoutar El Maghraoui, Travis J. Desell, Boleslaw K. Szymanski, and Carlos A. Varela. The Internet Operating System: Middleware for Adaptive Distributed Computing. International Journal of High Performance Computing Applications (IJHPCA), Special Issue on Scheduling Techniques for Large-Scale Distributed Platforms, 20(4):467-480, 2006.

Reconfigurable Capabilities Migration Split & Merge Reconfiguration request (migrate/split/merge/replicate) Reconfigurable Application Checkpointing Original Application Application profiling IOS API Decision Module Profiling Module Protocol Module Work Steal requests Communication profiles Reconfigure? Performance profiles Evaluates the expected gain of a potential reconfiguration Sends and receives work steal requests Available processing Decision Interfaces to resources profilers Latency/ bandwidth info Network monitor Memory monitor CPU monitor Initiate a work steal request when there are enough local resources IOS Agent IOS Architecture

Application The Profiling Module • Profiling API • Extensible • Resource usage queries • Collection formats of dynamic application performance • Periodic Application-level Profiling • Communication patterns • Communication frequency • Application-specific metrics (e.g., Iteration time) • Periodic Resource-level Profiling Application Performance Profile Profiling Module Machine Performance Profile Memory Monitor Disk Monitor Network Monitor CPU Monitor

The Protocol Module • Decentralized coordination of agents • Resource discovery and dissemination • Notifications to application entities about reconfiguration requests • Choice of variations of work stealing • Mappings to various virtual topologies

Moderately Loaded Under Loaded Work Steal Packet, TTL=3 Heavily Loaded Work Steal Packet, TTL=5 Work Steal Packet, TTL=4 P2P- based Work Stealing Protocols • Every node maintains a list of peers • When a node becomes idle, it starts probing for work • A DFS technique is used with depth limit TTL

Peer-to-Peer Middleware Agent Topology (P2P) • List of peers, arranged in groups based on latency: • Local (0-10 ms) • Regional (11-100 ms) • National (101-250 ms) • Global (251+ ms) • Work steal requests: • Propagated randomly within the closest group until time to live is reached or work found • Propagated to progressively farther groups if no work is found • Peers respond to steal packets when the decision module decides to reconfigure application based on performance model Workstations Mobile Resources Node Node Node Node Node Node Node Node Node Node Node Node Node Cluster Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node

Cluster Node Node Node Cluster ClusterManager Node Node Node Node Node Node Node ClusterManager Node Node Node Node Node Cluster Node Node Node Node ClusterManager Node Node Node Node Node Cluster-to-Cluster Middleware Agent Topology (C2C) • Hierarchical peer organization • Each cluster has a manager (super peers) • Each node in a cluster reports periodically profiled information to manager • Managers perform intra-cluster load balancing • Cluster managers form a dynamic peer-to-peer network • Managers may join, leave at any time • Leader election in case of leaving managers • Clusters can split and merge depending on network conditions • Inter-cluster load balancing is based on work-stealing similar to p2p protocol component • Clusters are organized dynamically based on latency

Presence Management

IOS Load Balancing Strategies • Modularity for customizable load balancing and profiling strategies: • Random work-stealing (RS, based on Cilk’s work): • Lightly-loaded nodes periodically send work steal packets to randomly picked peer nodes • Simple (no broadcasts required) and stable strategy • Application topology-sensitive work-stealing (ATS) • An extension of RS that collocates processes that communicate frequently • Network topology-sensitive work-stealing (NTS) • An extension of ATS that takes the network topology and performance into consideration • Adds periodical profiling of end-to-end network performance among peer nodes to measure latency and bandwidth

Resource Sensitive Model • Reconfiguration decisions • When to migrate • Where to migrate • How many entities to migrate • When to tune up the entities’ granularity by either • Increasing entity size by merging, or • Decreasing entity size by splitting

A General Model for Weighted Resource-Sensitive Work-Stealing (WRS) • Given: • A set of resources R • A set of entities A that need those resources • A potential migration from the current platform pc to a new platform pm • Compute the expected gain in overall performance G achieved by themigration estimating: • Change of speed of execution Ds(A,pc,pm)>0 of entities A on a new platform • Life-time expectancy of entities A, L(A), set to the current execution time divided by the number by one grater than the number of migrations executed by entities A already • Cost of migration M(A,pc,pm), of entities A to the new platform. • G = Ds(A,pc,pm)*L(A)/M(A,pc,pm) • If G > 1, migration is profitable

Migration and Malleability • How can applications be reconfigured? • Migration: dynamic mapping to resources • Malleability: dynamic granularity • How to accomplish dynamic granularity? • Allow components to split and merge. 2 0 1 3

Migration and Malleability • How can applications be reconfigured? • Migration: dynamic mapping to resources • Malleability: dynamic granularity • How to accomplish dynamic granularity? • Allow components to split and merge. 4 0 2 1 5 3

Migration and Malleability • How can applications be reconfigured? • Migration: dynamic mapping to resources • Malleability: dynamic granularity • How to accomplish dynamic granularity? • Allow components to split and merge. 4 0 2 1 3 5

Ghost Cells Data Cells P0 P0 N Boundary Cells P1 P1 Ghost Cell Exchange Legend … … Pi Pi 4-pt update stencil Original Data Space Split Merge Pi+1 N Pi+1 Pi+2 … 3 2 … Pn-1 Pn-2 Pn Pn-1 P0 P0 P1 P1 … … Pi Parallel Decomposition N Pi Pi+1 Pi+2 Pi+1 Pi+3 … … 2 4 Pn-1 Pn+1 P0 P1 P2 Pn-1 Pn Pn+2 Split and Merge Operations

Impact of Process Granularity Experiments on a dual-processor node (SUN Blade 1000)

Malleability vs. Migration Processors: 8 16 12 15 10 8 13% Speedup 6% Speedup 15% Speedup

When to Split and When to Merge? • Merge is a local operation • Triggered when OS context-switching frequency is high and the local environment is relatively stable • Split is both a local and a remote operation • Locally triggered when the cache miss rate is high most likely because of large application granularity • Remotely triggered when a work steal packet is received and N << NR (N is the number of processes and NR is the number of resources)

Functional Issues of Reconfiguration • Applications need to be amenable to migration and/or malleability • Migration and malleability support are programming model-dependent • Examples: • Simple Actor Language and Architecture (SALSA) applications • A language for actor-oriented applications • Implicit support for actor migration • No explicit support for malleability • Message Passing Applications (MPI) • Explicit support neither for migration nor for malleability

Extending MPI: Semi-transparent checkpointing Process migration support Process malleability support Integration with IOS Currently provided for iterative applications C/C++ Bindings Reference: Kaoutar El Maghraoui, Travis Desell, Boleslaw K. Szymanski, James D. Teresco, and Carlos A. Varela. Towards a Middleware Framework for Dynamically Reconfigurable Scientific Computing. In L. Grandinetti, editor, Grid Computing and New Frontiers of High Performance Processing, volume 14 of Advances in Parallel Computing, pages 275-301. Elsevier, 2005. Reconfiguring MPI Applications with IOS

The MPI/IOS Runtime Architecture • Uses instrumented MPI applications • Process Checkpointing, Migration, and Malleability are confined to (PCM) library • Wrappers are provided for most MPI native calls • Includes the MPI library • Includes the IOS runtime components

IOS Migrate Migration Example MPI_SPAWN Transfer of state 3 0 1 4 5 0 2 Newly created communicator MPI_COMM_WORLD

Migration Example MPI_Intercomm_merge: merges the two communicators 3 0 1 4 5 0 2 MPI_COMM_WORLD

Migration Example MPI_Intercomm_merge: merges the two communicators 3 3 1 4 5 0 2 MPI_COMM_WORLD

MPI library PCM code instrumentation MPI user program Instrumented program PMPI library Compiler/Linker PCM library Executable IOS Reconfiguration Middleware TCP Sockets PCM Daemon Profiling MPI Applications • The profiling library is based on the MPI profiling interface • Transparent interception of all MPI calls • Goal: Profile MPI applications' communication patterns

Migration Policies Single migration Group migration Dynamic Networks Joining and leaving resources Dynamic Granularity Middleware Overhead Empirical Evaluation • Application Communication Topologies • Unconnected • Sparse • Tree • Hypercube • Middleware Agent Topologies • Peer-to-peer • Cluster-to-cluster • Network Topologies • Grid-like (set of homogeneous clusters) • Internet-like (more heterogeneous)

Dynamic Networks • Nodes were added and removed dynamically to test scalability. • During the 1st half of the experiment, every 30 seconds, a node was added. • During the 2nd half, every 30 seconds, a node was removed • Throughput improves as the number of available nodes grows.

Grid-like Topology: Relatively homogeneous processors Very high performance networking within clusters (e.g., myrinet and gigabit ethernet) Networking between clusters uses dedicated, high bandwidth links (e.g., the extensible terascale facility) Internet-like Topology: Wide range of processor architectures and operating systems Nodes are unreliable Networking between nodes ranges from low bandwidth and latency to one based on the dedicated, fiber-optic links Physical Network Topologies

Results for Applications with high communication to computation ratio Tree Application Topology

Results for applications with high communication to computation ratio (2) Hypercube Application Topology

Adaptation Experiments with Migration • Testbed: 20 node heterogeneous cluster ( 4-dual processor SUN Blade 1000 nodes and 16 single-processor SUN Ultra nodes) • Application: A parallel simulation of an iterative two-dimensional heat diffusion • Artificial load used to emulate a shared and dynamic environment ~82% improvement in makespan

Adaptation with Split/Merge Features Binary split operation Binary split operation Binary split operation Migration of 2 processes to a joining dual processor node Binary split operation Migration of 2 processes due to a node leaving the computation A merge operation Migration of 2 processes to a joining dual processor node

Split/Merge Capabilities • The application started initially with 8 processors. The 8 additional processors were made available at iteration 860. 8 processes were split and migrated to harness the newly available resources.

Overhead of the PCM Library

Application to Astro-Informatics Co-PIs M. Magdon-Ismail, H. Newberg, and C. Varela

Approach • Develop a Generic Maximum Likelihood Evaluator (GMLE) • Goals: • Plug-and-play scientific models, search methods and distributed execution environments • Determine which applications and search methods work best on which execution environments • Develop new search methods which take advantage of highly concurrent and distributed environments • Enable future research into difficult scientific problems • Separation of Concerns: • Simple interfaces between scientific computing, machine learning and distributed computing components

Scientific Models Search Routines Initial Parameters Data Initialization Integral Function Integral Composition Likelihood Function Likelihood Composition Gradient Descent Genetic Search Simplex … Optimized Parameters Evaluation Request Results Distribute Parameters Combine Results Evaluator Creation Evaluator Evaluator Evaluator Evaluator Evaluator … Distributed Evaluation Framework SALSA/Java MPI/C GMLE Architecture

Preliminary Results • GMLE implemented in SALSA/Java and MPI/C • We executed a “small” test example on RPI Grid and BlueGene/L • Used 3 heterogeneous clusters on the RPI Grid: • 4 Quad-Processor PowerPCs (16 Processors) • 4 Quad-Processor Dual-Core Opterons (32 Processors) • 10 Quad-Processor Opterons (40 Processors) • Used two BlueGene/L partitions: • 256 node (256 processors, 512 in virtual mode) • 512 node (512 processors, 1024 in virtual mode)

GMLE on the RPI Grid and BlueGene/L Small test example 2 Minute Evaluation MLE requires 10,000+ Evaluations 15+ Day Runtime ~230x Speedup <1 Day Runtime ~100x Speedup 1.5 Day Runtime

Conclusions • Even the sample example is highly expensive: • Calculation only done over a single wedge of space for a single test model • Higher accuracy is required: • Can be improved by more detailed integral calculation, which increases computational time polynomially • Calculating the convolution for each point increases computation time by 30 times or more. • More computational power is very enabling: • Faster turn-around times means models and data can be tested quicker, streamlining the scientific cycle • Also allows for more detailed models for richer research

Evaluating the convergence rates of the different search methods on different architectures and evaluation frameworks. Expanding the available search methods. Continued collaboration with various scientific disciplines to examine how different types of scientific computation will scale and utilize these search methods. Webpages: http://wcl.cs.rpi.edu/gmle/ http://milkyway.cs.rpi.edu/ Identification of new or unstudied streams Multiple star types Multiple wedge shaped volume Multiple pieces of tidal debris Addition of kinematic data Future Work http://www.nasa.gov

Overall Conclusions • A framework for the dynamic reconfiguration of MPI applications in dynamic environments • Modular • Application-independent • Generic interfaces for the interoperation with various programming paradigms • E.g: Actors and Communicating Processes • Decentralized reconfiguration policies • Transparent and non-intrusive application monitoring and profiling • Fine-grained application reconfiguration • Empirical evaluation demonstrates usefulness of this approach for iterative applications

Future Directions • Interoperability with existing grid middleware (Globus, MPICH-G2, Condor, etc.) • Reconfiguration policies for additional classes of applications (e.g., non-iterative applications and commercial enterprise applications) • Deployment of the proposed middleware on larger environments and with larger applications • Extending IOS to handle Virtual Machine (VM) migration and dynamic resizing of VMs

Related Work • MPICH-G2 • Grid-enabled implementation of MPI • http://www3.niu.edu/mpi/ • Adaptive MPI (AMPI) • Implementation of MPI on top of light threads with support of process migration [Huang03] • MPI Process Swapping • Enhancement of the performance of iterative application using initial over-allocation of processors and selection of the best executing nodes [Sievert04] • Extensions to MPI with checkpointing and restart: • SRS library [Vadhiyar03]: application stop and restart • CoCheck [Stellner96] and StarFish[Agbaria99]: fault tolerance support

Malleability, Migration and Replication for Adaptive Distributed Computing

Malleability, Migration and Replication for Adaptive Distributed Computing

Presentation Transcript

Distributed Database and Replication

DISTRIBUTED COMPUTING

Distributed Systems Course Replication

Malleability

Distributed Computing with Adaptive Heuristics

Replication for Mobile Computing

Challenges in Distributed Energy Adaptive Computing

Distributed Computing for Crystallography

Distributed Computing

Parallel and Distributed Computing for Neuroinformatics

DISTRIBUTED COMPUTING

Distributed Computing

Adaptive Computing

ASR: Adaptive Selective Replication for CMP Caches

Adaptive Parallel Applications in Distributed Computing Environment

Migration and adaptive capacity building

Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid

Adaptive Matchmaking in Distributed Computing

Distributed Computing

Free Network Measurement for Adaptive Virtualized Distributed Computing

Distributed Selfish Replication

Distributed computing