Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines

Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang http://www.cs.ucsb.edu/research/tmpi Department of Computer Science University of California, Santa Barbara

MPI-Based Parallel Computation on Shared Memory Machines • Shared Memory Machines (SMMs) or SMM Clusters become popular for high end computing. • MPI is a portable high performance parallel programming model.  MPI on SMMs • Threads are easy to program. But MPI is still used on SMMs: • Better portability for running on other platforms (e.g. SMM clusters); • Good data locality due to data partitioning. Shen, Tang, and Yang @ SuperComputing'99

Scheduling for Parallel Jobs in Multiprogrammed SMMs • Gang-scheduling • Good for parallel programs which synchronize frequently; • Affect resource utilization (Processor-fragmentation; not enough parallelism to use allocated resource). • Space/time Sharing • Time sharing combined with dynamic partitioning; • High throughput. Popular in current OS (e.g., IRIX 6.5) • Impact on MPI program execution • Not all MPI nodes are scheduled simultaneously; • The number of available processors for each application may change dynamically. • Optimization is needed for fast MPI execution on SMMs. Shen, Tang, and Yang @ SuperComputing'99

Techniques Studied • Thread-Based MPI execution [PPoPP’99] • Compile-time transformation for thread-safe MPI execution • Fast context switch and synchronization • Fast communication through address sharing • Two-level thread management for multiprogrammed environments • Even faster context switch/synchronization • Use scheduling information to guide synchronization • Our prototype system: TMPI Shen, Tang, and Yang @ SuperComputing'99

Impact of synchronization on coarse-grain parallel programs • Running a communication-infrequent MPI program (SWEEP3D) on 8 SGI Origin 2000 processors with multiprogramming degree 3. • Synchronization costs 43%-84% of total time. • Execution time breakdown for TMPI and SGI MPI: Shen, Tang, and Yang @ SuperComputing'99

Related Work MPI-related Work • MPICH, a portable MPI implementation [Gropp/Lusk et al.]. • SGI MPI, highly optimized on SGI platforms. • MPI-2, multithreading within a single MPI node. Scheduling and Synchronization • Process Control [Tucker/Gupta] and Scheduler Activation[Anderson et al.] Focus on OS research. • Scheduler-conscious Synchronization[Kontothanssis et al.] Focus on primitives such as barriers and locks. • Hood/Cilk threads [Arora et al.] and Loop-level Scheduling [Yue/Lilja]. Focus on fine-grain parallelism. Shen, Tang, and Yang @ SuperComputing'99

Outline • Motivations & Related Work • Adaptive Two-level Thread Management • Scheduler-conscious Event Waiting • Experimental Studies Shen, Tang, and Yang @ SuperComputing'99

Context Switch/Synchronization in Multiprogrammed Environments In multiprogrammed environments, synchronization leads to more context switches  large performance impact. • Conventional MPI implementation maps each MPI node to an OS process. • Our earlier work maps each MPI node to a kernel thread. • Two-level Thread Management: maps each MPI node to a user-level thread. • Faster context switch and synchronization among user-level threads • Very few kernel-level context switches Shen, Tang, and Yang @ SuperComputing'99

System Architecture … ... • Targeted at multiprogrammed environments • Two-level thread management MPI application MPI application … ... TMPI Runtime TMPI Runtime … ... User-level threads User-level threads System-wide resource management Shen, Tang, and Yang @ SuperComputing'99

Adaptive Two-level Thread Management • System-wide resource manager (OS kernel or User-level central monitor) • collects information about active MPI applications; • partitions processors among them. • Application-wide user-level thread management • maps each MPI node into a user-level thread; • schedules user-level threads on a pool of kernel threads; • controls the number of active kernel threads close to the number of allocated processors. • Big picture (in the whole system):  #Active kernel threads ≈ #Processors  Minimize kernel-level context switch Shen, Tang, and Yang @ SuperComputing'99

User-level Thread Scheduling • Every kernel thread can be: • active: executing an MPI node (user-level thread); • suspended. • Execution invariant for each application: • #active kernel threads ≈ #allocated processors (minimize kernel-level context switch) • #kernel threads = #MPI nodes (avoid dynamic thread creation) • Every active kernel thread polls system resource manager, which leads to: • Deactivation: suspending itself • Activation: waking up some suspended kernel threads • No-action • When to poll? Shen, Tang, and Yang @ SuperComputing'99

Polling in User-Level Context Switch • Context switch is a result of synchronization (e.g. an MPI node waits for a message). • Underlying kernel thread polls system resource manager during context switch: • Two stack switches if deactivation  suspend on a dummy stack • One stack switch otherwise • After optimization, 2s in average on SGI Power Challenge Shen, Tang, and Yang @ SuperComputing'99

Outline • Motivations & Related Work • Adaptive Two-level Thread Management • Scheduler-conscious Event Waiting • Experimental Studies Shen, Tang, and Yang @ SuperComputing'99

Event Waiting Synchronization • All MPI synchronization is based on waitEvent waiter caller waitEvent(*pflag == value); • Waiting could be: • spinning • yielding/blocking waiting *pflag = value; wakeup Shen, Tang, and Yang @ SuperComputing'99

Tradeoff between spin and block • Basic rules for waiting using spin-then-block: • Spinning wastes CPU cycles. • Blocking introduces context switch overhead; always-blocking is not good for dedicated environments. • Previous work focuses on choosing the best spin time. • Our optimization focus and findings: • Fast context switch has substantial performance impact; • Use scheduling information to guide spin/block decision: • Spinning is futile when the caller is not currently scheduled; • Most blocking cost comes from cache flushing penalty. (actual cost varies, up to several ms) Shen, Tang, and Yang @ SuperComputing'99

Scheduler-conscious Event Waiting • User-level scheduler provides: • scheduling info • affinity info Shen, Tang, and Yang @ SuperComputing'99

Experimental Settings • Machines: • SGI Origin 2000 system with 32 195MHz MIPS R10000s with 2GB memory • SGI Power Challenge with 4 200MHz MPIS R4400s with 256MB memory • Compare among: • TMPI-2: TMPI with two-level thread management • SGI MPI: SGI’s native MPI implementation • TMPI: original TMPI without two-level thread management Shen, Tang, and Yang @ SuperComputing'99

Testing Benchmarks • Sync frequency is obtained by running each benchmark with 4 MPI nodes on 4-processor Power Challenge. • The higher the multiprogramming degree, the more spin-blocks (context switch) during each synchronization • Sparse LU benchmarks have much more frequent synchronization than others. Shen, Tang, and Yang @ SuperComputing'99

Performance evaluation on a Multiprogrammed Workload • Workload: contains a sequence of six jobs launched with a fixed interval. • Compare jobturnaround timein Power Challenge. Shen, Tang, and Yang @ SuperComputing'99

Workload with Certain Multiprogramming Degrees • Goal: identify the performance impact of multiprogramming degrees. • Experimental setting: • Each workload has one benchmark program. • Run n MPI nodes on p processors (n≥p). • Multiprogramming degree is n/p. • Compare megaflop rates or speedups of the kernel part of each application. Shen, Tang, and Yang @ SuperComputing'99

Performance Impact of Multiprogramming Degree (SGI Power Challenge) Shen, Tang, and Yang @ SuperComputing'99

Performance Impact of Multiprogramming Degree (SGI Origin 2000) Performance ratios of TMPI-2 over TMPI Performance ratios of TMPI-2 over SGI MPI Shen, Tang, and Yang @ SuperComputing'99

Benefits of Scheduler-conscious Event Waiting Improvement over simple spin-block on Power Challenge Improvement over simple spin-block on Origin 2000 Shen, Tang, and Yang @ SuperComputing'99

Conclusions Contributions for optimizing MPI execution: • Adaptive two-level thread management; Scheduler-conscious event waiting; • Great performance improvement: up to an order of magnitude, depending on applications and load; • In multiprogrammed environments, fast context switch/synchronization is important even for communication-infrequent MPI programs. Current and future work • Support threaded MPI on SMP-clusters http://www.cs.ucsb.edu/research/tmpi Shen, Tang, and Yang @ SuperComputing'99

Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines