1 / 24

Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines

Explore the efficiency and impact of thread-based MPI execution in multiprogrammed shared memory machines. Discover techniques for faster synchronization and context switch, optimizing MPI performance on SMMs.

alira
Download Presentation

Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang http://www.cs.ucsb.edu/research/tmpi Department of Computer Science University of California, Santa Barbara

  2. MPI-Based Parallel Computation on Shared Memory Machines • Shared Memory Machines (SMMs) or SMM Clusters become popular for high end computing. • MPI is a portable high performance parallel programming model.  MPI on SMMs • Threads are easy to program. But MPI is still used on SMMs: • Better portability for running on other platforms (e.g. SMM clusters); • Good data locality due to data partitioning. Shen, Tang, and Yang @ SuperComputing'99

  3. Scheduling for Parallel Jobs in Multiprogrammed SMMs • Gang-scheduling • Good for parallel programs which synchronize frequently; • Affect resource utilization (Processor-fragmentation; not enough parallelism to use allocated resource). • Space/time Sharing • Time sharing combined with dynamic partitioning; • High throughput. Popular in current OS (e.g., IRIX 6.5) • Impact on MPI program execution • Not all MPI nodes are scheduled simultaneously; • The number of available processors for each application may change dynamically. • Optimization is needed for fast MPI execution on SMMs. Shen, Tang, and Yang @ SuperComputing'99

  4. Techniques Studied • Thread-Based MPI execution [PPoPP’99] • Compile-time transformation for thread-safe MPI execution • Fast context switch and synchronization • Fast communication through address sharing • Two-level thread management for multiprogrammed environments • Even faster context switch/synchronization • Use scheduling information to guide synchronization • Our prototype system: TMPI Shen, Tang, and Yang @ SuperComputing'99

  5. Impact of synchronization on coarse-grain parallel programs • Running a communication-infrequent MPI program (SWEEP3D) on 8 SGI Origin 2000 processors with multiprogramming degree 3. • Synchronization costs 43%-84% of total time. • Execution time breakdown for TMPI and SGI MPI: Shen, Tang, and Yang @ SuperComputing'99

  6. Related Work MPI-related Work • MPICH, a portable MPI implementation [Gropp/Lusk et al.]. • SGI MPI, highly optimized on SGI platforms. • MPI-2, multithreading within a single MPI node. Scheduling and Synchronization • Process Control [Tucker/Gupta] and Scheduler Activation[Anderson et al.] Focus on OS research. • Scheduler-conscious Synchronization[Kontothanssis et al.] Focus on primitives such as barriers and locks. • Hood/Cilk threads [Arora et al.] and Loop-level Scheduling [Yue/Lilja]. Focus on fine-grain parallelism. Shen, Tang, and Yang @ SuperComputing'99

  7. Outline • Motivations & Related Work • Adaptive Two-level Thread Management • Scheduler-conscious Event Waiting • Experimental Studies Shen, Tang, and Yang @ SuperComputing'99

  8. Context Switch/Synchronization in Multiprogrammed Environments In multiprogrammed environments, synchronization leads to more context switches  large performance impact. • Conventional MPI implementation maps each MPI node to an OS process. • Our earlier work maps each MPI node to a kernel thread. • Two-level Thread Management: maps each MPI node to a user-level thread. • Faster context switch and synchronization among user-level threads • Very few kernel-level context switches Shen, Tang, and Yang @ SuperComputing'99

  9. System Architecture … ... • Targeted at multiprogrammed environments • Two-level thread management MPI application MPI application … ... TMPI Runtime TMPI Runtime … ... User-level threads User-level threads System-wide resource management Shen, Tang, and Yang @ SuperComputing'99

  10. Adaptive Two-level Thread Management • System-wide resource manager (OS kernel or User-level central monitor) • collects information about active MPI applications; • partitions processors among them. • Application-wide user-level thread management • maps each MPI node into a user-level thread; • schedules user-level threads on a pool of kernel threads; • controls the number of active kernel threads close to the number of allocated processors. • Big picture (in the whole system):  #Active kernel threads ≈ #Processors  Minimize kernel-level context switch Shen, Tang, and Yang @ SuperComputing'99

  11. User-level Thread Scheduling • Every kernel thread can be: • active: executing an MPI node (user-level thread); • suspended. • Execution invariant for each application: • #active kernel threads ≈ #allocated processors (minimize kernel-level context switch) • #kernel threads = #MPI nodes (avoid dynamic thread creation) • Every active kernel thread polls system resource manager, which leads to: • Deactivation: suspending itself • Activation: waking up some suspended kernel threads • No-action • When to poll? Shen, Tang, and Yang @ SuperComputing'99

  12. Polling in User-Level Context Switch • Context switch is a result of synchronization (e.g. an MPI node waits for a message). • Underlying kernel thread polls system resource manager during context switch: • Two stack switches if deactivation  suspend on a dummy stack • One stack switch otherwise • After optimization, 2s in average on SGI Power Challenge Shen, Tang, and Yang @ SuperComputing'99

  13. Outline • Motivations & Related Work • Adaptive Two-level Thread Management • Scheduler-conscious Event Waiting • Experimental Studies Shen, Tang, and Yang @ SuperComputing'99

  14. Event Waiting Synchronization • All MPI synchronization is based on waitEvent waiter caller waitEvent(*pflag == value); • Waiting could be: • spinning • yielding/blocking waiting *pflag = value; wakeup Shen, Tang, and Yang @ SuperComputing'99

  15. Tradeoff between spin and block • Basic rules for waiting using spin-then-block: • Spinning wastes CPU cycles. • Blocking introduces context switch overhead; always-blocking is not good for dedicated environments. • Previous work focuses on choosing the best spin time. • Our optimization focus and findings: • Fast context switch has substantial performance impact; • Use scheduling information to guide spin/block decision: • Spinning is futile when the caller is not currently scheduled; • Most blocking cost comes from cache flushing penalty. (actual cost varies, up to several ms) Shen, Tang, and Yang @ SuperComputing'99

  16. Scheduler-conscious Event Waiting • User-level scheduler provides: • scheduling info • affinity info Shen, Tang, and Yang @ SuperComputing'99

  17. Experimental Settings • Machines: • SGI Origin 2000 system with 32 195MHz MIPS R10000s with 2GB memory • SGI Power Challenge with 4 200MHz MPIS R4400s with 256MB memory • Compare among: • TMPI-2: TMPI with two-level thread management • SGI MPI: SGI’s native MPI implementation • TMPI: original TMPI without two-level thread management Shen, Tang, and Yang @ SuperComputing'99

  18. Testing Benchmarks • Sync frequency is obtained by running each benchmark with 4 MPI nodes on 4-processor Power Challenge. • The higher the multiprogramming degree, the more spin-blocks (context switch) during each synchronization • Sparse LU benchmarks have much more frequent synchronization than others. Shen, Tang, and Yang @ SuperComputing'99

  19. Performance evaluation on a Multiprogrammed Workload • Workload: contains a sequence of six jobs launched with a fixed interval. • Compare jobturnaround timein Power Challenge. Shen, Tang, and Yang @ SuperComputing'99

  20. Workload with Certain Multiprogramming Degrees • Goal: identify the performance impact of multiprogramming degrees. • Experimental setting: • Each workload has one benchmark program. • Run n MPI nodes on p processors (n≥p). • Multiprogramming degree is n/p. • Compare megaflop rates or speedups of the kernel part of each application. Shen, Tang, and Yang @ SuperComputing'99

  21. Performance Impact of Multiprogramming Degree (SGI Power Challenge) Shen, Tang, and Yang @ SuperComputing'99

  22. Performance Impact of Multiprogramming Degree (SGI Origin 2000) Performance ratios of TMPI-2 over TMPI Performance ratios of TMPI-2 over SGI MPI Shen, Tang, and Yang @ SuperComputing'99

  23. Benefits of Scheduler-conscious Event Waiting Improvement over simple spin-block on Power Challenge Improvement over simple spin-block on Origin 2000 Shen, Tang, and Yang @ SuperComputing'99

  24. Conclusions Contributions for optimizing MPI execution: • Adaptive two-level thread management; Scheduler-conscious event waiting; • Great performance improvement: up to an order of magnitude, depending on applications and load; • In multiprogrammed environments, fast context switch/synchronization is important even for communication-infrequent MPI programs. Current and future work • Support threaded MPI on SMP-clusters http://www.cs.ucsb.edu/research/tmpi Shen, Tang, and Yang @ SuperComputing'99

More Related