320 likes | 415 Views
Understanding Performance, Power and Energy Behavior in Asymmetric Processors. Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute of Technology. Outline. Background and Motivation Thread Interactions Dynamic Scheduling Asymmetry Aware Scheduling
E N D
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute of Technology
Outline • Background and Motivation • Thread Interactions • Dynamic Scheduling • Asymmetry Aware Scheduling • Conclusion and Future Work
PEB PEB PEA Interconnect PEB PEB Heterogeneous Architectures • A particularly interesting class of parallel machines is Heterogeneous Architectures • Multiple types of Processing Elements (PEs) available on the same machine
Special Accelerator IBM Cell processor Heterogeneous Architectures • Heterogeneous architectures are becoming very common Focus of this talk Fast core Fast core Slow core Slow core Slow core Slow core Asymmetric Processors
Machine configurations • M-I experiments have 8 threads, M-II experiments have 16 threads • AMPs emulated using SpeedStep/PowerNow
Power Measurement Using Extech 380801 Power Analyzer Total system power consumption Power Socket Windows Machine Experiment Machine Power Cable Serial Cable 6
PARSEC Benchmark Suite • Desktop-oriented multithreaded benchmark suite • Multithreaded • Animation, Data Mining, Financial Analysis • Pthreads, OpenMP
Performance of PARSEC benchmarks Execution Time slow-limited middle-perf unstable • On average, performance of half-half is between that of all-slow and all-fast
barrier barrier Classification of Benchmarks barrier (b) middle-perf (c) unstable (a) slow-limited
Energy Consumption of PARSEC Energy consumption slow-limited middle-perf • In half-half/all-slow, total energy consumption is higher even though average power consumed might be lower
Behavior of Parsec Benchmarks • Observations –Different applications behave differently on AMPs –Usually SMP with fast processors saves energy
Outline • Background and Motivation • Thread Interactions • Dynamic Scheduling • Asymmetry Aware Scheduling • Conclusion and Future Work
Thread Interactions Sources of thread interactions • Critical Sections • Barriers
Critical Sections (CS) • Waiting to enter CSs Case (a) Case (b) Critical section Useful work Waiting
barrier Barriers • Waiting for other threads to finish barrier
Effect of Critical Section length • CS limited application Normalized Power Consumption • As critical section length increases, the average power consumed decreases
Effect of Critical Section length • CS limited application Normalized Execution Time
Effect of Critical Section length • CS limited application Normalized Execution Time • Performance of AMPs sensitive to CS length 19
Effect of Critical Section length • CS limited application Normalized Energy Consumption • Energy consumption shows the same trend 20
Effect of Critical Section frequency • Both length and frequency of CS affect performance and energy consumption • As frequency increases, performance difference between half-half and all-fast reduces • If majority of the execution time is spent waiting for locks, it is OK to have a few slow processors • Results available in the paper
Effect of Barriers • For few barriers, half-half performs similar to all-slow • For large number of barriers, half-half performs similar to all-fast • Results available in the paper
Outline • Background and Motivation • Thread Interactions • Dynamic Scheduling • Asymmetry Aware Scheduling • Conclusion and Future Work
Dynamic Scheduling • Motivation: better run-time adaptivity • Each thread requests for more work after completing the assigned work • OpenMP, Intel Thread Building Blocks
Dynamic Scheduling • Can help improve performance and reduce energy consumption in AMPs • Should be preferred to static and guided policies • Parallel-for application
Outline • Background and Motivation • Thread Interactions • Dynamic Scheduling • Asymmetry Aware Scheduling • Conclusion and Future Work
Slow core Fast core Slow core Fast core barrier Scheduling in AMPs • Longest Job to a Fast Processor First (LJFPF) [Lakshminarayana’08]
How Does the Scheduler Know • Length of work? • Current mechanism: application sends task length information • On-going work: Prediction mechanism
LJFPF • ITK: Medical image processing applications (OpenSource) • MultiRegistration (Registration method) • kernel with 50 iterations • 50 iterations divided among 8 threads Normalized Execution Time Normalized Energy Consumption
Outline • Background and Motivation • Thread Interactions • Dynamic Scheduling • Asymmetry Aware Scheduling • Conclusion and Future Work
Conclusion & Future Work Conclusion • Evaluated the performance/energy consumption behavior of multithreaded applications in AMPs • For symmetric workloads • With little thread interaction: SMP with fast processors • With a lot of thread interaction: AMP could be better • For asymmetric threads – AMP could provide lowest energy consumption Future Work • Predict application characteristics and use predicted information for thread scheduling on AMPs