Understanding Performance, Power and Energy Behavior in Asymmetric Processors

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute of Technology

Outline • Background and Motivation • Thread Interactions • Dynamic Scheduling • Asymmetry Aware Scheduling • Conclusion and Future Work

PEB PEB PEA Interconnect PEB PEB Heterogeneous Architectures • A particularly interesting class of parallel machines is Heterogeneous Architectures • Multiple types of Processing Elements (PEs) available on the same machine

Special Accelerator IBM Cell processor Heterogeneous Architectures • Heterogeneous architectures are becoming very common Focus of this talk Fast core Fast core Slow core Slow core Slow core Slow core Asymmetric Processors

Machine configurations • M-I experiments have 8 threads, M-II experiments have 16 threads • AMPs emulated using SpeedStep/PowerNow

Power Measurement Using Extech 380801 Power Analyzer Total system power consumption Power Socket Windows Machine Experiment Machine Power Cable Serial Cable 6

PARSEC Benchmark Suite • Desktop-oriented multithreaded benchmark suite • Multithreaded • Animation, Data Mining, Financial Analysis • Pthreads, OpenMP

Performance of PARSEC benchmarks Execution Time slow-limited middle-perf unstable • On average, performance of half-half is between that of all-slow and all-fast

barrier barrier Classification of Benchmarks barrier (b) middle-perf (c) unstable (a) slow-limited

Energy Consumption of PARSEC Energy consumption slow-limited middle-perf • In half-half/all-slow, total energy consumption is higher even though average power consumed might be lower

Behavior of Parsec Benchmarks • Observations –Different applications behave differently on AMPs –Usually SMP with fast processors saves energy

Why do different applications behave differently on AMPs?

Thread Interactions Sources of thread interactions • Critical Sections • Barriers

Critical Sections (CS) • Waiting to enter CSs Case (a) Case (b) Critical section Useful work Waiting

barrier Barriers • Waiting for other threads to finish barrier

Effect of Critical Section length • CS limited application Normalized Power Consumption • As critical section length increases, the average power consumed decreases

Effect of Critical Section length • CS limited application Normalized Execution Time

Effect of Critical Section length • CS limited application Normalized Execution Time • Performance of AMPs sensitive to CS length 19

Effect of Critical Section length • CS limited application Normalized Energy Consumption • Energy consumption shows the same trend 20

Effect of Critical Section frequency • Both length and frequency of CS affect performance and energy consumption • As frequency increases, performance difference between half-half and all-fast reduces • If majority of the execution time is spent waiting for locks, it is OK to have a few slow processors • Results available in the paper

Effect of Barriers • For few barriers, half-half performs similar to all-slow • For large number of barriers, half-half performs similar to all-fast • Results available in the paper

Dynamic Scheduling • Motivation: better run-time adaptivity • Each thread requests for more work after completing the assigned work • OpenMP, Intel Thread Building Blocks

Dynamic Scheduling • Can help improve performance and reduce energy consumption in AMPs • Should be preferred to static and guided policies • Parallel-for application

Slow core Fast core Slow core Fast core barrier Scheduling in AMPs • Longest Job to a Fast Processor First (LJFPF) [Lakshminarayana’08]

How Does the Scheduler Know • Length of work? • Current mechanism: application sends task length information • On-going work: Prediction mechanism

LJFPF • ITK: Medical image processing applications (OpenSource) • MultiRegistration (Registration method) • kernel with 50 iterations • 50 iterations divided among 8 threads Normalized Execution Time Normalized Energy Consumption

Conclusion & Future Work Conclusion • Evaluated the performance/energy consumption behavior of multithreaded applications in AMPs • For symmetric workloads • With little thread interaction: SMP with fast processors • With a lot of thread interaction: AMP could be better • For asymmetric threads – AMP could provide lowest energy consumption Future Work • Predict application characteristics and use predicted information for thread scheduling on AMPs

Thank you!

Understanding Performance, Power and Energy Behavior in Asymmetric Processors