When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

When Less Is MOre (LIMO): Controlled Parallelism forImproved Efficiency GauravChadha, Scott Mahlke, SatishNarayanasamy University of Michigan

Motivation • Hardware trends • CMPs are ubiquitous. • More and more cores in a system • Mobile: Qualcomm Snapdragon, Samsung Exynos, NVIDIA Tegra 3. • Server: Tilera • Multi-threaded applications are pervasive. • But, do we always want to maximize the number of threads? NO

Run fewer threads: DVFS • Mostmulti-threaded applications stop scaling beyond a certain number of cores. • It becomes counter-productive to run more threads. • Maximum power budget is fixed for a system. • Fewer cores can “borrow” power from disabled cores. • Intel Turbo Boost Frequency increases in steps of 133 MHz frequency cores

Scalability: Problems • Too many threads • Increased contention for shared resources. • Increased synchronization costs. • Too few threads • Underutilization of resources.

Scalability: Less threads are better • 4 threads best for streamcluster

Scalability: Less threads are as good • Ferret, facesim, x264, dedup show poor scalability

Scalability: Opportunities • Run fewer threads • Disable some cores and increase frequency of the active ones.

Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

Run fewer threads: DVFS • DVFS makes the case for fewer threads more compelling. • With fewer threads • increase frequency • reduce contention. 5 out of 11 benchmarks Who can decide the best number of threads? Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

DVFS in current systems Inputs change 1.1 GHz 1.1 GHz 1.1 GHz 1.4 GHz Stalled System resources change Programmer decides how many threads to run (e.g. 32 threads on 32 cores) Different hardware configurations S Program characteristics change Stalled threads Execution progress 10 threads stalled 12 threads stalled 16 threads stalled Turbo Boost increases frequency

Our system 1.1 GHz 1.1 GHz 1.4 GHz Stalled Detection logic pro-actively disables more threads S Disabled Frequency is increased threads Execution progress 10 threads stalled 12 threads stalled 16 threads stalled/disabled Turbo Boost

Less Is MOre (LIMO) • Less Is MOre for efficiency • Observation: • Most programs do not scale after a certain limit • DVFS can help provide better performance • A runtime system • Monitors shared resource contention (shared cache, shared program variables) • Pro-actively disables threads • Employs DVFS LIMO

Outline

Roadblocks: Shared Cache • Roadblocks • Physical shared resources • Program level shared resources • Shared cache

Roadblocks: Shared Cache • Abstract representation of most multi-threaded programs • The peak performance point shifts depending on working set size and shared cache size Best performance Working set fits in shared cache Working set does not fit in shared cache Working set too large

Roadblocks: Program Resources • Roadblocks • Physical shared resources • Program level shared resources • Shared cache • Synchronization stalls (locks)

Roadblocks: Program Resources Best performance Increased parallelism gives more performance Increased synchronization costs hurt performance

LIMO 1.1 GHz 1.1 GHz 1.4 GHz 1.8 GHz Stalled After 100 million instructions, working set size estimate calculated Pro-actively disables more threads • 20 threads at 1.1 GHz: 20 * 1.1 = 22 • 16 threads at 1.4 GHz: 16 * 1.4 = 22.4 S Disabled Frequency is increased threads Working set of 10 threads fits in cache - 6 threads disabled Disabled • 10 threads at 1.4 GHz: 10 * 1.4 = 14 • 8 threads at 1.8 GHz: 8 * 1.8 = 14.4 Pro-actively disables more threads D Execution progress 10 threads stalled 12 threads stalled 16 threads stalled/disabled 8 threads disabled

Methodology: Configuration • Modified timing simulator FeS2 which uses Simics. • Hardware configuration:

Methodology: Simulation • 9 evenly spaced checkpoints • Timing simulations starting from these checkpoints • 80 million useful instructions simulated/checkpoint • Statistics cleared after the first 20 million • Useful instructions: committed in user mode, excluding spin loops. • Benchmarks from the PARSEC benchmark suite, Apache web server (httpd), speech recognition benchmark (sphinx) from ALP.

Example perf. breakdown Ferret

% Performance Improvement Good scalability Reduced synchronization stalls Reduced thrashing in shared cache

Conclusion • Scalability is difficult to achieve and predict. • Determining best number of threads is hard. • Contention in shared hardware resources • Contention in program level shared objects • LIMO frees the programmer from this burden. • Monitors shared resource contention (shared cache, shared program variables) • Pro-actively disables threads • Employs DVFS • 14% average improvement in performance over all threads.

Thank you!

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

Presentation Transcript

Funnel Sort*: Cache Efficiency and Parallelism

Types of Parallelism

Funnel Sort*: Cache Efficiency and Parallelism

UWF WRITING LAB RULES OF THUMB FOR PARALLELISM

Parallelism #5

Parallelism

Parallelism Orchestration using DoPE : the Degree of Parallelism Executive

parallelism

Faulty Parallelism

PARALLELISM

Improved Spectrum Efficiency for the Next Generation WLANs

Library of Social Sciences

Parallelism

Parallelism

The BATANGAS PROJECT Gateway to Increased Efficiency And Improved Productivity

Parallelism with that , which , and who

Chapter 20: Parallel Databases

The Benefits of ISO 9000….

Motivating Facilities Staff Towards Improved Operating Efficiency

West Way Limo