330 likes | 469 Views
When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency. Gaurav Chadha , Scott Mahlke , Satish Narayanasamy University of Michigan. Motivation. Hardware trends CMPs are ubiquitous. More and more cores in a system
E N D
When Less Is MOre (LIMO): Controlled Parallelism forImproved Efficiency GauravChadha, Scott Mahlke, SatishNarayanasamy University of Michigan
Motivation • Hardware trends • CMPs are ubiquitous. • More and more cores in a system • Mobile: Qualcomm Snapdragon, Samsung Exynos, NVIDIA Tegra 3. • Server: Tilera • Multi-threaded applications are pervasive. • But, do we always want to maximize the number of threads? NO
Run fewer threads: DVFS • Mostmulti-threaded applications stop scaling beyond a certain number of cores. • It becomes counter-productive to run more threads. • Maximum power budget is fixed for a system. • Fewer cores can “borrow” power from disabled cores. • Intel Turbo Boost Frequency increases in steps of 133 MHz frequency cores
Scalability: Problems • Too many threads • Increased contention for shared resources. • Increased synchronization costs. • Too few threads • Underutilization of resources.
Scalability: Less threads are better • 4 threads best for streamcluster
Scalability: Less threads are as good • Ferret, facesim, x264, dedup show poor scalability
Scalability: Opportunities • Run fewer threads • Disable some cores and increase frequency of the active ones.
Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1
Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1
Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1
Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1
Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1
Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1
Run fewer threads: DVFS • DVFS makes the case for fewer threads more compelling. • With fewer threads • increase frequency • reduce contention. 5 out of 11 benchmarks Who can decide the best number of threads? Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1
DVFS in current systems Inputs change 1.1 GHz 1.1 GHz 1.1 GHz 1.4 GHz Stalled System resources change Programmer decides how many threads to run (e.g. 32 threads on 32 cores) Different hardware configurations S Program characteristics change Stalled threads Execution progress 10 threads stalled 12 threads stalled 16 threads stalled Turbo Boost increases frequency
Our system 1.1 GHz 1.1 GHz 1.4 GHz Stalled Detection logic pro-actively disables more threads S Disabled Frequency is increased threads Execution progress 10 threads stalled 12 threads stalled 16 threads stalled/disabled Turbo Boost
Less Is MOre (LIMO) • Less Is MOre for efficiency • Observation: • Most programs do not scale after a certain limit • DVFS can help provide better performance • A runtime system • Monitors shared resource contention (shared cache, shared program variables) • Pro-actively disables threads • Employs DVFS LIMO
Roadblocks: Shared Cache • Roadblocks • Physical shared resources • Program level shared resources • Shared cache
Roadblocks: Shared Cache • Abstract representation of most multi-threaded programs • The peak performance point shifts depending on working set size and shared cache size Best performance Working set fits in shared cache Working set does not fit in shared cache Working set too large
Roadblocks: Program Resources • Roadblocks • Physical shared resources • Program level shared resources • Shared cache • Synchronization stalls (locks)
Roadblocks: Program Resources Best performance Increased parallelism gives more performance Increased synchronization costs hurt performance
LIMO 1.1 GHz 1.1 GHz 1.4 GHz 1.8 GHz Stalled After 100 million instructions, working set size estimate calculated Pro-actively disables more threads • 20 threads at 1.1 GHz: 20 * 1.1 = 22 • 16 threads at 1.4 GHz: 16 * 1.4 = 22.4 S Disabled Frequency is increased threads Working set of 10 threads fits in cache - 6 threads disabled Disabled • 10 threads at 1.4 GHz: 10 * 1.4 = 14 • 8 threads at 1.8 GHz: 8 * 1.8 = 14.4 Pro-actively disables more threads D Execution progress 10 threads stalled 12 threads stalled 16 threads stalled/disabled 8 threads disabled
Methodology: Configuration • Modified timing simulator FeS2 which uses Simics. • Hardware configuration:
Methodology: Simulation • 9 evenly spaced checkpoints • Timing simulations starting from these checkpoints • 80 million useful instructions simulated/checkpoint • Statistics cleared after the first 20 million • Useful instructions: committed in user mode, excluding spin loops. • Benchmarks from the PARSEC benchmark suite, Apache web server (httpd), speech recognition benchmark (sphinx) from ALP.
Example perf. breakdown Ferret
Example perf. breakdown Ferret
Example perf. breakdown Ferret
Example perf. breakdown Ferret
Example perf. breakdown Ferret
% Performance Improvement Good scalability Reduced synchronization stalls Reduced thrashing in shared cache
Conclusion • Scalability is difficult to achieve and predict. • Determining best number of threads is hard. • Contention in shared hardware resources • Contention in program level shared objects • LIMO frees the programmer from this burden. • Monitors shared resource contention (shared cache, shared program variables) • Pro-actively disables threads • Employs DVFS • 14% average improvement in performance over all threads.