Thread criticality for power efficiency in CMPs

ECE 692 Topic Presentation • Thread criticality for power efficiency in CMPs KhairulKabir Nov. 3rd, 2009

Why Thread Criticality prediction? • Critical thread • One with the longest completion time in the parallel region T0 T1 T2 T3 InstsExec • Problems • Performance degradation • Energy inefficiency D-Cache Miss • Sources of variability • Algorithm, process variation, thermal emergencies etc. I-Cache Miss • Purpose • Load balancing for performance improvement • Energy optimization using DVFS Stall Stall

Related Work • Instruction criticality [Fields et al., Tune et al. 2001 etc.] • Identify critical instruction • Thrifty barrier [Li et al. 2005] • Faster cores transitioned into low-power mode based on prediction of barrier stall time. • DVFS for energy-efficiency at barriers [Liu et al. 2005] • Faster core tracks the waiting time and predicts the DVFS for next execution of the same parallel loop • Meeting points [Cai et al. 2008] • DVFS non-critical threads by tracking loop iterations completion rate across cores (parallel loops)

Thread Criticality Predictors for Dynamic • Performance, Power, and Resource Management in Chip Multiprocessors AbhishekBhattacharjee Margaret Martonosi Dept. of Electrical Engineering Princeton University

What is This Paper About? • Thread criticality predictor(TCP) design • Methodology • Identify architectural events impacting thread criticality • Introduce basic TCP hardware • Thread criticality predictor uses • Apply to Intel’s Threading Building Blocks(TBB) • Apply for energy-efficiency in barrier-based programs

Thread Criticality Prediction Goals • Design goals • 1. Accuracy • 2. Low-overhead implementation • Simple HW (allow SW policies to be built on top) • 3. One predictor, many uses Design decisions 1. Find suitable architectural metric 2. History-based local approach versus thread-comparative approach 3. This paper: TBB, DVFS and other uses: shared last-level cache management, SMT and memory priority, …

Methodology • Evaluations on a range of architectures: high-performance and embedded domains • GEMS simulator – To evaluate the performance on architectures representative of the high-performance domain • ARM simulator – To evaluate the performance benefits of TCP-guided task stealing in Intel’s TBB • FPGA-based emulator used to assess energy savings from TCP-guided DVFS

Architectural Metrics • History-based TCP • Requires repetitive barrier behavior • Information local to core: no communication • Problem for in-order pipelines: variant IPCs • Inter-core TCP metrics • Instruction count • Cache misses • Control flow changes • Translate lookaside buffer(TLB) miss

Thread-Comparative Metrics for TCP: Instruction Counts

Thread-Comparative Metrics for TCP: L1 D Cache Misses

Thread-Comparative Metrics for TCP: L1 I & D Cache Misses

Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses

Basic TCP Hardware • TCP hardware components • Per core criticality counters • Interval bound register

Basic TCP Hardware Per-core Criticality Counters track poorly cached, slow threads Periodically refresh criticality counters with Interval Bound Register Inst 15 Inst 30 Inst 35 Inst 1 Inst 135 Inst 5 Inst 2 Inst 20 Inst 5: Miss Over Inst 20 Inst 25: L2 $ Miss Inst 5: L1 D$ Miss! Inst 2 Inst 25: Miss Over Inst 1 Inst 10 Inst 5 Inst 2 Inst 15 Inst 20: Miss Over Inst 125 Inst 20: L1 I$ Miss! Inst 25 Inst 1 Inst 35 Inst 20 Inst 15 Inst 5 Inst 1 Inst 30 Inst 135 Inst 2 Core 0 Core 1 Core 2 Core 3 L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 Cache Miss! L1 Cache Miss! L2 Cache Miss! Shared L2 Cache L2 Controller TCP Hardware Criticality Counters 0 0 0 0 0 11 1 1 1 1 0 0 0 0 0 0

TBB Task Stealing & Thread Criticality • TBB dynamic scheduler distributes tasks • Each thread maintains software queue filled with tasks • Empty queue – thread “steals” task from another thread’s queue • Approach 1: Default TBB uses random task stealing • More failed steals at higher core counts  poor performance • Approach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008] • Steal based on number of items in SW queue • Must track and compare maximum occupancy counts

TCP-Guided TBB Task Stealing Core 0 SW Q0 Core 1 SW Q1 Core 2 SW Q2 Core 3 SW Q3 Task 0 Task 4 None Task 5 Task 7 Task 6 Core 2: Steal Req. Core 3: L1 Miss Task 1 • TCP initiates steals from critical thread • Modest message overhead: L2 access latency • Scalable: 14-bit criticality counters  114 bytes of storage @ 64 cores Task 2 Task 3 Task 7 Shared L2 Cache Criticality Counters Clock: 0 Clock: 10 Clock: 100 Clock: 30 Core 3: L2 Miss 14 0 5 0 2 0 0 21 1 11 Interval Bound Register TCP Control Logic Scan for max val. Steal from Core 3

TCP-Guided TBB Task Stealing • TBB with random task stealing • TBB with TCP-guided task stealing

TCP-Guided TBB Performance % Performance improvement versus random task stealing Avg. Improvement over Random (32 cores) = 21.6 % Avg. Improvement over Occupancy (32 cores) = 13.8 %

Adapting TCP for Energy Efficiency in Barrier-Based Programs Insts Exec T0 T1 T2 T3 • Approach: DVFS non-critical threads to eliminate barrier stall time • Challenges: • Relative criticalities • Miss-prediction costs • DVFS overheads L2 D$ Miss L2 D$ Over T1 critical, => DVFS T0, T2, T3

Hardware and Algorithm for TCP-Guided DVFS • TCP hardware components • Criticality counters • SST – Switching Suggestion Table • SCT – Suggestion Confidence Table • Interval bound register • TCP-guided DVFS algorithm – two key steps • Use SST to translate criticality counter values into thread criticalities • Criticality counter value is above a pre-defined threshold T and running at the nominal frequency • Determined by criticality counter value with SST entries • Suggests frequency switch if matching SST entry is different from current frequency • Feeds the suggested target frequency from SST to the SCT • Assesses confidence on SST’s DVFS suggestion

TCP-Guided DVFS – Effect of Criticality Counter Threshold • Lowest bar – pre-calculated or correct state, averaged across all barrier instances • Central bar – learning time taken until the correct DVFS state is first reached • Upper bar – prediction noise or time spent in erroneous DVFS after having arrived at the correct one • Low T increases susceptibility to temporal noise • Too many frequency changes and performance overhead result without good suggestion confidence

TCP for DVFS: Results Average 15% energy savings • Benchmark with more load imbalance generally save more energy

Conclusions • Goal 1: Accuracy • Accurate TCPs based on simple cache statistics • Goal 2: Low-overhead hardware • Scalable per-core criticality counters used • TCP in central location where cache information is already available • Goal 3: Versatility • TBB improved by 13.8% over best known approach @ 32 cores • DVFS used to achieve 15% energy savings • Two uses shown, many others possible…

Meeting Points: Using Thread Criticality to Adapt Multi-core Hardware to parallel Region QiongCai, José González, Ryan Rakvic, GrigoriosMagklis, Pedro ChaparroAntonio González

Introduction • Meeting point thread characterization • Identifies the critical thread of a single multi threaded application • Identifies amount of the slacks of non-critical threads • Proposed applications • Thread delaying for multi-core systems • Save energy consumptions by scaling down the frequency and voltage of the cores containing non-critical threads • Thread balancing for simultaneous multi-threaded cores • Improves overall performance by giving higher priority to the critical thread

Example: a parallelized loop from PageRank (lz77 method) • Observations: • The code is already written to achieve workload balance but imbalance still exists. CPU1 is slower than CPU0. • Reasons for imbalance: (i) Different cache misses (ii) Different control paths How To Find Critical Threads Dynamically?

Identification of Critical Threads • Insertion of meeting points • Place in a parallel region that is visited by all thread • Can be done by the hardware, the compiler or the programmer • Identification technique • A thread-private counter is incremented • The most critical thread is the one with the smallest counter • Slack of a thread is estimated as the difference of its counter and the counter of the slowest counter

Thread delaying • CPUs of the non-critical threads, can be put into deep sleep • Consumes almost zero energy • Not the most energy-efficient approach to deal with workload imbalance • Make non-critical threads run at a lower frequency/voltage level • All threads arrive at the barrier at the same time

Thread Delaying Frequency barrier Thread 1 A • Proposal: • Energy = Activity x Capacitance x Voltage2 • Reduce voltage when executing parallel threads • Delay threads arriving early to the barrier Frequency Thread 2 B Frequency Thread 3 C Thread 4 Frequency D Area -> Energy needed to execute the instructions of the thread

3 3 3 2 2 2 2 1 1 1 1 3 Thread Delaying Frequency barrier Thread 1 A Frequency Thread 2 B Frequency Thread 3 C Thread 4 Frequency D Area -> Energy needed to execute the instructions of the thread

Thread Delaying Energy B C D A B C D A B Energy Saved C D

Implementation of Thread delaying • MP-COUNTER_TABLE • Contains as many entries as number of cores in the processor • 32-bit counter • Consistent among all cores • HISTORY-TABLE • An entry for each possible frequency level • 2-bit up-down saturating counter • Implementation • Each core broadcasts the counter value in each 10 execution of the meeting point instruction • Invoke thread delaying algorithm • History table is updated

Thread Balancing • Speeding up a parallel application running more than one thread • Two-way in-order SMT with an issue bandwidth of two instruction per cycle • Both threads have ready instructions, allow both of them • One thread has ready instruction, can issue up to two instruction per cycle • If threads belong the same parallel application, prioritize critical thread • Thread balancing • Identify critical thread • Give the critical thread more priority in the issue logic

Thread Balancing Logic • Targeted for 2-way SMT: • Imbalance hardware logic: identify critical thread • Issue prioritization logic • If a thread is critical and it has two ready instructions, it is allowed to issue both instructions regardless of the number of ready instructions the non-critical thread has • Otherwise, the base issue policy is applied

Simulation Framework and Benchmarks • SoftSDV for Intel64/IA32 processor • Simulate multithreaded primitives including locks and synchronization operation and shared memory and events • RMS(Recognition, Mining, and Synthesis) benchmark • Highly data-intensive and highly parallel(computer vision, data mining, etc) • Benchmarks are parallelized by pthreads or OpenMP • 99% of total execution is parallel for all except FIMI (28% coverage)

Performance Results for Thread delaying • Baseline is aggressive • Every core is running at full speed and stops when it is completed. Once the core stops, it consumes zero power • Save 4% - 44% energy • Energy savings come from the large frequency decreases on non-critical thread

Performance Results for Thread Balancing • Baseline is aggressive • Every core is running at full speed and stops when it is completed • Performance benefit ranges from 1% - 20% • Performance benefit correlates with imbalance levels

Conclusions • Meeting point thread characterization dynamically estimates the criticality of the threads in a parallel execution • Thread delaying combines per-core DVFS and meeting point thread characterization together to reduce energy consumption on non-critical threads • Thread balancing gives higher priority in the issue queue of an SMT core to the critical thread.

Comparison of the Two Papers

Critiques • Paper 1 • It did not mention how to calculate the values for SST • Accuracy of barrier based DVFS depends on pre-calculated SST’s values • Paper 2 • The total number of times each thread visits the meeting point should be roughly same, that means meeting point thread characterization cannot handle variable loop iteration size • It just works well for parallel loop, but fails for any large parallel region without parallel loop • It might not be always feasible for hardware to detect parallel loop and insert the meeting point

Thank you !

Thread criticality for power efficiency in CMPs