1 / 23

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

This paper proposes a utility-based acceleration approach for identifying and accelerating performance-limiting bottlenecks or lagging threads in multithreaded applications on asymmetric CMPs. The approach combines speedup and criticality of code segments to prioritize acceleration decisions. A coordination mechanism is also introduced to ensure efficient execution on large cores.

darnellr
Download Presentation

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao* M. Aater Suleman* Onur Mutlu‡ Yale N. Patt* * HPS Research Group University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University

  2. Asymmetric CMP (ACMP) • One or a few large, out-of-order cores, fast • Many small, in-order cores, power-efficient • Critical code segments run on large cores • The rest of the code runs on small cores Small core Small core Large core Small core Small core ??? Small core Small core Small core Small core Small core Small core Small core Small core

  3. Bottlenecks T0 T1 T2 T3 Barrier 1 Barrier 2 Accelerating Critical Sections (ACS), Suleman et al., ASPLOS’09 T0 T1 T2 T3 Barrier 1 Barrier 2 Bottleneck Identification and Scheduling (BIS), Joao et al., ASPLOS’12

  4. Lagging Threads Lagging thread = potential future bottleneck Execution time reduction Progress P0 = 10 T0 P1 = 11 T1 P2 = 6 T2 P3 = 10 T3 Barrier 1 t1 Barrier 2 t2 Barrier 2 T2: Lagging thread • Previous work about progress of multithreaded applications: • Meeting points, Cai et al., PACT’08 • Thread criticality predictors, Bhattacharjee and Martonosi, ISCA’09 • Age-based scheduling (AGETS), Lakshminarayana at al., SC’09

  5. Two problems • 1) Do we accelerate bottlenecks or lagging threads? • 2) Multiple applications: which application do we accelerate? Application 1 T0 Acceleration decisions need to consider both: - the criticality of code segments - how much speedup they get T1 T2 T3 for lagging threads and bottlenecks Application 2 T0 T1 T2 T3 t1

  6. Utility-Based Acceleration (UBA) • Goal: identify performance-limiting bottlenecks or lagging threads from any running application and accelerate them on large cores of an ACMP • Key insight: a Utility of Acceleration metric that combines speedup and criticality of each code segment • Utility of accelerating code segment c of length t on an application of length T: L R G

  7. L: Local acceleration of c How much code segment c is accelerated • Estimate S = estimate performance on a large core while running on a small core • Performance Impact Estimation (PIE, Van Craeynest et al., ISCA’12) : considers both instruction-level parallelism (ILP) and memory-level parallelism (MLP) to estimate CPI t c running on small core ∆t c running on large core tLargeCore S: speedup of c

  8. R: Relevance of code segment c How relevant code segment c is for the application tlastQ t T Q Q Q: scheduling quantum

  9. G: Global effect of accelerating c How much accelerating c reduces total execution time  Acceleration of application Criticality of c  Acceleration of c Single thread t c running on small core ∆t ∆T G=1 c running on large core

  10. G: Global effect of accelerating c How much accelerating c reduces total execution time Critical sections: classify into strongly-contended and weakly-contended and estimate G differently (in the paper)  Acceleration of application Criticality of c  Acceleration of c Idle T1 G=0 T2 G=1/2 T3 Barrier 2 quanta to get the benefit of 1

  11. Utility-Based Acceleration (UBA) Bottleneck Identification Lagging Thread Identification Set of Highest-Utility Bottlenecks Set of Highest-Utility Lagging Threads Acceleration Coordination Large core control

  12. Lagging thread identification • Lagging threads are those that are making the least progress • How to define and measure progress?  Application-specific problem • We borrow from Age-Based Scheduling (SC’09) • Progress metric (committed instructions) • Assumption: same number of committed instructions between barriers • But we could easily use any other progress metric… • Minimum progress = minP • Set of lagging threads = { any thread with progress < minP + ∆P } • Compute Utility for each lagging thread

  13. Utility-Based Acceleration (UBA) Bottleneck Identification Lagging Thread Identification Set of Highest-Utility Bottlenecks Set of Highest-Utility Lagging Threads Acceleration Coordination 1 per large core Large core control

  14. Bottleneck identification • Software: programmer, compiler or library • Delimit potential bottlenecks with BottleneckCall and BottleneckReturn instructions • Replace code that waits with a BottleneckWait instruction • Hardware: Bottleneck Table • Keep track of threads executing or waiting for bottlenecks • Compute Utility for each bottleneck • Determine set of Highest-Utility Bottlenecks • Similar to our previous work BIS, ASPLOS’12 • BIS uses thread waiting cycles instead of Utility

  15. Utility-Based Acceleration (UBA) Bottleneck Identification Lagging Thread Identification Set of Highest-Utility Bottlenecks Set of Highest-Utility Lagging Threads Acceleration Coordination Large core control

  16. Acceleration coordination LT assigned to each large core every quantum Scheduling Buffer Large Cores No more bottlenecks  LT3 returns to large core Set of Highest-Utility Lagging Threads LT: lagging threads LT1 LT2 LT3 LT3 LT3 LT4 ULT1 > ULT2 > ULT3 > ULT4 U: utility Small Cores Bottleneck Acceleration Utility Threshold (BAUT) B1 B2 B: bottlenecks UB1 > BAUT UB2 > BAUT Bottleneck B2 will be enqueued Bottleneck B1 will preempt lagging thread LT3

  17. Methodology • Workloads • single-application: 9 multithreaded applications with different impact from bottlenecks • 2-application: all 55 combinations of (9 MT + 1 ST) • 4-application: 50 random combinations of (9 MT + 1 ST) • Processor configuration • x86 ISA • Area of large core = 4 x Area of small core • Large core: 4GHz, out-of-order, 128-entry ROB, 4-wide, 12-stage • Small core: 4GHz, in-order, 2-wide, 5-stage • Private 32KB L1, private 256KB L2, shared 8MB L3 • On-chip interconnect: Bi-directional ring, 2-cycle hop latency

  18. Comparison points • Single application • ACMP (Morad et al., Comp. Arch. Letters’06) • only accelerates Amdahl’s serial bottleneck • Age-based scheduling (AGETS, Lakshminarayana et al., SC’09) • only accelerates lagging threads • Bottleneck Identification and Scheduling (BIS, Joao et al., ASPLOS’12) • only accelerates bottlenecks • Multiple applications • AGETS+PIE: select most lagging thread with AGETS and use PIE across applications • only accelerates lagging threads • MA-BIS: BIS with shared large cores across applications • only accelerates bottlenecks

  19. Single application, 1 large core Optimal number of threads, 28 small cores, 1 large core UBA outperforms both AGETS and BIS by 8% Neither bottlenecks nor lagging threads Lagging threads: benefit from AGETS and UBA Limiting critical sections: benefit from BIS and UBA UBA’s benefit increases with area budget and number of large cores

  20. Multiple applications 2-application workloads, 60 small cores, 1 large core 55 1 UBA improves Hspeedup over AGETS+PIE and MA-BIS by 2 to 9%

  21. To effectively use ACMPs: Accelerate both fine-grained bottlenecks and lagging threads Accelerate single and multiple applications Utility-Based Acceleration (UBA) is a cooperative software-hardware solution to both problems Our Utility of Acceleration metric combines a measure of acceleration and a measure of criticality to allow meaningful comparisons between code segments Utility is implemented for an ACMP but is general enough to be extended to other acceleration mechanisms UBA outperforms previous proposals for single applications and their aggressive extensions for multiple-application workloads UBA is a comprehensive fine-grained acceleration proposal for parallel applications without programmer effort Summary

  22. Thank You! Questions?

  23. Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao* M. Aater Suleman* Onur Mutlu‡ Yale N. Patt* * HPS Research Group University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University

More Related