1 / 23

Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

36 th International Symposium on Computer Architecture (ISCA 2009). Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading. Carlos Madriles, Pedro L ó pez, Josep M. Codina, Enric Gibert,

Download Presentation

Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 36th International Symposium on Computer Architecture (ISCA 2009) Boosting Single-thread Performance in Multi-Core Systems through Fine-GrainMulti-Threading Carlos Madriles, Pedro López, Josep M. Codina, Enric Gibert, Fernando Latorre, Alejandro Martínez, Raúl Martínez, and Antonio González Intel Barcelona Research Center, Intel Labs Universitat Politecnica de Catalunya, Barcelona Presenter: Ki Sup Hong

  2. Why Is Single Thread Performance Important? • Exploiting TLP is more power-efficient than exploiting ILP • The trend is to increase the number of cores integrated into the same die • Power and thermal constraints advocate for simpler cores • However, single thread performance matters • Sequential and lightly threaded apps • Sequential regions limit scalability of parallelism (Amdahl’s law) • Schemes to boost single-thread performance on CMPs are crucial

  3. Main Research Efforts New techniques to exploit parallelism are needed • Main research efforts to exploit both ILP and TLPfor boosting single-thread performance on CMPs • Adaptive and cluster microarchitectures • Combine or fuse cores • Parallelism constrained by the instruction window • Speculative multithreading (SpMT) techniques • Traditional schemes constrain the exploitable parallelism

  4. Contribution: Anaphase • Novel speculative multithreading technique to boost single-thread applications on CMPs • Effectively exploits ILP, TLP, and MLP • Thread decomposition performed at instruction granularity • Adapts to available parallelism in the application • Minimizes inter-thread dependences • Improves workload balance • Increases the amount of exploitable MLP • HW / SW hybrid scheme • Low hardware complexity and powerful thread decomposition • Cost-effective hardware support on top of a conventional CMP • A novel uncore module supports anaphase threads execution • Very few changes on the cores

  5. Outline • The Anaphase Scheme • Threading Model • Decomposition Algorithm • The Anaphase Hardware Support • Architecture Overview • Architectural State Management • Evaluation • Conclusions

  6. Anaphase Scheme Overview • Main distinguishing features : • Threading model • Fine-grain thread decomposition • Hardware support

  7. Threading Model : Key Features

  8. Threading Model : Key Features • P-Slice is a piece of code that is added at the beginning of each thread • P-Slice is used for value prediction • If the values that are computed by one thread and consumed by another can be predicted, the consumer thread can be executed in parallel with the producer thread

  9. Threading Model : Key Features

  10. Threading Model : Key Features

  11. Decomposition Algorithm [1] [1] G. Karypis, and V. Kumar, “Analysis of Multilevel Graph Partitioning”, in Procs. Of the 7th Supercomputing, 1995

  12. Outline • The Anaphase Scheme • Threading Model • Decomposition Algorithm • The Anaphase Hardware Support • Architecture Overview • Architectural State Management • Evaluation • Conclusions

  13. Architecture Overview

  14. Anaphase Execution

  15. Anaphase Execution

  16. Anaphase Execution

  17. Anaphase Execution

  18. Anaphase Execution

  19. Outline • The Anaphase Scheme • Threading Model • Decomposition Algorithm • The Anaphase Hardware Support • Architecture Overview • Architectural State Management • Evaluation • Conclusions

  20. Experimental Framework • Compiler • Anaphase partitioning implemented on top of the IntelⓇ Compiler (icc) • Simulator • Cycle accurate x86 CMP (1 tile with 2 out-of-order cores modeled) • Two different core configurations evaluated • IntelⓇ CoreTM 2 like (Medium core) • Half-sized main structures (Tiny core) • Core Fusion [1] proposal implemented for comparison • Benchmarks : Spec 2006 [1] E. Ipek, M. Kirman, N. Kirman, and J.F. Martinez, “Core fusion: Accommodating Software Diversity in Chip Multiprocessors”, in Proc. Of the Int. Symp. On Computer Architecture, 2007

  21. Performance Evaluation Average SpeedUp : Anaphase=41%(Tiny), 31%(Medium) : CoreFusion=28%(Tiny), 12%(Medium) MLP exploited by Anaphase

  22. Dynamic Instruction Breakdown High coverage → less 10% serial code Low overhead → 22% replicated instructions and explicit communications Few explicit communications → only 3% High accuracy → less 1% squashed instructions

  23. Conclusions Anaphase is a SW / HW technique that leverages multiple cores to boost single-thread code It exploits ILP, TLP, and MLP with high accuracy and low overheads Low hardware overhead (7%) Important benefits : 41% average speedup for Spec 2006

More Related