1 / 25

Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications

Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications. Hongtao Zhong, Steven A. Lieberman, and Scott A. Mahlke Advanced Computer Architecture Laboratory University of Michigan. Multicore Architectures. Multicore becomes a trend Intel Core Duo, 2005

jaden
Download Presentation

Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extending Multicore Architectures to Exploit Hybrid Parallelism inSingle-Thread Applications Hongtao Zhong, Steven A. Lieberman, and Scott A. Mahlke Advanced Computer Architecture Laboratory University of Michigan 1

  2. Multicore Architectures • Multicore becomes a trend • Intel Core Duo, 2005 • Intel Core Quad, 2006 • Sun T1, 8 cores, 2005 • 16 – 32 cores, near future • Need for simpler cores • Power density • Cooling costs • Multiple cores on a chip • High throughput • Good for multithreaded apps core 3 core 0 core 1 core 2 L2 L2 interconnect L2 L2 core 7 core 0 core 1 core 2 8 core Sun T1 processor 2

  3. How About Single Thread Applications? Single thread performance, Core Duo vs. Pentium M (same cache, same platform) Source : Meldelson et al. Intel Technology Journal, Vol 10, Issue 02, 2006 3

  4. Objective of this Work • Automatically accelerate single thread applications on multicore systems • Exploit irregular parallelism across cores • Instruction level parallelism (ILP) • Fine-grain thread level parallelism (TLP ) • Loop level parallelism (LLP) • Adaptive architecture • Configurate resources to exploit available parallelism • Dynamic adaptability Hybrid parallelism 4

  5. Approach • Voltron: Hardware/software approach • Architecture mechanisms • Dual mode execution (coupled, decoupled) • Flexible inter-core communication • Fast thread spawning • Efficient memory ordering • High rate-of-return speculation • Compiler techniques • Compiler controlled distributed branch • Fine-grain thread extraction • Speculative loop parallelization with recovery 5

  6. + + - / & | L L L S * - + + S >> + L L L / * << & + - | < + br Parallelism Type 1: ILP 6

  7. Core 2 Core 3 Core 1 Core 0 Parallelism Type 1: ILP • Emulate VLIW • Low latency communication + + - / & | L L L S * - + + S >> + L L L / * << & + - | < + br 7

  8. + + - / & | L L L S * - + + S >> + L L L / * << & + - | < br + br br br Core 2 Core 3 Core 1 Core 0 Parallelism Type 1: ILP • Emulate VLIW • Low latency communication • Compiler controlled distributed branch • Lockstep execution 8

  9. Voltron Architecture for ILP To west To north Banked L2 Cache Comm Mem . . . FU FU FU Core 0 Core 1 Register Files stall bus GPR FPR PR BTR br bus Instruction Fetch/Decode Core 2 Core 3 L1 L1 Instruction Cache Data Cache Banked L2 Cache To/From Banked L2 From Banked L2 9

  10. Experimental Setup • Trimaran Toolset • Simulator • Multiple cores, multiple instruction stream • Inter-core communication • MOESI coherent protocol • Configuration • 1 ALU, 1 memory unit, 1 communication unit per core • 1 cycle inter-core move latency per hop • 4KB L1 I-cache, 4KB L1 D-cache per core • 128KB shared L2 cache • Single core baseline • 25 benchmarks from SpecInt, SpecFP, and MediaBench 10

  11. ILP Speedup Mediabench SpecFP SpecInt Achieved > 80% of the performance on wide VLIW with same resources. 11

  12. C B D E Parallelism Type 2 : Fine-grain TLP • Fine-grain threads • Few instructions • Scalar communication • Shared stack frame A C B D E 12

  13. C C B B D D E E Parallelism Type 2 : Fine-grain TLP • Fine-grain threads • Few instructions • Scalar communication • Shared stack frame A st ld ld ld st ld 13

  14. Parallelism Type 2 : Fine-grain TLP • Fine-grain threads • Few instruction • Scalar communication • Shared stack frame • Decoupled execution • Different control flow • Asynchronous communication • Fast thread spawning • Efficient memory ordering • Compiler algorithm • Memory dependences • Load balance A’ A st ld ld ld st ld C B D E Core 1 Core 0 14

  15. Voltron for Fine-grain TLP To west To north Comm Mem . . . FU FU FU Register Files GPR FPR PR BTR Instruction Fetch/Decode L1 L1 Instruction Cache Data Cache To/From Banked L2 From Banked L2 15

  16. Dual Mode Network • Coupled mode • Direct bypass [Multiflow] • Coupled execution • 1 cycle min latency, num_hops • Decoupled mode • Message queues [RAW] • SEND / RECV • Decoupled execution • 3 cycle min latency, 2 + num_hops • Fast fine-grain thread spawning • Enforce operation ordering 16

  17. Fine-grain TLP Speedup Mediabench SpecInt SpecFP * * * * * * * * * Works better for memory intensive applications 17

  18. Parallelism Type 3 : LLP • DOALL loops • No cross-iteration dependences • Iterations can execute in parallel • Memory dependences hard to prove 18

  19. Parallelism Type 3 : LLP • DOALL loops • No cross-iteration dependences • Iterations can execute in parallel • Memory dependences hard to prove • Statistical DOALL • Profile memory dependences • Speculatively parallelize • Detect violation and rollback core 1 core 0 init init restart reset reset Unexpected dependence iter 0-7 iter 4-7 iter 0-3 finalize finalize 19

  20. Voltron for LLP To west To north • Detect memory dependence violation • Roll back memory state • Compiler roll back register state cache Comm Mem . . . FU FU FU tag state data T Register Files GPR FPR PR BTR Instruction Fetch/Decode L1 D-cache L1 w/ Transactional MemSupport Instruction Cache To/From Banked L2 From Banked L2 20

  21. LLP Speedup SpecInt Mediabench SpecFP Accelerate non-provable DOALL and small loops 21

  22. Speedup for Hybrid Execution SpecInt Mediabench SpecFP • 2 core average – ILP:1.23, TLP: 1.16, LLP: 1.17, Hybrid: 1.46 • 4 core average – ILP:1.33, TLP: 1.23, LLP: 1.37, Hybrid: 1.83 22

  23. Time Breakdown SpecFP SpecInt Mediabench Both coupled and decoupled mode are necessary. 23

  24. Conclusions and Future Work • Voltron – Adaptive multicore system • Accelerate single thread applications • Exploit ILP, fine-grain TLP and statistical LLP • Coupled and decoupled execution • Dual-mode operand network • Compiler managed loop speculation • Hybrid parallelism combines the benefits • Future work • Fine-grain thread identification • Virtualization of resources 24

  25. Thank You • Questions? For more information: http://cccp.eecs.umich.edu 25

More Related