250 likes | 484 Views
Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications. Hongtao Zhong, Steven A. Lieberman, and Scott A. Mahlke Advanced Computer Architecture Laboratory University of Michigan. Multicore Architectures. Multicore becomes a trend Intel Core Duo, 2005
E N D
Extending Multicore Architectures to Exploit Hybrid Parallelism inSingle-Thread Applications Hongtao Zhong, Steven A. Lieberman, and Scott A. Mahlke Advanced Computer Architecture Laboratory University of Michigan 1
Multicore Architectures • Multicore becomes a trend • Intel Core Duo, 2005 • Intel Core Quad, 2006 • Sun T1, 8 cores, 2005 • 16 – 32 cores, near future • Need for simpler cores • Power density • Cooling costs • Multiple cores on a chip • High throughput • Good for multithreaded apps core 3 core 0 core 1 core 2 L2 L2 interconnect L2 L2 core 7 core 0 core 1 core 2 8 core Sun T1 processor 2
How About Single Thread Applications? Single thread performance, Core Duo vs. Pentium M (same cache, same platform) Source : Meldelson et al. Intel Technology Journal, Vol 10, Issue 02, 2006 3
Objective of this Work • Automatically accelerate single thread applications on multicore systems • Exploit irregular parallelism across cores • Instruction level parallelism (ILP) • Fine-grain thread level parallelism (TLP ) • Loop level parallelism (LLP) • Adaptive architecture • Configurate resources to exploit available parallelism • Dynamic adaptability Hybrid parallelism 4
Approach • Voltron: Hardware/software approach • Architecture mechanisms • Dual mode execution (coupled, decoupled) • Flexible inter-core communication • Fast thread spawning • Efficient memory ordering • High rate-of-return speculation • Compiler techniques • Compiler controlled distributed branch • Fine-grain thread extraction • Speculative loop parallelization with recovery 5
+ + - / & | L L L S * - + + S >> + L L L / * << & + - | < + br Parallelism Type 1: ILP 6
Core 2 Core 3 Core 1 Core 0 Parallelism Type 1: ILP • Emulate VLIW • Low latency communication + + - / & | L L L S * - + + S >> + L L L / * << & + - | < + br 7
+ + - / & | L L L S * - + + S >> + L L L / * << & + - | < br + br br br Core 2 Core 3 Core 1 Core 0 Parallelism Type 1: ILP • Emulate VLIW • Low latency communication • Compiler controlled distributed branch • Lockstep execution 8
Voltron Architecture for ILP To west To north Banked L2 Cache Comm Mem . . . FU FU FU Core 0 Core 1 Register Files stall bus GPR FPR PR BTR br bus Instruction Fetch/Decode Core 2 Core 3 L1 L1 Instruction Cache Data Cache Banked L2 Cache To/From Banked L2 From Banked L2 9
Experimental Setup • Trimaran Toolset • Simulator • Multiple cores, multiple instruction stream • Inter-core communication • MOESI coherent protocol • Configuration • 1 ALU, 1 memory unit, 1 communication unit per core • 1 cycle inter-core move latency per hop • 4KB L1 I-cache, 4KB L1 D-cache per core • 128KB shared L2 cache • Single core baseline • 25 benchmarks from SpecInt, SpecFP, and MediaBench 10
ILP Speedup Mediabench SpecFP SpecInt Achieved > 80% of the performance on wide VLIW with same resources. 11
C B D E Parallelism Type 2 : Fine-grain TLP • Fine-grain threads • Few instructions • Scalar communication • Shared stack frame A C B D E 12
C C B B D D E E Parallelism Type 2 : Fine-grain TLP • Fine-grain threads • Few instructions • Scalar communication • Shared stack frame A st ld ld ld st ld 13
Parallelism Type 2 : Fine-grain TLP • Fine-grain threads • Few instruction • Scalar communication • Shared stack frame • Decoupled execution • Different control flow • Asynchronous communication • Fast thread spawning • Efficient memory ordering • Compiler algorithm • Memory dependences • Load balance A’ A st ld ld ld st ld C B D E Core 1 Core 0 14
Voltron for Fine-grain TLP To west To north Comm Mem . . . FU FU FU Register Files GPR FPR PR BTR Instruction Fetch/Decode L1 L1 Instruction Cache Data Cache To/From Banked L2 From Banked L2 15
Dual Mode Network • Coupled mode • Direct bypass [Multiflow] • Coupled execution • 1 cycle min latency, num_hops • Decoupled mode • Message queues [RAW] • SEND / RECV • Decoupled execution • 3 cycle min latency, 2 + num_hops • Fast fine-grain thread spawning • Enforce operation ordering 16
Fine-grain TLP Speedup Mediabench SpecInt SpecFP * * * * * * * * * Works better for memory intensive applications 17
Parallelism Type 3 : LLP • DOALL loops • No cross-iteration dependences • Iterations can execute in parallel • Memory dependences hard to prove 18
Parallelism Type 3 : LLP • DOALL loops • No cross-iteration dependences • Iterations can execute in parallel • Memory dependences hard to prove • Statistical DOALL • Profile memory dependences • Speculatively parallelize • Detect violation and rollback core 1 core 0 init init restart reset reset Unexpected dependence iter 0-7 iter 4-7 iter 0-3 finalize finalize 19
Voltron for LLP To west To north • Detect memory dependence violation • Roll back memory state • Compiler roll back register state cache Comm Mem . . . FU FU FU tag state data T Register Files GPR FPR PR BTR Instruction Fetch/Decode L1 D-cache L1 w/ Transactional MemSupport Instruction Cache To/From Banked L2 From Banked L2 20
LLP Speedup SpecInt Mediabench SpecFP Accelerate non-provable DOALL and small loops 21
Speedup for Hybrid Execution SpecInt Mediabench SpecFP • 2 core average – ILP:1.23, TLP: 1.16, LLP: 1.17, Hybrid: 1.46 • 4 core average – ILP:1.33, TLP: 1.23, LLP: 1.37, Hybrid: 1.83 22
Time Breakdown SpecFP SpecInt Mediabench Both coupled and decoupled mode are necessary. 23
Conclusions and Future Work • Voltron – Adaptive multicore system • Accelerate single thread applications • Exploit ILP, fine-grain TLP and statistical LLP • Coupled and decoupled execution • Dual-mode operand network • Compiler managed loop speculation • Hybrid parallelism combines the benefits • Future work • Fine-grain thread identification • Virtualization of resources 24
Thank You • Questions? For more information: http://cccp.eecs.umich.edu 25