240 likes | 406 Views
Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors. Carmelo Acosta 1 Francisco J. Cazorla 2 Alex Ramírez 1,2 Mateo Valero 1,2 1 UPC-Barcelona 2 Barcelona Supercomputing Center. Overview. Introduction Simulation Methodology Results Conclusions.
E N D
Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta1Francisco J. Cazorla2 Alex Ramírez1,2 Mateo Valero1,2 1 UPC-Barcelona 2 Barcelona Supercomputing Center
Overview • Introduction • Simulation Methodology • Results • Conclusions
Introduction • As Process Technology advances it is more important what to do with transistors. • Current trend to replicate cores. • Intel: Pentium4, Core Duo, Core 2 Duo, Core 2 Quad • AMD: Opteron Dual-Core, Opteron Quad-Core • IBM: POWER4, POWER5 • Sun Microsystems: Niagara T1, Niagara T2
Introduction Power4 (CMP) Power5 (CMP+SMT) • Memory Subsystem (green) spreads over more than half the chip area.
Introduction • Each L1 is connected to each L2 bank with a bus-based interconnection network.
Goal • Is directly applicable prior research in the SMT field in the new CMP+SMT scenario? • NO…we have to revisit well-known SMT ideas. • Instruction Fetch Policy
Fetch ICOUNT ROB
Fetch L2 miss ICOUNT ROB FETCH Stalled • Processor’s resources balanced between running threads. • All resources devoted to blue thread unused until L2 miss resolution.
Fetch L2 miss FLUSH ROB FLUSH Triggered • All resources devoted to the pending instructions of the blue thread are freed.
Fetch L2 miss FLUSH ROB Thread Stalled • Freed resources allow additional forward progress. • L2 miss late detection L2 miss prediction.
L2 b0 L2 b1 L2 b2 L2 b3 L2 b0 L2 b1 L2 b2 L2 b3 I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ Core Core Core Core Core Single vs Multi Core • More pressure on both: • Interconnection Network • Shared L2 banks
L2 b0 L2 b1 L2 b2 L2 b3 L2 b0 L2 b1 L2 b2 L2 b3 I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ Core Core Core Core Core Single vs Multi Core More Unpredictable L2 Access Latency - BAD for FLUSH
Overview • Introduction • Simulation Methodology • Results • Conclusions
L2 b0 L2 b1 L2 b2 L2 b3 I$ I$ I$ I$ D$ D$ D$ D$ Core Core Core Core Simulation Methodology • Trace driven SMT simulator derived from SMTsim. • C2T2, C3T2, C4T2 multicore configurations. (CXTY, where X= Num. Cores and Y= Num. Threads/Core) Core Details (* per thread)
Simulation Methodology • Instruction Fetch Policies: • ICOUNT • FLUSH • Workload classified per type: • ILP All threads have good memory behavior. • MEM All threads have bad memory behavior. • MIX Mixes both types of threads.
Overview • Introduction • Simulation Methodology • Results • Conclusions
Results : Single-Core (2 threads) • FLUSH yields 22% average speedup over ICOUNT, in MIX workloads. • Mainly on MEM/MIX workloads
+Cores -Speedup Results : Multi-Core (2 threads/core) • FLUSH drops to 9% average slowdown over ICOUNT in a four-cored multicore.
+Cores +latency +dispersion Results : L2 Hits Latency on Multi-Core L2 hit latency (cycles)
Results : L2 miss prediction • In this four-cored example, the best choice is predicting L2 miss after 90 cycles.
Results : L2 miss prediction • But, in this other four-cored example the best choice is not to predict L2 miss.
Overview • Introduction • Simulation Methodology • Results • Conclusions
Conclusions • Future high-degree CMPs open new challenging research topics in CMP+SMT cooperation. • The CMP outer cache level and interconnection characteristics may heavily affect SMT intra-core performance. • For example, FLUSH relies on a predictable L2 hit latency, heavily affected in a CMP+SMT scenario. • FLUSH drops from 22% average speedup to 9% average slowdown when moving from single-core to quad-core configuration.
Thank you Questions?