Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors

Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta1Francisco J. Cazorla2 Alex Ramírez1,2 Mateo Valero1,2 1 UPC-Barcelona 2 Barcelona Supercomputing Center

Overview • Introduction • Simulation Methodology • Results • Conclusions

Introduction • As Process Technology advances it is more important what to do with transistors. • Current trend to replicate cores. • Intel: Pentium4, Core Duo, Core 2 Duo, Core 2 Quad • AMD: Opteron Dual-Core, Opteron Quad-Core • IBM: POWER4, POWER5 • Sun Microsystems: Niagara T1, Niagara T2

Introduction Power4 (CMP) Power5 (CMP+SMT) • Memory Subsystem (green) spreads over more than half the chip area.

Introduction • Each L1 is connected to each L2 bank with a bus-based interconnection network.

Goal • Is directly applicable prior research in the SMT field in the new CMP+SMT scenario? • NO…we have to revisit well-known SMT ideas. • Instruction Fetch Policy

Fetch ICOUNT ROB

Fetch L2 miss ICOUNT ROB FETCH Stalled • Processor’s resources balanced between running threads. • All resources devoted to blue thread unused until L2 miss resolution.

Fetch L2 miss FLUSH ROB FLUSH Triggered • All resources devoted to the pending instructions of the blue thread are freed.

Fetch L2 miss FLUSH ROB Thread Stalled • Freed resources allow additional forward progress. • L2 miss late detection  L2 miss prediction.

L2 b0 L2 b1 L2 b2 L2 b3 L2 b0 L2 b1 L2 b2 L2 b3 I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ Core Core Core Core Core Single vs Multi Core • More pressure on both: • Interconnection Network • Shared L2 banks

L2 b0 L2 b1 L2 b2 L2 b3 L2 b0 L2 b1 L2 b2 L2 b3 I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ Core Core Core Core Core Single vs Multi Core More Unpredictable L2 Access Latency - BAD for FLUSH

L2 b0 L2 b1 L2 b2 L2 b3 I$ I$ I$ I$ D$ D$ D$ D$ Core Core Core Core Simulation Methodology • Trace driven SMT simulator derived from SMTsim. • C2T2, C3T2, C4T2 multicore configurations. (CXTY, where X= Num. Cores and Y= Num. Threads/Core) Core Details (* per thread)

Simulation Methodology • Instruction Fetch Policies: • ICOUNT • FLUSH • Workload classified per type: • ILP  All threads have good memory behavior. • MEM  All threads have bad memory behavior. • MIX  Mixes both types of threads.

Results : Single-Core (2 threads) • FLUSH yields 22% average speedup over ICOUNT, in MIX workloads. • Mainly on MEM/MIX workloads

+Cores  -Speedup Results : Multi-Core (2 threads/core) • FLUSH drops to 9% average slowdown over ICOUNT in a four-cored multicore.

+Cores  +latency +dispersion Results : L2 Hits Latency on Multi-Core L2 hit latency (cycles)

Results : L2 miss prediction • In this four-cored example, the best choice is predicting L2 miss after 90 cycles.

Results : L2 miss prediction • But, in this other four-cored example the best choice is not to predict L2 miss.

Conclusions • Future high-degree CMPs open new challenging research topics in CMP+SMT cooperation. • The CMP outer cache level and interconnection characteristics may heavily affect SMT intra-core performance. • For example, FLUSH relies on a predictable L2 hit latency, heavily affected in a CMP+SMT scenario. • FLUSH drops from 22% average speedup to 9% average slowdown when moving from single-core to quad-core configuration.

Thank you Questions?

Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors