Dynamically Heterogeneous Cores Through 3D Resource Pooling

Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman Homayoun National Science Foundation CI Fellow University of California San Diego

Why Heterogeneity? • Existing General Purpose CMP designs use only homogeneous cores • A general purpose one-size-fits-all core is not necessarily the most efficient • One processor optimized for each application! Core 2 Core 1

Static vs. Dynamic Heterogeneity • Prior proposals (e.g., Kumar 2003) propose static heterogeneity. • Increases chance of finding an appropriate core • Does not guarantee perfect match • Others have proposed solutions for dynamic heterogeneity (Core Fusion, TFlex). • Due to the difficult of sharing resources at a fine granularity, they enable only coarse-grain sharing. • Big (combined) cores or small cores.

Outline • Resource Pooling • Why 3D? • Design Solutions • Adaptive Policies • Results • Conclusion

Application Resource Utilization

Application Resource Utilization LDSQ ROB RF IQ

Application Resource Utilization Dual-Core Machine Application 1 LDSQ ROB RF IQ Application 2 LDSQ ROB RF IQ underutilized

Dynamic Heterogeneity Through Resource Pooling Register File Register File ROB ROB Core 2 Core 1 Dynamic vs. Static Heterogeneity

Outline • Need for Heterogeneity • Why 3D? • Design Solutions • Adaptive Policies • Results • Conclusion

Why NOT Sharing in 2D? • Long wire delay in 2D In 2D, it is not efficient 5 nsec 500 psec Demanding

Our Solution: 3D

Our Solution: 3D • Fast interconnection network (Through Silicon Via)TSV Minimize the Communication Latency As fast as few ps (three order of magnitude smaller than 2D) 5 psec 5000 psec • A principal advantage • No change to the fundamental pipeline design of 2D architectures, yet still exploits the 3D to provide greater energy proportionality and core customization

Stackable Structures for Resource Pooling • Performance bottleneck and power hungry resources • Reorder Buffer and Register File (SRAM) • Instruction Queue and Load and Store Queue (CAM+SRAM) • Our goal: • share units across multiple cores with minimal impact on design spec (latency, number of ports and power) • Use previously proposed modular design • Each partition is a self-standing and independently usable unit • Effective in reducing power and access delay Part 1 Part 2 Register File Part 3 Part 4 Independent partition

Example of Resource Sharing • Additional logic to decide whether partition is empty • Additional logic to route the signal to the right partition Register File in Core 1 Free Free TSV Register File in Core 0 Decoder MUX Partition

Adaptive Policies for Resource Pooling • Several issues need to be considered • Ownership • Fast releasing • Fast reallocation • Cycle by cycle adaptation • Prevent starvation • A simple adaptive policy specification (MinMax policy) • Set limit for the size of resources • how much they can grow up to (MAX) or they can shrink down to (MIN) • Use free list • Use central arbitration

MinMax Policy Example Application 3 Application 4 Application 1 Application 2 Core 1 Core 4 Core 2 Core 3 Register File MIN Free List Arbitration Unit

Baseline Architecture (1) (2) • Processor Model • High-end architecture, four OoO cores with issue width of 4 • Medium-end architecture, four OoO cores with issue width of 2 • 3D Floorplans (different performance, flexibility, and temperature tradeoff) • (1) Conventional (Thermal-Optimized Design) • (2) Proposed (Performance-Optimized Design)

Evaluation Power Performance Temperature Energy-Delay 4 Thread 1 Thread 2 Thread Core 1 Core 2 Active core Core 3 Idle core Core 4 Link

Single Thread Performance • Single benchmark (3 out of 4 cores are idle) Speed Up Standard SPEC2K and SPEC2006 Benchmark Average 45% in Medium-end, 26% in High-end

Multi-Thread Performance • 2Thr: 2 idle cores + underutilized resources in the active cores • 4Thr: No idle cores, only underutilized resources Normalized Weighted Speedup (%) gains are dramatic when some cores are idle

Medium-end vs High-end • Resource pooling makes the medium core significantly more competitive with the high-end. Normalized Weighted Speedup (%) Only 3%! 28% 14% 0 Idle Core 2 Idle Core 3 Idle Core Increase Resource Sharing

Power • Pooling pay a small price in power • Because of the enhanced throughput. • Large speedups on low-IPC threads and high average speedup, but smaller increase in total instruction throughput and thus smaller increase in power power (Watt) 4X 3X

Temperature • Interestingly, the temperature of the medium resource-pooling core is comparable to the high-end core temperature (Celsius)

Efficiency • Even still, at equal temperature, the more modest cores have a significant advantage in energy efficiency measured in MIPS2/W (MIPS2/W is the inverse of energy-delay product) Normalized 2X

Conclusions • Homogeneous cores are inherently inefficient for a diverse workload. • Cores are typically overprovisioned as a result • 3D stacking of cores enables fine-grain sharing (pooling) of resources not possible in 2D designs. • Our dynamically heterogeneous 3D architecture allows the processor to construct the right core for each application dynamically, maximizing energy efficiency. • Our 3D pooling architecture • Leverages our experience in 2D pipeline design, yet still gains significant benefit from 3D • Adapts to the specific demands of an application within a few cycles. • Reduces reliance on overprovisioned cores, instead grabbing larger resources only when needed.

End of presentation

Dynamically Heterogeneous Cores Through 3D Resource Pooling