200 likes | 330 Views
Class Representation For Advanced VLSI Course. Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation of the POWER5 TM Microprocessor J. Clabes1, J. Friedrich1, M. Sweet1, J DiLullo1, S. Chu1, D. Plass2, J.
E N D
Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation of the POWER5TM Microprocessor J. Clabes1, J. Friedrich1, M. Sweet1, J DiLullo1, S. Chu1, D. Plass2, J. Dawson2, P. Muench2, L. Powell1, M. Floyd1, B. Sinharoy2, M. Lee1, M. Goulet1, J. Wagoner1, N. Schwarz1, S. Runyon1, G. Gorman1, P. Restle3, Kalla1, J. McGill1, S. Dodson1 1IBM System Group, Austin, TX 2IBM System Group, Poughkeepsie, NY 3IBM Research, Yorktown Heights, NY IEEE International Solid-State Circuits Conference 2004 Winter 2004
Outline • Motivation • Background • Threading Fundamentals • Enhancement SMT Implementation in POWER5 • Memory Subsystem Enhancements • Power Efficiency • Additional SMT Considerations • Summary
Microprocessor Design Optimization Focus Areas • Motivation … • Memory latency • Increased processor speeds make memory appear further away • Longer stalls possible • Branch processing • Mispredict more costly as pipeline depth increases resulting in stalls and wasted power • Predication drives increased power and larger chip area • Execution Unit Utilization • Currently 20-25% execution unit utilization common • Simultaneous multi-threading (SMT) and POWER architecture address these areas
POWER4 --- Shipped in Systems December 2001 • Background … • Technology: 180nm lithography, Cu, SOI • POWER4+ shipping in 130nm today • 267mm2 185M transistors • Dual processor core • 8-way superscalar • Out of Order execution • Load / Store units • 2 Fixed Point units • 2 Floating Point units • Logical operations on Condition Register • Branch Execution unit • > 200 instructions in flight • Hardware instruction and data prefetch
POWER5 --- The Next Step • Background … • Technology: 130nm lithography, Cu, SOI • 389mm2 276M Transistors • Dual processor core • 8-way superscalar • Simultaneous multithreaded (SMT) core • Up to 2 virtual processors per real processor • Natural extension to POWER4 design
System-level view of POWER5 • Background …
Multi-threading Evolution • Threading …
Changes Going From ST to SMT Core • Enhancement … • SMT easily added to Superscalar Micro-architecture • Second Program Counter (PC) added to share I-fetch bandwidth • GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread) • Completion logic replicated to track two threads • Thread bit added to most address/tag buses
POWER5 Resources Size Enhancements • Enhancement … • Enhanced caches and translation resources • I-cache: 64 KB, 2-way set associative, LRU • D-cache: 32 KB, 4-way set associative, LRU • First level Data Translation: 128 entries, fully associative, LRU • L2 Cache: 1.92 MB, 10-way set associative, LRU • Larger resource pools • Rename registers: GPRs, FPRs increased to 120 each • L2 cache coherency engines: increased by 100% • Enhanced data stream prefetching • Memory controller moved on chip
Thread Priority • Enhancement … • Instances when unbalanced execution desirable • No work for opposite thread • Thread waiting on lock • Software determined non uniform balance • Power management • … • Solution: Control instruction decode rate • Software/hardware controls 8 priority levels for each thread
Modifications to POWER4 System Structure • Memory …
Power Efficient Design Implementation • Power … • DC power mitigation • Leverage triple Vt technology • Decrease low Vt usage by 90% • Increase high Vt usage by 30% • Leverage triple Tox technology • Thick Tox usage for decoupling capacitors • AC power mitigation • Minimal usage of dynamic circuits • Reduce loading on clock mesh • Incorporation of dynamic clock gating
Thermal control logic and sample thermal response. • Power …
16-way Building Block • Additional …
POWER5 Multi-Chip Module • Additional … • 95mm % 95mm • Four POWER5 chips • Four cache chips • 4,491 signal I/Os • 89 layers of metal
64-way SMP Interconnection • Additional … • Interconnection exploits enhanced distributed switch • All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchy • Additional …
POWER Server Roadmap • Additional …
Summary • Summary … • First dual core SMT microprocessor • Extended SMP to 64-way • Operating in laboratory • Power dynamically managed with no performance penalty • Implementation permits future technology scalability from • circuit and power perspective • Innovative approach leveraging technology with system • focus for high performance in a power efficient design
Other References • [1] R. Kalla , B. Sinharoy , J. Tendler , “IBM POWER5 CHIP : A DUAL-CORE MULTITHREADED PROCESSOR” , IEE Computer Society , MARCH-APRIL 2004 • [2] R. Kalla , IBM System Group , “IBM’s POWER5 Design and Methodology” , IBM Corporation 2003