370 likes | 1.2k Views
The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts
E N D
The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Slides: 1998 Microprocessor Forum (Peter Bannon) and 1999 Microprocessor Forum (Joel Emer)
Alpha Microprocessor Roadmap Higher Performance 0.125mm 0.18mm 0.35mm 21464 EV8 21364 EV7 21264 EV6 Lower Cost 0.125mm 0.28mm 21364 EV78 21264EV67 0.18mm 21264EV68 1998 1999 2000 2001 2002 2003 First System Ship
Alpha 21264 Microprocessor • Architectural Features • First “Out-of-Order” Alpha • Four-wide superscalar • … • Performance • World’s Fastest Microprocessor (www.spec.org, 11/17/99) • 39 SPECINT95, 68 SPECFP95 @ 700 Mhz • Intel Pentium III @ 733 Mhz delivers 36 SPECINT95, 30 SPECFP95
Alpha Microprocessor Roadmap Higher Performance 0.125mm 0.18mm 0.35mm 21464 EV8 21364 EV7 21264 EV6 Lower Cost 0.125mm 0.28mm 21364 EV78 21264EV67 0.18mm 21264EV68 1998 1999 2000 2001 2002 2003 First System Ship
Alpha 21364 Goals • Leadership single stream performance • Higher operating frequency • Integrated memory interface • Leadership multiprocessor performance • Integrated system / multiprocessor interface
Alpha 21364 Features • System-on-a-Chip • Alpha 21264 core with enhancements • Integrated L2 Cache • Integrated memory controller • Integrated network interface • Fault-Tolerance • Support for lock-step operation to enable high-availability systems.
Address In Address Out R A M B U S L2 Cache Memory Controller N S E WI/O Network Interface 16 L1 Victim Buf 16 L2 Victim Buf 21364 Chip Block Diagram 21264 Core 16 L1 Miss Buffers 64K Icache 64K Dcache
21364 Core FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Reg Map Int Issue Queue (20) Branch Predictors Reg File (80) Exec L2 cache1.5MB 6-Set Addr Exec L1 Data Cache 64KB 2-Set Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Addr Exec Next-Line Address 4 Instructions / cycle L1 Ins. Cache 64KB 2-Set FP ADD Div/Sqrt Reg File (72) FP Issue Queue (15) FP Reg Map Victim Buffer FP MUL Miss Address
Integrated L2 Cache • 1.5 MB • 6-way set associative • 16 GB/s total read/write bandwidth • 16 Victim buffers for L1 -> L2 • 16 Victim buffers for L2 -> Memory • ECC SECDED code • 12ns load to use latency
Integrated Memory Controller • Direct RAMbus • High data capacity per pin • 800 MHz operation • 30ns CAS latency pin to pin • 6 GB/sec read or write bandwidth • 100s of open pages • Directory based cache coherence • ECC SECDED
Integrated Network Interface • Direct processor-to-processor interconnect • 10 GB/second per processor • 15ns processor-to-processor latency • Out-of-order network with adaptive routing • Asynchronous clocking between processors • 3 GB/second I/O interface per processor
IO M IO M M IO M M IO IO IO M IO M IO M IO M IO M IO M IO M 364 364 364 364 364 364 364 364 364 364 364 364 21364 System Block Diagram
Alpha 21364 Technology • 0.18 mm CMOS • 1000+ MHz • 100 Watts @ 1.5 volts • 3.5 cm2 • 6 Layer Metal • 100 million transistors • 8 million logic • 92 million RAM
Alpha 21364 Status • 70 SPECint95 (estimated) • 120 SPECfp95 (estimated) • RTL model running • Tapeout: Summer 2000
21364 Summary: System on a Chip • Integrated L2 cache and memory controller • outstanding single processor performance • Integrated network interface • high performance multi-processor systems • scales to large number of processors
Alpha Microprocessor Overview Higher Performance 0.125mm 0.18mm 0.35mm 21464 EV8 21364 EV7 21264 EV6 Lower Cost 0.125mm 0.28mm 21364 EV78 21264EV67 0.18mm 21264EV68 1998 1999 2000 2001 2002 2003 First System Ship
Alpha 21464 Goals • Leadership single stream performance • Higher operating frequency / better technology • New microarchitecture • Integrated memory interface (like 21364) • Leadership multiprocessor performance • Simultaneous Multithreading (with minimal change/cost) • Integrated system / multiprocessor interface (like 21364)
Alpha 21464 Technology Overview • Leading edge process technology – 1.2-2.0GHz • 0.125µm CMOS • SOI-compatible • Cu interconnect • low-k dielectrics • Chip characteristics • ~1.2V Vdd • ~250 Million transistors
Alpha 21464 Architecture Overview • Enhanced out-of-order execution • 8-wide superscalar • Large on-chip L2 cache • Direct RAMBUS interface • On-chip router for system interconnect • Glueless, directory-based, ccNUMA • for up to 512-way multiprocessing • 4-way simultaneous multithreading (SMT)
Instruction Issue Time Reduced function unit utilization due to dependencies
Superscalar Issue Time Superscalar leads to more performance, but lower utilization
Predicated Issue Time Adds to function unit utilization, but results are thrown away
Chip Multiprocessor Time Limited utilization when only running one thread
Fine Grained Multithreading Time Intra-thread dependencies still limit performance
Simultaneous Multithreading Time Maximum utilization of function units by independent operations
Decode/Map Queue Fetch Reg Read Execute Dcache/Store Buffer Reg Write Retire PC RegisterMap Regs Regs Dcache Icache Basic Out-of-order Pipeline Thread-blind
Decode/Map Queue Fetch Reg Read Execute Dcache/Store Buffer Reg Write Retire PC RegisterMap Regs Regs SMT Pipeline Dcache Icache
Changes for SMT • Basic pipeline – unchanged • Replicated resources • Program counters • Register maps • Shared resources • Register file (size increased) • Instruction queue • First and second level caches • Translation buffers • Branch predictor
TPU 0 TPU1 TPU2 TPU3 Icache TLB Dcache Scache Architectural Abstraction • 1 Processor with 4 Thread Processing Units (TPUs) • Shared hardware resources
IO M IO IO M IO IO M M M M IO IO M M IO M IO 0 1 2 3 21464 System Block Diagram EV8 EV8 EV8 EV8 EV8 EV8 EV8 EV8 EV8
Alpha 21464 Summary • Leadership single stream performance • Higher operating frequency / better technology • New microarchitecture • Integrated memory interface (like 21364) • Leadership multiprocessor performance • Simultaneous Multithreading (with minimal changes/cost) • Integrated system / multiprocessor interface (like 21364)
Maintain Performance Lead Beyond Y2K • Alpha 21364 • Reuses 21264 microprocessor core • System on a chip • Alpha 21464 • New microarchitecture • System on a chip • Simultaneous Multithreading
My Current Research: Beyond 21464? • The Truth Project (w/ Joel Emer) • Examines different microarchitectural issues • The Multinet Project (w/ Rick Kessler) • Tightly-coupled multiprocessor networks • The Reliant Project (w/ Steve Reinhardt) • Self-Checking Microprocessors using SMT, ISCA submission • Asim (w/ VSSAD Labs) • Performance Model for Alphas beyond 21464