180 likes | 405 Views
CS 7960-4 Lecture 23. Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001. Quick Facts. November 2000: Willamette, 0.18 m , Al interconnect, 42M transistors, 217mm 2 , 55W, 1.5GHz
E N D
CS 7960-4 Lecture 23 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001
Quick Facts • November 2000: Willamette, 0.18m, Al interconnect, • 42M transistors, 217mm2, 55W, 1.5GHz • February 2004: Prescott, 0.09m, Cu interconnect, • 125M transistors, 112mm2, 103W, 3.4GHz
Clock Frequencies • Aggressive clocks => little work per pipeline stage • => deep pipelines => low IPC, large buffers, high • power, high complexity, low efficiency • 50% increase in clock speed => 30% increase in • performance Mispredict latency = 10 cyc Mispredict latency = 20 cyc
Variable Clocks • The fastest clock is defined as the time for an • ALU operation and bypass (twice the main • processor clock) • Different parts of the chip operate at slower • clocks to simplify the pipeline design (e.g. RAMs)
Front End • ITLB, RAS, decoder – Note: no I-cache • Trace Cache: contains 12Kmops (~8K-16KB • I-cache), saves 3 pipe stages, reduces power • Front-end BTB accessed on a trace cache miss • and smaller Trace-cache BTB to detect next • trace line – no details on branch pred algo • Microcode ROM: implements mop translation for • complex instructions
Execution Engine • Allocator: resource (regs, IQ, LSQ, ROB) manager • Rename: 8 logical regs are renamed to 128 phys • regs; ROB (126 entries) only stores pointers • (Pentium 4) and not the actual reg values (unlike • P6) – simpler design, less power • Two queues (memory and non-memory) and • multiple schedulers (select logic) – can issue six • instrs/cycle
Schedulers • Register porting, multiple queues • 3GHz clock speed = time for a 16-bit add and bypass
NetBurst • 3GHz ALU clock = time for a 16-bit add and bypass • to itself (area is kept to a minimum) • Used by 60-70% of all mops in integer programs • Staggered addition – speeds up execution of • dependent instrs – an add takes three cycles • Early computation of lower 16 bits => early • initiation of cache access
Data Cache • 4-way 8KB cache; 2-cycle load-use latency for • integer instrs and 6-cycle latency for fp instrs • Distance between load scheduler and execution • is longer than load latency • Speculative issue of load-dependent instrs and • selective replay • Store buffer (24 entries) to forward results to loads • (48 entries) – no details on load issue algo
Cache Hierarchy • 256KB 8-way L2; 7-cycle latency; new operation • every two cycles • Stream prefetcher from memory to L2 – stays • 256 bytes ahead • 3.2GB/s system bus: 64-bit wide bus at 400MHz
Recent Advances • Willamette (2000) Prescott (2004) • L1 data cache 8KB 16KB • L2 cache 256KB 1MB • Pipeline stages 20 31 • Frequency 1.5GHz 3.4GHz • Technology 0.18m 0.09m
Research Real Processors • Palacharla (clustering), Optimal-pipeline-depths, • Trace cache, Stream buffers, SMT, Voltage scaling, • Questions: branch predictor, clustered organization, • memory dependences, power optimizations
UltraSPARC IV • CMP with 2 UltraSPARC IIIs – speedups of 1.6 • and 1.14 for swim and lucas (static parallelization) • UltraSPARC III : 4-wide, 16 queue entries, 14 • pipeline stages • 4KB branch predictor – 95% accuracy, 7-cycle • penalty • 2KB prefetch buffer between L1 and L2
Alpha 21364 • Tournament predictor – local and global; 36Kb • Issue queue (20-Int, 15-FP), 4-wide Int, 2-wide FP • Two clusters, each with 2 FUs and a copy of the • 80-entry register file
Next Class’ Paper • “Value Prediction”,
Title • Bullet