CS 7960-4 Lecture 23

CS 7960-4 Lecture 23 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001

Quick Facts • November 2000: Willamette, 0.18m, Al interconnect, • 42M transistors, 217mm2, 55W, 1.5GHz • February 2004: Prescott, 0.09m, Cu interconnect, • 125M transistors, 112mm2, 103W, 3.4GHz

Clock Frequencies • Aggressive clocks => little work per pipeline stage • => deep pipelines => low IPC, large buffers, high • power, high complexity, low efficiency • 50% increase in clock speed => 30% increase in • performance Mispredict latency = 10 cyc Mispredict latency = 20 cyc

Variable Clocks • The fastest clock is defined as the time for an • ALU operation and bypass (twice the main • processor clock) • Different parts of the chip operate at slower • clocks to simplify the pipeline design (e.g. RAMs)

Microarchitecture Overview

Front End • ITLB, RAS, decoder – Note: no I-cache • Trace Cache: contains 12Kmops (~8K-16KB • I-cache), saves 3 pipe stages, reduces power • Front-end BTB accessed on a trace cache miss • and smaller Trace-cache BTB to detect next • trace line – no details on branch pred algo • Microcode ROM: implements mop translation for • complex instructions

Execution Engine • Allocator: resource (regs, IQ, LSQ, ROB) manager • Rename: 8 logical regs are renamed to 128 phys • regs; ROB (126 entries) only stores pointers • (Pentium 4) and not the actual reg values (unlike • P6) – simpler design, less power • Two queues (memory and non-memory) and • multiple schedulers (select logic) – can issue six • instrs/cycle

Schedulers • Register porting, multiple queues • 3GHz clock speed = time for a 16-bit add and bypass

NetBurst • 3GHz ALU clock = time for a 16-bit add and bypass • to itself (area is kept to a minimum) • Used by 60-70% of all mops in integer programs • Staggered addition – speeds up execution of • dependent instrs – an add takes three cycles • Early computation of lower 16 bits => early • initiation of cache access

Data Cache • 4-way 8KB cache; 2-cycle load-use latency for • integer instrs and 6-cycle latency for fp instrs • Distance between load scheduler and execution • is longer than load latency • Speculative issue of load-dependent instrs and • selective replay • Store buffer (24 entries) to forward results to loads • (48 entries) – no details on load issue algo

Cache Hierarchy • 256KB 8-way L2; 7-cycle latency; new operation • every two cycles • Stream prefetcher from memory to L2 – stays • 256 bytes ahead • 3.2GB/s system bus: 64-bit wide bus at 400MHz

Performance Results

Recent Advances • Willamette (2000)  Prescott (2004) • L1 data cache 8KB  16KB • L2 cache 256KB  1MB • Pipeline stages 20  31 • Frequency 1.5GHz  3.4GHz • Technology 0.18m 0.09m

Research  Real Processors • Palacharla (clustering), Optimal-pipeline-depths, • Trace cache, Stream buffers, SMT, Voltage scaling, • Questions: branch predictor, clustered organization, • memory dependences, power optimizations

UltraSPARC IV • CMP with 2 UltraSPARC IIIs – speedups of 1.6 • and 1.14 for swim and lucas (static parallelization) • UltraSPARC III : 4-wide, 16 queue entries, 14 • pipeline stages • 4KB branch predictor – 95% accuracy, 7-cycle • penalty • 2KB prefetch buffer between L1 and L2

Alpha 21364 • Tournament predictor – local and global; 36Kb • Issue queue (20-Int, 15-FP), 4-wide Int, 2-wide FP • Two clusters, each with 2 FUs and a copy of the • 80-entry register file

Next Class’ Paper • “Value Prediction”,

Title • Bullet

CS 7960-4 Lecture 23

CS 7960-4 Lecture 23

Presentation Transcript

CS 584 Lecture 4

CS 7960-4 Lecture 20

CS 7960-4 Lecture 24

CS 7960-4 Lecture 8

CS 519: Lecture 4

CS 7960-4 Lecture 5

CS 140L Lecture 4

CS 140L Lecture 4

CS 425 Lecture 4

CS 7960-4 Lecture 2

CS 7810 Lecture 23

CS 160: Lecture 23

CS 7960-4 Lecture 17

CS 160: Lecture 23

CS 160: Lecture 4

CS 7960-4 Lecture 10

CS 7960-4 Lecture 7

CS 7960-4 Lecture 20

CS 7960-4 Lecture 4

CS 7960-4 Lecture 20

CS 7960-4 Lecture 14

CS 7960-4 Lecture 18