Understanding the Microarchitecture of the Pentium 4 Processor

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001

Clock Frequencies • Aggressive clocks => little work per pipeline stage • => deep pipelines => low IPC, large buffers, high • power, high complexity, low efficiency • 50% increase in clock speed => 30% increase in • performance Mispredict latency = 10 cyc Mispredict latency = 20 cyc

Deep Pipelines

Variable Clocks • The fastest clock is defined as the time for an • ALU operation and bypass (twice the main • processor clock) • Different parts of the chip operate at slower • clocks to simplify the pipeline design (e.g. RAMs)

Microarchitecture Overview

Front End • ITLB, RAS, decoder • Trace Cache: contains 12Kmops (~8K-16KB • I-cache), saves 3 pipe stages, reduces power • Front-end BTB accessed on a trace cache miss • and smaller Trace-cache BTB to detect next • trace line – no details on branch pred algo • Microcode ROM: implements mop translation for • complex instructions

Execution Engine • Allocator: resource (regs, IQ, LSQ, ROB) manager • Rename: 8 logical regs are renamed to 128 phys • regs; ROB (126 entries) only stores pointers • (Pentium 4) and not the actual reg values (unlike • P6) – simpler design, less power • Two queues (memory and non-memory) and • multiple schedulers (select logic) – can issue six • instrs/cycle

Schedulers • 3GHz clock speed = time for a 16-bit add and bypass

NetBurst • 3GHz ALU clock = time for a 16-bit add and bypass • to itself (area is kept to a minimum) • Used by 60-70% of all mops in integer programs • Staggered addition – speeds up execution of • dependent instrs – an add takes three cycles • Early computation of lower 16 bits => early • initiation of cache access

Detailed Microarchitecture

Data Cache • 4-way 8KB cache; 2-cycle load-use latency for • integer instrs and 6-cycle latency for fp instrs • Distance between load scheduler and execution • is longer than load latency • Speculative issue of load-dependent instrs and • selective replay • Store buffer (24 entries) to forward results to loads • (48 entries) – no details on load issue algo

Cache Hierarchy • 256KB 8-way L2; 7-cycle latency; new operation • every two cycles • Stream prefetcher from memory to L2 – stays • 256 bytes ahead • 3.2GB/s system bus: 64-bit wide bus at 400MHz

Performance Results

Quick Facts • November 2000: Willamette, 0.18m, Al interconnect, • 42M transistors, 217mm2, 55W, 1.5GHz • February 2004: Prescott, 0.09m, Cu interconnect, • 125M transistors, 112mm2, 103W, 3.4GHz

Improvements • Willamette (2000)  Prescott (2004) • L1 data cache 8KB  16KB • L2 cache 256KB  1MB • Pipeline stages 20  31 • Frequency 1.5GHz  3.4GHz • Technology 0.18m 0.09m

Pentium M • Based on the P6 microarchitecture • Lower design complexity (some inefficiencies • persist, such as copying register values from ROB • to architected register file) • Improves on P4 branch predictor

PM Changes to P6, cont. • Intel has not released the exact length of the pipeline. • Known to be somewhere between the P4 (20 stage)and the P3 (10 stage). Rumored to be 12 stages. • Trades off slightly lower clock frequencies (than P4) for better performance per clock, less branch prediction penalties, …

Banias • 1st version • 77 million transistors, 23 million more than P4 • 1 MB on die Level 2 cache • 400 MHz FSB (quad pumped 100 MHZ) • 130 nm process • Frequencies between 1.3 – 1.7 GHz • Thermal Design Point of 24.5 watts http://www.intel.com/pressroom/archive/photos/centrino.htm

Dothan • Launched May 10, 2004 • 140 million transistors • 2 MB Level 2 cache • 400 or 533 MHz FSB • Frequencies between 1.0 to 2.26 GHz • Thermal Design Point of 21(400 MHz FSB) to 27 watts http://www.intel.com/pressroom/archive/photos/centrino.htm

Branch Prediction • Longer pipelines mean higher penalties for mispredicted branches • Improvements result in added performance and hence less energy spent per instruction retired

Branch Prediction in Pentium M • Enhanced version of Pentium 4 predictor • Two branch predictors added that run in tandem with P4 predictor: • Loop detector • Indirect branch detector • 20% lower misprediction rate than PIII resulting in up to 7% gain in real performance

Branch Prediction Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php

Loop Detector • A predictor that always branches in a loop will always incorrectly branch on the last iteration • Detector analyzes branches for loop behavior • Benefits a wide variety of program types http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm

Indirect Branch Predictor • Picks targets based on global flow control history • Benefits programs compiled to branch to calculated addresses http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm

Benchmark

Battery Life

UltraSPARC IV • CMP with 2 UltraSPARC IIIs – speedups of 1.6 • and 1.14 for swim and lucas (static parallelization) • UltraSPARC III : 4-wide, 16 queue entries, 14 • pipeline stages • 4KB branch predictor – 95% accuracy, 7-cycle • penalty • 2KB prefetch buffer between L1 and L2

Alpha 21364 • Tournament predictor – local and global; 36Kb • Issue queue (20-Int, 15-FP), 4-wide Int, 2-wide FP • Two clusters, each with 2 FUs and a copy of the • 80-entry register file

Title • Bullet

Understanding the Microarchitecture of the Pentium 4 Processor

Understanding the Microarchitecture of the Pentium 4 Processor

Presentation Transcript

CS 7810 Lecture 19

CS 7810 Lecture 17

CS 7810 Lecture 22

CS 7810 Lecture 25

CS 7810 Lecture 9

CS 7810 Lecture 2

CS 7810 Lecture 14

CS 7810 Lecture 8

CS 7810 Lecture 13

CS 7810 Lecture 21

CS 7810 Lecture 23

CS 7810 Lecture 9

CS 7810 Lecture 21

CS 7810 Lecture 13

CS 7810 Lecture 3

CS 7810 Lecture 25

CS 7810 Lecture 8

CS 7810 Lecture 5

CS 7810 Lecture 12

CS 7810 Lecture 19

CS 7810 Lecture 2