Intel Core 2 Duo

Intel Core 2 Duo CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009

Introduction • Motivation • A Multi-Core on our desks • A new microarchitecture to replace Netburst • Intel Core 2 Duo • A dual-core CPU • ISA with SIMD Extension • Intel Core microarchitecture • Memory Hierarchy System

Instruction Set Architecture • Base: X86-64 • No VLIW (Itanium) • SIMD Extensions: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 Walfdale, SSE4.1, Sep 2006 Core 2, SSSE3, July 2006 Prescott, SSE3, 2004 Pentium 4, SSE2, 2001 e.g. Permuting bytes in a word Pentium III, SSE, 1999 DSP-oriented math, process management Pentium MMX, 1996 Double precision, 128-bit register support 8 new registers, Float-point Operations 8 new registers, Packed data type, Integer Operations

00000000 00000000 00000000 00000000 Streaming SIMD Extension (SSE) 4.1 • Beginning with the 45 nm processors • 47 instructions that improve performance of media data manipulation • e.g. Fast and efficient bit width conversions • Convert single byte values to word (16-bit) values.

SSE2 Code • MOVDQU XMM0, M64 • PXOR XMM1, XMM1 • PUNPCKLBW XMM0, XMM1

SSE4.1 Code • PMOVZXBW XMM0, M64 • DEST[15:0] <-- ZeroExtend(SRC[7:0]); • DEST[31:16] <-- ZeroExtend(SRC[15:8]); • DEST[47:32] <-- ZeroExtend(SRC[23:16]); • DEST[63:48] <-- ZeroExtend(SRC[31:24]); • DEST[79:64] <-- ZeroExtend(SRC[39:32]); • DEST[95:80] <-- ZeroExtend(SRC[47:40]); • DEST[111:96] <-- ZeroExtend(SRC[55:48]); • DEST[127:112] <-- ZeroExtend(SRC[63:56]); • Benefits • Reduced instruction number (31) • Better performance (~40% speedup each loop) • Reduced register pressure (21)

Microarchitecture • The Cores • Single-die(107 mm²), • Two identical core(L1 cache 64K x 2), • Shared L2 cache 6M • No Hyper-threading, no L3 cache • Keep front-side bus • Larger L2 cache

Microarchitecture • 14-stage Pipeline • 4 wide decode • 4 wide Retire • Macro-fusion • Enhanced ALUs • Deeper Buffers

Another View

Decode Hardware • 128 bits fetch bandwidth • 18-entry IQ • Complex Decode -produces 1-4 micro-ops • Micro-code Sequencer

Macro-fusion New Micro-op • Represent instruction pair as single micro-op Enhanced ALUs • To execute new compare and jump (CMPJCC) micro-op in one clock

Out of Order Execution • 96 entries ROB • 32 Entry Reservation Station

Execution Units • 6 dispatch ports(1 Load, 2 Store, 3 universal ports) • 3 integer ALU, 2 float point ALU

Branch Predictor • Loop Detector - Track the number of loop iterations for future reference • branch prediction unit (BPU) selects among for every branch: -bimodal predictor -global predictor -loop detector

Cache Organization • private L1 DCache and ICache, 32K/core, 8way, 64B linesize, write-back(directory-based conherence) • shared L2 cache, 8way, 64B linesize (E8xxx) pros: could be less bus traffic cons: longer access latency than private L2 cache; potential conflict between threads -- FSB 1333MHz (E8xxx) • Memory disambiguation • aggressive memory dependence speculation based on a load's- EIP-address-indexed hash table • watchdog mechanism

Prediction Implementation • History table indexed by Instruction Pointer • Each entry in the history array has a saturating counter • Once counter saturates: disambiguation possible on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses • When a particular load failed disambiguation: reset its counter • Each time a particular load correctly disambiguated: increment counter

Predictor Lookup • when sent from RS, set disambiguation bit • If meets an older unknow store address, set "update" • If prediction is "go", dispatch, set "done" • Else blocked • A store in Load Buffer scan all previous load, if a match found, "reset" bit set. • When load commits, update history. Load Dispatch Prediction Verification

Execute Disable Bit Support • AMD Enhanced Virus Protection; ARM eXecute Never • help prevent buffer overflow attacks • no need of software patches for buffer overflow attacks • segregate memory by either storage of code or data • processor disable code execution when malicious worms try to inserting code into data buffers (with OS support)

Instruction Pointer Based Prefetcher • L1 DCache:2 IP prefetchers/core L1 ICache:1 traditional prefetcher L2 Cache: 2 IP prefetchers; • predict what memory address will be used and deliver in time • record every load's history using Instruction Pointer • IP history array • parameters for prefetch traffic control fine-tuned for different platforms • prefetch monitor

References • Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies • Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel • Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine • Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX • too many…

Questions?

Intel Core 2 Duo

Intel Core 2 Duo

Presentation Transcript

Intel Core Duo Processor

Optimizing for Intel multi-/many-core architectures

DUO DISTRIBUTION

DUO Training

Duo

Intel shows off 50-core chip

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

duo showcase

Duo

Duo Fishbone

Chapter 2 Assemblers intel/multi-core/demos.htm

duo-system

Intel Multi-Core Technology

Core 2

Duo

A Dynamic Visualization of Core-2 Duo Interrupts