330 likes | 700 Views
Intel Pentium 4. ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk. Overview:. Product review Specialized architectural features (NetBurst) SIMD instructional capabilities (MMX, SSE2) SHARC 2106x comparison. Intel Pentium 4. Reworked micro-architecture for high-bandwidth applications
E N D
Intel Pentium 4 ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk
Overview: • Product review • Specialized architectural features (NetBurst) • SIMD instructional capabilities (MMX, SSE2) • SHARC 2106x comparison
Intel Pentium 4 • Reworked micro-architecture for high-bandwidth applications • Internet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environments • These are DSP intensive applications! • What about uses other than in PC?
Hardware Features:(NetBurst micro-architecture) • Hyper pipelined technology • Advanced dynamic execution • Cache (data, L1, L2) • Rapid ALU execution engines • 400 MHz bus • OOE • Microcode ROM
Hyper Pipeline • 20-stage pipeline!!! • breaks down complex CISC instructions • sub-stages mimic RISC • faster execution
Filling the pipeline... • Review of next 126 instructions to be executed • Branch prediction • if mispredict must flush 20-stage pipeline!!! • branch target buffer (BTB) • 4K branch history table (BHT) • assembly instruction hints
Cache • 8KB Data Cache • L1 Execution Trace Cache • 12K of previous micro-instructions stored • saves having to translate • L2 Advanced Transfer Cache • 256K for data • 256-bit transfer every cycle • allows 77GB/s data transfer on 2.4GHz
Rapid ALU Execution Engines • 2 ALUs • allow parallel operations • Many arithmetic operations take 1/2 cycle • each 2X ALU can have 2 operations per cycle
Software Features: • Multimedia Extensions (MMX) • 8 MMX registers • Streaming SIMD Extensions (SSE2) • 8 SSE/SSE2 registers • Standard x86 Registers • EAX, EBX, ECX, EDX, ESI, etc. • Register rename to over 100
MMX (Multimedia Extensions) • Accelerated performance through SIMD • multimedia, communication, internet applications • 64-bit packed INTEGER data • signed/unsigned
SSE2 (Streaming SIMD Extensions) • Accelerate a broad range of applications • video, speech, and image, photo processing, encryption, financial, engineering, and scientific applications • 128-bit SIMD instruction formats • 4 single precision FP values • 2 double precision FP values • 16 byte values • 8 word values • 4 double word values • 2 quad word values • 1 128-bit integer value
SIMD Example(16-tap FIR filter - Real numbers) • Applications for real FIR filters • general purpose filters in image processing, audio, and communication algorithms • Will utilize SSE2 SIMD instruction set
Thinking about SIMD • SSE2 instruction format is 128-bits • 128-bit SSE2 registers • Many data formats! • What precision do we want? • Lets use 32-bit floating point for coefficients, input, output 4 data sets x 32-bit = 128 bits
Parallelizing • Require many single multiplications (coefficients x inputs), then add the results for output! • Multiplications… • then need to perform additions...
Using SSE2 format • Can hold 4 elements of an array (of 32-bit data) in each 128-bit register • 4 single precision floating point ops per cycle (32-bit)
Additions... • In both registers, now have 4 32-bit results • First add the results into an accumulator register • 4 single precision floating point ops per cycle (32-bit)
Additions... • In a register, now have 4 32-bit results • however, NO SSE2 instruction to add these 4! • But can use other instructions • Some BIT INTERTWINING…then add • This will give results for several output values!
ADI SHARC 21k vs. P4 Disadvantages • Slower clock speed (40MHz vs 2400MHz) • Less opportunities for parallelism (5 vs 11) • Much less memory (Cache and System) • Limited algorithm applicability • Limited applications • Older (Less support – compiler) • 1994 vs 2001
ADI Sharc 21k vs. P4 Advantages • Hardware loops • Easier to program for optimal speed • Cheaper • Lower power consumption • Runs cooler
FIR Performance • Hard to obtain P4 performance numbers • Can estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full. • 2 * 2.4GHz ~ 4.8 billion multiplies per second • If ~4 multiplies per element & 44000 samples/s • FIR length > ~25k taps • SHARC => ~ 200 taps (Lab 4) • Factor of ~125x
IIR Performance • Hard to obtain P4 performance numbers • No hardware circular buffers • Does have BTB, BHT, etc. • Prefetches ~256bytes ahead of current position in code.
FFT Performance • Hard to obtain P4 performance numbers • Prime95 uses FFT to calculate Lucas-Lehmer test for Mersenne Primes • Involves FFT, squaring and iFFT, etc. • 256k points on P4 2.3GHz ~ 10.517ms • Compare to SHARC 2048 point FFT ~0.37ms • If SHARC could do 256k, 46.25ms (But…)
Optimization Example • Hard to optimize Pentium 4 assembly • Example of multiplying by a constant, 10 • Taken mainly from: www.emulators.com/docs/pentium_1.htm
Multiplying by 10 • Slowest way: • IMUL EAX, 10 • Usually optimal way (Visual C++ 6.0) • LEA EAX, [EAX+EAX*4] • SHL EAX, 1 • Shift – Add – Shift • On most x86 processors takes 2 cycles • Pentium MMX and before 3 cycles • On Pentium 4 takes 6 cycles!
Multiplying by 10 • Optimal for Pentium 4 • LEA ECX, [EAX + EAX] • LEA EAX, [ECX+EAX*8] • On most x86 still takes 2 cycles • On Pentium 4 takes ~ 3 cycles (OOE - Ops) • But on older processors Pentium MMX and before this now takes 4 cycles!
Multiplying by 10 • Best generic case • LEA EAX, [EAX + EAX*4] • ADD EAX, EAX • On most x86 still takes 2 cycles • On older processors Pentium MMX and before this now takes 3 cycles again • On Pentium 4 this takes 4 cycles • Obviously really hard to optimize
REFERENCES • Intel application note: AP 809 - Real and Complex Filter Using Streaming SIMD Extentions • graphics from: http://www6.tomshardware.com/cpu/00q4/001120/p4-01.html