1 / 28

Intel Pentium 4

Intel Pentium 4. ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk. Overview:. Product review Specialized architectural features (NetBurst) SIMD instructional capabilities (MMX, SSE2) SHARC 2106x comparison. Intel Pentium 4. Reworked micro-architecture for high-bandwidth applications

liko
Download Presentation

Intel Pentium 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intel Pentium 4 ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk

  2. Overview: • Product review • Specialized architectural features (NetBurst) • SIMD instructional capabilities (MMX, SSE2) • SHARC 2106x comparison

  3. Intel Pentium 4 • Reworked micro-architecture for high-bandwidth applications • Internet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environments • These are DSP intensive applications! • What about uses other than in PC?

  4. Hardware Features:(NetBurst micro-architecture) • Hyper pipelined technology • Advanced dynamic execution • Cache (data, L1, L2) • Rapid ALU execution engines • 400 MHz bus • OOE • Microcode ROM

  5. Hyper Pipeline • 20-stage pipeline!!! • breaks down complex CISC instructions • sub-stages mimic RISC • faster execution

  6. Filling the pipeline... • Review of next 126 instructions to be executed • Branch prediction • if mispredict must flush 20-stage pipeline!!! • branch target buffer (BTB) • 4K branch history table (BHT) • assembly instruction hints

  7. Cache • 8KB Data Cache • L1 Execution Trace Cache • 12K of previous micro-instructions stored • saves having to translate • L2 Advanced Transfer Cache • 256K for data • 256-bit transfer every cycle • allows 77GB/s data transfer on 2.4GHz

  8. Rapid ALU Execution Engines • 2 ALUs • allow parallel operations • Many arithmetic operations take 1/2 cycle • each 2X ALU can have 2 operations per cycle

  9. Software Features: • Multimedia Extensions (MMX) • 8 MMX registers • Streaming SIMD Extensions (SSE2) • 8 SSE/SSE2 registers • Standard x86 Registers • EAX, EBX, ECX, EDX, ESI, etc. • Register rename to over 100

  10. MMX (Multimedia Extensions) • Accelerated performance through SIMD • multimedia, communication, internet applications • 64-bit packed INTEGER data • signed/unsigned

  11. SSE2 (Streaming SIMD Extensions) • Accelerate a broad range of applications • video, speech, and image, photo processing, encryption, financial, engineering, and scientific applications • 128-bit SIMD instruction formats • 4 single precision FP values • 2 double precision FP values • 16 byte values • 8 word values • 4 double word values • 2 quad word values • 1 128-bit integer value

  12. SIMD Example(16-tap FIR filter - Real numbers) • Applications for real FIR filters • general purpose filters in image processing, audio, and communication algorithms • Will utilize SSE2 SIMD instruction set

  13. Thinking about SIMD • SSE2 instruction format is 128-bits • 128-bit SSE2 registers • Many data formats! • What precision do we want? • Lets use 32-bit floating point for coefficients, input, output 4 data sets x 32-bit = 128 bits

  14. Parallelizing • Require many single multiplications (coefficients x inputs), then add the results for output! • Multiplications… • then need to perform additions...

  15. Using SSE2 format • Can hold 4 elements of an array (of 32-bit data) in each 128-bit register • 4 single precision floating point ops per cycle (32-bit)

  16. Additions... • In both registers, now have 4 32-bit results • First add the results into an accumulator register • 4 single precision floating point ops per cycle (32-bit)

  17. Additions... • In a register, now have 4 32-bit results • however, NO SSE2 instruction to add these 4! • But can use other instructions • Some BIT INTERTWINING…then add • This will give results for several output values!

  18. ADI SHARC 21k vs. P4 Disadvantages • Slower clock speed (40MHz vs 2400MHz) • Less opportunities for parallelism (5 vs 11) • Much less memory (Cache and System) • Limited algorithm applicability • Limited applications • Older (Less support – compiler) • 1994 vs 2001

  19. ADI Sharc 21k vs. P4 Advantages • Hardware loops • Easier to program for optimal speed • Cheaper • Lower power consumption • Runs cooler

  20. FIR Performance • Hard to obtain P4 performance numbers • Can estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full. • 2 * 2.4GHz ~ 4.8 billion multiplies per second • If ~4 multiplies per element & 44000 samples/s • FIR length > ~25k taps • SHARC => ~ 200 taps (Lab 4) • Factor of ~125x

  21. IIR Performance • Hard to obtain P4 performance numbers • No hardware circular buffers • Does have BTB, BHT, etc. • Prefetches ~256bytes ahead of current position in code.

  22. FFT Performance • Hard to obtain P4 performance numbers • Prime95 uses FFT to calculate Lucas-Lehmer test for Mersenne Primes • Involves FFT, squaring and iFFT, etc. • 256k points on P4 2.3GHz ~ 10.517ms • Compare to SHARC 2048 point FFT ~0.37ms • If SHARC could do 256k, 46.25ms (But…)

  23. Optimization Example • Hard to optimize Pentium 4 assembly • Example of multiplying by a constant, 10 • Taken mainly from: www.emulators.com/docs/pentium_1.htm

  24. Multiplying by 10 • Slowest way: • IMUL EAX, 10 • Usually optimal way (Visual C++ 6.0) • LEA EAX, [EAX+EAX*4] • SHL EAX, 1 • Shift – Add – Shift • On most x86 processors takes 2 cycles • Pentium MMX and before 3 cycles • On Pentium 4 takes 6 cycles!

  25. Multiplying by 10 • Optimal for Pentium 4 • LEA ECX, [EAX + EAX] • LEA EAX, [ECX+EAX*8] • On most x86 still takes 2 cycles • On Pentium 4 takes ~ 3 cycles (OOE - Ops) • But on older processors Pentium MMX and before this now takes 4 cycles!

  26. Multiplying by 10 • Best generic case • LEA EAX, [EAX + EAX*4] • ADD EAX, EAX • On most x86 still takes 2 cycles • On older processors Pentium MMX and before this now takes 3 cycles again • On Pentium 4 this takes 4 cycles • Obviously really hard to optimize

  27. REFERENCES • Intel application note: AP 809 - Real and Complex Filter Using Streaming SIMD Extentions • graphics from: http://www6.tomshardware.com/cpu/00q4/001120/p4-01.html

More Related