460 likes | 767 Views
Intel Multimedia Extensions and Hyper-Threading. Michele Co CS451. Outline. Evolution of Intel multimedia extensions x87 (386) MMX (Pentium MMX, Pentium II) SSE (Pentium III) SSE2 (Pentium 4 – Willamette) SSE3 (Pentium 4 – Prescott) Hyper-Threading. X87 FPU.
E N D
Intel Multimedia ExtensionsandHyper-Threading Michele Co CS451
Outline • Evolution of Intel multimedia extensions • x87 (386) • MMX (Pentium MMX, Pentium II) • SSE (Pentium III) • SSE2 (Pentium 4 – Willamette) • SSE3 (Pentium 4 – Prescott) • Hyper-Threading
X87 FPU • 8 80-bit data registers (double extended precision floating point) • Data registers treated as a stack • Control register – FP precision, rounding, … • Status register – FPU busy, TOS, CC, error, exception, … • Tag register- (2 bits) valid, zero, special, empty • Last instruction pointer register • Last data (operand) pointer register • Opcode register
x87 Instructions • Data transfer (load, store, move) • Basic arithmetic • Comparison • Transcendental (trigonometric, log, exp) • Load constant • x87 FPU control
MMX • SIMD execution • 8 64-bit data registers (MMX) • Aliased to x87 FPU registers • Randomly accessible
MMX Instructions • Data transfer • Arithmetic • Comparison • Conversion • Unpacking • Logical • Shift • Empty MMX state
SSE • Pentium III • 8 128-bit data registers (XMM) • Independent of x87 FPU and MMX registers • SSE instructions can be executed in parallel with MMX/x87 • MXCSR register – control and status for XMM registers (similar to x87 status register) • EFLAGS register – results of compare ops • 128-bit packed single-precision fp data type • Prefetching, cacheability, store ordering control instructions
SSE Instructions • Packed and scalar single-precision floating point • Logical • Conversion • 64-bit SIMD integer • MXCSR management • State management • Cacheability control, prefetch, memory ordering • SFENCE (store fence) • FXSAVE, FXRSTORE • extension of x87 fast save and restore of x87, MMX registers to also include save/restore of XMM, MXCSR registers
SSE2 • Pentium 4 • More data types • More instructions to support new data types
SSE2 Instructions • Support for additional types • CLFLUSH (cache line flush) • LFENCE (load fence) • MFENCE (load + store fence)
SSE3 • Pentium 4 (Prescott) • Support for Hyper-Threading • 13 new instructions • 10 SIMD support instructions • 1 x87 accelerating instruction (fp to int conversion) • Synchronization of threads • MONITOR (monitor write-back stores) • MWAIT (wait for write-back store) • No new state
Terminology • Process • Program associated with a context (state: registers, program counter, flags, etc.) • Consists of one or more threads • Thread • “lightweight process” (less state)
Hyper-threading • Single physical processor appears as 2 logical processors • Thread Level Parallelism (TLP) • Many applications have software threads that can be executed simultaneously • Online transaction processing • Web services • Latency can leave execution units idle • Cache misses • Branch mispredictions • Waiting for loads/stores
Techniques for Minimizing Effect of Long Latency • Chip multiprocessing (CMP) • 2 processors on single die • Larger than single core chip, manufacture more expensive • Time-slice or switch-on-event multithreading • Switch threads after fixed time period or on long latency events like cache misses • Doesn’t take advantage of other sources of inefficient resource usage (branch mispredictions, instruction dependencies, etc.) • Simultaneous multithreading (SMT) • Multiple threads execute on single processor without switching • Hyper-Threading is Intel’s implementation
Resource Requirements for HT Need to maintain 2 contexts • Replicated • Register renaming logic (RAT) • Instruction Pointer • ITLB • Return stack predictor • Various other architectural registers (GP, control, APIC, machine state) • Partitioned • Re-order buffers (ROBs) • Load/Store buffers • Various queues, like the scheduling queues, uop queue, etc. • Shared • Caches: trace cache, L1, L2, L3, microcode ROM • Microarchitectural registers • Execution Units
Hyper-Threading Goals • Minimize die area cost for implementing • Ensure forward progress by at least one logical processor • Maintain single-threaded performance
Frontend Changes • 2 PCs • Arbitration for shared resource access • Trace cache, microcode ROM, caches • One logical processor at a time per structure • Thread tags per trace cache entry • Microcode ROM – 2 microcode instruction pointers • Wider pipeline latches to hold state for 2 contexts • Branch prediction • RAS and branch history buffer duplicated • Global history shared, but tagged with logical processor ID
Execution Modes • Single-task (ST), Multi-task (MT) • ST0, ST1 • HALT: transitions ST modes depending on logical processor executing • Interrupt sent to halted processor transitions to MT