700 likes | 832 Views
Efficient Support for All Levels of Parallelism for Complex Media Applications. Ruchira Sasanka Ph.D. Preliminary Exam Thesis Advisor: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign. Motivation. Complex multimedia applications are critical workloads
E N D
Efficient Support for All Levels of Parallelism for Complex Media Applications Ruchira Sasanka Ph.D. Preliminary Exam Thesis Advisor: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign
Motivation • Complex multimedia applications are critical workloads • Demand high performance • Demand high energy efficiency • Media apps important for General-purpose processors (GPPs) • Increasingly run on general-purpose processors • Multiple specifications/standards demand high programmability • How can GPPs support multimedia apps efficiently?
16 16 : : : : : : Motivation • Is support for Data Level Parallelism (DLP) sufficient? • VIRAM, Imagine • Assume large amounts of similar (regular) work • Look at media-kernels • Are full-applications highly regular? • Similarity/amount limited by decisions (control) • Multiple specifications and intelligent algorithms • Varying amounts of DLP – sub-word, vector, stream • Instruction/Data/Thread-level parallelism (ILP/DLP/TLP) • Need to support all levels of parallelism efficiently
Philosophy: Five Guidelines • Supporting All Levels of Parallelism • ILP, TLP, sub-word SIMD (SIMDsw), vectors, streams • Familiar programming model • Increases portability and facilitates wide-spread adoption • Evolutionary hardware • Adaptive, partitioned resources for efficiency • Limited dedicated systems (seamless integration) • Degree/type of parallelism varies with application
Contributions of Thesis • Analyze complex media apps • Make the case that media apps need all levels of parallelism • ALP: Support for All Levels of Parallelism • Based on a contemporary CMP/SMT processor • Careful partitioning for energy efficiency • Novel DLP support • Supports vectors and streams • Seamless integration with CMP/SMT core (no dedicated units) • Low hardware/programming overhead • Speedups of 5X to 49X, EDP improvement of 5X to 361X • w/ thread + SIMDsw + Novel DLP, over 4-wide OOO core
Other Contributions • CMP/SMT/Hybrid Energy Comparison (ICS’04) • Evaluates the best energy TLP support for media apps on GPPs • Mathematical model to predict energy efficiency • Found a hybrid CMP/SMT to be an attractive solution • Hardware Adaptation for energy for Media Apps (ASPLOS’02) • Control algorithms for hardware adaptation • MS Thesis, jointly done with Chris Hughes • Not discussed here to focus on current work
Outline • ALP: Efficient ILP, TLP and DLP for GPPs • Architecture • Support for TLP • Support for ILP • Support for DLP • Methodology • Results • Summary • Related work • Future work
Supporting TLP • 4–core CMP • 2 SMT threads for each core • Total 8 threads • Based on our TLP study • Shared L2 cache Core 0 Core 1 Core 2 Core 3 L2 Cache
L1 DC Bank 0, Way 2 L1 IC Bank 1, Way 2 L1 DC Bank 0, Way 3 L1 IC Bank 0, Way 1 L1 IC Bank 0, Way 2 L1 IC Bank 0, Way 3 L1 IC Bank 1, Way 1 L1 DC Bank 1, Way 0 L1 IC Bank 0, Way 0 L1 IC Bank 1, Way 0 L1 DC Bank 1, Way 3 L1 DC Bank 1, Way 1 L1 DC Bank 0, Way 0 L1 DC Bank 0, Way 1 L1 IC Bank 1, Way 3 L1 DC Bank 1, Way 2 ROB ROB ROB ROB Supporting ILP (and SMT) • Partitioned for energy/latency and SMT support • Sub-word DLP support with SIMDsw units d T L B L2 Sub-bank Ld/St Q Int ALU 2 SIMD FPUs Fet/ Dec Ld/St Q 2 SIMD ALUs Int RegFile Int Iss. Q L2 Sub-bank FP/SIMD RegFile RET RET Int Iss. Q i T L B Fet/ Dec BrP Int ALU FP Iss. Q FP Iss. Q BrP FP/SIMD RegFile Int RegFile REN L2 Sub-bank Int Iss. Q Fet/ Dec Int Iss. Q Int ALU REN 2 SIMD FPUs Int ALU 2 SIMD ALUs d T L B L2 Sub-bank Key Thread 0 Thread 1 Shared
128b SIMDsw element : : : : Supporting DLP • 128-bit SIMDsw operations • like SSE/AltiVec/VIS • Novel DLP support • Supports vectors/streams of SIMDsw elements • 2-dimensional DLP • No vector execution unit ! • No vector masks !! • No vector register file !!!
Novel DLP Support • Vector data and computation decoupled • Vector data provides locality and regularity • Vector instructions encode multiple operations • Vector registers morphed out of L1 cache lines • Indexed Vector Registers (IVR) • Computation uses SIMDsw units
L1 DC Bank 1, Way 2 L1 IC Bank 0, Way 0 L1 DC Bank 0, Way 3 L1 IC Bank 0, Way 1 L1 IC Bank 0, Way 2 L1 IC Bank 0, Way 3 L1 IC Bank 1, Way 1 L1 IC Bank 1, Way 2 L1 IC Bank 1, Way 3 L1 IC Bank 1, Way 0 L1 DC Bank 0, Way 2 L1 DC Bank 1, Way 3 ROB ROB ROB ROB Indexed Vector Registers (IVR) • IVR allocated in cache ways close to SIMDsw units • Each IVR bank partitioned and allocated only on demand d T L B L1 DC Bank 0, Way 1 L1 DC Bank 0, Way 0 L2 Sub-bank IVR IVR Ld/St Q Int ALU 2 SIMD FPUs Fet/ Dec Ld/St Q 2 SIMD ALUs Int RegFile Int Iss. Q L2 Sub-bank FP/SIMD RegFile RET RET Int Iss. Q i T L B Fet/ Dec BrP Int ALU FP Iss. Q FP Iss. Q BrP FP/SIMD RegFile Int RegFile REN L2 Sub-bank Int Iss. Q Fet/ Dec Int Iss. Q Int ALU REN 2 SIMD FPUs Int ALU 2 SIMD ALUs IVR d T L B IVR L2 Sub-bank L1 DC Bank 1, Way 1 L1 DC Bank 1, Way 0 Key Thread 0 Thread 1 Shared
Indexed Vector Registers (IVR) Current Element Pointer • No logic in the critical path – in fact faster than cache Start Element Available Elements Length V0 0 8 2 6 Vector Descriptors V1 2 8 4 10 : : : 128b Cache Line 0 1 0 1 2 3 2 3 V0 4 5 4 5 V1 6 7 6 7 8 9 : : : : Bank 0, Way 0 Bank 0, Way 1
Novel DLP Support • Vector data and computation decoupled • Vector registers morphed out of L1 cache lines • Indexed Vector Registers (IVR) • Computation uses SIMDsw units • Vector instructions for memory • Vector memory instructions serviced at L2 • SIMDsw instructions for computation • SIMDsw instructions can access vector registers (IVR)
Vector vs. SIMDsw Loop • V2 = k * (V0 + V1) vector_load V0 vector_load V1 vector_add V0, V1 V3// V3 temporary vector register vector_mult V3, simd_reg1 V2// k is in simd_reg1 vector_store V2 …// V2 = k * (V0 + V1) vector_load V0; vector_load V1 vector_alloc V2 do for all elements in vector V0 and V1 simd_add V0, V1 simd_reg0// adds next elements of V0 & V1 simd_mul simd_reg0, simd_reg1 V2// write prod. into next elem of V2 cur_elem_pointer_increment V0, V1, V2// advance pointers to next element vector_store V2 … Vector: ALP:
0 8 0 8 8 8 0 0 8 1 0 0 0 1 9 V1 V0 V2 SIMDsw Loops in Action • vector_load V0 • vector_load V1 • vector_alloc V2 • do for all elements in vector V0 and V1 • simd_add V0, V1 simd_reg0 • simd_mul simd_reg0, simd_reg1 V2 • cur_elem_pointer_increment V0, V1, V2 • vector_store V2 … Current Element Pointer Start Element Available Elements Length 2 V0 2 V1 : : : 10 9 V2 : : : vector_load V0 vector_load V1 simd_add V0[0], V1[0] simd_reg0 IQ: simd_mul simd_reg0, simd_reg1 V2[8] simd_add V0[1], V1[1] simd_reg0’ simd_mul simd_reg0’, simd_reg1 V2[9] Bank 0, Way 0 Bank 0, Way 1 vector_load V0 REN: vector_load V1 vector_alloc V2 simd_add V0[0], V1[0] simd_reg0 simd_mul simd_reg0, simd_reg1 V2[8] cur_elem_pointer_increment V0, V1, V2 simd_add V0[1], V1[1] simd_reg0’ simd_mul simd_reg0’, simd_reg1 V2[9]
Why SIMDsw Loops? • Lower peak memory BW and better memory latency tolerance • Computation interspersed with memory operand use Vector: vector_load V0 vector_load V1 vector_add V0, V1 V3// V3 temporary vector register vector_mult V3, simd_reg1 V2// k is in simd_reg1 vector_store V2 …// V2 = k * (V0 + V1) vector_load V0; vector_load V1 vector_alloc V2 do for all elements in vector V0 and V1 simd_add V0, V1 simd_reg0// adds next elements of V0 & V1 simd_mul simd_reg0, simd_reg1 V2// write prod. into next elem of V2 cur_elem_pointer_increment V0, V1, V2// advance pointers to next element vector_store V2 … ALP:
Why SIMDsw Loops? • Fewer vector register ports and no temporary vector regs • Lower power, fewer L1 invalidations (vector allocations) Vector: 1 2 3 vector_sub V0, V1 V2// V2 = (V0 – V1) vector_mul V2, V2 V3// V3 = (V0 – V1) * (V0 – V1) vector_add V3, reg0 => reg0// sum += (V0 – V1) * (V0 – V1) do for all elements in vector V0 and V1 simd_add V0, V1 simd_reg0 simd_mul simd_reg0, simd_reg0 simd_reg1 simd_add simd_reg1, simd_reg2 simd_reg2 3 3 4 ALP: 1 2
Why SIMDsw Loops? • Lower peak memory BW and better memory latency tolerance • Computation interspersed with memory operand use • Fewer vector register ports and no temporary vector regs • Lower power, fewer L1 invalidations • No vector mask registers • Compatibility with streams • Easy utilization of heterogeneous execution units
Drawbacks of SIMDsw Loops • Use more dynamic instructions • Higher front-end energy • Higher front-end width • Require static/dynamic unrolling for multi-cycle exec units • Vectors contain independent elements • Static unrolling increases code size • Dynamic unrolling limited by issue queue size • How to retain advantages and get rid of drawbacks? Future work
Supporting DLP - Streams • A stream is a long vector • Processed using SIMDsw loops similar to vectors, except • Only a limited # of records allocated in IVR at a time • Stream load/allocate (analogous to vector load/alloc) • When a cur_elem_pointer_increment retires • Head element is discarded from IVR (or written to memory) • New element appended to IVR
N+8 N+9 N N+1 N N+1 N+8 N+9 Streams – Example stream_load addr:stride V0 stream_load addr:stride V1 stream_alloc addr:stride V2 // just allocate some IVR entries do for all elements in stream simd_add V0, V1 V2 cur_elem_pointer_inc V0, V1, V2 // optional: can assume auto increment stream_strore V2 // flushes the remainder of V2 N+2 N+3 N+2 N+3 V0 V1 N+4 N+5 N+4 N+5 N+6 N+7 N+6 N+7
ALP’s Vector Support vs. SIMDsw (e.g., MMX) • ALP’s vector support different from conventional vector • Uses SIMDsw loops and IVR • How does ALP’s vector support differ from SIMDsw (MMX)? • Vector data (IVR) • Vector memory instructions • Stream support What do these really buy?
ALP’s Vector Support vs. SIMDsw (e.g.,MMX) • V2 = k * (V0 + V1) vector_load V0 vector_load V1 vector_alloc V2 do for all elements in vector V0 and V1 simd_add V0, V1 simd_reg0// adds next W elements of V0 & V1 simd_mul simd_reg0, simd_reg1 V2// write prod. into next W elems of V2 vector_store V2 … do for all elements simd_load … simd_reg2 simd_load … simd_reg3 simd_add simd_reg2, simd_reg3 simd_reg0 simd_mul simd_reg0, simd_reg1 simd_reg4 simd_store… simd_reg4 …. ALP: More Instructions More SIMDsw register pressure Loads are evil (LdStQ/TLB/Cache Tag/Can miss) Indexed stores are really 2 instructions SIMDsw (MMX):
ALP’s Vector Support vs. SIMDsw (e.g., MMX) • Sources of performance/energy advantage • Reduced instructions (especially loads/stores) • Load latency hiding and reduced repeated loading • Block pre-fetching • No L1 or SIMDsw register file pollution • Low latency/energy IVR access and no alignment
Outline • ALP: Efficient ILP, TLP and DLP for GPPs • Architecture • Support for TLP • Support for ILP • Support for DLP • Methodology • Results • Summary • Related work • Future work
L1 IC Bank 0, Way 0 L1 DC Bank 1, Way 3 L1 DC Bank 1, Way 2 L1 IC Bank 1, Way 0 L1 IC Bank 1, Way 3 L1 IC Bank 1, Way 2 L1 DC Bank 0, Way 2 L1 IC Bank 0, Way 3 L1 IC Bank 0, Way 2 L1 IC Bank 0, Way 1 L1 DC Bank 0, Way 3 L1 IC Bank 1, Way 1 ROB ROB ROB ROB Indexed Vector Registers (IVR) L1 D Cache Write Through 2 Banks 16K Per Bank 4 Ways Per Bank 2 Ports Per Bank 32 B Line Size SIMD FPUs/ALUs 2 Per Partition 128-bit Load/Store Queue 2 Partitions 16 Entries Per Partition 2R/2W Per Partition 2 Issues Per Partition Int Issue Queues 2 Partitions 2 Banks Per Partition 12 Entries Per Bank 2R/2W Per Bank Tag: 4R/2W Per Bank 2 Issues Per Bank d T L B L1 DC Bank 0, Way 1 L1 DC Bank 0, Way 0 L2 Sub-bank IVR IVR Int RegFile 2 Partitions 32 Regs Per Partition 4R/3W Per Partition Fet/ Dec Ld/St Q FP/SIMD RegFile 2 Partitions 16 Regs Per Part. 128-bit 4R/3W Per Part. Int ALU 2 SIMD FPUs Ld/St Q Reorder Buffer 2 Partitions 2 Banks Per Partition 32 Entries Per Bank 2R/2W Per Bank 2 Ret. Per Bank 2 SIMD ALUs FP/SIMD Issue Queue 2 Partitions 16 Entries Per Partition 2R/2W Per Partition Tag: 4R/2W Per Partition 2 Issues Per Partition Int RegFile L2 Sub-bank I Cache 2 Banks 8K Per Bank 4 Ways Per Bank 1 Port Per Bank 32 B Line Size Int Iss. Q RET RET Int Units 64 Bit 2 Per Partition FP/SIMD RegFile Fet/ Dec Int Iss. Q i T L B L2 Unified Cache Write Back 4 Banks 4 Sub-banks Per Bank 64K Per Sub-bank 4 Ways Per Sub-bank 1 Port Per Sub-bank 64 B Line Size BrP Int ALU FP Iss. Q REN FP Iss. Q FP/SIMD RegFile BrP Int RegFile Fet/ Dec Int Iss. Q L2 Sub-bank Branch Predictor G-Select 2-Partitions 2K Per Partition REN Int Iss. Q Int ALU 2 SIMD FPUs Int ALU 2 SIMD ALUs Rename Table 2 Partitions 4 Wide Per Partition IVR d T L B IVR L2 Sub-bank L1 DC Bank 1, Way 1 L1 DC Bank 1, Way 0 Freq: 500 MHz Key: Shared Thread 0 Thread 1
Methodology • Execution driven simulator • Functional simulation based on RSIM • Entirely new timing simulator for ALP • Applications compiled using Sun CC with full optimization • SIMDsw/Vector instructions in separate assembly file • Hand-written assembly instructions (like MMX) • Used only in a small number of heavily used routines • Simulated through hooks placed in application binary • Dynamic power modeled using WATTCH • Static power with HotLeakage coming soon …
Multi-threaded Full Applications • MPEG 2 encoding • Each thread encodes part of a frame • DLP instructions in DCT, SAD, IDCT, QUANT, IQUANT • MPEG 2 decoding • Each thread decodes a part of a frame • DLP instructions in IDCT, CompPred • Ray Tracing (Tachyon) • Each thread processes independent ray; No DLP • Speech Recognition (Sphinx 3) • Each thread evaluates a Gaussian scoring model • DLP instructions in Gaussian vector operations • Face Recognition (CSU) • Threaded matrix multiplication and distance evaluation • Streams used for matrix manipulations
Systems Simulated • 1 Thread Superscalar: 1T, 1T+S, 1T+SV • 1T - The base system with one thread and no SIMDsw • 1T+S - base + SIMDsw instructions • 1T+SV - base + SIMDsw/Vector instructions + IVR • 4 Thread CMP: 4T, 4T+S, 4T+SV • Analogous to above but running four-threads on 4-core CMP • 8 Thread CMP+SMT: 4x2T, 4x2T+S, 4x2T+SV • Analogous to first three, but running 8 threads • 4 core CMP, with each core running 2 SMT threads
Outline • ALP: Efficient ILP, TLP and DLP for GPPs • Architecture • Support for TLP • Support for ILP • Support for DLP • Methodology • Results • Summary • Related work • Future work
Speedup • SIMDsw support adds • 1.5X to 6.6X over base • Vector support adds • 2X to 9.7X over base • 1.1X to 1.9X over SIMDsw Speedup MPGenc MPGdec RayTrace SpeechRec FaceRec
Speedup 35.9 • CMP support adds • 3.1X to 3.9X over base • 2.5X to 3.8X over 1T+SV Speedup MPGenc MPGdec RayTrace SpeechRec FaceRec
Speedup 35.9 48.8 • ALP achieves • 5.0X to 48.8X over base • All forms of parallelism essential • SMT support adds • 1.14X to 1.87X over CMP (+SV) • 1.03X to 1.29X over CMP (+S) Speedup MPGenc MPGdec RayTrace SpeechRec FaceRec
Energy Consumption (No DVS) % Energy (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec • SIMDsw savings 1.4X to 4.8X over base • +SV savings 1.1X to 1.4X over SIMDsw
Energy Consumption (No DVS) % Energy (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec • CMP savings 1.09X to 1.17X (for +SV) • 1.08X to 1.16X over base
Energy Consumption (No DVS) % Energy (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec • SMT increases energy by 4% (+SV) and 14% (+SV) • ALP reduces energy up to 7.4X
Energy Delay Product (EDP) Improvement 63 • SIMDsw support adds (no RayTrace) • 2.3X to 30.7X over base • Vector support adds (no RayTrace) • 4.5X to 63X over base • 1.3X to 2.5X over SIMDsw EDP Improvement (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec
Energy Delay Product (EDP) Improvement 266 • CMP adds • 4.0X to 4.3X over base • 2.5X to 4.6X over 1T+SV 126 63 EDP Improvement (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec
Energy Delay Product (EDP) Improvement • ALP achieves • 5X to 361X over base • All forms of parallelism essential 361 266 159 126 63 • SMT support adds • 1.1X to 1.9X over CMP (+SV) • 0.9X to 1.2X over CMP (+S) EDP Improvement (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec
Analysis: Vector Vs. SIMDsw (Recap) • Performance due to 3 primary enhancements • Vector data (IVR) • Vector memory instructions • Stream support • Sources of performance/energy advantage • Reduced instructions (especially loads/stores) • Load latency hiding and reduced repeated loading • Block pre-fetching • No L1 or SIMD register file pollution • Low latency/energy IVR access and no alignment
Number of Retired Instructions/Operations Operations % Retired Inst/Oper. (Normalized to 1T) Instructions MPGenc MPGdec RayTrace SpeechRec FaceRec • Operations reduced by eliminating overhead • Instructions reduced by less overhead and packing of operations
Vector Vs. SIMDsw – Retirement Stall Distribution MPGenc MPGdec SpeechRec FaceRec • SIMD memory stalls replaced by fewer vector memory stalls • Streaming in FaceRec eliminates most of memory stalls
Outline • ALP: Efficient ILP, TLP and DLP for GPPs • Architecture • Support for TLP • Support for ILP • Support for DLP • Methodology • Results • Summary • Related work • Future work
Comparison with Other Architectures • Several interesting architectures • Imagine, RAW, VIRAM, Tarantula, TRIPS … • Most do not report performance for full media apps • Detailed modeling/programming difficult • Imagine gives a frames per second number for MPEG 2 encoder Compare with Imagine
Comparison With Imagine • MPEG 2 encoding on Imagine • 138 fps for 360x288 resolution at 200MHz • Do not include • B frame encoding • at least 30% more than P, twice than I • 2/3 of all frames are B frames • Huffman VLC • Only 5% on single thread • Up to 35% when other parts parallelized/vectorized • Half pixel motion estimation • Adds 30% to the execution time Hard to make fair energy comparison ALP achieves 79 fps with everything @ same frequency
Summary • Complex media apps need all levels of parallelism • Supporting all levels of parallelism is essential • No single type of parallelism gives the best performance/energy • CMP/SMT processors with evolutionary DLP support effective • ALP supports DLP efficiently • Benefits of vectors and streams with low cost • Decoupling of vector data from instructions • Adaptable L1 cache as a vector register file • Evolutionary hardware and familiar programming model • Overall, speedups of 5X to 49X, EDP gains of 5X to 361X
Outline • ALP: Efficient ILP, TLP and DLP for GPPs • Architecture • Support for TLP • Support for ILP • Support for DLP • Methodology • Results • Summary • Related work • Future work
Future Work • Eliminating the drawbacks of SIMDsw loops • Benefits of ILP • Memory system enhancements • Scalability of ALP • More applications • Adaptation for energy
Eliminating the Drawbacks of SIMDsw Loops • Drawbacks • Use more dynamic instructions • Require static/dynamic unrolling for multi-cycle exec units • Solution • Loop repetition with in-order issue • No renaming • SIMDsw registers volatile across SIMDsw code blocks • Automatic cur_elem_pointer increment