Efficient Support for All Levels of Parallelism for Complex Media Applications

Efficient Support for All Levels of Parallelism for Complex Media Applications Ruchira Sasanka Ph.D. Preliminary Exam Thesis Advisor: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign

Motivation • Complex multimedia applications are critical workloads • Demand high performance • Demand high energy efficiency • Media apps important for General-purpose processors (GPPs) • Increasingly run on general-purpose processors • Multiple specifications/standards demand high programmability •  How can GPPs support multimedia apps efficiently?

16 16 : : : : : : Motivation • Is support for Data Level Parallelism (DLP) sufficient? • VIRAM, Imagine • Assume large amounts of similar (regular) work • Look at media-kernels • Are full-applications highly regular? • Similarity/amount limited by decisions (control) • Multiple specifications and intelligent algorithms • Varying amounts of DLP – sub-word, vector, stream • Instruction/Data/Thread-level parallelism (ILP/DLP/TLP) •  Need to support all levels of parallelism efficiently

Philosophy: Five Guidelines • Supporting All Levels of Parallelism • ILP, TLP, sub-word SIMD (SIMDsw), vectors, streams • Familiar programming model • Increases portability and facilitates wide-spread adoption • Evolutionary hardware • Adaptive, partitioned resources for efficiency • Limited dedicated systems (seamless integration) • Degree/type of parallelism varies with application

Contributions of Thesis • Analyze complex media apps • Make the case that media apps need all levels of parallelism • ALP: Support for All Levels of Parallelism • Based on a contemporary CMP/SMT processor • Careful partitioning for energy efficiency • Novel DLP support • Supports vectors and streams • Seamless integration with CMP/SMT core (no dedicated units) • Low hardware/programming overhead • Speedups of 5X to 49X, EDP improvement of 5X to 361X • w/ thread + SIMDsw + Novel DLP, over 4-wide OOO core

Other Contributions • CMP/SMT/Hybrid Energy Comparison (ICS’04) • Evaluates the best energy TLP support for media apps on GPPs • Mathematical model to predict energy efficiency • Found a hybrid CMP/SMT to be an attractive solution • Hardware Adaptation for energy for Media Apps (ASPLOS’02) • Control algorithms for hardware adaptation • MS Thesis, jointly done with Chris Hughes •  Not discussed here to focus on current work

Outline • ALP: Efficient ILP, TLP and DLP for GPPs • Architecture • Support for TLP • Support for ILP • Support for DLP • Methodology • Results • Summary • Related work • Future work

Supporting TLP • 4–core CMP • 2 SMT threads for each core • Total 8 threads • Based on our TLP study • Shared L2 cache Core 0 Core 1 Core 2 Core 3 L2 Cache

L1 DC Bank 0, Way 2 L1 IC Bank 1, Way 2 L1 DC Bank 0, Way 3 L1 IC Bank 0, Way 1 L1 IC Bank 0, Way 2 L1 IC Bank 0, Way 3 L1 IC Bank 1, Way 1 L1 DC Bank 1, Way 0 L1 IC Bank 0, Way 0 L1 IC Bank 1, Way 0 L1 DC Bank 1, Way 3 L1 DC Bank 1, Way 1 L1 DC Bank 0, Way 0 L1 DC Bank 0, Way 1 L1 IC Bank 1, Way 3 L1 DC Bank 1, Way 2 ROB ROB ROB ROB Supporting ILP (and SMT) • Partitioned for energy/latency and SMT support • Sub-word DLP support with SIMDsw units d T L B L2 Sub-bank Ld/St Q Int ALU 2 SIMD FPUs Fet/ Dec Ld/St Q 2 SIMD ALUs Int RegFile Int Iss. Q L2 Sub-bank FP/SIMD RegFile RET RET Int Iss. Q i T L B Fet/ Dec BrP Int ALU FP Iss. Q FP Iss. Q BrP FP/SIMD RegFile Int RegFile REN L2 Sub-bank Int Iss. Q Fet/ Dec Int Iss. Q Int ALU REN 2 SIMD FPUs Int ALU 2 SIMD ALUs d T L B L2 Sub-bank Key Thread 0 Thread 1 Shared

128b SIMDsw element : : : : Supporting DLP • 128-bit SIMDsw operations • like SSE/AltiVec/VIS • Novel DLP support • Supports vectors/streams of SIMDsw elements • 2-dimensional DLP • No vector execution unit ! • No vector masks !! • No vector register file !!!

Novel DLP Support • Vector data and computation decoupled • Vector data provides locality and regularity • Vector instructions encode multiple operations • Vector registers morphed out of L1 cache lines • Indexed Vector Registers (IVR) • Computation uses SIMDsw units

L1 DC Bank 1, Way 2 L1 IC Bank 0, Way 0 L1 DC Bank 0, Way 3 L1 IC Bank 0, Way 1 L1 IC Bank 0, Way 2 L1 IC Bank 0, Way 3 L1 IC Bank 1, Way 1 L1 IC Bank 1, Way 2 L1 IC Bank 1, Way 3 L1 IC Bank 1, Way 0 L1 DC Bank 0, Way 2 L1 DC Bank 1, Way 3 ROB ROB ROB ROB Indexed Vector Registers (IVR) • IVR allocated in cache ways close to SIMDsw units • Each IVR bank partitioned and allocated only on demand d T L B L1 DC Bank 0, Way 1 L1 DC Bank 0, Way 0 L2 Sub-bank IVR IVR Ld/St Q Int ALU 2 SIMD FPUs Fet/ Dec Ld/St Q 2 SIMD ALUs Int RegFile Int Iss. Q L2 Sub-bank FP/SIMD RegFile RET RET Int Iss. Q i T L B Fet/ Dec BrP Int ALU FP Iss. Q FP Iss. Q BrP FP/SIMD RegFile Int RegFile REN L2 Sub-bank Int Iss. Q Fet/ Dec Int Iss. Q Int ALU REN 2 SIMD FPUs Int ALU 2 SIMD ALUs IVR d T L B IVR L2 Sub-bank L1 DC Bank 1, Way 1 L1 DC Bank 1, Way 0 Key Thread 0 Thread 1 Shared

Indexed Vector Registers (IVR) Current Element Pointer • No logic in the critical path – in fact faster than cache Start Element Available Elements Length V0 0 8 2 6 Vector Descriptors V1 2 8 4 10 : : : 128b Cache Line 0 1 0 1 2 3 2 3 V0 4 5 4 5 V1 6 7 6 7 8 9 : : : : Bank 0, Way 0 Bank 0, Way 1

Novel DLP Support • Vector data and computation decoupled • Vector registers morphed out of L1 cache lines • Indexed Vector Registers (IVR) • Computation uses SIMDsw units • Vector instructions for memory • Vector memory instructions serviced at L2 • SIMDsw instructions for computation • SIMDsw instructions can access vector registers (IVR)

Vector vs. SIMDsw Loop • V2 = k * (V0 + V1) vector_load  V0 vector_load  V1 vector_add V0, V1  V3// V3 temporary vector register vector_mult V3, simd_reg1  V2// k is in simd_reg1 vector_store V2  …// V2 = k * (V0 + V1) vector_load  V0; vector_load  V1 vector_alloc V2 do for all elements in vector V0 and V1 simd_add V0, V1  simd_reg0// adds next elements of V0 & V1 simd_mul simd_reg0, simd_reg1  V2// write prod. into next elem of V2 cur_elem_pointer_increment V0, V1, V2// advance pointers to next element vector_store V2  … Vector: ALP:

0 8 0 8 8 8 0 0 8 1 0 0 0 1 9 V1 V0 V2 SIMDsw Loops in Action • vector_load  V0 • vector_load  V1 • vector_alloc V2 • do for all elements in vector V0 and V1 • simd_add V0, V1  simd_reg0 • simd_mul simd_reg0, simd_reg1  V2 • cur_elem_pointer_increment V0, V1, V2 • vector_store V2  … Current Element Pointer Start Element Available Elements Length 2 V0 2 V1 : : : 10 9 V2 : : : vector_load  V0   vector_load  V1  simd_add V0[0], V1[0] simd_reg0 IQ:  simd_mul simd_reg0, simd_reg1  V2[8]  simd_add V0[1], V1[1] simd_reg0’  simd_mul simd_reg0’, simd_reg1  V2[9] Bank 0, Way 0 Bank 0, Way 1 vector_load  V0 REN: vector_load  V1 vector_alloc V2 simd_add V0[0], V1[0] simd_reg0 simd_mul simd_reg0, simd_reg1  V2[8] cur_elem_pointer_increment V0, V1, V2 simd_add V0[1], V1[1] simd_reg0’ simd_mul simd_reg0’, simd_reg1  V2[9]

Why SIMDsw Loops? • Lower peak memory BW and better memory latency tolerance • Computation interspersed with memory operand use Vector: vector_load  V0 vector_load  V1 vector_add V0, V1  V3// V3 temporary vector register vector_mult V3, simd_reg1  V2// k is in simd_reg1 vector_store V2  …// V2 = k * (V0 + V1) vector_load  V0; vector_load  V1 vector_alloc V2 do for all elements in vector V0 and V1 simd_add V0, V1  simd_reg0// adds next elements of V0 & V1 simd_mul simd_reg0, simd_reg1  V2// write prod. into next elem of V2 cur_elem_pointer_increment V0, V1, V2// advance pointers to next element vector_store V2  … ALP:

Why SIMDsw Loops? • Fewer vector register ports and no temporary vector regs • Lower power, fewer L1 invalidations (vector allocations) Vector: 1 2 3 vector_sub V0, V1  V2// V2 = (V0 – V1) vector_mul V2, V2  V3// V3 = (V0 – V1) * (V0 – V1) vector_add V3, reg0 => reg0// sum += (V0 – V1) * (V0 – V1) do for all elements in vector V0 and V1 simd_add V0, V1  simd_reg0 simd_mul simd_reg0, simd_reg0  simd_reg1 simd_add simd_reg1, simd_reg2  simd_reg2 3 3 4 ALP: 1 2

Why SIMDsw Loops? • Lower peak memory BW and better memory latency tolerance • Computation interspersed with memory operand use • Fewer vector register ports and no temporary vector regs • Lower power, fewer L1 invalidations • No vector mask registers • Compatibility with streams • Easy utilization of heterogeneous execution units

Drawbacks of SIMDsw Loops • Use more dynamic instructions • Higher front-end energy • Higher front-end width • Require static/dynamic unrolling for multi-cycle exec units • Vectors contain independent elements • Static unrolling increases code size • Dynamic unrolling limited by issue queue size • How to retain advantages and get rid of drawbacks?  Future work

Supporting DLP - Streams • A stream is a long vector • Processed using SIMDsw loops similar to vectors, except • Only a limited # of records allocated in IVR at a time • Stream load/allocate (analogous to vector load/alloc) • When a cur_elem_pointer_increment retires • Head element is discarded from IVR (or written to memory) • New element appended to IVR

N+8 N+9 N N+1 N N+1 N+8 N+9 Streams – Example stream_load addr:stride  V0 stream_load addr:stride  V1 stream_alloc addr:stride V2 // just allocate some IVR entries do for all elements in stream simd_add V0, V1  V2 cur_elem_pointer_inc V0, V1, V2 // optional: can assume auto increment stream_strore V2 // flushes the remainder of V2 N+2 N+3 N+2 N+3 V0 V1 N+4 N+5 N+4 N+5 N+6 N+7 N+6 N+7

ALP’s Vector Support vs. SIMDsw (e.g., MMX) • ALP’s vector support different from conventional vector • Uses SIMDsw loops and IVR • How does ALP’s vector support differ from SIMDsw (MMX)? • Vector data (IVR) • Vector memory instructions • Stream support  What do these really buy?

ALP’s Vector Support vs. SIMDsw (e.g.,MMX) • V2 = k * (V0 + V1) vector_load  V0 vector_load  V1 vector_alloc V2 do for all elements in vector V0 and V1 simd_add V0, V1  simd_reg0// adds next W elements of V0 & V1 simd_mul simd_reg0, simd_reg1  V2// write prod. into next W elems of V2 vector_store V2  … do for all elements simd_load …  simd_reg2 simd_load …  simd_reg3 simd_add simd_reg2, simd_reg3  simd_reg0 simd_mul simd_reg0, simd_reg1  simd_reg4 simd_store… simd_reg4  …. ALP: More Instructions More SIMDsw register pressure Loads are evil (LdStQ/TLB/Cache Tag/Can miss) Indexed stores are really 2 instructions SIMDsw (MMX):

ALP’s Vector Support vs. SIMDsw (e.g., MMX) • Sources of performance/energy advantage • Reduced instructions (especially loads/stores) • Load latency hiding and reduced repeated loading • Block pre-fetching • No L1 or SIMDsw register file pollution • Low latency/energy IVR access and no alignment

L1 IC Bank 0, Way 0 L1 DC Bank 1, Way 3 L1 DC Bank 1, Way 2 L1 IC Bank 1, Way 0 L1 IC Bank 1, Way 3 L1 IC Bank 1, Way 2 L1 DC Bank 0, Way 2 L1 IC Bank 0, Way 3 L1 IC Bank 0, Way 2 L1 IC Bank 0, Way 1 L1 DC Bank 0, Way 3 L1 IC Bank 1, Way 1 ROB ROB ROB ROB Indexed Vector Registers (IVR) L1 D Cache Write Through 2 Banks 16K Per Bank 4 Ways Per Bank 2 Ports Per Bank 32 B Line Size SIMD FPUs/ALUs 2 Per Partition 128-bit Load/Store Queue 2 Partitions 16 Entries Per Partition 2R/2W Per Partition 2 Issues Per Partition Int Issue Queues 2 Partitions 2 Banks Per Partition 12 Entries Per Bank 2R/2W Per Bank Tag: 4R/2W Per Bank 2 Issues Per Bank d T L B L1 DC Bank 0, Way 1 L1 DC Bank 0, Way 0 L2 Sub-bank IVR IVR Int RegFile 2 Partitions 32 Regs Per Partition 4R/3W Per Partition Fet/ Dec Ld/St Q FP/SIMD RegFile 2 Partitions 16 Regs Per Part. 128-bit 4R/3W Per Part. Int ALU 2 SIMD FPUs Ld/St Q Reorder Buffer 2 Partitions 2 Banks Per Partition 32 Entries Per Bank 2R/2W Per Bank 2 Ret. Per Bank 2 SIMD ALUs FP/SIMD Issue Queue 2 Partitions 16 Entries Per Partition 2R/2W Per Partition Tag: 4R/2W Per Partition 2 Issues Per Partition Int RegFile L2 Sub-bank I Cache 2 Banks 8K Per Bank 4 Ways Per Bank 1 Port Per Bank 32 B Line Size Int Iss. Q RET RET Int Units 64 Bit 2 Per Partition FP/SIMD RegFile Fet/ Dec Int Iss. Q i T L B L2 Unified Cache Write Back 4 Banks 4 Sub-banks Per Bank 64K Per Sub-bank 4 Ways Per Sub-bank 1 Port Per Sub-bank 64 B Line Size BrP Int ALU FP Iss. Q REN FP Iss. Q FP/SIMD RegFile BrP Int RegFile Fet/ Dec Int Iss. Q L2 Sub-bank Branch Predictor G-Select 2-Partitions 2K Per Partition REN Int Iss. Q Int ALU 2 SIMD FPUs Int ALU 2 SIMD ALUs Rename Table 2 Partitions 4 Wide Per Partition IVR d T L B IVR L2 Sub-bank L1 DC Bank 1, Way 1 L1 DC Bank 1, Way 0 Freq: 500 MHz Key: Shared Thread 0 Thread 1

Methodology • Execution driven simulator • Functional simulation based on RSIM • Entirely new timing simulator for ALP • Applications compiled using Sun CC with full optimization • SIMDsw/Vector instructions in separate assembly file • Hand-written assembly instructions (like MMX) • Used only in a small number of heavily used routines • Simulated through hooks placed in application binary • Dynamic power modeled using WATTCH • Static power with HotLeakage coming soon …

Multi-threaded Full Applications • MPEG 2 encoding • Each thread encodes part of a frame • DLP instructions in DCT, SAD, IDCT, QUANT, IQUANT • MPEG 2 decoding • Each thread decodes a part of a frame • DLP instructions in IDCT, CompPred • Ray Tracing (Tachyon) • Each thread processes independent ray; No DLP • Speech Recognition (Sphinx 3) • Each thread evaluates a Gaussian scoring model • DLP instructions in Gaussian vector operations • Face Recognition (CSU) • Threaded matrix multiplication and distance evaluation • Streams used for matrix manipulations

Systems Simulated • 1 Thread Superscalar: 1T, 1T+S, 1T+SV • 1T - The base system with one thread and no SIMDsw • 1T+S - base + SIMDsw instructions • 1T+SV - base + SIMDsw/Vector instructions + IVR • 4 Thread CMP: 4T, 4T+S, 4T+SV • Analogous to above but running four-threads on 4-core CMP • 8 Thread CMP+SMT: 4x2T, 4x2T+S, 4x2T+SV • Analogous to first three, but running 8 threads • 4 core CMP, with each core running 2 SMT threads

Speedup • SIMDsw support adds • 1.5X to 6.6X over base • Vector support adds • 2X to 9.7X over base • 1.1X to 1.9X over SIMDsw Speedup MPGenc MPGdec RayTrace SpeechRec FaceRec

Speedup 35.9 • CMP support adds • 3.1X to 3.9X over base • 2.5X to 3.8X over 1T+SV Speedup MPGenc MPGdec RayTrace SpeechRec FaceRec

Speedup 35.9 48.8 • ALP achieves • 5.0X to 48.8X over base • All forms of parallelism essential • SMT support adds • 1.14X to 1.87X over CMP (+SV) • 1.03X to 1.29X over CMP (+S) Speedup MPGenc MPGdec RayTrace SpeechRec FaceRec

Energy Consumption (No DVS) % Energy (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec • SIMDsw savings 1.4X to 4.8X over base • +SV savings 1.1X to 1.4X over SIMDsw

Energy Consumption (No DVS) % Energy (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec • CMP savings 1.09X to 1.17X (for +SV) • 1.08X to 1.16X over base

Energy Consumption (No DVS) % Energy (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec • SMT increases energy by 4% (+SV) and 14% (+SV) • ALP reduces energy up to 7.4X

Energy Delay Product (EDP) Improvement 63 • SIMDsw support adds (no RayTrace) • 2.3X to 30.7X over base • Vector support adds (no RayTrace) • 4.5X to 63X over base • 1.3X to 2.5X over SIMDsw EDP Improvement (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec

Energy Delay Product (EDP) Improvement 266 • CMP adds • 4.0X to 4.3X over base • 2.5X to 4.6X over 1T+SV 126 63 EDP Improvement (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec

Energy Delay Product (EDP) Improvement • ALP achieves • 5X to 361X over base • All forms of parallelism essential 361 266 159 126 63 • SMT support adds • 1.1X to 1.9X over CMP (+SV) • 0.9X to 1.2X over CMP (+S) EDP Improvement (Normalized to 1T) MPGenc MPGdec RayTrace SpeechRec FaceRec

Analysis: Vector Vs. SIMDsw (Recap) • Performance due to 3 primary enhancements • Vector data (IVR) • Vector memory instructions • Stream support • Sources of performance/energy advantage • Reduced instructions (especially loads/stores) • Load latency hiding and reduced repeated loading • Block pre-fetching • No L1 or SIMD register file pollution • Low latency/energy IVR access and no alignment

Number of Retired Instructions/Operations Operations % Retired Inst/Oper. (Normalized to 1T) Instructions MPGenc MPGdec RayTrace SpeechRec FaceRec • Operations reduced by eliminating overhead • Instructions reduced by less overhead and packing of operations

Vector Vs. SIMDsw – Retirement Stall Distribution MPGenc MPGdec SpeechRec FaceRec • SIMD memory stalls replaced by fewer vector memory stalls • Streaming in FaceRec eliminates most of memory stalls

Comparison with Other Architectures • Several interesting architectures • Imagine, RAW, VIRAM, Tarantula, TRIPS … • Most do not report performance for full media apps • Detailed modeling/programming difficult • Imagine gives a frames per second number for MPEG 2 encoder  Compare with Imagine

Comparison With Imagine • MPEG 2 encoding on Imagine • 138 fps for 360x288 resolution at 200MHz • Do not include • B frame encoding • at least 30% more than P, twice than I • 2/3 of all frames are B frames • Huffman VLC • Only 5% on single thread • Up to 35% when other parts parallelized/vectorized • Half pixel motion estimation • Adds 30% to the execution time Hard to make fair energy comparison ALP achieves 79 fps with everything @ same frequency

Summary • Complex media apps need all levels of parallelism • Supporting all levels of parallelism is essential • No single type of parallelism gives the best performance/energy • CMP/SMT processors with evolutionary DLP support effective • ALP supports DLP efficiently • Benefits of vectors and streams with low cost • Decoupling of vector data from instructions • Adaptable L1 cache as a vector register file • Evolutionary hardware and familiar programming model • Overall, speedups of 5X to 49X, EDP gains of 5X to 361X

Future Work • Eliminating the drawbacks of SIMDsw loops • Benefits of ILP • Memory system enhancements • Scalability of ALP • More applications • Adaptation for energy

Eliminating the Drawbacks of SIMDsw Loops • Drawbacks • Use more dynamic instructions • Require static/dynamic unrolling for multi-cycle exec units • Solution • Loop repetition with in-order issue • No renaming • SIMDsw registers volatile across SIMDsw code blocks • Automatic cur_elem_pointer increment

Efficient Support for All Levels of Parallelism for Complex Media Applications

Efficient Support for All Levels of Parallelism for Complex Media Applications

Presentation Transcript

Topical science support for all levels

Topical science support for all levels

Scheduling for parallelism

Approximated Provenance for Complex Applications

Efficient Crawling of Complex Rich Internet Applications

RTI for Elementary A Multi -Tiered System of Support for All Levels

RULES OF THUMB FOR PARALLELISM

Efficient Complex Query Support For Multi-version XML Documents

Connectivity for all applications

Distributed search for complex heterogeneous media

Desktop Design for complex Applications

Techniques for Developing Efficient Petascale Applications

Resilient Multicast Support for Continuous-Media Applications

Efficient Complex Operators for Irregular Codes

Nonlinear Simulation for Complex Biomedical Applications

Levels of Support/ Levels of Prompting

LEVELS OF SUPPORT

XIA: Efficient Support for Evolvable Internetworking

Approximated Provenance for Complex Applications

Assignments Help for All Levels of Academics

Accounting Courses Available for All Levels

MTSS Continuum of Support for ALL