1 / 16

Enhancing Performance with SIMD: A Deep Dive into Vector Processing

Learn about SIMD processors, SSE technology, and how they boost multimedia applications. Explore vector processing, SIMD architecture, and C++ intrinsics. See real-world examples and the impact on performance.

olgat
Download Presentation

Enhancing Performance with SIMD: A Deep Dive into Vector Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture16 SSE vectorprocessing SIMD MultimediaExtensions

  2. Improving performance withSSE • We’ve seen how we can apply multithreading to speed up the cardiacsimulator • But there is another kind of parallelism available to us:SSE Scott B. Baden / CSE 160 /Wi '16

  3. Hardware ControlMechanisms Flynn’s classification(1966) How do the processors issueinstructions? PE+ CU Interconnect SIMD: Single Instruction, Multiple Data Execute aglobalinstructionstreaminlock-step PE+ CU PE+ CU PE+ CU PE PE+ CU Interconnect PE MIMD: Multiple Instruction, MultipleData Control Unit PE Clusters and servers processors execute instruction streamsindependently PE PE 26 Scott B. Baden / CSE 160 /Wi '16

  4. SIMD (Single Instruction Multiple Data) • Operateonregulararraysofdata • Two landmark SIMDdesigns • ILIAC IV(1960s) • Connection Machine 1 and 2(1980s) • Vectorcomputer:Cray-1(1976) • IntelandotherssupportSIMDfor multimedia andgraphics • SSE • Streaming SIMD extensions,Altivec • Operations defined onvectors • GPUs,CellBroadbandEngine (SonyPlaystation) • Reducedperformanceondatadependent or irregularcomputations 1 1 1 4 2 2 = * 2 2 1 6 3 forall i =0:N-1 p[i] = a[i] *b[i] 2 forall i = 0 :n-1 x[i] = y[i] + z [ K[i] ] endforall forall i = 0 :n-1 if(x[i]< 0) then y[i] =x[i] else y[i]= x[i] endif endforall 27 Scott B. Baden / CSE 160 /Wi '16

  5. AreSIMDprocessorsgeneralpurpose? A.Yes B.No Scott B. Baden / CSE 160 /Wi '16

  6. AreSIMDprocessorsgeneralpurpose? A.Yes B.No Scott B. Baden / CSE 160 /Wi '16

  7. Whatkindofparallelismdoesmultithreading provide? A. MIMD B. SIMD Scott B. Baden / CSE 160 /Wi '16

  8. Whatkindofparallelismdoesmultithreading provide? A. MIMD B. SIMD Scott B. Baden / CSE 160 /Wi '16

  9. Streaming SIMDExtensions • SIMD instruction set on shortvectors • SSE: SSE3 on Bang, but most will need only SSE2 See https://goo.gl/DIokKjand • https://software.intel.com/sites/landingpage/IntrinsicsGuide • Bang : 8x128 bit vector registers (newer cpus have16) for i = 0:N-1 { p[i] = a[i] *b[i];} 1 1 1 a 4 2 2 = * b 2 2 1 X X X X 6 3 2 p 4 doubles 8 floats , intsetc Scott B. Baden / CSE 160 /Wi '16

  10. SSE Architecturalsupport • SSE2,SSE3, SSE4,AVX • Vector operations on short vectors: add, subtract, 128 bit loadstore • SSE2+: 16 XMM registers (128bits) • These are in addition to the conventional registers and are treatedspecially • Vector operations on short vectors: add, subtract, Shuffling (handlesconditionals) • Data transfer: load/store • See the Intel intrisicsguide: • software.intel.com/sites/landingpage/IntrinsicsGuide • May need to invoke compiler options depending on level of optimization Scott B. Baden / CSE 160 /Wi '16

  11. C++intrinsics • C++ functions and datatypes that map directly onto 1 or more machineinstructions • Supported by all majorcompilers • The interface provides 128 bit data types and operations on those datatypes • _m128(float) • _m128d(double) • Data movement andinitialization • mm_load_pd (alignedload) • mm_store_pd • mm_loadu_pd (unalignedload) • Data may need to bealigned m128d vec1,vec2,vec3; for (i=0; i<N; i+=2){ vec1 =_mm_load_pd(&b[i]); vec2 =_mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 =_mm_sqrt_pd(vec3); _mm_store_pd(&a[i],vec3); } Scott B. Baden / CSE 160 /Wi '16

  12. How do wevectorize? • Originalcode • double a[N], b[N],c[N]; • for (i=0; i<N; i++) { a[i] = sqrt(b[i] /c[i]); • Identify vector operations, reduce loopbound • for (i = 0; i < N;i+=2) • a[i:i+1] = vec_sqrt(b[i:i+1] /c[i:i+1]); • The vectorinstructions • __m128dvec1,vec2,vec3; for (i=0; i<N; i+=2){ • vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3=_mm_div_pd(vec1,vec2); vec3 =_mm_sqrt_pd(vec3); • _mm_store_pd(&a[i],vec3); • } Scott B. Baden / CSE 160 /Wi '16

  13. Performance • Without SSE vectorization : 0.777sec. • With SSEvectorization: 0.454sec. • Speedup due to vectorization:x1.7 • $PUB/Examples/SSE/Vec double*a,*b,*c m128d vec1,vec2,vec3; for(i=0;i<N;i+=2){ vec1 = _mm_load_pd(&b[i]); vec2 =_mm_load_pd(&c[i]); vec3=_mm_div_pd(vec1,vec2); vec3 =_mm_sqrt_pd(vec3); _mm_store_pd(&a[i],vec3); } double *a, *b, *c for(i=0;i<N;i++){ a[i] = sqrt(b[i] /c[i]); } Scott B. Baden / CSE 160 /Wi '16

  14. The assemblercode double *a, *b,*c __m128dvec1,vec2,vec3; for (i=0; i<N; i+=2){ vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3=_mm_div_pd(vec1,vec2); vec3 =_mm_sqrt_pd(vec3); _mm_store_pd(&a[i],vec3); } double*a,*b,*c for(i=0;i<N;i++){ a[i] = sqrt(b[i] /c[i]); } .L12: movsd divsd sqrtsd xmm0, QWORD PTR [r12+rbx] xmm0, QWORD PTR[r13+0+rbx] xmm1,xmm0 ucomisdxmm1,xmm1//checksforillegalsqrt jp .L30 movsd QWORD PTR[rbp+0+rbx],xmm1 add cmp jne rbx,8 #ivtmp.135 rbx,16384 .L12 Scott B. Baden / CSE 160 /Wi '16

  15. What preventsvectorization • Interrupted flow out of theloop • for (i=0; i<n; i++){ • a[i] = b[i] +c[i]; • maxval = (a[i] > maxval ? a[i] :maxval); if (maxval > 1000.0)break; • } • Loop not vectorized/parallelized: multipleexits • This loop willvectorize • for (i=0; i<n; i++){ • a[i] = b[i] +c[i]; • maxval = (a[i] > maxval ? a[i] :maxval); • } Scott B. Baden / CSE 160 /Wi '16

  16. SSE2 Cheatsheet (load andstore) xmm: one operand is a 128-bit SSE2register mem/xmm: other operand is in memory or an SSE2register {SS} Scalar SingleprecisionFP: one 32-bit operand in a 128-bitregister {PS}PackedSingleprecisionFP:four32-bitoperandsina128-bitregister {SD} Scalar Double precision FP: one 64-bit operand in a 128-bit register {PD}PackedDoubleprecisionFP,ortwo64-bitoperandsina128-bitregister {A} 128-bit operand is aligned inmemory {U} the 128-bit operand is unaligned inmemory {H} move the high half of the 128-bitoperand Krste Asanovic & Randy H.Katz {L} move the low half of the 128-bitoperand Scott B. Baden / CSE 160 /Wi '16

More Related