410 likes | 553 Views
Optimizing compiler. Vectorization. The trend in the microprocessor development. Parallel instruction execution. Sequential instruction execution. Different kinds of parallel: Pipeline Superscalar Vector operations Multi-core and multiprocessor tasks
E N D
The trend in the microprocessor development Parallel instruction execution Sequential instruction execution • Different kinds of parallel: • Pipeline • Superscalar • Vector operations • Multi-core and multiprocessor tasks • An optimizing compiler is a tool which translates source code into an executable module and optimize source code for better performance. Parallelization is a transformation of the sequential program into multi-threaded or vectorized (or both) to utilize multiple execution units simultaneously without breaking the correctness of the program.
Vectorization is an example of data parallelism (SIMD) C[1] = A[1]+B[1] C[2] = A[2]+B[2] C[3] = A[3]+B[3] C[4] = A[4]+B[4] Vector operation A[1] A[2] A[3] A[4] B[1] B[2] B[3] B[4] B[1] A[1] + B[2] A[2] B[3] A[3] B[4] A[4] = C[1] C[2] C[3] C[4] C[1] C[2] C[3] C[4]
The approximate scheme of the loop vectorization for(i=0;i<100;i++) p[i]=b[i] Non-aligned starting elements passed Vector operations Vector register - 8 elements Tail loop 100 operations 16 operations
A typical vector instruction is an operation on two vectors in memory or in fixed length registers. These vectors can be loaded from memory by a single or multiple operations. • Vectorizationis a compiler optimization that inserts vector instructions instead of scalar. This optimization "wraps" the data into vector; scalar operations are replaced by operations with these vectors (packets). • Such optimization can be also performed manually by developer. • A (1: n: k) - section of the array in Fortran is very convenient for vector register representation. A(1:10:3) for(i=0;i<U,i+=vl) { S1: lhs1[i:i+vl-1:1] = rhs1[i:i+vl-1:1]; … Sn: lhsn[i:i+vl-1:1] = rhsn[i:i+vl-1:1]; } for(i=0;i<U,i++) { S1: lhs1[i] = rhs1[i]; … Sn: lhsn[i] = rhsn[i]; }
MMX,SSE vector instruction sets MMX is a single instruction, multiple data (SIMD) instruction set designed by Intel, introduced in 1996 for P5-based Pentium line of microprocessors, known as "Pentium with MMX Technology”. MMX (Multimedia Extensions) is a set of instructions perform specific actions for streaming audio/video encoding and decoding. MMX is: MMM0-MMM7 64 bit registers (were aliased with existing 80 bit FP stack registers) the concept of packages (each register can store one 64 bit integer or 2 - 32 bit or 4 - 16 bit or 8 - 8-bit) 57 instructions are divided into groups: data movement, arithmetic, comparison, conversion, logical, shift, shufle, unpack, cache control and prefetch state management. MMX provides only integer operations. Simultaneous operations with floats and packed integers are not possible.
Streaming SIMD Extensions (SSE) is a SIMD instruction set extension of the x86 architecture, designed by Intel and introduced with Pentium III series processors in 1999. SIMD instructions can greatly increase the performance when exactly the same operations are performed on the multiple data objects. Typical applications are digital signal processing and computer graphics. SSE is: 8 128-bit registers (xmm0 to xmm7) set of instructions for operations with scalar and packed data types 70 new instructionsfor single precision floating point data mainly SSE2, SSE3, SSEE3, SSE4 are further extensions of SSE. SSE2 has packed data type with double precision floating point. Advanced Vector Extensions (AVX) is an extension of the x86 instruction set architecture for Intel and AMD microprocessors proposed by Intel in March 2008. Firstly supported by Intel Westmere processor in Q1 2011 and by AMD Bulldozer processor in Q3 2011. AVX provides new features, new instructions and a new coding scheme. The size of vector registersis increased from 128 to 256 bits(YMM0-YMM15). The existing 128-bit instructions use the lower half of YMM registers.
Different data types can be packed in vector registers as follows Selecting the appropriate data type for calculations can significantly affect application performance.
Optimization with SwitchesSIMD – SSE, SSE2, SSE3, SSE4.2 Support 2x doubles 4x floats SSE4.2 SSE3 SSE2 1x dqword 16x bytes SSE 8x words MMX* 4x dwords 2x qwords * MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers
Instruction groups • Data movement instructions : • Instruction Suffix Description • movdqamove double quadword aligned • movdqumove double quadword unaligned • mova [ ps, pd ] move floating-point aligned • movu [ ps, pd ] move floating-point unaligned • movhl [ ps ] move packed floating-point high to low • movlh [ ps ] move packed floating-point low to high • movh [ ps, pd ] move high packed floating-point • movl [ ps, pd ] move low packed floating-point • mov [ d, q, ss, sd ] move scalar data • lddquload double quadword unaligned • mov<d/sh/sl>dup move and duplicate • pextr [ w ] extract word • pinstr [ w ] insert word • pmovmsk [ b ] move mask • movmsk [ ps, pd ] move mask • An aligned data movement instruction cannot be applied to the memory location which is not aligned by 16 (bytes).
Intel arithmetic instructions : • Instruction Suffix Description • padd [ b, w, d, q ] packed addition (signed and unsigned) • psub [ b, w, d, q ] packed subtraction (signed and unsigned) • padds [ b, w ] packed addition with saturation (signed) • paddus [ b, w ] packed addition with saturation (unsigned) • psubs [ b, w ] packed subtraction with saturation (signed) • psubus [ b, w ] packed subtraction with saturation (unsigned) • pmins [ w ] packed minimum (signed) • pminu [ b ] packed minimum (unsigned) • pmaxs [ w ] packed maximum (signed) • pmaxu [ b ] packed maximum (unsigned)
Floating-point arithmetic instructions : • Instruction Suffix Description • add [ ss, ps, sd, pd ] addition • div [ ss, ps, sd, pd ] division • min [ ss, ps, sd, pd ] minimum • max [ ss, ps, sd, pd ] maximum • mul [ ss, ps, sd, pd ] multiplication • sqrt [ ss, ps, sd, pd ] square root • sub [ ss, ps, sd, pd ] subtraction • rcp[ ss, ps] approximated reciprocal • rsqrt[ ss, ps] approximated reciprocal square root • Idiomatic arithmetic instructions : • Instruction Suffix Description • pang [ b, w ] packed average with rounding (unsigned) • pmulh/pmulhu/pmull [ w ] packed multiplication • psad [ bw ] packed sum of absolute differences (unsigned) • pmadd [ wd ] packed multiplication and addition (signed) • addsub [ ps, pd ] floating-point addition/subtraction • hadd [ ps, pd ] floating-point horizontal addition • hsub [ ps, pd ] floating-point horizontal subtraction
Logical instructions : • Instruction Suffix Description • pand bitwise logicalAND • pandn bitwise logicalAND-NOT • por bitwise logicalOR • pxor bitwise logicalXOR • and [ ps, pd ] bitwise logicalAND • andn [ ps, pd ] bitwise logicalAND-NOT • or [ ps, pd ] bitwise logicalOR • xor [ ps, pd ] bitwise logicalXOR • Comparison instructions : • Instruction Suffix Description • pcmp<cc> [ b, w, d ] packed compare • cmp<cc> [ ss, ps, sd, pd ] floating-point compare • <cc> defines comparison operation. • lt – less, gt – greater, eq - equal
Conversion instructions : • Instruction Suffix Description • packss [wb, dw] pack with saturation (signed) • paсkus [wb] pack with saturation (unsigned) • cvt<s2d> conversion • cvtt<s2d> conversion with truncation • Shift instructions : • Instruction Suffix Description • psll [ w, d, q, dq ] shift left logical (zero in) • psra [w, d] shift right arithmetic (sign in) • psrl [ w, d, q, dq ] shift right logical (zero in) • Shuffle instructions : • Instruction Suffix Description • pshuf [ w, d ] packed shuffle • pshufh [w] packed shuffle high • pshufl [w] packed shuffle low • ырга [ ps, pd ] shuffle
Unpack instructions : • Instruction Suffix Description • punpckh [bw, wd, dq, qdq] unpack high • punpckl [bw, wd, dq, qdq] unpack low • unpckh [ps, pd] unpack high • unpckl [ps, pd] unpack low • Cacheability control and prefetch instructions : • Instruction Suffix Description • movnt [ ps, pd, q, dq ] move aligned non-temporal • prefetch<hint> prefetch with hint • State management instructions • These instructions are commonly used by operating system.
Three Sets of Switches to Enable Processor-specific Extensions • Switches –x<EXT> like –xSSE4_1 • Imply an Intel processor check • Run-time error message at program start when launched on processor without <EXT> • Switches –m<EXT> like –mSSE3 • No processor check • Illegal instruction fault when launched on processor without <EXT> • Switches –ax<EXT> like –axSSE4_2 • Automatic processor dispatch – multiple code paths • Processor check only available on Intel processors • Non-Intel processors take default path • Default path is –mSSE2 but can be modified again by another –x<EXT> switch
Simple estimation of vectorizationprofitability • A typical vector instruction is an operation on two vectors in memory or in registers of fixed length. These vectors can be loaded from memory in a single operation or in part. • Description of the foundations of SSE technology can be found in CHAPTER 10 PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) of the document «Intel 64 and IA-32 Intel Architecture Software Developer's Manual» (Volume 1) • Microsoft Visual Studio supports a set of SSE intrinsics that allows you to use SSE instructions directly from C/C++ code. You need to include xmmintrin.h, which defines the vector type __m128 and vector operations. • For example, we need to vectorize manually the following loop: • for(i=0;i<N;i++) • C[i]=A[i]*B[i]; • To do this: • 1.) organize / fill the vector variables • 2.) use multiply intrinsic for the vector variables • 3.) write the results of calculations back to memory
#include <stdio.h> #include <xmmintrin.h> #define N 40 int main() { float a[N][N][N],b[N][N][N],c[N][N][N]; int i,j,k,rep; __m128 *xa,*xb,*xc; for(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k++) { a[i][j][k]=1.0; b[i][j][k]=2.0;} for(rep=0;rep<10000;rep++) { #ifdef PERF for(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k+=4) { xa=(__m128*)&(a[i][j][k]); xb=(__m128*)&(b[i][j][k]); xc=(__m128*)&(c[i][j][k]); *xc=_mm_mul_ps(*xa,*xb); } #else for(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k++) c[i][j][k]=a[i][j][k]*b[i][j][k]; #endif } printf("%f\n ",c[21][11][18]); } An example illustrating the vectorization with SSE intrinsics icl -Od test.c -Fetest1.exe icl -Od test.c -DPERF -Fetest_opt.exe time test1.exe 2.000000 CPU time for command: 'test1.exe' real 3.406 sec user 3.391 sec system 0.000 sec time test_opt.exe 2.000000 CPU time for command: 'test_opt.exe' real 1.281 sec user 1.250 sec system 0.000 sec Intel compiler 12.0 was used for this experiment
The resulting speedup is 2.7x. • We used aligned by 16 memory access instructions in this example. Alignment was matched accidently, in the real case, you need to worry about it. • Test-optimized compiler shows the following result: • icl test.c-Qvec_report3-Fetest_intel_opt.exe • time test_intel_opt.exe • 2.000000 • CPU time for command: 'test_intel_opt.exe' • real 0.328 sec • user 0.313 sec • system 0.000 sec
Admissibility of vectorization • Vectorization is a permutation optimization. Initial execution order is changed during vectorization. • Permutation optimization is acceptable, if it preserves the order of dependencies. Thus we have a criterion for the admissibility of vectorization in terms of dependencies. • The simplest case when there are no dependencies inside the processed loop. • In more complicated case there are dependences inside the vectorized loop but its order is the same as inside the initial scalar loop. • Let’s recall the description of the dependency in a loop: • There is a loop dependency between the statements S1 and S2 in the set of nested loops, if and only if 1) there are two loop nest iteration vectors i and j such that i <j or i = j and there is a path from S1 to S2 inside the loop 2) statement S1 for iteration i and statement S2 for iteration j refer to the same memory area. 3) One of these statements writes to this memory.
Option for vectorization control • /Qvec-report[n] • control amount of vectorizer diagnostic information • n=0 no diagnostic information • n=1 indicate vectorized loops (DEFAULT) • n=2 indicate vectorized/non-vectorized loops • n=3 indicate vectorized/non-vectorized loops and prohibiting • data dependence information • n=4 indicate non-vectorized loops • n=5 indicate non-vectorized loops and prohibiting data • dependence information • Usage: icl -c -Qvec_report3 loop.c • Diagnostic examples: • C:\loops\loop1.c(5) (col. 1): remark: LOOP WAS VECTORIZED. • C:\loops\loop3.c(5) (col. 1): remark: loop was not vectorized: vectorization possible but seems inefficient. • C:\loops\loop6.c(5) (col. 1): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
Simple criteria of vectorization admissibility • Let’s write vectorization of loop with usage of fortran array sections. • A good criterion for vectorization is the fact that the introduction section of the array does not create dependency. DO I=1,N A(I)=A(I) END DO DO I=1,N/VL A(I:I+VL)=A(I:I+VL) END DO DO I=1,N/VL A(I+1:I+1+VL)=A(I:I+VL) END DO DO I=1,N A(I+1)=A(I) END DO There is dependency because A(I+1:I+1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 are intersected. DO I=1,N/VL A(I-1:I-1+VL)=A(I:I+VL) END DO DO I=1,N A(I-1)=A(I) END DO There is no dependency because A(I-1:I-1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 aren’t intersected.
Let’s check an assumption: Loop can be vectorized, if the dependence distance greater or equal to number of array elements within the vector register. Check this with compiler: ifort test.F90 -o a.out –vec_report3 echo ------------------------------------- ifort test.F90 -DPERF -o b.out –vec_report3 ./build.sh test.F90(11): (col. 1) remark: loop was not vectorized: existence of vector dependence. ------------------------------------- test.F90(11): (col. 1) remark: LOOP WAS VECTORIZED. PROGRAM TEST_VEC INTEGER,PARAMETER :: N=1000 #ifdef PERF INTEGER,PARAMETER :: P=4 #else INTEGER,PARAMETER :: P=3 #endif INTEGER A(N) DO I=1,N-P A(I+P)=A(I) END DO PRINT *,A(50) END
Dependency analysis and directives There are two tasks which compiler should perform for dependency evaluation: Alias analysis (pointers which can address the same memory should be detected) Definition-use chains analysis Compiler should prove that there are not aliased objects and precisely calculate the dependencies. It is hard task and sometimes compiler isn’t able to solve it. There are methods of providing additional information to the compiler: - Option –ansi_alias(the pointers can refer only to the objects of the same or compatible type). - restrictattributes for pointer arguments (C/C++). - #pragma ivdepsays that there are not dependencies in the following loop. (C/C++) - !DEC$ IVDEP Fortran analogue of #pragma ivdep
Some performance issues for the vectorized code INTEGER :: A(1000),B(1000) INTEGER I,K INTEGER, PARAMETER :: REP = 500000 A = 2 DO K=1,REP CALL ADD(A,B) END DO PRINT *,SHIFT,B(101) CONTAINS SUBROUTINE ADD(A,B) INTEGER A(1000),B(1000) INTEGER I !DEC$ UNROLL(0) DO I=1,1000-SHIFT B(I) = A(I+SHIFT)+1 END DO END SUBROUTINE END Let’s consider some simple test with a assignment which is appropriate for vectorization. Let us obtain vectorized code with usage of Intel Fortran compiler for different values of SHIFT macro. /fpp – option for preprocessor Intel compiler makes vectorization if level of optimization is 2 or 3. (-O2 or –O3) Option –Ob0 is used to forbid inlining.
Experiment results ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=0 -Fea.exe -Qvec_report >a.out 2>&1 ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 -Feb.exe -Qvec_report >b.out 2>&1 time.exe a.exe 0 3 CPU time for command: 'a.exe' real 0.125 sec user 0.094 sec system 0.000 sec time.exe b.exe 1 3 CPU time for command: 'b.exe' real 0.297 sec user 0.281 sec system 0.000 sec
ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=0 /Fas -Ob0 -S –Fafast.s .B2.5: ; Preds .B2.5 .B2.4 $LN83: ;;; B(I) = A(I+SHIFT)+1 movdqa xmm1, XMMWORD PTR [eax+ecx*4] ;17.11 $LN84: paddd xmm1, xmm0 ;17.4 $LN85: movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4 $LN86: add ecx, 4 ;16.3 $LN87: cmpecx, 1000 ;16.3 $LN88: jb .B2.5 ; Prob 99% ;16.3 fast.s
ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 /Fas -Ob0 -S –Faslow.s .B2.5: ; Preds .B2.5 .B2.4 $LN81: ;;; B(I) = A(I+SHIFT)+1 movdqu xmm1, XMMWORD PTR [4+eax+ecx*4] ;17.11 $LN82: paddd xmm1, xmm0 ;17.4 $LN83: movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4 $LN84: add ecx, 4 ;16.3 $LN85: cmpecx, 996 ;16.3 $LN86: jb .B2.5 ; Prob 99% ;16.3 slow.s CONCLUSION: MOVDQA—Move Aligned Double Quadword MOVDQU—Move Unaligned Double Quadword In fast version aligned instructions are used and vector registers are filled faster. Unaligned instructions are slower. For latest architectures they shows the same performance as aligned instructions if applied to the aligned data.
Performance of vectorized loop depends on the memory location of the objects used. The important aspect of program performance is the memory alignment of the data. Data Structure Alignment is computer memory data placement. This concept includes two distinct but related issues: alignment of the data (Data alignment) and data structure filling (Data structure padding). Data alignment specifies how certain data is located relative to the boundaries of memory. This property is usually associated with a data type. Filling data structures involves insertion of unnamed fields into the data structure in order to preserve the relative alignment of structure fields.
Data alignment Information about the alignment can be obtained with intrinsic __alignof__. The size and the default alignment of the variable of a type may depend on the compiler. (ia32 or intel64) printf("int: sizeof=%d align=%d\n",sizeof(a),__alignof__(a)); Alignment for ia32 Intel C++ compiler: boolsizeof = 1 alignof = 1 wchar_tsizeof = 2 alignof = 2 short intsizeof = 2 alignof = 2 intsizeof = 4 alignof = 4 long intsizeof = 4 alignof = 4 long longintsizeof = 8 alignof = 8 float sizeof = 4 alignof = 4 double sizeof = 8 alignof = 8 long double sizeof = 8 alignof = 8 void* sizeof = 4 alignof = 4 The same rules are used for array alignment. There is the possibility to force the compiler to align object in a certain way: __declspec(align(16)) float x[N];
Data Structure Alignment struct foo { bool a; char pad1[1]; short b; char pad2[4]; long long c; bool d; char pad3[7]; }; struct foo { bool a; short b; long long c; bool d; }; The order of fields in the structure affects the size of the object of a derived type. To reduce the size of the object structure fields should be sorted by descending of its size. You can use __declspec to align structure fields. typedefstructaStuct{ __declspec(align(16)) float x[N]; __declspec(align(16)) float y[N]; __declspec(align(16)) float z[N]; };
The approximate scheme of the loop vectorization for(i=0;i<100;i++) p[i]=b[i] Non-aligned starting elements passed Vector operations Vector register - 8 elements Tail loop Loop vectorization usually produces three loops: loop for non-aligned staring elements, vectorized loop and tail. Vectorization of loop with small number of iterations can be unprofitable.
Additional vectorizationexample Main.c Vector.c #include <stdio.h> #define N 1000 extern void Calculate(float *,float *, float *,int); int main() { float x[N],y[N],z[N]; inti,rep; for(i=0;i<N;i++) { x[i] = 1;y[i] = 0; z[i] = 1; } for(rep=0;rep<10000000;rep++) { Calculate(&x[1],&y[0],&z[0],N-1); } printf("x[1]=%f\n",x[1]); } void Calculate(float * a,float * b, float * c , int n) { int i; for(i=0;i<n;i++) { a[i] = a[i]+b[i]+c[i]; } return; } First argument alignment differs iclmain.cvec.c -O1 –FeA time a.exe 12.6 s.
Compiler makes auto vectorization for –O2 or –O3. Option -Qvec_report informs about vectorized loops. iclmain.cvec.c –O2 –Qvec_report –Feb vec.c(3): (col. 3) remark: LOOP WAS VECTORIZED. time b.exe 3.67 s. Vectorization is possible because the compiler inserts run-time check for vectorizingwhen some of the pointers may be not aliased. The application size is enlarged. 1) 2) void Calculate(float * resrticta,float * restrict b, float * restrict c , int n) { To restrict align attribute we need to add option –Qstd=c99 iclmain.cvec.c –Qstd=c99 –O2 –Qvec_report –Fec vec.c(3): (col. 3) remark: LOOP WAS VECTORIZED. time c.exe 3.55 s. Small improvement because of avoiding run-time check Useful fact: For modern calculation systems performance of aligned and unaligned instructions almost the same when applied to aligned objects.
3) int main() { __declspec(align(16)) float x[N]; __declspec(align(16)) float y[N]; __declspec(align(16)) float z[N]; Calculate(&x[0],&y[0],&z[0],N-1); void Calculate(float * resrticta,float * restrict b, float * restrict c , int n) { Int n; __assume_aligned(a,16); __assume_aligned(b,16); __assume_aligned(c,16); iclmain.cvec.c –Qstd=c99 –O2 –Qvec_report –Fed vec.c(3): (col. 3) remark: LOOP WAS VECTORIZED. time d.exe 3.20 s. This update demonstrates improvement because of the better alignment of vectorized objects. Arrays in main are aligned to 16. With this update all argument pointers are well aligned and the compiler is informed by __assume_aligned directive. It allows to remove the first scalar loop.
Data alignment • Good array data alignment for SSE: 16B for AVX: 32B • Data alignment directives: • C/C++ Windows : __declspec(align(16)) float X[N]; • Linux/MacOS : float X[N] __attribute__ ((aligned (16)); • Fortran !DIR$ ATTRIBUTES ALIGN: 16:: A • Aligned malloc • _aligned_malloc() • _mm_malloc() • Data alignment assertion (16B example) • C/C++: __assume_aligned(p,16); • Fortran: !DIR$ ASSUME_ALIGNED A(1):16 • Aligned loop assertion • C/C++: #pragma vector aligned • Fortran: !DIR$ VECTOR ALIGNED
Non-unit stride and unit stride access Well aligned data is better for vectorization because in this case vector register is filled by the single operation. In case with non-unit stride access to array, register filling is more complicated task and vectorization is less profitable. Auto vectorizer cooperates with loop optimizations for improving access to objects. There are compiler directives which recommend to make vectorization in case when compiler doesn’t make it because it looks unprofitable. C/C++ #pragmavector{aligned|unaligned|always} #pragma novector Fortran !DEC$ VECTOR ALWAYS !DEC$ NOVECTOR
Vectorizationof outer loop Usually auto vectorizer processes the nested loop. Vectorization of the outer loop can be done using “simd” directive. for(rep=0;rep<10000000;rep++) { #pragma simd for(i=0;i<N;i++) { j=0; while(A[i][j]<=B[i][j] && j<N) { C[i][j]=C[i][j]+B[j][i]-A[j][i]; j++; } } } printf("%d\n",C[0][2]); } #define N 200 #include <stdio.h> int main() { int A[N][N],B[N][N],C[N][N]; inti,j,rep; for(i=0;i<N;i++) for(j=0;j<N;j++) { A[i][j]=i+j; B[i][j]=2*j-i; C[i][j]=0; } icl vec.c -O3 -Qvec- -Fea (Qvec- disable vectorization) 20.7 s icl vec.c -O3 -Qvec_report -Feb 17.s vec.c(17): (col. 3) remark: SIMD LOOP WAS VECTORIZED.