650 likes | 786 Views
It’s all about performance: Using Visual C++ 2012 to maximize your hardware. Jim Radigan - Dev Lead/Architect C ++ Optimizer Don McCrady - Dev Lead C++ Amp # 211 3-013.
E N D
It’s all about performance: Using Visual C++ 2012 to maximize your hardware Jim Radigan - DevLead/Architect C++ Optimizer Don McCrady - Dev Lead C++ Amp #211 3-013
Mission:Go under the coversMake folks aware of massive hardware resources Then how C++ exploits it.….By covering PPL, Amp or “doing nothing” !
Ivy Bridge 1.4 Billion Transistors
Going Native • You’ve been hearing about native C++ Renaissance • This is what its all about – exploiting the harware
Ivy Bridge C++ PPL AMP
Agenda $87.7 B • 1. Hardware • 2. C++ auto vec+par • 3. C++ PPL • 4. C++ AMP $100 .0B +
Hardware – Forms of Parallelism • Super Scalar • Vector • Vector + Parallel • SPMD
Super Scalar – instruction level parallelism • 20% of ILP resides in a basic block • 60% resides across two adjacent “basic blocks” • 20% is scattered though the rest of the code
Super Scalar - needs speculative execution bar; 140: r0 = 42 141: r1 = r0 + r3 142: M[r1] = r1 143: zflag = r3 – r2 144: jz foo 145 … … … 188: 189: foo; 190: r4 = 0 191: r5 = 0 192: r6 = 0 193: M[r1] = 0
Super Scalar - Path of certain execution foo; 190: r4 = 0 bar; 140: r0 = 42 141: r1 = r0 + r3 142: M[r1] = r1 143: zflag = r3 – r2 144: jz foo 191: r5 = 0 192: r6 = 0 193: M[r1] = 0 When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative
Super Scalar – enables C++ vectorization VOID FOO (int *A, int * B, int *C) { IF ( ( _ISA_AVAILABE == 2) … SSE 4.2 ? && ( &A[1000] < &B[0] ) … Pointer overlap && ( &A[1000] < &C[0] ) ) { … FAST VECTOR/PARALLEL LOO P … ELSE … SEQUENTIAL LOOP …
Vector “addps xmm1, xmm0 “ xmm0 xmm1 + xmm1
VECTOR (N operations) SCALAR (1 operation) v2 v1 r2 r1 + + r3 v3 vector length add r3, r1, r2 vadd v3, v1, v2 Vector - CPU
0 1 2 3 4 5 6 7 threadID … float x = input[threadID]; float y = func(x); output[threadID] = y; … Arrays of Parallel Threads - SPMD • All threads run the same code (SPMD) • Each thread has an ID that it uses to compute memory addresses and make control decisions
Agenda • Hardware • 2. C++ auto vec+par • 3. C++ PPL • 4. C++ AMP
C++ Vectorizer – VS2012 Compiler Super Scalar Vector Vector + Parallel
Simple vector add loop for (i = 0; i < 1000/4; i++){ movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 } for (i = 0; i < 1000; i++) A[i] = B[i] + C[i]; Compiler look across loop iterations !
Compiler or “Do it yourself” C++ void add(float* A, float* B, float* C, int size) { for (inti = 0; i < size/4; ++i) { p_v1 = _mm_loadu_ps(A); p_v2 = _mm_loadu_ps(B); res = _mm_sub_ps(p_v1,p_v2); _mm_store_ps(C,res); …. C++ or Klingon
Vector - all loads before all stores “addps xmm1, xmm0 “ xmm0 xmm1 + xmm1
Legal to vectorize ? FOR ( j = 2; j <= 5; j++) A( j ) = A (j-1) + A (j+1) Not Equal !! A (2:5) = A (1:4) + A (3:7) A(3) = ?
Vector Semantics • ALL loads before ALL stores A (2:4) = A (1:4) + A (3:7) VR1 = LOAD(A(1:4)) VR2 = LOAD(A(3:7)) VR3 = VR1 + VR2 // A(3) = F (A(2) A(4)) STORE(A(2:4)) = VR3
Vector Semantics • Instead - load store load store ... FOR ( j = 2; j <= 257; j++) A( j ) = A( j-1 ) + A( j+1 ) A(2) = A(1) + A(3) A(3) = A(2) + A(4) // A(3) = F ( A(1)A(2)A(3)A(4) ) A(4) = A(3) + A(5) A(5) = A(4) + A(6) …
Doubled the optimizer A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)
for (size_t j = 0; j < numBodies; j++) { D3DXVECTOR4 r; r.x = A[j].pos.x - pos.x; r.y = A[j].pos.y - pos.y; r.z = A[j].pos.z - pos.z; float distSqr = r.x*r.x + r.y*r.y + r.z*r.z; distSqr += softeningSquared; float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = fParticleMass * invDistCube; acc.x += r.x * s; acc.y += r.y * s; acc.z += r.z * s; } Legal vect+par? Complex C++ Not just arrays!
Hard! Compiler reports why it failed to vectorize or parallelize cl /Qvect-report:2 /O2 t.cpp cl /Qpar-report:2 /O2 t.cpp
Parallelism + vector void foo() { CompilerParForLib(0, 1000, 4, &foo$par1, A, B, C); } foo$par1(int T1, int T2, int *A, int *B, int *C) { for (int i=T1; i<T2; i+=4) movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 } void foo() { #pragmaloop(hint_parallel(4)) for (int i=0; i<1000; i++) A[i] = B[i] + C[i]; } • foo$par1(0, 249, A, B, C); core 1 instr • foo$par1(250, 499, A, B, C); core 2 instr • foo$par1(500, 749, A, B, C); core 3 instr • foo$par1(750, 999, A, B, C); core 4 instr Runtime Vectorized+ and parallel
The Bigger Picture VECT UNIT SCLR UNIT SCLR UNIT VECT UNIT VECT UNIT SCLR UNIT SCLR UNIT VECT UNIT
Vector + parallel DemoDev10/Win7 - fully optimizednovec_concrt.aviDev11/Win8 – fully optimizedvec_omp.avi
Not your grandfather’s vectorizer for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if(dc[k] < -INFTY) dc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if(dc[k] < -INFTY) dc[k] = -INFTY; if(k < M) { ic[k] = mpp[k] + tpmi[k]; if((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if(ic[k] < -INFTY) ic[k] = -INFTY; } } for(k = 1; k <= M; k++) { if(k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } } for(k = 1; k < M; k++) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; }
Vector control flow algebra • ID source code that looks like this (in a loop): if (X > Y) { Y = X; } • vectorizer could create: Y = MAX(X, Y)
Vector control flow “pmax xmm1, xmm0 “ xmm0 xmm1 pmax xmm1
Not your grandfather’s vectorizer if ( __isa_availablility > SSE2 && NO_ALIASISIN ) { for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; mc[k] = MAX(ip[k-1] + tpim[k-1], mc[k]) mc[k] = MAX (dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = MAX (xmb + bp[k], mc[k]) mc[k] = MAX (mc[k], -INFTY) } for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; dc[k] = MAX (mc[k-1] + tpmd[k-1],dc[k]) dc[k] = MAX (dc[k], -INFTY) } for (k = 1; k <= M; k++) { ic[k] = mpp[k] + tpmi[k]; ic[k] = MAX (ip[k] + tpii[k],ic[k]) ic[k] += is[k]; ic[k] = MAX (ic[k],-INFTY) } } Vector loop Scalar loop Vector loop
Vector Math Libary 15x faster
Vectorization – targeting vector library • for (i=0; i<n; i++) { • a[i] = a[i] + b[i]; • a[i] = sin(a[i]); • } NEW Run-Time Library HW SIMD instruction • for(i=0; i<n; i=i+VL) { • a(i: i+VL-1) = a(i: i+VL-1) + b(i: i+VL-1); • a(i: i+VL-1) = _svml_Sin(a(i: i+VL-1)); • }
Parallel and vector – on by default for (inti = 0; i < _countof(a); ++i) { float dp = 0.0f; for (int j = 0; j < _countof(a); ++j){ float fj = (float)j; dp += sin(fj) * exp(fj); } a[i] = dp; }
Pragma Foo (float *a, float *b, float *c) { #pragma loop(hint_parallel(N)) for (auto i=0; i<N; i++) { *c++ = (*a++) * bar(b++); }; Use simple directives Pointers and procedure calls with escaped pointers prevent analysis for auto-parallelization
16x speedup – unmodified C++ • Scheduling • Static • Dynamic
…and compiler selects scheduler strategy • for (int l = top; l < bottom; l++){ • for (int m = left; m < right; m++ ){ • int y = *(blurredImage + (l*dimX) +m); • ySourceRed += (unsigned int) (y & 0x00FF0000) >> 16; • ySourceGreen += (unsigned int) (y & 0x0000ff00) >> 8; • ySourceBlue += (unsigned int) (y & 0x000000FF); • averageCount++; • } • }
Software – “no magic bullet” C++ PPL- cpu parallel_for (0, 1000, 1, [&](inti) { A[i] = B[i] + C[i]; } ); C++ AMP - gpu parallel_for_each ( e, [&] (index<2> idx) restrict(amp) { c[idx] = b[idx] + a[idx]; } ); copy(c,pC); C++ vectorizer -cpu for (int i=0; i<1000; i++) A[i] = B[i] + C[i];
Software C++ PPL parallel_for (0, 1000, 1, [&](inti) { } ); C++ AMP parallel_for_each( e, [&] (index<2> idx) restrict(amp) { } ); copy(c,pC); C++ vectorizer for (int i=0; i<1000; i++)
Built with C++ • Windows 8 SQL Office • Mission critical correctness and compile time
PPL for C++ • Parallel Programming Libaray
3 PPL constructs – simple but huge value • parallel_invoke( • [&]{quicksort(a, left, i-1, comp);}, • [&]{quicksort(a, i+1, right, comp);} ); parallel_for (0, 100, 1, [&](inti) { /* …*/ } ); vector<int> vec; parallel_for_each (vec.begin(), vec.end(), [&](int& i) { /* ... */ });