150 likes | 162 Views
Discover how Intel's ISPC challenges traditional programming with Gangs, SIMD, and LLVM for modern architectures. The revolutionary approach offers simplicity and performance at a cost.
E N D
The end of programming PetaQCD collaboration
End of Parallelith • The era when we were in control of what gets executed where and how is pretty much over • No modern compiler can generate efficient code for modern architectures • Because they are SPMD of various width • So companies have to do something else
Warps • To abstract and hide the execution of programs nVidia came up with Thread Warp, a bunch of threads executed at the same time at some part of the hardware
ISPC • Intel noticed that people don’t quite like CUDA • Yet figured out that no compiler can figure out where and how to parallelize • So it abandoned idea of having own compiler • And conceived a syntax which would appear user-friendly • Wrote an LLVM(low-level virtual machine) frontend and few backends • Just like NVIDIA did.
Gangs • So whatdoes brave new code like? export void simple(uniform float vin[], uniform float vout[], uniform int count) { foreach (index = 0 ... count) { float v = vin[index]; if (v < 3.) v = v * v; else v = sqrt(v); vout[index] = v; } } float vin[16], vout[16]; for (int i = 0; i < 16; ++i) vin[i] = i; simple(vin, vout, 16);
Gangs • So instead of Thread Warps we have Gangs • Bunch of program « instances » mapped onto SIMD • About 2-4x the width of SIMD • And variables can be shared or unique across gangs • Atomic operations should be used inside a gang • No threads, no context switching, just an army of marching ants
So what? • Intel’s ISPC is poor man’s CUDA • Flow control is much simpler (things go uniformly) • Obviously at the cost of performance • And is designed to deceive users that it is simpler • While all that it does is mapping code to instances • Which then get mapped to low-level functions • So in principle ISPC can be used for any architecture
So we care? • Yes we do. • First, Intel confirmed that classic compilers are dead • Second, it has shown us what a proper backend for AVX and SSE should look like (opensource code) define <16 x double> @__gather_base_offsets64_double(i8 * %ptr, i32 %scale, <16 x i64> %offsets, <16 x i32> %mask32) nounwind readonly alwaysinline { … %v1 = call <4 x double> @llvm.x86.avx2.gather.q.pd.256(<4 x double> undef, i8 * %ptr, <4 x i64> %offsets_1, <4 x double> %vecmask_1, i8 %scale8) assemble_4s(double, v, v1, v2, v3, v4) ret <16 x double> %v }
So what? • Third, we have a glimpse at how intel optimises. It doesn’t. • It just parses the commands via AST(abstract sysntax tree, basic notion of LLVM ) walking ASTNode * WalkAST(ASTNode *node, ASTPreCallBackFunc preFunc, ASTPostCallBackFunc postFunc, void *data) { if (node == NULL) return node; // Call the callback function if (preFunc != NULL) { if (preFunc(node, data) == false) // The function asked us to not continue recursively, so stop. return node; }
And... • Generates IR • And maps it to SIMD backends void AST::GenerateIR() { for (unsigned int i = 0; i < functions.size(); ++i) functions[i]->GenerateIR(); }
So we have … • An unstable quickly developing utility • Which may or may not be actually useful beyond AVX(no MIC support yet) • Is comparably obscure to CUDA (but with source!) • and not GPL.
Short term • ISPC may be a valueble tool to investigate the efficiency of the llvm-based compiler for the Intel architectures • One may try to generate gang-compatible code • Because it obviously will be superiour to the icc • Which fails to vectorize properly
Medium term • And wefinally have access to the L1 prefetcher! • NT means data willbediscardedafter use uniform int32 array[...]; for (uniform int i = 0; i < count; ++i) { // do computation with array[i] prefetch_l1(&array[i+32]); } void prefetch_{l1,l2,l3,nt}(void * uniform ptr) void prefetch_{l1,l2,l3,nt}(void * varying ptr)
Long term • It islikelythatmovingfromQiralIR to ISPC to IntelIRwe are loosing information • And certainlyaddingoverhead • So havingourownWalkASTmapping to intel LLVM backendsshouldbewaybetter • And wecanalsomodifythem for IBM SIMD (remember, IBM alsowantssame model, but withdifferent images instead of instances within one image) • And maybe CUDA just as well
Conclusions • All modern achitectures are damnGPUs • Divide variables into unique and uniform • Give up the control of whatisexecutedwhen and how, to varyinglevel • Have eitherimplicit or explicit barriers/sync • Have gangs/thread warps as sort of uniform threads • And LLVM seems to be the choice to deal withthem.