190 likes | 310 Views
Generating Truly Optimal Code Using a Metaprogramming Library. Don Clugston First D Programming Conference, 24 August 2007. String mixins in D – undercooked, but very tasty. char [] greet(char [] greeting) { return ` writefln(“ ` ~ greeting ~` , world!”); `; } void main() {
E N D
Generating Truly Optimal CodeUsing a Metaprogramming Library Don Clugston First D Programming Conference, 24 August 2007
String mixins in D – undercooked, but very tasty char [] greet(char [] greeting) { return `writefln(“` ~ greeting ~`, world!”);`; } void main() { mixin( greet( “Hello” ) ); } • Compiles to: • Vindicates built-in string operations void main() { writefln( “Hello, world!” ); }
The Challenge • Fortran: BLAS (a standard set of highly optimised routines). The crucial functions are coded in asm. y[] += a * x[] • But BLAS is limited – nothing for simple things: • x[] = y[] - z[] • a[] = r[]*0.3 + g[]*0.5 + b[]*0.2; void DAXPY(double [] y, double [] x, double a) { for (i = 0; i < y.length; ++i) y[i] += x[i] * a; }
Operating overloading • Gives ideal syntax, always works • Can’t operate on built-in types • Inefficient because: • Creates unnecessary temporaries. • Multiple loops, eg a[]=b[]+c[]+d[] • Somehow, we need to get the expression inside the ‘for’ loop! double [] temp1= new double[], temp2 = new double[]; for(int i=0; i<b.length; ++i) temp1[i] = b[i] + c[i]; for(int i=0, i<temp1.length; ++i) temp2[i] = temp1[i] + d[i]; a = temp2;
The Wizard Solution: Expression Templates (eg, Blitz++) • Overloaded operators don’t do the calculation: instead, they record the operation as a proxy type, creating a syntax tree. • Example: (a+b)/(c-d): • Need a good optimiser. • Works in D as well as C++. BUT… we are fighting the compiler! DVExpr<DVBinExprOp<DVExpr< DVBinExprOp<DVec::iterT, DVec::iterT, DApAdd>>, DVExpr<DVBinExprOp< DVec::iterT, DVec::iterT, DApSubtract>>, DApDivide>>
Representing the Syntax Tree in D • In D, any expression can be represented in a single template. • Represent types and values in a tuple. Represent expression in a char []. A..Z correspond to T[0]..T[25]. eg: Note that ‘A’ appears twice in the expression (operator overloading can’t represent that). void vectorOperation(char [] expression, T…)(T values) { } vectorOperation!(“A+=(B*C)/(A+D)”)(x, y, z, u, v);
Finding the vectors in a tuple • It’s a vector if you can index it. • Imperfection – can’t index tuple in CTFE. • Workaround – create array of results. • Usage: if ( isVector!(Tuple)[i]) { … } template isVector(T...) { static if (T.length == 0) const bool [] isVector = []; else static if( is( typeof(T[0][0]) ) ) const bool [] isVector = true ~ isVector!(T[1..$]); else const bool [] isVector = false ~ isVector!(T[1..$]); }
Metaprogramming For Muggles char [] muggle (char [] expr, Values...)() { char [] code = "for (int i=0; i<values[0].length; ++i) {"; foreach(c; expr) if (c >= 'A' && c <= 'Z’) { // A-Z become tuple members. code ~= "values[" ~ itoa(c-'A') ~ "]"; // add [i] if it was a vector if (isVector!(Values)[c-'A']) code ~= "[i]"; } else code ~= c; // Everything else is unchanged return code ~ "; }“; } template VEC(char [] expr) { void VEC(Values...)(Values values) { mixin( muggle!(expr, Values) ); }} USAGE: double [] firstvec, secondvec, thirdvec; VEC!("A+=B*(C+A*D)")(firstvec, secondvec, thirdvec, 25.7);
Trivial enhancements • Ensure all vectors are the same length. • Assert no aliasing (vectors don’t overlap). • Equalize with hand-coded asm BLAS routines. foreach(int i, bool b; isVector!(Values)[1..$]) { if (b) code ~= “assert(values[“ ~ atoi(i) ~ “].length == values[0].length);”; } static if ( expr == “A+=B*C” && is( Values[0] == double[] ) && is( Values[1] == double[] ) && is ( Values[2] : double ) ) { return “DAXPY(values[0].length, values[0].ptr, values[1].ptr, values[2]);”; }
Asm code via perturbation • It’s hard to determine the optimal asm for an algorithm, much easier to modify existing code. • Begin with Agner Fogg’s optimal asm code for DAXPY. Use same loop design and register allocation strategy. • Ignore difficult cases – fallback to D code.
X87 (stack-based) • Convert the infix expression into postfix. Split += into + and =. • Swap operands to avoid FMUL latency. A += B - C * D A = (A+B) - (C*D) C D * A B + - A = • Avoid gaps in the instruction set • Eg, fewer instructions for 80-bit reals, so load them first whenever possible.
X87 code generation • Directly convert postfix to inline asm. VEC!("C+=B*(A+D)")( 2213.3, vec1, floatvec, vec2); // Postfix : BAD+*C+C= L1: fld double ptr [EAX + 8*ESI]; //B fld double ptr [EAX + 8*ESI]; //A fadd double ptr [EDX + 8*ESI]; //D+ fmulp ST(1), ST; //* fadd float ptr [ECX + 4*ESI]; //C+ fxch ST(1), ST; fstp float ptr [ECX + 4*ESI - 4]; // C= L2: inc ESI; jnz L1;
SSE/SSE2 (register-based) • Can’t do mixed-precision operations. • Unroll loop by 2 or 4, to take advantage of SIMD. • Instruction scheduling is less critical, but register allocation is more complicated than for x87.
GPGPU • Use the GPU in modern video cards to perform massively parallel calculations. • Uses OpenGL or DirectX calls, instead of inline asm. • Full of hacks (pretend your data is a texture!) – but a rational API should emerge soon. • This should NOT be built into a compiler!
Adding a front end • Operator overloading • Same limitations as before • Mixins eg, mixin(blade(“firstvec+=secondvec*2.38”)); • clumsy syntax BUT: • Can detect aliases • Allows better error messages • Can unroll small loops inline • Closer to proposed macro syntax
Front end using mixins • Lex: first += second * 2.38 A+=B*C. • Determine types, resolve aliases, convert constants to literals. • Determine precedence and associativity • Perform constant folding • We can do most of this using mixins • Compiler help is most required for 4 • __traits could help
Determining types char[] getSymbolTable(char [][] symbols) { char [] result = "["; for(int i=0; i<symbols.length; ++i) { if (i>0) result ~=","; result ~= "[typeof(" ~ symbols[i] ~ `).stringof, ` ~ symbols[i] ~ `.stringof]`; } result ~= "]"; return result; } • When mixed in, this creates an array[2] of string literals. • [0] is the type, [1] is the value
Determining precedence class AST(char [] expr) { alias expr text; AST!("(" ~ text ~ “+” ~ T.text ~ ")") opAdd(T)(T x) { return null; } AST!("(" ~ text ~ “*” ~ T.text ~ ")") opMul(T)(T x) { return null; } AST!( text ~ "([" ~ T.text ~ "])“ ) opIndex(T)(T x) { return null; } } char [] getPrecedence(char [] expr) { char [] code = "typeof("; for(int i=0; i<expr.length; ++i) { if (expr[i]>='A' && expr[i]<='Z') code ~= "(cast(AST!(`" ~ expr[i] ~"`))(null))"; else code ~= expr[i]; } return code ~ ").text"; } mixin(getPrecedence(“A+B*C*D”) ) “A+((B*C)*D)”
Conclusion • Implementation and syntactic issues remain • Syntax for runtime and compile-time reflection • Macros, and an extended __traits syntax should help. • How to clean up mixin(), yet retain its power? • Yet perfectly optimal code is already possible. Libraries can perform optimisations previously required a compiler back-end.