A L anguage for the Compact Representation of Multiple Program Version s

A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio1,2, James Brodman3, Thomas Roeder4, Kamen Yotov4, DenisBarthou2, Albert Cohen5, María Jesús Garzarán3, DavidPadua3, and Keshav Pingali4 1BULL S.A. 2 University of Versailles 3 University of Illinois at Urbana-Champaign 4 Cornell University 5 INRIA Futurs International Workshop LCPC 2005

Outline • Context in optimization for high performance • Goals of this language • Features of this language • Examples (Daxpy & Dgemm) • Conclusion International Workshop LCPC 2005

Context • Complex architecture and fragile optimizations • Unpredictable performance • Architecture, domain-specific optimizations • Resort to empirical search • Complement general-purpose optimizations with user-driven ones International Workshop LCPC 2005

Example FFT performance best available implementation (FFTW, Intel IPP, Spiral) Reasonable implementation (Numerical recipes. GNU scientific library) International Workshop LCPC 2005

Goals of X-Language • Tool to help programmers generate and evaluate multiple versions of their programs: • Applying control and data structure transformations • Trying multiple transformation sequences and parameters • Evaluating performance of each version and taking decisions about which transformation variants to try International Workshop LCPC 2005

Goals of X-Language (cont.) • The code must be portable accross ISO-C compilers: • Use #pragma annotations for the above tasks • Observable program semantics not altered by the interpretation of these pragmas (assuming transformation legality) International Workshop LCPC 2005

Compiler Reflection Spiral Transformation X-Language Tick C Atlas XLG General purpose Generation Domain specific Black box Manual Comparaison with related works International Workshop LCPC 2005

Features of the language • Elementary transformations (fission, stripmining, interchanging, unrolling,…) • Composition of transformations • Conditional transformations (versioning) • Procedural abstraction of transformations • A mechanism to define new transformations • No validity check is performed for the transformation International Workshop LCPC 2005

General schema of X-Language Code with Pragmas Transformation Descriptions search Different versions Compile Execute and measure performance International Workshop LCPC 2005

X-Language • Naming loops or scopes #pragma xlang name loop1 for(i=0;i<10;i++){a[i]=4;} • Format of transformation #pragma xlang stripmine loop1 4 ii Transformation name Loop name Name of additional loops generated by transformations #pragma xlang parameters International Workshop LCPC 2005

Elementary transformations implemented in X-language • Full unrolling • Partial unrolling • Scalar promote • Interchange • Loop fission • Loop fusion • Strip mining • Lifting • Sofware pipelining International Workshop LCPC 2005

#pragma xlang loop1 for(i=min;i<4*max;i+=4) int nl1; #pragma xlang ii for(nl1=0;nl1<4;nl1 ++) a[i+nl1]=b[i+nl1] Applying transformation #pragma xlang loop1 for(i=min;i<4*max;i++) a[i]=b[i] #pragma xlang stripmine loop1 4 ii International Workshop LCPC 2005

How to search the value of parameters ? • Using multistage evaluation • External script for(k=1;k<16;k=2*k) ‘{ #pragma xlang loop1 for(i=min;i<max;i++) a[i]=b[i] #pragma xlang stripmine loop1 ‘d(k) ii ‘} International Workshop LCPC 2005

#pragma xlang loop2 for(j=min2;j<max2;j++) { a[0]=b[j]; a[1]=b[j]; a[2]=b[j]; a[3]=b[j]; } Composing transformations #pragma xlang loop1 for(i=0;i<4;i++) #pragma xlang loop2 for(j=min2;j<max2;j++) a[i]=b[j] #pragma xlang interchange loop1 loop2 #pragma xlang fullunroll loop1 International Workshop LCPC 2005

Analyses and Transformations • Static analyses should also enable the design of smarter (higher level) transformation primitives • External tool to find information International Workshop LCPC 2005

With interference graph Without interference graph u_0=u[0]; u_1=u[1]; for(i=2;i<2*N;i+=2) {u_0 = u_1 + u_2; u_1 = u_0 + u_1;} u[i]=u_0; u[i+1]=u _1;} for(i=2;i<2*N;i+=2) {u_1=u[i-1]; u_2=u[i-2]; u_0 = u_1 + u_2; u_1 = u_0 + u_1; u[i]=u_0; u[i+1]=u _1;} Example with analysis for(i=2;i<2*N;i+=2) {u[i]=u[i-1]+u[i-2]; u[i+1]=u[i]+u[i-1];} International Workshop LCPC 2005

Extending the X-Language Rewriting rule : #pragma xlang name iloop for (i = 0; i < N; i++) {<body> } % #pragma xlang name iiloop1 for (ii = 0; ii < (N/4)*4; ii += 4) #pragma xlang name iloop1 for (i = ii; i < ii+4; i++) { <body>} #pragma xlang name iloop2 for (i = (N/4)*4; i < N; i++) f {<body>} %% Pattern before  Pattern after transformation International Workshop LCPC 2005

Daxpy Example #pragma xlang name loop1 for(k=0;k<2000;k++) Y[k]=alpha*X[k]*Y[k]; We can modify values of N /** A few values tested for unrolling factor – Different generated version **/ #pragma xlang transform stripmine loop1 k N; #pragma xlang transform scalarize-in X in loop1 #pragma xlang transform lift l1.loads before loop1 #pragma xlang transform scalarize-out Y in loop1 #pragma xlang transform lift loop1.loads before loop1 #pragma xlang transform lift loop1.stores after loop1 #pragma xlang transform fullunroll loop1.loads #pragma xlang transform fullunroll loop1.stores #pragma xlang transform fullunroll loop1 International Workshop LCPC 2005

Daxpy Example – Different generated versions Unrolling factor : 8 for(k=0;k<2000;k=k+16){ double x_0 = X[k+0]; double x_1 = X[k+1]; double x_2 = X[k+2]; … y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; y_2=alpha*x_2+y_2; y_3=alpha*x_3+y_3; … Y[k+0] = y_0; Y[k+1] = y_1; Y[k+2] = y_2; Y[k+3] = y_3; … } Unrolling factor : 4 for(k=0;k<2000;k=k+4){ double x_0 = X[k+0]; double x_1 = X[k+1]; double x_2 = X[k+2]; double x_3 = X[k+3]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; double y_2 = Y[k+2]; double y_3 = Y[k+3]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; y_2=alpha*x_2+y_2; y_3=alpha*x_3+y_3; Y[k+0] = y_0; Y[k+1] = y_1; Y[k+2] = y_2;} Unrolling factor : 2 for(k=0;k<2000;k=k+2){ double x_0 = X[k+0]; double x_1 = X[k+1]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; Y[k+0] = y_0; Y[k+1] = y_1; } International Workshop LCPC 2005

Matrix Multiply(Loop Declaration) #pragma xlang name iloop for (i = 0; i < NB; i++) #pragma xlang name jloop for (j = 0; j < NB; j++) #pragma xlang name kloop for (k = 0; k < NB; k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } • The DGEMM example: • Matrix Multiplication • Problems : • Data locality • Scheduling International Workshop LCPC 2005

Matrix Multiply(Transformation Declaration) #pragma xlang transform stripmine iloop NU NUloop #pragma xlang transform stripmine jloop MU MUloop #pragma xlang transform interchange kloop MUloop #pragma xlang transform interchange jloop NUloop #pragma xlang transform interchange kloop NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform fullunroll MUloop #pragma xlang transform scalarize_in b in kloop #pragma xlang transform scalarize_in a in kloop #pragma xlang transform scalarize_in&out c in kloop #pragma xlang transform lift kloop.loads before kloop #pragma xlang transform lift kloop.stores after kloop Sequence of transformations for Itanium: International Workshop LCPC 2005

Matrix Multiply(Transformation Sequence) #pragma xlang name iloop for(i = 0; i < NB; i++){ #pragma xlang name jloop for(j = 0; j < NB; j += 4){ #pragma xlang name kloop.loads {c_0_0 = c[i+0][j+0]; c_0_1 = c[i+0][j+1]; c_0_2 = c[i+0][j+2]; c_0_3 = c[i+0][j+3]; } #pragma xlang name kloop for(k = 0; k < NB; k++){ {a_0 = a[i+0][k]; a_1 = a[i+0][k]; a_2 = a[i+0][k]; a_3 = a[i+0][k];} {b_0 = b[k][j+0]; b_1 = b[k][j+1]; b_2 = b[k][j+2]; b_3 = b[k][j+3];} {c_0_0=c_0_0+a_0*b_0; c_0_1=c_0_1+a_1*b_1; c_0_2=c_0_2+a_2*b_2; c_0_3=c_0_3+a_3*b_3;} ... } #pragma xlang name kloop.stores {c[i+0][j+0] = c_0_0; c[i+0][j+1] = c_0_1; c[i+0][j+2] = c_0_2; c[i+0][j+3] = c_0_3;} }} ... // Remainder code International Workshop LCPC 2005

Block copies • Block Matrix Multiplication: better performance if matrices are contiguous in memory (TLB) • Poor performance of C copy • Resort to a tool generating specific asm code • Tool generating a good code with search (XLG is an asm search) International Workshop LCPC 2005

Matrix Multiply(Results) International Workshop LCPC 2005

Conclusion Describe transformations with reuse, procedures, conditionals X-Language: • language designed to generate multiversion programs • Multistage language with a flexible pattern-matching and rewriting language • Experts can describe specific application transformation optimizations International Workshop LCPC 2005

Future works • Dependence analysis • Going further searching asm code transformation • More transformations: vectorization, alignment,… International Workshop LCPC 2005

A L anguage for the Compact Representation of Multiple Program Version s