1 / 26

A L anguage for the Compact Representation of Multiple Program Version s

A L anguage for the Compact Representation of Multiple Program Version s. S é bastien Donadio 1,2 , James Brodman 3 , Thomas Roeder 4 , Kamen Yotov 4 , Denis Barthou 2 , Albert Cohen 5 , Mar í a Jes ú s Garzar á n 3 , David Padua 3 , and Keshav Pingali 4.

lev-beasley
Download Presentation

A L anguage for the Compact Representation of Multiple Program Version s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio1,2, James Brodman3, Thomas Roeder4, Kamen Yotov4, DenisBarthou2, Albert Cohen5, María Jesús Garzarán3, DavidPadua3, and Keshav Pingali4 1BULL S.A. 2 University of Versailles 3 University of Illinois at Urbana-Champaign 4 Cornell University 5 INRIA Futurs International Workshop LCPC 2005

  2. Outline • Context in optimization for high performance • Goals of this language • Features of this language • Examples (Daxpy & Dgemm) • Conclusion International Workshop LCPC 2005

  3. Context • Complex architecture and fragile optimizations • Unpredictable performance • Architecture, domain-specific optimizations • Resort to empirical search • Complement general-purpose optimizations with user-driven ones International Workshop LCPC 2005

  4. Example FFT performance best available implementation (FFTW, Intel IPP, Spiral) Reasonable implementation (Numerical recipes. GNU scientific library) International Workshop LCPC 2005

  5. Goals of X-Language • Tool to help programmers generate and evaluate multiple versions of their programs: • Applying control and data structure transformations • Trying multiple transformation sequences and parameters • Evaluating performance of each version and taking decisions about which transformation variants to try International Workshop LCPC 2005

  6. Goals of X-Language (cont.) • The code must be portable accross ISO-C compilers: • Use #pragma annotations for the above tasks • Observable program semantics not altered by the interpretation of these pragmas (assuming transformation legality) International Workshop LCPC 2005

  7. Compiler Reflection Spiral Transformation X-Language Tick C Atlas XLG General purpose Generation Domain specific Black box Manual Comparaison with related works International Workshop LCPC 2005

  8. Features of the language • Elementary transformations (fission, stripmining, interchanging, unrolling,…) • Composition of transformations • Conditional transformations (versioning) • Procedural abstraction of transformations • A mechanism to define new transformations • No validity check is performed for the transformation International Workshop LCPC 2005

  9. General schema of X-Language Code with Pragmas Transformation Descriptions search Different versions Compile Execute and measure performance International Workshop LCPC 2005

  10. X-Language • Naming loops or scopes #pragma xlang name loop1 for(i=0;i<10;i++){a[i]=4;} • Format of transformation #pragma xlang stripmine loop1 4 ii Transformation name Loop name Name of additional loops generated by transformations #pragma xlang parameters International Workshop LCPC 2005

  11. Elementary transformations implemented in X-language • Full unrolling • Partial unrolling • Scalar promote • Interchange • Loop fission • Loop fusion • Strip mining • Lifting • Sofware pipelining International Workshop LCPC 2005

  12. #pragma xlang loop1 for(i=min;i<4*max;i+=4) int nl1; #pragma xlang ii for(nl1=0;nl1<4;nl1 ++) a[i+nl1]=b[i+nl1] Applying transformation #pragma xlang loop1 for(i=min;i<4*max;i++) a[i]=b[i] #pragma xlang stripmine loop1 4 ii International Workshop LCPC 2005

  13. How to search the value of parameters ? • Using multistage evaluation • External script for(k=1;k<16;k=2*k) ‘{ #pragma xlang loop1 for(i=min;i<max;i++) a[i]=b[i] #pragma xlang stripmine loop1 ‘d(k) ii ‘} International Workshop LCPC 2005

  14. #pragma xlang loop2 for(j=min2;j<max2;j++) { a[0]=b[j]; a[1]=b[j]; a[2]=b[j]; a[3]=b[j]; } Composing transformations #pragma xlang loop1 for(i=0;i<4;i++) #pragma xlang loop2 for(j=min2;j<max2;j++) a[i]=b[j] #pragma xlang interchange loop1 loop2 #pragma xlang fullunroll loop1 International Workshop LCPC 2005

  15. Analyses and Transformations • Static analyses should also enable the design of smarter (higher level) transformation primitives • External tool to find information International Workshop LCPC 2005

  16. With interference graph Without interference graph u_0=u[0]; u_1=u[1]; for(i=2;i<2*N;i+=2) {u_0 = u_1 + u_2; u_1 = u_0 + u_1;} u[i]=u_0; u[i+1]=u _1;} for(i=2;i<2*N;i+=2) {u_1=u[i-1]; u_2=u[i-2]; u_0 = u_1 + u_2; u_1 = u_0 + u_1; u[i]=u_0; u[i+1]=u _1;} Example with analysis for(i=2;i<2*N;i+=2) {u[i]=u[i-1]+u[i-2]; u[i+1]=u[i]+u[i-1];} International Workshop LCPC 2005

  17. Extending the X-Language Rewriting rule : #pragma xlang name iloop for (i = 0; i < N; i++) {<body> } % #pragma xlang name iiloop1 for (ii = 0; ii < (N/4)*4; ii += 4) #pragma xlang name iloop1 for (i = ii; i < ii+4; i++) { <body>} #pragma xlang name iloop2 for (i = (N/4)*4; i < N; i++) f {<body>} %% Pattern before  Pattern after transformation International Workshop LCPC 2005

  18. Daxpy Example #pragma xlang name loop1 for(k=0;k<2000;k++) Y[k]=alpha*X[k]*Y[k]; We can modify values of N /** A few values tested for unrolling factor – Different generated version **/ #pragma xlang transform stripmine loop1 k N; #pragma xlang transform scalarize-in X in loop1 #pragma xlang transform lift l1.loads before loop1 #pragma xlang transform scalarize-out Y in loop1 #pragma xlang transform lift loop1.loads before loop1 #pragma xlang transform lift loop1.stores after loop1 #pragma xlang transform fullunroll loop1.loads #pragma xlang transform fullunroll loop1.stores #pragma xlang transform fullunroll loop1 International Workshop LCPC 2005

  19. Daxpy Example – Different generated versions Unrolling factor : 8 for(k=0;k<2000;k=k+16){ double x_0 = X[k+0]; double x_1 = X[k+1]; double x_2 = X[k+2]; … y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; y_2=alpha*x_2+y_2; y_3=alpha*x_3+y_3; … Y[k+0] = y_0; Y[k+1] = y_1; Y[k+2] = y_2; Y[k+3] = y_3; … } Unrolling factor : 4 for(k=0;k<2000;k=k+4){ double x_0 = X[k+0]; double x_1 = X[k+1]; double x_2 = X[k+2]; double x_3 = X[k+3]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; double y_2 = Y[k+2]; double y_3 = Y[k+3]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; y_2=alpha*x_2+y_2; y_3=alpha*x_3+y_3; Y[k+0] = y_0; Y[k+1] = y_1; Y[k+2] = y_2;} Unrolling factor : 2 for(k=0;k<2000;k=k+2){ double x_0 = X[k+0]; double x_1 = X[k+1]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; Y[k+0] = y_0; Y[k+1] = y_1; } International Workshop LCPC 2005

  20. Matrix Multiply(Loop Declaration) #pragma xlang name iloop for (i = 0; i < NB; i++) #pragma xlang name jloop for (j = 0; j < NB; j++) #pragma xlang name kloop for (k = 0; k < NB; k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } • The DGEMM example: • Matrix Multiplication • Problems : • Data locality • Scheduling International Workshop LCPC 2005

  21. Matrix Multiply(Transformation Declaration) #pragma xlang transform stripmine iloop NU NUloop #pragma xlang transform stripmine jloop MU MUloop #pragma xlang transform interchange kloop MUloop #pragma xlang transform interchange jloop NUloop #pragma xlang transform interchange kloop NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform fullunroll MUloop #pragma xlang transform scalarize_in b in kloop #pragma xlang transform scalarize_in a in kloop #pragma xlang transform scalarize_in&out c in kloop #pragma xlang transform lift kloop.loads before kloop #pragma xlang transform lift kloop.stores after kloop Sequence of transformations for Itanium: International Workshop LCPC 2005

  22. Matrix Multiply(Transformation Sequence) #pragma xlang name iloop for(i = 0; i < NB; i++){ #pragma xlang name jloop for(j = 0; j < NB; j += 4){ #pragma xlang name kloop.loads {c_0_0 = c[i+0][j+0]; c_0_1 = c[i+0][j+1]; c_0_2 = c[i+0][j+2]; c_0_3 = c[i+0][j+3]; } #pragma xlang name kloop for(k = 0; k < NB; k++){ {a_0 = a[i+0][k]; a_1 = a[i+0][k]; a_2 = a[i+0][k]; a_3 = a[i+0][k];} {b_0 = b[k][j+0]; b_1 = b[k][j+1]; b_2 = b[k][j+2]; b_3 = b[k][j+3];} {c_0_0=c_0_0+a_0*b_0; c_0_1=c_0_1+a_1*b_1; c_0_2=c_0_2+a_2*b_2; c_0_3=c_0_3+a_3*b_3;} ... } #pragma xlang name kloop.stores {c[i+0][j+0] = c_0_0; c[i+0][j+1] = c_0_1; c[i+0][j+2] = c_0_2; c[i+0][j+3] = c_0_3;} }} ... // Remainder code International Workshop LCPC 2005

  23. Block copies • Block Matrix Multiplication: better performance if matrices are contiguous in memory (TLB) • Poor performance of C copy • Resort to a tool generating specific asm code • Tool generating a good code with search (XLG is an asm search) International Workshop LCPC 2005

  24. Matrix Multiply(Results) International Workshop LCPC 2005

  25. Conclusion Describe transformations with reuse, procedures, conditionals X-Language: • language designed to generate multiversion programs • Multistage language with a flexible pattern-matching and rewriting language • Experts can describe specific application transformation optimizations International Workshop LCPC 2005

  26. Future works • Dependence analysis • Going further searching asm code transformation • More transformations: vectorization, alignment,… International Workshop LCPC 2005

More Related