1 / 20

*partially supported by ST Microelectronics

A Retargetable Preprocessor for Multimedia Instructions* (work in progress) INRIA F. Bodin, G. Pokam, J. Simonnet. *partially supported by ST Microelectronics. Multimedia Instructions. Instruction set extension to achieve high performance many different ones crucial for embedded systems

vinson
Download Presentation

*partially supported by ST Microelectronics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Retargetable Preprocessor for Multimedia Instructions* (work in progress)INRIA F. Bodin, G. Pokam, J. Simonnet *partially supported by ST Microelectronics

  2. Multimedia Instructions • Instruction set extension to achieve high performance • many different ones • crucial for embedded systems • difficult to use • Retargetability is an issue

  3. Multimedia Instructions • Exploits sub word parallelism

  4. An Example (Trimedia) char *back, *forward, *idct, *destination; for (i = 0; i<64; i += 1){ destination[i] = ((back[i]+ forward[i] + 1) >> 1) + idct[i]; } int *i_back = (int *) back; int *i_forward = (int *) forward; int *i_idct = (int *) idct; int *i_dest = (int *) destination; for (i = 0; i<16; i += 1){ temp = QUADAVG(i_back[i], i_forward[i]); i_dest[i] = DSPUQUADADDUI(temp, i_idct[i]); }

  5. MMI Automatic Exploitation Vectorization [Bik01] [Krall00] [...] Src code Pre- processing Idioms/MMI Recognition Code Generation Alignment Loop Unrolling [Larsen00] [Leupers2000] machine independent machine dependent

  6. The MMI Recognition Phase • Find the instruction available on the machine • after vectorization • after unrolling • User interaction • fast retargetability • not only for compiler writer • no compiler recompilation needed

  7. A MMI Example temp[i] =(back[i]+forward[i]+1)>>1; rather than t1[i] = (back[i] + forward[i]); t2[i] = t1[i] + 1; temp[i] = t2[i] >> 1;

  8. SWARecog : a Retargetable Engine for MMI • Front-end independent • CoSy • Sage++ • Retargetable • configurable intermediate form • Uses a rewriting system based on U. Assmann’s work [Assmann96]

  9. An Overview of SWARecog CoSy Sage++ IR description based

  10. The Intermediate Format • Identical for code and rules • Attributes declaration • Node declaration • Edge declaration [NODES] Operator:ENUM = {cast, mul, add, sright, assg,...}; [NODES] VariableName:STRING = {}; [EDGES] distance:INTEGER = {} DEFAULT 0; NODELabel Assign: NodeType = {operator} Operator = {assg} ValueType = {int} (flowdep ObjectAddr:14 Assign:8 1) (flowdep Plus:9 Assign:8 2) [negated] (same ObjectAddr:14 ObjectAddr:11 0)

  11. A Rule Description Example b = a+a b = a<<1 NODELabel v: NodeType = {scalar} Operator = {obj} Aliased = {0} ValueType = {int} VariableName = {*} LoopSector = {body} v v 1 v + << * * • RULE [1] MulToShift: • (flowdep v:1 Plus:6 0) • (flowdep v:2 Plus:6 0) • (flowdep Plus:6 Exp:0 *) 1 • (same v:1 v:2 0) • (same v:2 v:1 0) (flowdep v:1 Shift:7 1) (flowdep IntConst1:8 Shift:7 2) (flowdep Shift:7 Exp:0 *) 1

  12. Example-1 /*$pragma[VectorLoop("NoAlias")]*/ for (i = xa; i < xb; i = i+4){ sum = sum + (s[i] * om[i]); sum = sum + (s[i+1] * om[i+1]); sum = sum + (s[i+2] * om[i+2]); sum = sum + (s[i+3] * om[i+3]); } for (i = xa ; i < xb ; i = i + 4){ sum = sum + dualDotProd(packCont(s[i], s[i + 1]), packCont(om[i], om[i + 1])); sum = sum + dualDotProd(packCont(s[i + 2], s[i + 3]), packCont(om[i + 2], om[i + 3])); }

  13. Example-2 for (i = xa; i < xb; i = i+4) /*$pragma[VectorLoop("NoAlias")]*/{ d[i] = j + 4 + (s[i] + om[i]); d[i+1] = j + 4 + (s[i+1] + om[i+1]); d[i+2] = j + 4 + (s[i+2] + om[i+2]); d[i+3] = j + 4 + (s[i+3] + om[i+3]); } instance number ($INSTANCE) for (i = xa ; i < xb ; i = i + 4){ NEWVAR_temp2_1 = dualAdd(packCont(s[i + 2], s[i + 3]), packCont(om[i + 2], om[i + 3])); NEWVAR_temp2_2 = dualAdd(packCont(s[i], s[i + 1]), packCont(om[i], om[i + 1])); NEWVAR_temp1_1 = unpackCont(NEWVAR_temp2_1, 0); d[i + 2] = j + 4 + (NEWVAR_temp1_1); NEWVAR_temp3_1 = unpackCont(NEWVAR_temp2_1, 1); d[i + 3] = j + 4 + (NEWVAR_temp3_1); NEWVAR_temp1_2 = unpackCont(NEWVAR_temp2_2, 0); d[i] = j + 4 + (NEWVAR_temp1_2); NEWVAR_temp3_2 = unpackCont(NEWVAR_temp2_2, 1); d[i + 1] = j + 4 + (NEWVAR_temp3_2); }

  14. Combining the Rules • Strata or alternative based • normalization based Rule Desc. Rule Desc. Rewriting Engine Rewriting Engine C code IR Form Rewriting Engine IR Form IR Form ... .... Rewriting Engine Rewriting Engine C code IR Form Rule Desc. Rule Desc.

  15. Rule Generation • C rules description RHS Generator C code Rule description LHS Generator C code SWARecog Front-end Front-end C code C code

  16. A Rule Description Example the engine generates same_address_+1 defines the properties of the leaf expressions to match. for (i = 0; i < LOOP_BOUND1 -1; i = i+2) /*$pragma[LHS()]*/ { ROOT_1(LEAF_3(tab1[i]) + LEAF_4(tab2[i])); ROOT_2(LEAF_5(tab1[i+1]) + LEAF_6(tab2[i+1])); } for (i = 0; i < LOOP_BOUND1 -1; i = i+2) /*$pragma[RHS()]*/ { NEWVAR_temp2 = dualAdd(packCont(LEAF_3(tab1[i]),LEAF_5(tab1[i+1])), packCont(LEAF_4(tab2[i]),LEAF_6(tab2[i+1]))); NEWVAR_temp1 = unpackCont(NEWVAR_temp2,0); NEWVAR_temp3 = unpackCont(NEWVAR_temp2,1); ROOT_1(NEWVAR_temp1); ROOT_2(NEWVAR_temp3); }

  17. A Rule Description Example for (i = 0; i < LOOP_BOUND1 -1; i = i+2) /*$pragma[LHS()]*/ { ROOT_1(LEAF_7(sum) = LEAF_8(sum) + (LEAF_3(tab1[i]) * LEAF_4(tab2[i]))); ROOT_2(LEAF_9(sum) = LEAF_10(sum) + (LEAF_5(tab1[i+1]) * LEAF_6(tab2[i+1]))); } for (i = 0; i < LOOP_BOUND1 -1; i = i+2) /*$pragma[RHS()]*/ { ROOT_2(LEAF_9(sum) = LEAF_10(sum) + dualDotProd(packCont(LEAF_3(tab1[i]),LEAF_5(tab1[i+1])), packCont(LEAF_4(tab2[i]),LEAF_6(tab2[i+1])))); }

  18. Conclusion and Perspectives • The prototype is running • Vectorization and alignment phases are under development • Next step : study the tradeoff between unrolling and vectorization

  19. Bibliography • [Assmann96] Graph Rewrite Systems for Program Optimization, U. Assman, Technical Report RR2955, INRIA Rocquencourt, 1996 • [bik01] Experiments with Automatic Vectorization for the Pentium 4 Processor, A. Bik, M. Girkar, P. Grey and X. Tian, CPC, 2001 • [Cheong97] An Optimizer for Multimedia Instruction Sets, G. Cheong and M. Lam, Proceedings of the Second SUIF Compiler Workshop, 1997

  20. Bibliography (cont.) • [Krall00] Compilation Technique for Multimedia Processors, A. Krall and S. Lelait, IJPP, vol. 28, No 4, 2000 • [Larsen00] Exploiting Superword Level Parallelism with Multimedia Instruction Sets, S. Larsen and S. Amarasinghe, PLDI 2000 • [Leupers2000] Code Selection for Media Processors with SIMD Instructions, R. Leupers, DATE 2000 • [Sreraman00] A Vectorizing Compiler for Multimedia Extensions, N. Sreraman and R. Govindarajan, IJPP, vol. 28, No 4, 2000

More Related