1 / 22

A Framework for Parallelizing Load/Stores on Embedded Processors

A Framework for Parallelizing Load/Stores on Embedded Processors. Xiaotong Zhuang Santosh Pande John S. Greenland Jr. College of Computing, Georgia Tech. Background and Motivation. Speed gap between memory and CPU remains

nydia
Download Presentation

A Framework for Parallelizing Load/Stores on Embedded Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Parallelizing Load/Stores on Embedded Processors Xiaotong Zhuang Santosh Pande John S. Greenland Jr. College of Computing, Georgia Tech 1

  2. Background and Motivation • Speed gap between memory and CPU remains • Multi-bank memory architecture: Motorola DSP56000 series, NEC 77016, SONY pDSP, Analog Devices ADSP-210x, Starcore SC140 processor core • Parallel instructions allow parallel access to memory banks: PLDXY r1, @a, r2, @b, loads @ar1 and @br2 at the same time. • Objective: • Try to maximally generate parallel Load/Store (such as PLDXY) instructions through compiler optimizations. • Controlled code & data segment growth • Reasonable speed of compilation 2

  3. General approaches • Model as ILP problem--Rainer Leupers, Daniel Kotte, “Variable partitioning for dual memory bank DSPs”, ICASSP, May’01 • Variables Ni with value 0/1 for each LD/ST instr. to represent its memory bank assignment (X or Y) • Variables Eij with value 0/1 to represent whether two instructions can be merged • Enforcing other constraints and max the selected edge weight • Model as Graph problem--A.Sudarsanam, S.Malik, “Simultaneous Reference Allocation in Code Generation for Dual Data Memory Bank ASIPs”, TODAES, Apr’00 • Each Load/Store as a node • Edge between nodes represents they can be merged • Pick maximal number of edges that are disjoint 3

  4. Major contributions • Keep the model simple and easy to be solved mathematically • Identify the movable boundary problem, which impedes the problem modeling and simplification • Propose Motion Schedule Graph (MSG) and two approaches to solve it heuristically • Merge with instruction duplication and variable duplication • Cross basic block merges • Other improvements like local conflict elimination through rematerialization and some global optimization issues • An iterative approach, which systematically grows the code segment and then the data segment minimally. 4

  5. Basic concepts (1) • Post-pass approach: assuming a good register allocator has been used--Appel & George’s register allocation algorithm • Alias analysis • Memory access instruction dis-ambiguity • Most alias can be uniquely determined in our benchmark program • Memory access instructions • ST[addr],r is the definition of a memory address • LD[addr],r is the use of a memory address • For base-offset Load/Store instructions, normally for arrays, assume arrays are inseparable and more register conflicts will be considered. • DependenciesAlias analysis • Address conflicts • Register conflicts 5

  6. Basic concepts (2) • Building Webs • Webs: maximal union of du-chains. All variable def/use on the web MUST be allocate to the same memory location • One variable appears in separate web can be put into different memory locations • Achieve value separation • Motion range determination • Defined as interval between program points where a Load/Store can be legally moved, restrained by dependencies • Load/Store instructions with overlapping range MAY be merged • Notice for Movable Boundary problem 6

  7. Movable boundary problem • The motion boundary of one Load/Store instruction is also a Load/Store instruction • Assuming fixed boundary will cause incorrect merge 7

  8. Motion schedule graph • Pseudo fixed-boundary • For Store: move as early as possible assuming other instructions are fixed • For Load: move as late as possible assuming other instructions are fixed • Motion Schedule Graph • Nodes represent individual Load/Store instructions • Oval encloses Load/Store on the same web • Edges link nodes that have overlapped motion range (with respect to pseudo fixed-boundaries) 8

  9. Conflict resolution 9

  10. Example 10

  11. Graph solving • The whole problem is provably NP-complete—refer to Appendix A • Two separate problems: Bank Assignment and Edge Picking • For predetermined bank assignments, the Edge Picking problem can be optimally solved in polynomial time • Heuristic algorithms • Brutal force searching will take O(|V|32n) time. Doable for small programs • SA can approach the optimal solution but will greatly increase the compilation time • Use heuristic to solve bank assignment, then get optimal solution for Edge Picking 11

  12. Edge Picking as max flow problem 12

  13. Bank assignment heuristic 13

  14. Post-pass phases 14

  15. Cross BB merge (Instr. duplication) • Move to predecessor/successor to create new opportunities • To guarantee profitability • Move to where the reference is live • Move ST on EBB • Move LD on reverse EBB • Make sure: can be combined if pushed to at least one of the live predecessors/successors 15

  16. Variable duplication 16

  17. Local conflict elimination • Motivation • Register allocator may assign same register to neighboring ranges, which leads to register conflicts • ISA restrictions may need particular registers but not available at the program point • Rematerialization to free a register and reconstruct it after the merge to make the register available. 17

  18. Merge type and MSG properties 18

  19. Compilation time 19

  20. Runtime performance 20

  21. Code size comparison 21

  22. Conclusion • A framework to analyze and merge LD/STs. • Our heuristic approach comes close to exhaustive search with less compilation time. • Enhancing the range of motion of the instructions by undertaking variable and instruction replications, so the generated code quality is superior to the exhaustive methods previously proposed. 22

More Related