220 likes | 367 Views
A Framework for Parallelizing Load/Stores on Embedded Processors. Xiaotong Zhuang Santosh Pande John S. Greenland Jr. College of Computing, Georgia Tech. Background and Motivation. Speed gap between memory and CPU remains
E N D
A Framework for Parallelizing Load/Stores on Embedded Processors Xiaotong Zhuang Santosh Pande John S. Greenland Jr. College of Computing, Georgia Tech 1
Background and Motivation • Speed gap between memory and CPU remains • Multi-bank memory architecture: Motorola DSP56000 series, NEC 77016, SONY pDSP, Analog Devices ADSP-210x, Starcore SC140 processor core • Parallel instructions allow parallel access to memory banks: PLDXY r1, @a, r2, @b, loads @ar1 and @br2 at the same time. • Objective: • Try to maximally generate parallel Load/Store (such as PLDXY) instructions through compiler optimizations. • Controlled code & data segment growth • Reasonable speed of compilation 2
General approaches • Model as ILP problem--Rainer Leupers, Daniel Kotte, “Variable partitioning for dual memory bank DSPs”, ICASSP, May’01 • Variables Ni with value 0/1 for each LD/ST instr. to represent its memory bank assignment (X or Y) • Variables Eij with value 0/1 to represent whether two instructions can be merged • Enforcing other constraints and max the selected edge weight • Model as Graph problem--A.Sudarsanam, S.Malik, “Simultaneous Reference Allocation in Code Generation for Dual Data Memory Bank ASIPs”, TODAES, Apr’00 • Each Load/Store as a node • Edge between nodes represents they can be merged • Pick maximal number of edges that are disjoint 3
Major contributions • Keep the model simple and easy to be solved mathematically • Identify the movable boundary problem, which impedes the problem modeling and simplification • Propose Motion Schedule Graph (MSG) and two approaches to solve it heuristically • Merge with instruction duplication and variable duplication • Cross basic block merges • Other improvements like local conflict elimination through rematerialization and some global optimization issues • An iterative approach, which systematically grows the code segment and then the data segment minimally. 4
Basic concepts (1) • Post-pass approach: assuming a good register allocator has been used--Appel & George’s register allocation algorithm • Alias analysis • Memory access instruction dis-ambiguity • Most alias can be uniquely determined in our benchmark program • Memory access instructions • ST[addr],r is the definition of a memory address • LD[addr],r is the use of a memory address • For base-offset Load/Store instructions, normally for arrays, assume arrays are inseparable and more register conflicts will be considered. • DependenciesAlias analysis • Address conflicts • Register conflicts 5
Basic concepts (2) • Building Webs • Webs: maximal union of du-chains. All variable def/use on the web MUST be allocate to the same memory location • One variable appears in separate web can be put into different memory locations • Achieve value separation • Motion range determination • Defined as interval between program points where a Load/Store can be legally moved, restrained by dependencies • Load/Store instructions with overlapping range MAY be merged • Notice for Movable Boundary problem 6
Movable boundary problem • The motion boundary of one Load/Store instruction is also a Load/Store instruction • Assuming fixed boundary will cause incorrect merge 7
Motion schedule graph • Pseudo fixed-boundary • For Store: move as early as possible assuming other instructions are fixed • For Load: move as late as possible assuming other instructions are fixed • Motion Schedule Graph • Nodes represent individual Load/Store instructions • Oval encloses Load/Store on the same web • Edges link nodes that have overlapped motion range (with respect to pseudo fixed-boundaries) 8
Example 10
Graph solving • The whole problem is provably NP-complete—refer to Appendix A • Two separate problems: Bank Assignment and Edge Picking • For predetermined bank assignments, the Edge Picking problem can be optimally solved in polynomial time • Heuristic algorithms • Brutal force searching will take O(|V|32n) time. Doable for small programs • SA can approach the optimal solution but will greatly increase the compilation time • Use heuristic to solve bank assignment, then get optimal solution for Edge Picking 11
Cross BB merge (Instr. duplication) • Move to predecessor/successor to create new opportunities • To guarantee profitability • Move to where the reference is live • Move ST on EBB • Move LD on reverse EBB • Make sure: can be combined if pushed to at least one of the live predecessors/successors 15
Local conflict elimination • Motivation • Register allocator may assign same register to neighboring ranges, which leads to register conflicts • ISA restrictions may need particular registers but not available at the program point • Rematerialization to free a register and reconstruct it after the merge to make the register available. 17
Conclusion • A framework to analyze and merge LD/STs. • Our heuristic approach comes close to exhaustive search with less compilation time. • Enhancing the range of motion of the instructions by undertaking variable and instruction replications, so the generated code quality is superior to the exhaustive methods previously proposed. 22