A Framework for Parallelizing Load/Stores on Embedded Processors

A Framework for Parallelizing Load/Stores on Embedded Processors Xiaotong Zhuang Santosh Pande John S. Greenland Jr. College of Computing, Georgia Tech 1

Background and Motivation • Speed gap between memory and CPU remains • Multi-bank memory architecture: Motorola DSP56000 series, NEC 77016, SONY pDSP, Analog Devices ADSP-210x, Starcore SC140 processor core • Parallel instructions allow parallel access to memory banks: PLDXY r1, @a, r2, @b, loads @ar1 and @br2 at the same time. • Objective: • Try to maximally generate parallel Load/Store (such as PLDXY) instructions through compiler optimizations. • Controlled code & data segment growth • Reasonable speed of compilation 2

General approaches • Model as ILP problem--Rainer Leupers, Daniel Kotte, “Variable partitioning for dual memory bank DSPs”, ICASSP, May’01 • Variables Ni with value 0/1 for each LD/ST instr. to represent its memory bank assignment (X or Y) • Variables Eij with value 0/1 to represent whether two instructions can be merged • Enforcing other constraints and max the selected edge weight • Model as Graph problem--A.Sudarsanam, S.Malik, “Simultaneous Reference Allocation in Code Generation for Dual Data Memory Bank ASIPs”, TODAES, Apr’00 • Each Load/Store as a node • Edge between nodes represents they can be merged • Pick maximal number of edges that are disjoint 3

Major contributions • Keep the model simple and easy to be solved mathematically • Identify the movable boundary problem, which impedes the problem modeling and simplification • Propose Motion Schedule Graph (MSG) and two approaches to solve it heuristically • Merge with instruction duplication and variable duplication • Cross basic block merges • Other improvements like local conflict elimination through rematerialization and some global optimization issues • An iterative approach, which systematically grows the code segment and then the data segment minimally. 4

Basic concepts (1) • Post-pass approach: assuming a good register allocator has been used--Appel & George’s register allocation algorithm • Alias analysis • Memory access instruction dis-ambiguity • Most alias can be uniquely determined in our benchmark program • Memory access instructions • ST[addr],r is the definition of a memory address • LD[addr],r is the use of a memory address • For base-offset Load/Store instructions, normally for arrays, assume arrays are inseparable and more register conflicts will be considered. • DependenciesAlias analysis • Address conflicts • Register conflicts 5

Basic concepts (2) • Building Webs • Webs: maximal union of du-chains. All variable def/use on the web MUST be allocate to the same memory location • One variable appears in separate web can be put into different memory locations • Achieve value separation • Motion range determination • Defined as interval between program points where a Load/Store can be legally moved, restrained by dependencies • Load/Store instructions with overlapping range MAY be merged • Notice for Movable Boundary problem 6

Movable boundary problem • The motion boundary of one Load/Store instruction is also a Load/Store instruction • Assuming fixed boundary will cause incorrect merge 7

Motion schedule graph • Pseudo fixed-boundary • For Store: move as early as possible assuming other instructions are fixed • For Load: move as late as possible assuming other instructions are fixed • Motion Schedule Graph • Nodes represent individual Load/Store instructions • Oval encloses Load/Store on the same web • Edges link nodes that have overlapped motion range (with respect to pseudo fixed-boundaries) 8

Conflict resolution 9

Example 10

Graph solving • The whole problem is provably NP-complete—refer to Appendix A • Two separate problems: Bank Assignment and Edge Picking • For predetermined bank assignments, the Edge Picking problem can be optimally solved in polynomial time • Heuristic algorithms • Brutal force searching will take O(|V|32n) time. Doable for small programs • SA can approach the optimal solution but will greatly increase the compilation time • Use heuristic to solve bank assignment, then get optimal solution for Edge Picking 11

Edge Picking as max flow problem 12

Bank assignment heuristic 13

Post-pass phases 14

Cross BB merge (Instr. duplication) • Move to predecessor/successor to create new opportunities • To guarantee profitability • Move to where the reference is live • Move ST on EBB • Move LD on reverse EBB • Make sure: can be combined if pushed to at least one of the live predecessors/successors 15

Variable duplication 16

Local conflict elimination • Motivation • Register allocator may assign same register to neighboring ranges, which leads to register conflicts • ISA restrictions may need particular registers but not available at the program point • Rematerialization to free a register and reconstruct it after the merge to make the register available. 17

Merge type and MSG properties 18

Compilation time 19

Runtime performance 20

Code size comparison 21

Conclusion • A framework to analyze and merge LD/STs. • Our heuristic approach comes close to exhaustive search with less compilation time. • Enhancing the range of motion of the instructions by undertaking variable and instruction replications, so the generated code quality is superior to the exhaustive methods previously proposed. 22

A Framework for Parallelizing Load/Stores on Embedded Processors

A Framework for Parallelizing Load/Stores on Embedded Processors

Presentation Transcript

Embedded Computer Architecture 5SAI0 Coherence, Synchronization and Memory Consistency ( ch 5b,7)

UBC104 Embedded Systems

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Lower Power Embedded Architecture Design

On Dynamic Load Balancing on Graphics Processors

UML, Embedded Systems, and Application Frameworks

CECS 347 Embedded Processors

TI Sitara ™ AM37x Microprocessors Featuring ARM ® Cortex ™ -A8

Lecture 4: Embedded Application Framework Qt Tutorial Cheng-Liang (Paul) Hsieh

Macro instruction synthesis for embedded processors

嵌入式微處理機 Embedded Processors

Introduction to Embedded Systems

2-Hardware Design of Embedded Processors (cont.)

4-Integrating Peripherals in Embedded Systems

Charm++ FEM Framework Tutorial

Hardware Assisted Control Flow Obfuscation for Embedded Processors

Customizable Embedded System Architectures

An Efficient Packet Scheduling Algorithm in Network Processors

MicroChip

Some Embedded Processor Alternatives; Processors for this course: Introduction to Altera FPGAs

Embedded Web