Embedded Computer Architecture

Embedded Computer Architecture VLIW architectures: Generating VLIW code TU/e 5kk73 Henk Corporaal

VLIW lectures overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples • C6 • TM • TTA • Clustering and Reconfigurable components • Code generation • compiler basics • mapping and scheduling • TTA code generation • Design space exploration • Hands-on Embedded Computer Architecture H. Corporaal, and B. Mesman

Compiler basics • Overview • Compiler trajectory / structure / passes • Control Flow Graph (CFG) • Mapping and Scheduling • Basic block list scheduling • Extended scheduling scope • Loop scheduling • Loop transformations • separate lecture Embedded Computer Architecture H. Corporaal, and B. Mesman

Compiler basics: trajectory Source program Preprocessor Compiler Error messages Assembler Library code Loader/Linker Object program Embedded Computer Architecture H. Corporaal, and B. Mesman

Compiler basics:structure / passes Source code Lexical analyzer token generation check syntax check semantic parse tree generation Parsing Intermediate code data flow analysis local optimizations global optimizations Code optimization code selection peephole optimizations Code generation making interference graph graph coloring spill code insertion caller / callee save and restore code Register allocation Sequential code Scheduling and allocation exploiting ILP Object code Embedded Computer Architecture H. Corporaal, and B. Mesman

:= id + id * id 60 Compiler basics: structure Simple example: from HLL to (Sequential) Assembly code position := initial + rate * 60 Lexical analyzer temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 id := id + id * 60 Syntax analyzer Code optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 Code generator movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1 Intermediate code generator Embedded Computer Architecture H. Corporaal, and B. Mesman

Compiler basics:Control flow graph (CFG) CFG: shows the flow between basic blocks C input code: if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 ………….. ………….. Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. Embedded Computer Architecture H. Corporaal, and B. Mesman

Compiler basics: Basic optimizations • Machine independent optimizations • Machine dependent optimizations Embedded Computer Architecture H. Corporaal, and B. Mesman

Compiler basics: Basic optimizations • Machine independent optimizations • Common subexpression elimination • Constant folding • Copy propagation • Dead-code elimination • Induction variable elimination • Strength reduction • Algebraic identities • Commutative expressions • Associativity: Tree height reduction • Note: not always allowed(due to limited precision) • For details check any good compiler book ! Embedded Computer Architecture H. Corporaal, and B. Mesman

Compiler basics: Basic optimizations • Machine dependent optimization example • What’s the optimal implementation of a*34 ? • Use multiplier: mul Tb, Ta, 34 • Pro: No thinking required • Con: May take many cycles • Alternative: • SHL Tb, Ta, 1 • SHL Tc, Ta, 5 • ADD Tb, Tb, Tc • Pros: May take fewer cycles • Cons: • Uses more registers • Additional instructions ( I-cache load / code size) Embedded Computer Architecture H. Corporaal, and B. Mesman

r31 Callee saved registers r21 Caller saved registers r20 Other temporaries r11 r10 Function Argument and Result transfer r1 Hard-wired 0 r0 Compiler basics: Register allocation • Register Organization • Conventions needed for parameter passing • and register usage across function calls Embedded Computer Architecture H. Corporaal, and B. Mesman

Register allocation using graph coloring Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program? Some definitions: • A variable is defined at a point in program when a value is assigned to it. • A variable is used at a point in a program when its value is referenced in an expression. • The live range of a variable is the execution range between definitions and uses of a variable. Embedded Computer Architecture H. Corporaal, and B. Mesman

Program: • a := • c := • b := • := b • d := • := a • := c • := d a b c d Register allocation using graph coloring Live Ranges define use Embedded Computer Architecture H. Corporaal, and B. Mesman

a b c d Register allocation using graph coloring Inference Graph a • Coloring: • a = red • b = green • c = blue • d = green b c d Graph needs 3 colors => program needs 3 registers Question: map coloring requires (at most) 4 colors; what’s the maximum number of colors (= registers) needed for register interference graph coloring? Embedded Computer Architecture H. Corporaal, and B. Mesman

Program: • a := • c := • store c • b := • := b • d := • := a • load c • := c • := d Live Ranges a b c d Register allocation using graph coloring Spill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Example: Only two registers available !! Embedded Computer Architecture H. Corporaal, and B. Mesman

Register allocation for a monolithic RF Scheme of the optimistic register allocator Spill code Renumber Build Spill costs Simplify Select The Select phase selects a color (= machine register) for a variable that minimizes the heuristic h: h = fdep(col, var) + caller_callee(col, var) where: fdep(col, var) : a measure for the introduction of false dependencies caller_callee(col, var) : cost for mapping var on a caller or callee saved register Embedded Computer Architecture H. Corporaal, and B. Mesman

Compiler basics: Code selection • CISC era (before 1985) • Code size important • Determine shortest sequence of code • Many options may exist • Pattern matching Example M68029: D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ]  ADD ([10,A1], D2*16, 20) D1 • RISC era • Performance important • Only few possible code sequences • New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020 Embedded Computer Architecture H. Corporaal, and B. Mesman

Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples • C6 • TM • TTA • Clustering • Code generation • Compiler basics • Mapping and Scheduling of Operations • Design Space Exploration: TTA framework • What is scheduling • Basic Block Scheduling • Extended Basic Block Scheduling • Loop Scheduling Embedded Computer Architecture H. Corporaal, and B. Mesman

a b 2 * * d z y + + + e f - x r Data Dependence Graph (DDG) Mapping / Scheduling =placing operations in space and time d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; Embedded Computer Architecture H. Corporaal, and B. Mesman

b a 2 * * d cycle + + z 1 y * e f + 2 * - 3 x + r 4 + 5 - 6 + How to map these operations? • Architecture constraints: • One Function Unit • All operations single cycle latency Embedded Computer Architecture H. Corporaal, and B. Mesman

b a 2 * * d Mul Add-sub + + cycle z 1 y * + e f + 2 * + - 3 x + r 4 - 5 6 How to map these operations? • Architecture constraints: • One Add-sub and one Mul unit • All operations single cycle latency Embedded Computer Architecture H. Corporaal, and B. Mesman

x Pareto graph (solution space) x x x T execution x x x x x x x x x x x x x x x x x x x x x x x x x x x x Point x is pareto there is no point y for which i yi<xi 0 Cost There are many mapping solutions Embedded Computer Architecture H. Corporaal, and B. Mesman

Scheduling: Overview Transforming a sequential program into a parallel program: read sequential program read machine description file for each procedure do perform function inlining for each procedure do transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do perform instruction scheduling write out the parallel program Embedded Computer Architecture H. Corporaal, and B. Mesman

Basic Block Scheduling • Basic Block = piece of code which can only be entered from the top (first instruction) and left at the bottom (final instruction) • Scheduling a basic block =Assign resources and a cycle to every operation • List Scheduling =Heuristic scheduling approach, scheduling the operation one-by-one • Time_complexity = O(N), where N is #operations • Optimal scheduling has Time_complexity = O(exp(N) • Question: what is a good scheduling heuristic Embedded Computer Architecture H. Corporaal, and B. Mesman

Basic Block Scheduling • Make a Data Dependence Graph (DDG) • Determine minimal length of the DDG (for the given architecture) • minimal number of cycles to schedule the graph (assuming sufficient resources) • Determine: • ASAP (As Soon As Possible) cycle = earliest cycle instruction can be scheduled • ALAP (As Late As Possible) cycle = latest cycle instruction can be scheduled • Slack of each operation = ALAP – ASAP • Priority of operations = f (Slack, #decendants, #register impact, …. ) • Place each operation in first cycle with sufficient resources • Notes: • Basic Block = a (maximal) piece of consecutive instructions which can only be entered at the first instruction and left at the end • Scheduling order sequential • Scheduling Priority determined by used heuristic; e.g. slack + other contributions Embedded Computer Architecture H. Corporaal, and B. Mesman

Basic Block Scheduling:determine ASAP and ALAP cycles ASAP cycle we assume all operations are single cycle ! B C ALAP cycle ADD A slack <1,1> A C SUB <2,2> ADD NEG LD <3,3> <1,3> <2,3> A B LD MUL ADD <4,4> <2,4> <1,4> z y X Embedded Computer Architecture H. Corporaal, and B. Mesman

Cycle based list scheduling proc Schedule(DDG = (V,E)) beginproc ready = { v | (u,v)  E } ready’ = ready sched =  current_cycle = 0 whilesched  Vdo for eachv  ready’ (select in priority order) do ifResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched  {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v  sched  (u,v) E, u  sched } ready’ = { v | v  ready  (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhile endproc Embedded Computer Architecture H. Corporaal, and B. Mesman

A B C D E F G Extended Scheduling Scope:look at the CFG Code: CFG: Control Flow Graph A; If cond Then B Else C; D; If cond Then E Else F; G; Q: Why enlarge the scheduling scope? Embedded Computer Architecture H. Corporaal, and B. Mesman

A a) add r3, r4, 4 b) beq . . . B c) add r1, r1, r2 C d) sub r3, r3, r2 D e) mul r1, r1, r3 Extended basic block scheduling:Code Motion Q: Why moving code? • Downward code motions? • — a  B, a  C, a  D, c  D, d  D • Upward code motions? • — c  A, d  A, e  B, e  C, e  A Embedded Computer Architecture H. Corporaal, and B. Mesman

Trace Decision tree Hyperblock/region Superblock Possible Scheduling Scopes Embedded Computer Architecture H. Corporaal, and B. Mesman

A A A B C C tail duplication B C B D D D’ D E E’ F F E E F G G’ G Trace Superblock G Create and Enlarge Scheduling Scope Embedded Computer Architecture H. Corporaal, and B. Mesman

A A A tail duplication B C C B C B D D D’ D E F’ F E’ F E E F G G’ G’’ G Hyperblock/ region Decision Tree G Create and Enlarge Scheduling Scope Embedded Computer Architecture H. Corporaal, and B. Mesman

A B C D E F G Comparing scheduling scopes Embedded Computer Architecture H. Corporaal, and B. Mesman

Legend: Copy needed I Intermediate block Check for off-liveness Code movement Code movement (upwards) within regions: what to check? destination block I I I I add source block Embedded Computer Architecture H. Corporaal, and B. Mesman

A B C D E F Extended basic block scheduling:Code Motion • A dominates B  A is always executed before B • Consequently: • A does not dominate B  code motion from B to A requires code duplication • B post-dominates A  B is always executed after A • Consequently: • B does not post-dominate A  code motion from B to A is speculative Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B? Embedded Computer Architecture H. Corporaal, and B. Mesman

A A C B C B C’ C’ C’’ C’’ D D Loop unrolling Loop peeling Scheduling: Loops Loop Optimizations: A C B D Embedded Computer Architecture H. Corporaal, and B. Mesman

Scheduling: Loops • Problems with unrolling: • Exploits only parallelism within sets of n iterations • Iteration start-up latency • Code expansion Basic block scheduling Basic block scheduling and unrolling resource utilization Software pipelining time Embedded Computer Architecture H. Corporaal, and B. Mesman

  LD LD ML LD ML ST ML ST ST Example: y = a.x LD ML ST Software pipelining • Software pipelining a loop is: • Scheduling the loop such that iterations start before preceding iterations have finished Or: • Moving operations across the backedge LD LD ML LD ML ST ML ST ST Unroling (3 times) 5/3 cycles/iteration Software pipelining 1 cycle/iteration 3 cycles/iteration Embedded Computer Architecture H. Corporaal, and B. Mesman

Software pipelining (cont’d) Basic loop scheduling techniques: • Modulo scheduling (Rau, Lam) • list scheduling with modulo resource constraints • Kernel recognition techniques • unroll the loop • schedule the iterations • identify a repeating pattern • Examples: • Perfect pipelining (Aiken and Nicolau) • URPR (Su, Ding and Xia) • Petri net pipelining (Allan) • Enhanced pipeline scheduling (Ebcioğlu) • fill first cycle of iteration • copy this instruction over the backedge This algorithm most used in commercial compilers Embedded Computer Architecture H. Corporaal, and B. Mesman

ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) for (i = 0; i < n; i++) A[i+6] = 3* A[i] - 1; (a) Example loop (b) Code (without loop control) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Kernel ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Epilogue (c) Software pipeline Software pipelining: Modulo scheduling Example: Modulo scheduling a loop • Prologue fills the SW pipeline with iterations • Epilogue drains the SW pipeline Embedded Computer Architecture H. Corporaal, and B. Mesman

Software pipelining: determine II, the Initiation Interval Cyclic data dependences For (i=0;.....) A[i+6]= 3*A[i]-1 ld r1, (r2) (0,1) (1,0) (delay, iteration distance) mul r3, r1, 3 (1,6) (0,1) (1,0) ld_1 sub r4, r3, 1 ld_2 ld_3 (0,1) (1,0) -5 ld_4 st r4, (r5) Initiation Interval ld_5 ld_6 st_1 cycle(v) cycle(u) + delay(u,v) - II.distance(u,v) ld_7 Embedded Computer Architecture H. Corporaal, and B. Mesman

Resources: Cycles: Therefore: Or: Modulo scheduling constraints MII, minimum initiation interval, bounded by cyclic dependences and resources: MII = max{ ResMinII, RecMinII } Embedded Computer Architecture H. Corporaal, and B. Mesman

9 steps required to translate an HLL program: (see online bookchapter) Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports Let's go back to: The Role of the Compiler Embedded Computer Architecture H. Corporaal, and B. Mesman

Division of responsibilities between hardware and compiler Application (1) Frontend Superscalar (2) Determine Dependencies Determine Dependencies Dataflow (3) Binding of Operands Binding of Operands Multi-threaded (4) Scheduling Scheduling Indep. Arch (5) Binding of Operations Binding of Operations VLIW (6) Binding of Transports Binding of Transports TTA (7) Execute Responsibility of compiler Responsibility of Hardware Embedded Computer Architecture H. Corporaal, and B. Mesman

Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples • C6 • TM • TTA • Clustering • Code generation • Design Space Exploration: TTA framework Embedded Computer Architecture H. Corporaal, and B. Mesman

x Pareto curve (solution space) x x x exec. time x x x x x x x x x x x x x x x x cost Mapping applications to processorsMOVE framework User intercation Optimizer Architecture parameters feedback feedback Parametric compiler Hardware generator Move framework Parallel object code chip TTA based system Embedded Computer Architecture H. Corporaal, and B. Mesman

load/store unit load/store unit integer ALU integer ALU float ALU integer RF float RF boolean RF instruct. unit immediate unit TTA (MOVE) organization Data Memory Socket Instruction Memory Embedded Computer Architecture H. Corporaal, and B. Mesman

Code generation trajectory for TTAs • Frontend: • GCC or SUIF • (adapted) Application (C) Compiler frontend Sequential code Sequential simulation Input/Output Architecture description Compiler backend Profiling data Parallel code Parallel simulation Input/Output Embedded Computer Architecture H. Corporaal, and B. Mesman

• Exploration: TTA resource reduction Embedded Computer Architecture H. Corporaal, and B. Mesman

Exporation: TTA connectivity reduction Critical connections disappear Reducing bus delay Execution time FU stage constrains cycle time 0 Number of connections removed Embedded Computer Architecture H. Corporaal, and B. Mesman

Embedded Computer Architecture