160 likes | 290 Views
Scaling to the End of Silicon with EDGE Architectures. D. Burger, S.W. Keckler, K.S. McKinley, M. Dahlin, L.K. John, C. Lin, C.R. Moore, J. Burrill, R.G. McDonald, W. Yoder and the TRIPS Team (presented by Khalid El-Arini). Overview. Motivation High-level architecture description
E N D
Scaling to the End of Silicon with EDGE Architectures D. Burger, S.W. Keckler, K.S. McKinley, M. Dahlin, L.K. John, C. Lin, C.R. Moore, J. Burrill, R.G. McDonald, W. Yoder and the TRIPS Team (presented by Khalid El-Arini)
Overview • Motivation • High-level architecture description • Compiling for TRIPS • Discussion
Why do we need a new ISA? • For the last 20 years, we have witnessed dramatic improvements in processor performance • Acceleration of clock rates (x86): • 1990: 33 MHz • 2004: 3.4 GHz • Aggressive pipelining responsible for approximately half of performance gain • However, all good things come to an end [ Hrishikesh et al, ISCA ’02]
Explicit Data Graph Execution • Direct instruction communication • Producer and consumer instructions interact directly • An instruction fires when its inputs are available • Dataflow explicitly represented in hardware • No rediscovery of data dependencies • Higher exposed concurrency • More power-efficient execution
TRIPS: An EDGE Architecture • Four goals • Increase in concurrency • Power-efficient high performance • Mitigation of communication delays • Increased flexibility
Block Atomic Execution • Compiler groups instructions into blocks • Called “hyperblocks,” and contain up to 128 instructions • Each block is fetched, executed, and committed atomically • (similar to conventional notion of transactions) • Sequential execution semantics at block level – each block is a megainstruction • Dataflow execution semantics within each block
Hyperblocks and Predication • 128 instructions?! • Predication allows us to hide branches within dataflow graph • Loop unrolling and function inlining also help
T5 T13 T17 TRIPS Instructions • RISC add: • ADD R1, R2, R3 • TRIPS add: • T5: ADD T13, T17 • Compiler statically determines locations of instructions • Block mapping/execution model eliminates need to go through shared data structures (e.g., register file) while executing within a hyperblock
Compiling for TRIPS • Two new responsibilities • Generating hyperblocks • Spatial scheduling of blocks
Predicated Execution • Naïve implementation: • Route a predicate to every instruction in a predicated basic block • Wide fan-out problem • Better implementations: • Predicate only the first instruction in a chain • Saves power if predicate is false • Predicate only the last instruction in a chain • Hide latency of predicate computation
Spatial Scheduling • Two competing goals • Place independent instructions on different ALUs to increase concurrency • Place instructions near one another to minimize routing distances and communication delays
Discussion • Now that intermediate results within a hyperblock’s dataflow are directly passed between instructions, how will register allocation be affected? • Compare EDGE compiler/hardware responsibilities with RISC and VLIW