110 likes | 128 Views
Scaling to the End of Silicon with EDGE Architectures: The TRIPS Architecture. Doug Burger et al. University of Texas at Austin Presented by Jason D. Bakos. EDGE Architecture. The most popular clichés in computer architecture:.
E N D
Scaling to the End of Silicon with EDGE Architectures:The TRIPS Architecture Doug Burger et al. University of Texas at Austin Presented by Jason D. Bakos
EDGE Architecture • The most popular clichés in computer architecture: “[Microarchitectures | ISAs] are always driven by [capabilities | challenges] of the available fabrication technology.” “We are reaching a fundamental limit in [transistor size | clock speed | pipeline depth | voltage reduction]!” “Future architectures will be (on-chip) communication-bound.” “Future architectures must strike a balance between static, dynamic, and programmer-specified parallelism.” “Future architectures will be constrained by a power budget.”
EDGE Architecture • EDGE (Explicit Data-Graph Execution) addresses weaknesses in modern architectures: • Traditional architectures exploit ILP at run-time • Dynamic out-of-order schedulers • Multiple-issue units • Branch predictors • Register renaming/bypass networks • Reorder buffers • Large multi-ported register files • Not efficient: • Compiler already does much of this work (AST, CFG, DFG) – provide communication to MA • Get more parallelism? • Power hungry? • Shared structure bottleneck?
EDGE Architecture • Goals: • Develop ISA that allows the compiler to convey DFG directly to the architecture • Use aggressive compiler techniques to map large DFG to execution array • To achieve good utilization, requires extraction of large DFGs from code • Problem: DFGs model data dependence, not control • Must merge CFG into DFG • Involves flattening of control structures (unrolling, inlining) • Execute predicated instructions in parallel • Use direct instruction communication for explicit parallelism and eliminate need for global register file for holding temporary values
TRIPS • TRIPS: Tera-Op Reliable Intelligently-adaptive Processing System • Concurrency: • Scalable issue width and window size • Array of concurrent ALUs • Temporary register values not saved in register file • Control structures encapsulated/flattened within instruction block (single-entry, multiple exit) • Power: • Amortizes overheads of sequential instruction execution over blocks of instructions (128 instructions wide) • Communication delays: • Compile-time instruction placement into ALUs • Flexibility: • Reconfigurable memory banks
Compilation • For compiling C code: • Block-atomic execution of 128 instructions • Maps to 4x4 ALU array (8 instructions each) • TRIPS instructions don’t encode operands • Example: • RISC: ADD R1, R2, R3 • TRIPS: ADD T1, T2 (T1/T2 are the consumers) • Instructions fire when their operands arrive • Eliminates the need to go through shared structures, such as a scheduler or register file (except load/store)
Block Compilation Schedules instructions along critical path into same ALUs Branches produce boolean signals Core is re-mapped when all writes complete
Compiling for TRIPS • Superscaler architectures spend too much power resolving parallelism at run-time • VLIW architectures rely too much on compiler and can’t exploit additional parallelism discovered at run-time • Don’t schedule across control instructions • Stalls entire instruction word
Compiling for TRIPS • Compiler forms large instruction blocks with no internal control flow • Schedule instructions within each block to minimize inter-ALU communication latency • Each block is 128 instructions – 8 instructions per each of 16 nodes
Other Issues • Utilization of ALUs? Placement of instructions in ALU buffers (FIFO)? • Size limitations for forming merged CFG/DFGs? • Thread-level parallelism with TRIPS? • Dual-purpose memory banks? • “Scratchpads” • Dynamic code translation?