370 likes | 473 Views
Mapping Dataflow Blocks to Distributed Hardware. Behnam Robatmili, Katherine E. Coons, Doug Burger, Kathryn S. McKinley October 22, 2008. Motivation. Improve single-threaded code Current designs do not exhaust ILP Single-thread performance matters Efficiency is critical moving forward
E N D
Mapping Dataflow Blocks to Distributed Hardware Behnam Robatmili, Katherine E. Coons, Doug Burger, Kathryn S. McKinley October 22, 2008
Motivation • Improve single-threaded code • Current designs do not exhaust ILP • Single-thread performance matters • Efficiency is critical moving forward • Most energy in high-ILP processors is not consumed by the ALUs • EDGE architectures reduce energy overheads, but not communication
RISC EDGE Atomic unit L1: L2: add mul ld sub tlt jlt L2 fadd fmul ld sd jump L1 L1: L2: add mul ld sub tlt jlt L2 fadd fmul ld sd jump L1 R1 R0 ld Register File ld muli Atomic unit muli add br sd add R0 EDGE Architectures • Block atomic execution (Melvin & Patt 1988) • Instruction groups fetch, execute, and commit atomically • Direct instruction communication (Dataflow) • Explicitly encode dataflow graph by specifying targets
Outline • Motivation • Background • Block mapping strategies • Core Selection • Results • Conclusions and future work
P P P P L2 L2 L2 L2 P P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 C C C C C C C C TFlex Processors Composable, Lightweight Processor (CLP) P Operating system assigns resources to threads Core Fusion = x86 compatible approach w/ similar goals
P P L2 L2 L2 L2 C C P P L2 L2 L2 L2 Inst queue Reg bank L2 L2 L2 L2 C C C C C C L2 L2 L2 L2 C C C C L1 cache P L2 L2 L2 L2 L2 L2 L2 L2 1 cycle latency L2 L2 L2 L2 L2 L2 L2 L2 TFlex Cores P
System Components Operating System Allocate cores Mapping hints Cores available Application Atomic blocks Compiler Block Mapper Hardware Mapping decisions Compile Time Run Time
Outline • Motivation • Background • Block mapping strategies • Core selection • Results • Conclusions and future work
Hardware Block Mapper • Map blocks to cores at runtime • Fixed strategies • Adaptive strategy • Map instructions to cores • Compiler-generated IDs encode criticality/locality • Preserve locality information • Balance concurrency and communication
Deep Mapping Flat Mapping Block Mapper
Available Concurrency Differs Critical path length: 7 cycles Total instructions: 65 instructions Max possible IPC: 9.3 insts/cycle
Critical path length: 54 cycles Total instructions: 104 instructions Max possible IPC: 1.9 insts/cycle
Block Mapper Compiler: 1.3 IPC Hardware: 2 cores Deep Mapping Adaptive Mapping Flat Mapping Compiler: 1 IPC Hardware: 1 core Compiler: 1 IPC Hardware: 1 core Compiler: 2 IPC Hardware: 2 cores
Adaptive Block Mapping Evaluate block concurrency at compile time Calculate number of cores at runtime Select C available cores
Outline • Motivation • Background • Block mapping strategies • Core selection • Results • Conclusions and future work
? ? ? Block Mapper Improvements • Instruction mapping • Flat, and adaptive strategies • To which cores will instructions be mapped • Core selection • Deep, and adaptive strategies • To which cores will blocks be mapped
Blocks may use subset of cores Block mapper must select among available cores for deep, adaptive strategies Minimize inter-block communication Register locality Memory locality Block Mapper: Core Selection ? ? ?
Core Selection Algorithms • Round-robin (RR) • Inside-out (IO): Blocks in center have higher priority • Preferred Location (PL): Compiler-generated list of preferred cores Legend: High priority Low priority Inside-out (IO) Preferred Location (PL) Round-robin (RR)
Hardware Complexity • Flat and adaptive • Additional distributed protocols • Deep or adaptive with Core Selection • Priority encoder • Storage for core allocation status • Adaptive instruction mapping • Per-block reconfigurable mapping
Outline • Motivation • Background • Block mapping strategies • Instruction mapping and core selection • Results • Conclusions and future work
Methodology • TFlex simulator • Added support for different block mapping strategies • TFlex instruction scheduler • Added concurrency hints to block headers • Evaluated benchmarks • EEMBC, SPEC2000
Speedup over one dual-issue core SPEC floating point benchmarks SPEC integer benchmarks DI/Adaptive SI/Adaptive DI/Deep DI/Flat SI/Deep SI/Flat Results
Communication Overhead Hop counts on 16 single (SI) and dual-issue (DI) cores % of total hops using flat PL Deep PL Adaptive Flat Deep Adaptive
Future Work • New concurrency metrics • Vary the optimization strategy • Group instructions differently • Other granularities of parallelism
Conclusions • Less communication than flat • More complexity than deep
Backup Slides • Adaptive Block Mapping • Cross-core Communication Effect • Instruction Mapping • Communication Overhead • Encoding Locality • Preserving Locality • Reducing Inter-block Communication • Dual-issue Results • Single-issue Results
Adaptive Block Mapping Balance concurrency/communication Exploit concurrency when available Limit communication costs Combines hardware and software approaches Software statically summarizes code Hardware uses static information to map graphs efficiently
Cross-core Communication Effect SPEC benchmarks on 16 single dual-issue cores 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 Geomean of speedup over flat Perfect Reg Perfect Mem Perfect Operand Perfect All Baseline
Blocks may use subset of cores Block mapper must select among available cores for deep, adaptive strategies Minimize inter-block communication Register locality Memory locality Block Mapper: Core Selection ? ? ?
Instruction Mapping Problem: Compiler determines placement, but number of cores is unknown at compile time Solution: Hardware/software contract to preserve locality information across configurations
a b e Hardware interprets bits based on number of cores c f g d h Locality Locality Criticality Criticality Compiler encodes instruction IDs CRC FFFF 011 0000 CR FFFFF 01 10000 d: d: Inst ID Which core Reg file Reg file Reg file Reg file Reg file Reg file a 0000000 Which core Slot in issue queue Row 0 Row 0 Slot in issue queue L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache b 0010000 d c 0100000 Reg file Reg file Reg file Reg file Reg file Reg file Row 1 Row 1 d e 1010000 L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache f 1000000 Col 0 Col 0 Col 1 Col 2 Col 3 Col 1 g 1110000 h 1100000 Encoding Locality d 0110000
CRC FFFF a b e c f g e b f a d h c h d g Inst ID CR FFFFF a 0000000 C FFFFFF b 0010000 f a c 0100000 b e d 0110000 a b c d f e h g e 1010000 c h f 1000000 d g g 1110000 h 1100000 Preserving Locality
Reducing Inter-block Communication SPEC benchmarks on 16 dual-issue cores
Dual-Issue Results # of cores: 1 2 4 larger Speedup for Individual Benchmarks Concurrency Distribution 100 90 80 70 60 50 40 30 20 10 0
Single-Issue Results # of cores: 1 2 4 larger Speedup for Individual Benchmarks Concurrency Distribution 100 90 80 70 60 50 40 30 20 10 0
C C C C C C C C C C C C C C C C Motivation • How should blocks be mapped? • Limit communication among instructions • Exploit concurrency • Allocate resources to extract ILP