Mapping Dataflow Blocks to Distributed Hardware

Mapping Dataflow Blocks to Distributed Hardware Behnam Robatmili, Katherine E. Coons, Doug Burger, Kathryn S. McKinley October 22, 2008

Motivation • Improve single-threaded code • Current designs do not exhaust ILP • Single-thread performance matters • Efficiency is critical moving forward • Most energy in high-ILP processors is not consumed by the ALUs • EDGE architectures reduce energy overheads, but not communication

RISC EDGE Atomic unit L1: L2: add mul ld sub tlt jlt L2 fadd fmul ld sd jump L1 L1: L2: add mul ld sub tlt jlt L2 fadd fmul ld sd jump L1 R1 R0 ld Register File ld muli Atomic unit muli add br sd add R0 EDGE Architectures • Block atomic execution (Melvin & Patt 1988) • Instruction groups fetch, execute, and commit atomically • Direct instruction communication (Dataflow) • Explicitly encode dataflow graph by specifying targets

Outline • Motivation • Background • Block mapping strategies • Core Selection • Results • Conclusions and future work

P P P P L2 L2 L2 L2 P P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 C C C C C C C C TFlex Processors Composable, Lightweight Processor (CLP) P Operating system assigns resources to threads Core Fusion = x86 compatible approach w/ similar goals

P P L2 L2 L2 L2 C C P P L2 L2 L2 L2 Inst queue Reg bank L2 L2 L2 L2 C C C C C C L2 L2 L2 L2 C C C C L1 cache P L2 L2 L2 L2 L2 L2 L2 L2 1 cycle latency L2 L2 L2 L2 L2 L2 L2 L2 TFlex Cores P

System Components Operating System Allocate cores Mapping hints Cores available Application Atomic blocks Compiler Block Mapper Hardware Mapping decisions Compile Time Run Time

Outline • Motivation • Background • Block mapping strategies • Core selection • Results • Conclusions and future work

Hardware Block Mapper • Map blocks to cores at runtime • Fixed strategies • Adaptive strategy • Map instructions to cores • Compiler-generated IDs encode criticality/locality • Preserve locality information • Balance concurrency and communication

Deep Mapping Flat Mapping Block Mapper

Available Concurrency Differs Critical path length: 7 cycles Total instructions: 65 instructions Max possible IPC: 9.3 insts/cycle

Critical path length: 54 cycles Total instructions: 104 instructions Max possible IPC: 1.9 insts/cycle

Block Mapper Compiler: 1.3 IPC Hardware: 2 cores Deep Mapping Adaptive Mapping Flat Mapping Compiler: 1 IPC Hardware: 1 core Compiler: 1 IPC Hardware: 1 core Compiler: 2 IPC Hardware: 2 cores

Adaptive Block Mapping Evaluate block concurrency at compile time Calculate number of cores at runtime Select C available cores

Outline • Motivation • Background • Block mapping strategies • Core selection • Results • Conclusions and future work

? ? ? Block Mapper Improvements • Instruction mapping • Flat, and adaptive strategies • To which cores will instructions be mapped • Core selection • Deep, and adaptive strategies • To which cores will blocks be mapped

Blocks may use subset of cores Block mapper must select among available cores for deep, adaptive strategies Minimize inter-block communication Register locality Memory locality Block Mapper: Core Selection ? ? ?

Core Selection Algorithms • Round-robin (RR) • Inside-out (IO): Blocks in center have higher priority • Preferred Location (PL): Compiler-generated list of preferred cores Legend: High priority Low priority Inside-out (IO) Preferred Location (PL) Round-robin (RR)

Hardware Complexity • Flat and adaptive • Additional distributed protocols • Deep or adaptive with Core Selection • Priority encoder • Storage for core allocation status • Adaptive instruction mapping • Per-block reconfigurable mapping

Outline • Motivation • Background • Block mapping strategies • Instruction mapping and core selection • Results • Conclusions and future work

Methodology • TFlex simulator • Added support for different block mapping strategies • TFlex instruction scheduler • Added concurrency hints to block headers • Evaluated benchmarks • EEMBC, SPEC2000

Speedup over one dual-issue core SPEC floating point benchmarks SPEC integer benchmarks DI/Adaptive SI/Adaptive DI/Deep DI/Flat SI/Deep SI/Flat Results

Communication Overhead Hop counts on 16 single (SI) and dual-issue (DI) cores % of total hops using flat PL Deep PL Adaptive Flat Deep Adaptive

Future Work • New concurrency metrics • Vary the optimization strategy • Group instructions differently • Other granularities of parallelism

Conclusions • Less communication than flat • More complexity than deep

Questions?

Backup Slides • Adaptive Block Mapping • Cross-core Communication Effect • Instruction Mapping • Communication Overhead • Encoding Locality • Preserving Locality • Reducing Inter-block Communication • Dual-issue Results • Single-issue Results

Adaptive Block Mapping Balance concurrency/communication Exploit concurrency when available Limit communication costs Combines hardware and software approaches Software statically summarizes code Hardware uses static information to map graphs efficiently

Cross-core Communication Effect SPEC benchmarks on 16 single dual-issue cores 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 Geomean of speedup over flat Perfect Reg Perfect Mem Perfect Operand Perfect All Baseline

Blocks may use subset of cores Block mapper must select among available cores for deep, adaptive strategies Minimize inter-block communication Register locality Memory locality Block Mapper: Core Selection ? ? ?

Instruction Mapping Problem: Compiler determines placement, but number of cores is unknown at compile time Solution: Hardware/software contract to preserve locality information across configurations

a b e Hardware interprets bits based on number of cores c f g d h Locality Locality Criticality Criticality Compiler encodes instruction IDs CRC FFFF 011 0000 CR FFFFF 01 10000 d: d: Inst ID Which core Reg file Reg file Reg file Reg file Reg file Reg file a 0000000 Which core Slot in issue queue Row 0 Row 0 Slot in issue queue L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache b 0010000 d c 0100000 Reg file Reg file Reg file Reg file Reg file Reg file Row 1 Row 1 d e 1010000 L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache f 1000000 Col 0 Col 0 Col 1 Col 2 Col 3 Col 1 g 1110000 h 1100000 Encoding Locality d 0110000

CRC FFFF a b e c f g e b f a d h c h d g Inst ID CR FFFFF a 0000000 C FFFFFF b 0010000 f a c 0100000 b e d 0110000 a b c d f e h g e 1010000 c h f 1000000 d g g 1110000 h 1100000 Preserving Locality

Reducing Inter-block Communication SPEC benchmarks on 16 dual-issue cores

Dual-Issue Results # of cores: 1 2 4 larger Speedup for Individual Benchmarks Concurrency Distribution 100 90 80 70 60 50 40 30 20 10 0

Single-Issue Results # of cores: 1 2 4 larger Speedup for Individual Benchmarks Concurrency Distribution 100 90 80 70 60 50 40 30 20 10 0

C C C C C C C C C C C C C C C C Motivation • How should blocks be mapped? • Limit communication among instructions • Exploit concurrency • Allocate resources to extract ILP

Mapping Dataflow Blocks to Distributed Hardware

Mapping Dataflow Blocks to Distributed Hardware

Presentation Transcript

Dataflow Diagrams

Dataflow: A Complement to Superscalar

Mapping Coal Blocks and Forest Cover

Distributed System Building Blocks

Hardware Building Blocks

Dataflow Verilog

Photon Mapping on Programmable Graphics Hardware

Shadow Mapping with Today’s OpenGL Hardware

Dataflow Networks

Dataflow: A Complement to Superscalar

Towards Secure Dataflow Processing in Open Distributed Systems

Dataflow Monitoring

Realtime Caustics using Distributed Photon Mapping

Dataflow I: Dataflow Analysis

Dynamic Internet Mapping and Distributed GIServices

Dataflow Descriptions

Chapter 4: Main Building Blocks of Hardware

Dataflow

From actors to gates Notes on implementing dataflow programs in programmable hardware

DATAFLOW ARHITEKTURE

Highly Scalable Distributed Dataflow Analysis

Dataflow Datatypes