Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths

1. Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths Zhining Huang, Sharad Malik Department of Electrical Engineering Princeton University

2. Dynamically Reconfigurable Datapaths Speed-up kernel loops using reconfigurable hardware

3. Outline Application specific programmable platforms Methodology overview and architectural model Datapath design for kernel loops Direct Mapping, Pipelining Reconfigurable datapath design Case studies GSM, MPEG II Conclusion

4. Why programmable platforms? Design cost, time to market Different programmable platforms Bit level: FPGA based Word level: specialized VLIW, coarse grained reconfigurable coprocessors Thread level: Multiple PEs with on-chip communication networks Application Specific Programmable Platforms

5. Application Specific Programmable Platforms (contd.) Goal: Approach the flexibility of GPPs with the efficiency of ASICs Part of the MESCAL project Modern Embedded Systems, Compilers, Architectures and Languages A disciplined effort for application specific programmable platform development

6. Related Research Various reconfigurable coprocessors Garp [Hauser+97], PipeRench [Goldstein+99], Pleiades [Wan+00] Chameleon Systems, Morphics Technology� General reconfigurable fabrics + compiler Hardware resource, routing, compiler Our approach Design automation of the application specific reconfigurable fabrics Coarse grained dynamically reconfigurable logic

7. Architectural Model RISC + Coarse grained reconfigurable datapath Fixed function units Reconfigurable interconnections

8. Methodology Overview Designing the application specific reconfigurable datapath.

9. Mapping Kernel Loops from C to Hardware Generating a datapath for each kernel loop.

10. Direct Mapping Direct mapping from IR to hardware One instruction to one function unit

11. Direct Mapping (contd.) Branch condition transforms

12. Intra-iteration Scheduling Schedule FUs into different pipe stages

13. Inter-iteration Scheduling Pipelining the execution of loop iterations Determine the Initial Interval (II) of a loop datapath

14. Inter-iteration Scheduling (contd.) Data dependence from FU i to FU j across loop iterations Feedback connection II = PipeStage(i) � PipeStage(j) + FU_Delay(j), if II > 0

15. Inter-iteration Scheduling (contd.) Data dependence on memory access No feedback connections needed II = ?[ PipeStage(i) � PipeStage(j) + 1 ] / k? K: distance of dependent iterations, from data dependence analysis

16. Execution Time Estimation S: total # of pipeline stages of the datapath II: initial interval between the fetch of 2 consecutive iterations N: loop iteration number O: configuration overhead W: system write back Example: T = 5 + 2x(32-1) + 4 = 71

17. Reconfigurable Datapath Design Embed individual datapaths into a single datapath. Datapath graph Gi Vertices are hardware resources (memories, registers, function units) Edges are connections between them Construct a single graph G such that each Gi ? G and G has the fewest edges and vertices Bipartite matching based algorithm [Huang+ 2001]

18. Reconfigurable Datapath Merged graph G to reconfigurable datapath Vertices to function units Edges to reconfigurable interconnects By selecting subset of interconnections, any selected datapath can be generated and executed on reconfigurable datapath Appropriate interconnects in merged datapath are enabled using configuration bits

19. Routing Useful interconnections are selected Routing box to select between multiple connections Configuration contexts Configuration bits for routing box Control bits for some FU Static registers initialization

20. Reconfiguration Overhead Store configuration contexts of limited number of kernel loops in distributed RAMs Fast context switch for reconfigurable fabrics NEC OmniPath [Furuta+00], Chameleon systems Reconfiguration overhead read live-in register set write live-out register set

21. Critical Path and Clock Speed Critical path in the reconfigurable datapath Delay of FU Delay of routing box Delay of directly connected wires Critical path in general processor No longer in FU stage Branch control, decoding stage The clock speed of reconfigurable datapath should be no less than that for a general processor

22. Benchmark Studies MPEG Overall speedup: 3.57 10 kernel loops: 86% execution time Max possible speedup 7.14

23. Datapath Mapping Results Significant overlap between datapaths is obtained. Configuration bits: MPEG < 500bits, GSM < 1000bits

24. Speed-up vs. Memory Bandwidth Make multiple copies of datapath Constraint: number of memory ports

25. Clustered VLIW machine? Application specific clustered VLIW processor with one instruction per kernel loop Reconfiguration contexts as instructions Interconnections as application specific bypassing networks

26. Reconfigurable Datapath (RD) vs. VLIW

27. Applicable Application Domain computation intensive applications localized operational parallelism a few areas account for most of the execution time

28. Conclusion A methodology for the design of a dynamically reconfigurable datapath coprocessor Kernel loop IR to datapath hardware Datapath hardware merged into reconfigurable hardware MPEG, GSM benchmark case studies Examined reconfigurable datapaths vs. VLIW processors Future research Increasing the datapath pipelining throughput through FU merging Fully automating the process

Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths

Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths

Presentation Transcript

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Exploiting Parallelism

PHOTON A Dynamically Reconfigurable Hybrid

Dynamically Specialized Datapaths for Energy Efficient Computing

Exploiting Parallelism on GPUs

Janus : exploiting parallelism via hindsight

Dynamically Reconfigurable Architectures: An Overview

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamically Reconfigurable Neurons

DRRA Dynamically Reconfigurable Resource Array

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Exploiting Parallelism

Instruction Level Parallelism: Loop Level Parallelism

Exploiting Parallelism

Exploiting Operation Level Parallelism through Dynamically Reconfigurable Datapahts

Warp Processor: A Dynamically Reconfigurable Coprocessor