270 likes | 519 Views
. Dynamically Reconfigurable Datapaths. Speed-up kernel loops using reconfigurable hardware. Trivial Codes.
E N D
1. Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths Zhining Huang, Sharad Malik
Department of Electrical Engineering
Princeton University
2. Dynamically Reconfigurable Datapaths Speed-up kernel loops using reconfigurable hardware
3. Outline Application specific programmable platforms
Methodology overview and architectural model
Datapath design for kernel loops
Direct Mapping, Pipelining
Reconfigurable datapath design
Case studies
GSM, MPEG II
Conclusion
4. Why programmable platforms?
Design cost, time to market
Different programmable platforms
Bit level: FPGA based
Word level: specialized VLIW, coarse grained reconfigurable coprocessors
Thread level: Multiple PEs with on-chip communication networks Application Specific Programmable Platforms
5. Application Specific Programmable Platforms (contd.) Goal: Approach the flexibility of GPPs with the efficiency of ASICs
Part of the MESCAL project
Modern Embedded Systems, Compilers, Architectures and Languages
A disciplined effort for application specific programmable platform development
6. Related Research Various reconfigurable coprocessors
Garp [Hauser+97], PipeRench [Goldstein+99], Pleiades [Wan+00]
Chameleon Systems, Morphics Technology
General reconfigurable fabrics + compiler
Hardware resource, routing, compiler
Our approach
Design automation of the application specific reconfigurable fabrics
Coarse grained dynamically reconfigurable logic
7. Architectural Model RISC + Coarse grained reconfigurable datapath
Fixed function units
Reconfigurable interconnections
8. Methodology Overview Designing the application specific reconfigurable datapath.
9. Mapping Kernel Loops from C to Hardware Generating a datapath for each kernel loop.
10. Direct Mapping Direct mapping from IR to hardware
One instruction to one function unit
11. Direct Mapping (contd.) Branch condition transforms
12. Intra-iteration Scheduling Schedule FUs into different pipe stages
13. Inter-iteration Scheduling Pipelining the execution of loop iterations
Determine the Initial Interval (II) of a loop datapath
14. Inter-iteration Scheduling (contd.) Data dependence from FU i to FU j across loop iterations
Feedback connection
II = PipeStage(i) PipeStage(j) + FU_Delay(j), if II > 0
15. Inter-iteration Scheduling (contd.) Data dependence on memory access
No feedback connections needed
II = ?[ PipeStage(i) PipeStage(j) + 1 ] / k?
K: distance of dependent iterations, from data dependence analysis
16. Execution Time Estimation S: total # of pipeline stages of the datapath
II: initial interval between the fetch of 2 consecutive iterations
N: loop iteration number
O: configuration overhead
W: system write back
Example: T = 5 + 2x(32-1) + 4 = 71
17. Reconfigurable Datapath Design Embed individual datapaths into a single datapath.
Datapath graph Gi
Vertices are hardware resources (memories, registers, function units)
Edges are connections between them
Construct a single graph G such that each Gi ? G and G has the fewest edges and vertices
Bipartite matching based algorithm [Huang+ 2001]
18. Reconfigurable Datapath Merged graph G to reconfigurable datapath
Vertices to function units
Edges to reconfigurable interconnects
By selecting subset of interconnections, any selected datapath can be generated and executed on reconfigurable datapath
Appropriate interconnects in merged datapath are enabled using configuration bits
19. Routing Useful interconnections are selected
Routing box to select between multiple connections
Configuration contexts
Configuration bits for routing box
Control bits for some FU
Static registers initialization
20. Reconfiguration Overhead Store configuration contexts of limited number of kernel loops in distributed RAMs
Fast context switch for reconfigurable fabrics
NEC OmniPath [Furuta+00], Chameleon systems
Reconfiguration overhead
read live-in register set
write live-out register set
21. Critical Path and Clock Speed Critical path in the reconfigurable datapath
Delay of FU
Delay of routing box
Delay of directly connected wires
Critical path in general processor
No longer in FU stage
Branch control, decoding stage
The clock speed of reconfigurable datapath should be no less than that for a general processor
22. Benchmark Studies MPEG
Overall speedup: 3.57
10 kernel loops: 86% execution time
Max possible speedup 7.14
23. Datapath Mapping Results Significant overlap between datapaths is obtained.
Configuration bits: MPEG < 500bits, GSM < 1000bits
24. Speed-up vs. Memory Bandwidth Make multiple copies of datapath
Constraint: number of memory ports
25. Clustered VLIW machine? Application specific clustered VLIW processor with one instruction per kernel loop
Reconfiguration contexts as instructions
Interconnections as application specific bypassing networks
26. Reconfigurable Datapath (RD) vs. VLIW
27. Applicable Application Domain computation intensive applications
localized operational parallelism
a few areas account for most of the execution time
28. Conclusion A methodology for the design of a dynamically reconfigurable datapath coprocessor
Kernel loop IR to datapath hardware
Datapath hardware merged into reconfigurable hardware
MPEG, GSM benchmark case studies
Examined reconfigurable datapaths vs. VLIW processors
Future research
Increasing the datapath pipelining throughput through FU merging
Fully automating the process