1 / 29

Asia and South Pacific Design Automation Conference Taipei, Taiwan R.O.C. January 21, 2010

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors. Nagaraju Pothineni Google, India. Philip Brisk UC Riverside. Paolo Ienne EPFL. Anshul Kumar IIT Delhi. Kolin Paul IIT Delhi. Asia and South Pacific Design Automation Conference

toya
Download Presentation

Asia and South Pacific Design Automation Conference Taipei, Taiwan R.O.C. January 21, 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Nagaraju Pothineni Google, India Philip Brisk UC Riverside Paolo Ienne EPFL Anshul Kumar IIT Delhi Kolin Paul IIT Delhi Asia and South Pacific Design Automation Conference Taipei, Taiwan R.O.C. January 21, 2010

  2. Assembly code with ISEs Applications Compiler Extensible Processors ISE Instruction Set Extensions I$ RF D$ RF Fetch Decode Execute Memory Write-back 1

  3. G1 Gn Optimized source code rewritten with ISE calls Behavioral HDL description of n ISEs Architecture description Application source code … Target-specific compiler Source-to-source Compiler ISE Synthesizer Architecture description High-level program optimizations ISE identification Rewrite source code with ISEs Linker Processor Generator Assembler G1 Gn Optimized source code rewritten with ISE calls Behavioral description of n ISEs … Machine code with ISE calls Structural HDL description of the processor and n ISEs Compilation Flow 2

  4. I/O-constrained Scheduling ISE Synthesis Flow Gn RF I/O Ports Clock Period Constraint: G1 R W * =  … … I/O-constrained Scheduling Reschedule to Reduce Area Decompose each ISE into 1-cycle Ops Resource Allocation and Binding Yes No Done Clk period < * = * -  3

  5. A B C D E F A B C E F D I/O-constrained Scheduling • I/O ports are a resource constraint • Resource-constrained scheduling is NP-complete • Optimal algorithm [Pozzi and Ienne, CASES ’05] A B C RD1 Wr1 RF RF RD2 E F D 4

  6. ISE Synthesis Flow Gn RF I/O Ports Clock Period Constraint: G1 R W * =  … … I/O-constrained Scheduling Reschedule to Reduce Area Decompose each ISE into 1-cycle Ops Resource Allocation and Binding Yes No Done Clk period < * = * -  5

  7. Goal: Minimize area Constraints: Do not increase latency or clock period I/O constraints Implementation: Simulated annealing (details in the paper) Reschedule to Reduce Area 2 Adders, 1 Multiplier 1 Adder, 1 Multiplier 6

  8. ISE Synthesis Flow Gn RF I/O Ports Clock Period Constraint: G1 R W * =  … … I/O-constrained Scheduling Reschedule to Reduce Area Decompose each ISE into 1-cycle Ops Resource Allocation and Binding Yes No Done Clk period < * = * -  7

  9. B C D A B C B C A D A D E E E Decompose each ISE into Single-Cycle Operations ISE After Scheduling 1-cycle Ops 8

  10. B B C C B B A A D D A A C C D D E E E E Decomposition Facilitates Resource Sharing within an ISE 9

  11. ISE Synthesis Flow Gn RF I/O Ports Clock Period Constraint: G1 R W * =  … … I/O-constrained Scheduling Reschedule to Reduce Area Decompose each ISE into 1-cycle Ops Resource Allocation and Binding Yes No Done Clk period < * = * -  10

  12. ? Minimum cost common supergraph Higher cost common supergraph Maximum-cost weighted common isomorphic subgraph problem Weighted minimum-cost common supergraph (WMCS) problem 2-input operation (Multiplexers required) Requires a multiplexer No multiplexers needed Which solution is better? NP-complete NP-complete Depends on the cost of the multiplexers compared to the merged operations! (Graph theory) (Graph theory) (VLSI) Resource Allocation and Binding Two 1-cycle ISEs 11

  13. Old problem formulation Based on WMCS problem (graph theory) NP-complete [Bunke et al., ICIP ’02] Share as many operations/interconnects as possible [Moreano et al., TCAD ’05; de Souza e al., JEA ’05] Optimize port assignment as a post-processing step NP-complete [Pangrle, TCAD ’91] Heuristic [Chen and Cong, ASPDAC ’04] Contribution: New problem formulation Accounts for multiplexer cost and port assignment Datapath Merging (DPM) 12

  14. ILP Formulation See the paper for details Reduction to Max-Clique Problem Extends [Moreano et al., TCAD ’05] Solve Max-Clique problem optimally using “Cliquer” Identify isomorphic subgraphs up-front Merge isomorphic subgraphs rather than vertices/edges (Details in the paper) New DPM Algorithms 13

  15. v2 v5 v1 v4 r1 r2 e1 e2 e3 e4 r3 v3 v6 Edge mappings: Vertex mappings: e1 could map onto: e3, (v4, r3), (r1, v5), (r1, r3) 1. Map v1 onto v4 2. Allocate a new resource r1; map v1 onto r1 Must be compatible with vertex mappings! Example New ISE fragment to merge Partial merged datapath 14

  16. Vertex/vertex compatibility No Yes Compatibility • Vertex/edge compatibility Yes No 15

  17. Allocate an edge in the merged datapath May require a multiplexer Why Edge Mappings? 16

  18. Deterministic for non-commutative operators NP-complete for every commutative operator [Pangrle, TCAD ’91] L R L R Commutative Operator No! Port Assignment We want this! 17

  19. v5 v2 v1 v4 e1 e3 e2 e4 v6 v3 Mapping: Mapping: e1: (v4, v6, L) e1: (v4, v6, R) Edge Mappings = Port Assignment Commutative Operator 18

  20. Vertices correspond to mappings Vertex mappings Weight is 0 for vertex  vertex Weight is resource cost for vertex  new resource Edge mappings, including port assignment Weight is 0 if edge exists in merged datapath Weight is estimated cost of increasing mux size by +1 otherwise Place edges between compatible mappings Each max-clique corresponds to a complete binding solution Goal is to find max-clique of minimum weight Compatibility Graph 19

  21. v2, r2 v2, r2 v1, v4 v5 v5 v1, v4 v2 v1 e2 e2 e3 e1 e4 e4 e1, e3 e1 e2 v6 v3, v6 v3, r3 v5 v4 v3 fE(e2) = (r2, v6, L) fE(e2) = (r2, v6, L) fE(e2) = (r2, v6, L) fV(v3) = v6 fV(v3) = v6 fV(v3) = v6 fE(e2) = (r2, v6, R) fE(e2) = (r2, v6, R) fE(e2) = (r2, v6, R) w = Amux(2) w = Amux(2) w = Amux(2) w = Amux(2) w = Amux(2) w = Amux(2) w = 0 w = 0 w = 0 e3 e4 fE(e1) = (v4, v6, L) fE(e1) = (v4, v6, L) fE(e1) = (v4, v6, L) fE(e1) = (v4, v6, R) fE(e1) = (v4, v6, R) fE(e1) = (v4, v6, R) fE(e1) = (r1, v6, L) fE(e1) = (r1, v6, L) fE(e1) = (r1, v6, L) fE(e1) = (r1, v6, R) fE(e1) = (r1, v6, R) fE(e1) = (r1, v6, R) w = Amux(2) w = Amux(2) w = Amux(2) w = 0 w = 0 w = 0 w = Amux(2) w = Amux(2) w = Amux(2) w = Amux(2) w = Amux(2) w = Amux(2) v6 fV(v1) = v4 fV(v1) = v4 fV(v1) = v4 fV(v2) = r2 fV(v2) = r2 fV(v2) = r2 fV(v1) = r1 fV(v1) = r1 fV(v1) = r1 w = A() w = A() w = A() w = A( ) w = A( ) w = A( ) w = 0 w = 0 w = 0 fE(e1) = (r1, r3, R) fE(e1) = (r1, r3, R) fE(e1) = (r1, r3, R) fE(e1) = (v4, r3, L) fE(e1) = (v4, r3, L) fE(e1) = (v4, r3, L) fE(e1) = (v4, r3, R) fE(e1) = (v4, r3, R) fE(e1) = (v4, r3, R) fE(e1) = (r1, r3, L) fE(e1) = (r1, r3, L) fE(e1) = (r1, r3, L) w = 0 w = 0 w = 0 w = 0 w = 0 w = 0 w = 0 w = 0 w = 0 w = 0 w = 0 w = 0 fE(e2) = (r2, r3, L) fE(e2) = (r2, r3, L) fE(e2) = (r2, r3, L) fV(v3) = r3 fV(v3) = r3 fV(v3) = r3 fE(e2) = (r2, r3, R) fE(e2) = (r2, r3, R) fE(e2) = (r2, r3, R) r2 r3 r1 w = 0 w = 0 w = 0 w = A() w = A() w = A() w = 0 w = 0 w = 0 Compatibility Graph 20

  22. Internally developed research compiler 1-cycle ISEs [Atasu et al., DAC ’03] RF has 2 read ports, 1 write port Standard cell design flow, 0.18m technology node Five DPM algorithms Baseline No resource sharing ILP (Optimal) [This paper] Our heuristic* [This paper] Moreano’s heuristic* [Moreano et al., TCAD ’05] Brisk’s heuristic [Brisk et al., DAC ’04] Experimental Setup * Max-cliques found by “Cliquer” 21

  23. Moreano’s heuristic is sometimes competitive! Brisk’s heuristic performed as well as ours for one benchmark! Moreano’s heuristic is not always competitive! Brisk’s heuristic was NOT competitive for three benchmarks! Brisk’s heuristic outperformed Moreano’s for three benchmarks! 1-cycle ISE Area Savings 22

  24. Critical Path Delay Increase Critical Path Delay Increase (%) – Our heuristic 23

  25. Runtimes • Baseline 0 • Optimal (ILP) 3-8 hours • Our heuristic 2-10 minutes • Moreano’s heuristic ~1 minute • Brisk’s heuristic < 5-10 seconds 24

  26. Internally developed research compiler Multi-cycle ISEs [Pozzi and Ienne, CASES ’05] RF has 5 read ports, 2 write ports Standard cell design flow, 0.18m technology node Four versions of our flow Single-cycle ISE (None) (1-cycle) Full flow (200 MHz) (Multi-cycle) No rescheduling (200 MHz) (Multi-cycle) Baseline flow (200 MHz) (Multi-cycle) (Resource sharing and binding step disabled) Experimental Setup 25

  27. Resource sharing across multiple cycles of the same ISE Cost of extra registers Single vs. Multi-cycle ISEs Area savings (%) Single-cycle ISE (No clock period constraint) Full flow (200 MHz) 0% Baseline flow (200 MHz) 26

  28. Impact of Rescheduling Single-cycle ISE (no clock period constraint) Normalized Area Full Flow (200 MHz) No rescheduling (200 MHz) 27

  29. HLS Flow for ISEs RF I/O constraints Min-latency scheduling is NP-complete Requires two scheduling steps Rescheduling is important for area reduction Resource allocation and binding Modeled as a datapath merging problem New problem formulation Multiplexer cost Port assignment Conclusion 28

More Related