310 likes | 418 Views
Application Specific Instruction Generation for Configurable Processor Architectures. VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping Fan , Guoling Han , Zhiru Zhang. Supported by NSF. Outline. Motivation Related Works Problem Statement Proposed Solutions
E N D
Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping Fan, Guoling Han, Zhiru Zhang Supported by NSF
Outline • Motivation • Related Works • Problem Statement • Proposed Solutions • Experimental Results • Conclusions
Motivation (cont’d) • Flexibility is required to satisfy different requirements and to avoid potential design errors • Application Specific Instruction-set Processors (ASIPs) provide a solution to the tradeoff between efficiency and flexibility • A general purpose processor + specific hardware resource • Base instruction set + customized instructions • Specific hardware resource implements the customized instructions • Either runtime reconfigurable or pre-synthesized • Gain more popularity recently • IFX Carmel 20xx, ARM, Tensilica Xtensa, STM Lx, ARC Cores
a b c 0xf0 0x12 * * * + + + Application Specific Instruction-set Processor • Program with basic instructions set I • t1 = a * b; • t2 = b * 0xf0;; • t3 = c * 0x12; • t4 = t1 + t2; • t5 = t2 + t3; • t6 = t5 + t4; Custom Logic *: 2 clock cycles +: 1 clock cycles Execution time: 9 clock cycles
a b c 0xf0 0x12 * * * + + + Application Specific Instruction-set Processor (cont’d) • Program with extended instructions • t1 = extop1(a, b, 0xf0); • t2 = extop2(b, c, 0xf0, 0x12); • t3 = t1 + t2; extop1 extop2 Extended Instruction Set: Iextop1 expop2 extops: 2 clock cycles +: 1 clock cycles Execution time: 5 clock cycles Speedup: 1.8
Related Works • [Kastner et al, TODAES’02] Template generation + covering Limitation: • Minimum number of templates may not lead to maximum speedup • Ignore architecture constraints • [Atasu et al, DAC’03] Branch and bound Limitation: • High complexity • Instruction reuse is not considered • [Peymandoust et al, ICASAP’03] Instruction selection + instruction mapping Limitation: • Minimize the extended instruction number
n1 n3 n2 n4 n5 n6 Trivial Pattern Execution time I/O: 2-in 1-out Nontrivial Pattern SW Execution time HW Execution time I/O: 2-in 1-out Area: 2 a b c 0xf0 0x12 * * * + + + Preliminaries • Control data flow graph (CDFG) • Basic blocks(BBK) each bbk is a DAG, denoted by G(V, E) • Control edges • Cone • A subgraph consisting of node v and its predecessors such that any path connecting a node in the cone and v lies entirely in the cone • K-feasible cone • Pattern A single output DAG • Trivial pattern • Nontrivial pattern • Associated with execution time, number of I/O, area {a, b, 0xf0}
Problem Statement Given: • G(V, E) • The basic instruction set I • Pattern constraints: • Number of inputs |PI(pi)| Nin, i; • Number of outputs |PO(pi)| = 1, i; • Total area Objective: • Generate a pattern library P • Map G to the extended instruction set IP, so that the total execution time is minimized.
Problem Decomposition Sub-problem 1. Pattern Enumeration: Generate all of the patterns S satisfying the constraints (i) and (ii) from G(V, E). Sub-problem 2. Instruction Set Selection: Select a subset P of S to maximize the potential speedup while satisfying the area constraint. Sub-problem 3. Application Mapping: Map G(V, E) to IP so that the total execution time of G is minimized.
C ASIP constraints SUIF / CDFG generator Pattern Generation / CDFG Pattern Selection Application Mapping Pattern library Mapped CDFG Instruction Implementation / ASIP synthesis Implementation Proposed ASIP Compilation Flow
1. Pattern Enumeration • All possible application specific instruction patterns should be enumerated • Each pattern is a k-feasible cone • Cut enumeration is used to enumerate all the k-feasible cones [cong et al, FPGA’99] • In topological order, merge the cuts of fan-ins and discards those cuts not k-feasible
n1 n3 n2 n4 n5 n6 a b c 0xf0 0x12 * * * + + + 1. Pattern Enumeration (cont’d) 3-feasible cones: n1: {a, b} n2: {b, 0xf0} n3: {c, 0x12} n4: {n1, n2}, {n1, b, 0xf0}, {n2, a, b}, {a, b, 0xf0} n5: {n2, n3}, {n2, c, 0x12}, {n3, b, 0xf0} {b, 0xf0, c, 0x12} n6: {n4, n5}, {n4, n2, n3}, {n5, n1, n2}
2. Pattern Selection (1) • Resource cost and the execution time can be obtained using high-level estimation tool • The extended instructions should satisfy the area constraint • Use all the enumerated patterns • Optimal code can be generated • Mapping becomes unaffordable • Heuristically select a set of patterns
n1 n3 n2 n4 n5 n6 a b c 0xf0 0x12 * * * + + + 2. Pattern Selection (2) • Basic idea: simultaneously consider speed up, occurrence frequency and area. • Speedup Tsw(p) = |V(p)| Thw(p)= Length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p) • Occurrence • Some pattern instances may be isomorphic • Graph isomorphism test [ Nauty Package ] • Small subgraphs, isomorphism test is very fast Gain(p) = Speedup(p) Occurrence(p) Pattern *+ Tsw= 3 Thw= 2 Speedup = 1.5
2. Pattern Selection (3) • Selection under Area Constraint • Can be formulated as a 0-1 knapsack problem 0-1 knapsack problem: Given n items (patterns) and weight W (area constraint A), and the ith item (pattern) is associated with value (gain) viand weight (area) wi, select a subset of items to maximize the total value, while the total weight does not exceed W. • Optimally solvable by Dynamic programming algorithm.
3. Application Mapping (1) • Application mapping covers each node in G(V, E) with the extended instruction set to minimize the execution time. • The execution time of a mapped DAG is defined as the sum of the execution time of the patterns covering the DAG.
3. Application Mapping (2) • Theorem: The application mapping problem is equivalent to the minimum-area technology mapping problem. • Execution time ↔ area • Total area = sum of area of each component • Total execution time = sum of execution time of each pattern • Minimum-area mapping is NP-hard → application mapping is NP-hard • A lot of minimum-area technology mapping algorithms
Minimum-area technology mapping • [Keutzer, DAC’87 ] Tree decomposition + dynamic programming • [Rudell] [Liao, ICCAD’95] Min-cost binate covering Given: • a boolean function f with variable set X • a cost function which maps X to a nonnegative integer Objective: • find an assignment for each variable so that the value of f is 1 and the sum of cost is minimized
n1 n3 n2 n4 n5 n6 a b c 0xf0 0x12 * * * + + + Binate Covering (1)
n1 n3 n2 n4 n5 n6 a b c 0xf0 0x12 * * * + + + Binate Covering (2) The fan-ins of the sink node need be covered by some pattern Covering clause: p0
n1 n3 n2 n4 n5 n6 a b c 0xf0 0x12 * * * + + + Binate Covering (3) The nodes that generate inputs to pi must be covered by some other pattern Covering clause: p2+p6+p7+p10
n1 n3 n2 n4 n5 n6 a b c 0xf0 0x12 * * * + + + Binate Covering (4) p2 →p4 & p2 →p5 ¬p2 + p4 & ¬p2 + p5
n1 n3 n2 n4 n5 n6 a b c 0xf0 0x12 * * * + + + Binate Covering (4) ¬p6 + p4 ¬p7 + p5
n1 n3 n2 n4 n5 n6 a b c 0xf0 0x12 * * * + + + Binate Covering (5) f = p0(p2+p6+p7+p10)(¬p2 + p4)(¬p2 + p5)(¬p6 + p4)(¬p7 + p5) (p1+p8+p9+p11) (¬p1 + p3)(¬p1 + p4) (¬p8 + p3)(¬p9 + p4) min-cost cover: p0, p10, p11 with cost 1+2+2 = 5
Experimental Results (1) • A commercial reconfigurable system – Nios from Altera is used to implement the ASIPs. • 5 extended instruction formats • up to 2048 instructions for each format • Some DSP applications are taken as benchmark • Altera’s Quartus II 3.0 is used to aid the synthesis and the physical design of the extended instructions.
Experimental Results (2) Pattern size vs. number of pattern instances (2-input patterns)
Experimental Results (3) Speedupunder different input size constraints • Speedup = Textended/ Tbasic • Ideal speedup • pipeline hazard • memory impact
Experimental Results (4) Speedup and resource overhead on Nios implementations
Conclusions • Propose a set of algorithms for ASIP compilation • Actual performance metric is used as the optimization objective • Reduce the instruction mapping problem into an area-minimization logic covering problem • Operation duplication is considered implicitly • Experiments show encouraging speedup