170 likes | 309 Views
Automatically Generating Custom Instruction Set Extensions. Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors. Problem Statement. There’s a demand for high performance, low power special purpose systems E.g. Cell phones, network routers, PDAs
E N D
Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors 1
Problem Statement • There’s a demand for high performance, low power special purpose systems • E.g. Cell phones, network routers, PDAs • One way to achieve these goals is augmenting a general purpose processor with Custom Function Units (CFUs) • Combine several primitive operations • We propose an automated method for CFU generation 2
Example 1 2 Potential CFUs 1,3 2,4 2,6 3,4 4,5 5,8 6,7 7,8 3 4 6 5 7 8 4
Example 1 2 Potential CFUs 1,3 2,4 2,6 … 1,3,4 2,4,5 2,6,7 … 3 4 6 5 7 8 5
Example 1 2 Potential CFUs 1,3 2,4 2,6 … 1,3,4,5 2,4,5,8 2,6,7,8 … 1,3,4,5,8 3 4 6 5 7 8 6
Characterization • Use the macro library to get information on each potential CFU • Latency is the sum of each primitive’s latency • Area is the sum of each primitive’s macrocell 7
Performance On critical path Cycles saved Cost CFU area Control logic Difficult to measure Decode logic Difficult to measure Register file area Can be amortized Issues we consider LD AND 1 0.1 ADD 1 0.6 ASL 1 0.1 ADD 1 0.6 XOR 0.1 1 BR 8
IO number of input and output operands Usability How well can the compiler use the pattern More Issues to Consider OR LSL AND CMPP 9
Selection • Currently use a Greedy Algorithm • Pick the best performance gain / area first • Can yield bad selections OR LSL AND CMPP 10
Speedup: 1.24 10 cycles can be compressed down to 2! Cost: ~6 adders 6 inputs, 2 outputs C code this DFG came from: r ^=(((s[(t>>24)] + s[0x0100+((t>>16)&0xff)]) ^ s[0x0200+((t>>8)&0xff)]) + s[0x0300+((t&0xff)])&0xffffffff; Case study 1: Blowfish r65 r70 ADD r76 XOR r81 ADD # -1 AND r891 XOR #16 LSR #255 AND #256 ADD #2 LSL r91 ADD 11
Speedup: 1.20 3 cycles can be compressed down to 1 Cost: ~1.5 adders 2 inputs, 2 outputs C code this DFG came from: d = d & 7; if ( d & 4 ) { … } Case study 2: ADPCM Decode r16 #7 AND #4 AND #0 CMPP 12
Experimental Setup • CFU recognition implemented in the Trimaran research infrastructure • Speedup shown is with CFUs relative to a baseline machine • Four wide VLIW with predication • Can issue at most 1 Int, Flt, Mem, Brn inst./cyc. • 300 MHz clock • CFU Latency is estimated using standard cells from Synopsis’ design library 13
Varying the Number of CFUs • More CFUs yields more performance • Weakness in our selection algorithm causes plateaus 14
Varying the Number of Ops • Bigger CFUs yield better performance • If they’re too big, they can’t be used as often and they expose alternate critical paths 15
Related Work • Many people have done this for code size • Bose et al., Liao et al. • Typically done with traces • Arnold, et al. • Previous paper used more enumerative discovery algorithm • We are unique because: • Compiler based approach • Novel analyzation of CFUs 16
Conclusion and Future Work • CFUs have the potential to offer big performance gain for small cost • Recognize more complex subgraphs • Generalized acyclic/cyclic subgraphs • Develop our system to automatically synthesize application tailored coprocessors 17