1 / 17

Automatically Generating Custom Instruction Set Extensions

Automatically Generating Custom Instruction Set Extensions. Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors. Problem Statement. There’s a demand for high performance, low power special purpose systems E.g. Cell phones, network routers, PDAs

mills
Download Presentation

Automatically Generating Custom Instruction Set Extensions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors 1

  2. Problem Statement • There’s a demand for high performance, low power special purpose systems • E.g. Cell phones, network routers, PDAs • One way to achieve these goals is augmenting a general purpose processor with Custom Function Units (CFUs) • Combine several primitive operations • We propose an automated method for CFU generation 2

  3. System Overview 3

  4. Example 1 2 Potential CFUs 1,3 2,4 2,6 3,4 4,5 5,8 6,7 7,8 3 4 6 5 7 8 4

  5. Example 1 2 Potential CFUs 1,3 2,4 2,6 … 1,3,4 2,4,5 2,6,7 … 3 4 6 5 7 8 5

  6. Example 1 2 Potential CFUs 1,3 2,4 2,6 … 1,3,4,5 2,4,5,8 2,6,7,8 … 1,3,4,5,8 3 4 6 5 7 8 6

  7. Characterization • Use the macro library to get information on each potential CFU • Latency is the sum of each primitive’s latency • Area is the sum of each primitive’s macrocell 7

  8. Performance On critical path Cycles saved Cost CFU area Control logic Difficult to measure Decode logic Difficult to measure Register file area Can be amortized Issues we consider LD AND 1 0.1 ADD 1 0.6 ASL 1 0.1 ADD 1 0.6 XOR 0.1 1 BR 8

  9. IO number of input and output operands Usability How well can the compiler use the pattern More Issues to Consider OR LSL AND CMPP 9

  10. Selection • Currently use a Greedy Algorithm • Pick the best performance gain / area first • Can yield bad selections OR LSL AND CMPP 10

  11. Speedup: 1.24 10 cycles can be compressed down to 2! Cost: ~6 adders 6 inputs, 2 outputs C code this DFG came from: r ^=(((s[(t>>24)] + s[0x0100+((t>>16)&0xff)]) ^ s[0x0200+((t>>8)&0xff)]) + s[0x0300+((t&0xff)])&0xffffffff; Case study 1: Blowfish r65 r70 ADD r76 XOR r81 ADD # -1 AND r891 XOR #16 LSR #255 AND #256 ADD #2 LSL r91 ADD 11

  12. Speedup: 1.20 3 cycles can be compressed down to 1 Cost: ~1.5 adders 2 inputs, 2 outputs C code this DFG came from: d = d & 7; if ( d & 4 ) { … } Case study 2: ADPCM Decode r16 #7 AND #4 AND #0 CMPP 12

  13. Experimental Setup • CFU recognition implemented in the Trimaran research infrastructure • Speedup shown is with CFUs relative to a baseline machine • Four wide VLIW with predication • Can issue at most 1 Int, Flt, Mem, Brn inst./cyc. • 300 MHz clock • CFU Latency is estimated using standard cells from Synopsis’ design library 13

  14. Varying the Number of CFUs • More CFUs yields more performance • Weakness in our selection algorithm causes plateaus 14

  15. Varying the Number of Ops • Bigger CFUs yield better performance • If they’re too big, they can’t be used as often and they expose alternate critical paths 15

  16. Related Work • Many people have done this for code size • Bose et al., Liao et al. • Typically done with traces • Arnold, et al. • Previous paper used more enumerative discovery algorithm • We are unique because: • Compiler based approach • Novel analyzation of CFUs 16

  17. Conclusion and Future Work • CFUs have the potential to offer big performance gain for small cost • Recognize more complex subgraphs • Generalized acyclic/cyclic subgraphs • Develop our system to automatically synthesize application tailored coprocessors 17

More Related