1 / 39

Processor Acceleration Through Automated Instruction Set Customization

This paper presents a novel technique for automated instruction set customization to accelerate processor performance. The system identifies subgraphs and selects critical, low-latency, area-efficient, and input/output-optimized CFUs for compilation. The effectiveness of CFUs in one application can be applied to other applications in the same domain. (498 characters)

youngrogers
Download Presentation

Processor Acceleration Through Automated Instruction Set Customization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003 1

  2. CPU ASIC Motivation • Cell phones, PDAs, digital cameras, etc. are everywhere • High performance yet low power design point • General core + ASIC solution • Limited post-programmability • General core + application specific instructions (CFUs) CPU CFU 2

  3. + ^ & << + + ^ ^ | & << CFU 2 ^ * + | & << + | ^ What is a CFU? • Combine multiple primitive operations • Smaller code size, fewer RF reads • Increases performance CFU 1 1 1 2 ^ * + 2 1 3

  4. Automation is Key • This is ¼ of the DFG for a single basic block of blowfish 159 XOR 164 SHR 173 AND 4

  5. Related Work • Tensilica Xtensa • Commercial example • MIPS core + manually constructed CFU • Automatic instruction set synthesis is mature field • See paper for comparison of techniques • Our contributions • Novel technique for automatic CFU creation • System to utilize CFUs in multiple applications • Analysis of how effectively CFUs for one application apply to other applications in the same domain 5

  6. System Overview • Synthesis • Subgraph identification • Discover candidates for CFUs • Weed out what shouldn’t be picked • Selection • Determine which candidates to use as CFUs • Compilation • Subgraph replacement • Make use of the CFUs in a range of applications 6

  7. Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output & ^ % << | 7

  8. Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output & ^ % << | CFU Candidates & << 8

  9. Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output • Sum of these factors determines value of each direction • NOT picking CFUs & ^ % << | CFU Candidates & + << & 9

  10. Critical Path • Combining operations on the critical path will shrink the longer dependence chains • Maximize potential performance gain • Wt = • Slack is # cycles off longest dependence path ^ & 10/(0+1) = 10 ^ 10/(2+1) = 3.33 >> >> >> & & & + + + << << << << + + + + 10

  11. Latency • Growing toward low latency operations allows combination of more nodes in a cycle • Maximize DFG compression • Wt = ^ & ^ >> >> >> & & & 10*0.3 / 0.36 = 8.33 + + + << << << << 10*0.3 / 0.6 = 5 + + + + 11

  12. ^ & ^ >> >> >> & & & + + + << << << << + + + + Area • Want the most benefit for the least area • Wt = • Area is the sum of macrocell areas 10*0.5/0.5 = 10 10*0.5/1.5 = 3.33 12

  13. Input/Output • Want CFUs to use as few RF ports as possible • Smaller encoding • Allow growth of larger candidates • Wt = ^ & ^ 10*2/(4+1)= 4 >> >> >> & & & 10*2/(2+1)= 6.67 + + + << << << << + + + + 13

  14. Example ^ & 28.5 35 ^ 30.8 37.5 37.5 28.5 >> >> >> & & & + + + << << << << + + + + 14

  15. Example ^ & 28.5 35 ^ 30.8 40 28.5 >> >> >> 33.5 & & & + + + << << << << + + + + 15

  16. Example ^ & 28.5 35 ^ 30.8 28.5 >> >> >> 36 36 & & & + + + << << << << + + + + 16

  17. Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 17

  18. Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 18

  19. Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 19

  20. Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 20

  21. Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 21

  22. Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 22

  23. Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 23

  24. ^ & ^ >> >> >> & & & + + + << << << << + + + + Finished – Met External Constraints 24

  25. ^ ^ ^ ^ ^ ^ << << << << << << << << << << << ^ ^ + + & & & & & & & + + Set of Candidates ^ & ^ ^ ^ ^ ^ << << << << << << & & & & & & + + + + + ^ << << << << ^ << << << & & 25

  26. 1.50 1.38 1.25 Speedup 1.13 1.00 Avoids Exponential Explosion 26

  27. Greedy Selection Heuristic • Use estimates of performance improvement / cost 27

  28. 1 4 1 4 2 5 2 5 CFU 3 6 3 Compiler Replacement • Multiple applications can utilize CFUs • Vflib pattern matcher [Cor ’99] Instruction Synthesis CFU Description Compiler 28

  29. Experimental Setup • Implemented in the Trimaran toolset • Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle • CFUs use Int issue slot • CFU latency/area generated as sum of each individual macrocell • Pipeline latches were added if CFU latency >1 clock cycle • 300 MHz clock assumed • No branch or memory instructions in CFUs • Four application domains tested • Audio, Encryption, Image, Network 29

  30. Native Encryption Results 30

  31. Encryption Cross Compile 31

  32. IN_1 0x8 >> 0xF | IN_2 + Generalizing CFUs Subsumed (Multiple Paths) Wildcards (Multiple Nodes) IN_1 0x8, 0x0 IN_1 0x8 >> >> 0xF, 0x0 0xF | |,& IN_2 IN_2 + +,- 32

  33. 2.0 CFUs Subsumed Subgraphs 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 sha rijn-sha sha-rijn rijndael blowfish bfish-rijn rijn-bfish bfish-sha sha-bfish Effects of Generalization Speedup 33

  34. Conclusions • Developed two phase instruction set synthesis system • Guide function removes bad candidates • Greedy selection heuristic • Substantial speedups can be attained with very little die impact • Subsumed subgraphs and wildcarding increase cross-application effectiveness 34

  35. Questions? http://cccp.eecs.umich.edu 35

  36. Backup slides 36

  37. Individual Factors - Blowfish 37

  38. Individual Factors - Djpeg 38

  39. Selection • Uses estimates of performance improvement • Greedy Heuristic used ^ & ^ >> >> >> & & & + + + << << << << + + + + 39

More Related