390 likes | 400 Views
This paper presents a novel technique for automated instruction set customization to accelerate processor performance. The system identifies subgraphs and selects critical, low-latency, area-efficient, and input/output-optimized CFUs for compilation. The effectiveness of CFUs in one application can be applied to other applications in the same domain. (498 characters)
E N D
Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003 1
CPU ASIC Motivation • Cell phones, PDAs, digital cameras, etc. are everywhere • High performance yet low power design point • General core + ASIC solution • Limited post-programmability • General core + application specific instructions (CFUs) CPU CFU 2
+ ^ & << + + ^ ^ | & << CFU 2 ^ * + | & << + | ^ What is a CFU? • Combine multiple primitive operations • Smaller code size, fewer RF reads • Increases performance CFU 1 1 1 2 ^ * + 2 1 3
Automation is Key • This is ¼ of the DFG for a single basic block of blowfish 159 XOR 164 SHR 173 AND 4
Related Work • Tensilica Xtensa • Commercial example • MIPS core + manually constructed CFU • Automatic instruction set synthesis is mature field • See paper for comparison of techniques • Our contributions • Novel technique for automatic CFU creation • System to utilize CFUs in multiple applications • Analysis of how effectively CFUs for one application apply to other applications in the same domain 5
System Overview • Synthesis • Subgraph identification • Discover candidates for CFUs • Weed out what shouldn’t be picked • Selection • Determine which candidates to use as CFUs • Compilation • Subgraph replacement • Make use of the CFUs in a range of applications 6
Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output & ^ % << | 7
Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output & ^ % << | CFU Candidates & << 8
Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output • Sum of these factors determines value of each direction • NOT picking CFUs & ^ % << | CFU Candidates & + << & 9
Critical Path • Combining operations on the critical path will shrink the longer dependence chains • Maximize potential performance gain • Wt = • Slack is # cycles off longest dependence path ^ & 10/(0+1) = 10 ^ 10/(2+1) = 3.33 >> >> >> & & & + + + << << << << + + + + 10
Latency • Growing toward low latency operations allows combination of more nodes in a cycle • Maximize DFG compression • Wt = ^ & ^ >> >> >> & & & 10*0.3 / 0.36 = 8.33 + + + << << << << 10*0.3 / 0.6 = 5 + + + + 11
^ & ^ >> >> >> & & & + + + << << << << + + + + Area • Want the most benefit for the least area • Wt = • Area is the sum of macrocell areas 10*0.5/0.5 = 10 10*0.5/1.5 = 3.33 12
Input/Output • Want CFUs to use as few RF ports as possible • Smaller encoding • Allow growth of larger candidates • Wt = ^ & ^ 10*2/(4+1)= 4 >> >> >> & & & 10*2/(2+1)= 6.67 + + + << << << << + + + + 13
Example ^ & 28.5 35 ^ 30.8 37.5 37.5 28.5 >> >> >> & & & + + + << << << << + + + + 14
Example ^ & 28.5 35 ^ 30.8 40 28.5 >> >> >> 33.5 & & & + + + << << << << + + + + 15
Example ^ & 28.5 35 ^ 30.8 28.5 >> >> >> 36 36 & & & + + + << << << << + + + + 16
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 17
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 18
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 19
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 20
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 21
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 22
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 23
^ & ^ >> >> >> & & & + + + << << << << + + + + Finished – Met External Constraints 24
^ ^ ^ ^ ^ ^ << << << << << << << << << << << ^ ^ + + & & & & & & & + + Set of Candidates ^ & ^ ^ ^ ^ ^ << << << << << << & & & & & & + + + + + ^ << << << << ^ << << << & & 25
1.50 1.38 1.25 Speedup 1.13 1.00 Avoids Exponential Explosion 26
Greedy Selection Heuristic • Use estimates of performance improvement / cost 27
1 4 1 4 2 5 2 5 CFU 3 6 3 Compiler Replacement • Multiple applications can utilize CFUs • Vflib pattern matcher [Cor ’99] Instruction Synthesis CFU Description Compiler 28
Experimental Setup • Implemented in the Trimaran toolset • Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle • CFUs use Int issue slot • CFU latency/area generated as sum of each individual macrocell • Pipeline latches were added if CFU latency >1 clock cycle • 300 MHz clock assumed • No branch or memory instructions in CFUs • Four application domains tested • Audio, Encryption, Image, Network 29
IN_1 0x8 >> 0xF | IN_2 + Generalizing CFUs Subsumed (Multiple Paths) Wildcards (Multiple Nodes) IN_1 0x8, 0x0 IN_1 0x8 >> >> 0xF, 0x0 0xF | |,& IN_2 IN_2 + +,- 32
2.0 CFUs Subsumed Subgraphs 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 sha rijn-sha sha-rijn rijndael blowfish bfish-rijn rijn-bfish bfish-sha sha-bfish Effects of Generalization Speedup 33
Conclusions • Developed two phase instruction set synthesis system • Guide function removes bad candidates • Greedy selection heuristic • Substantial speedups can be attained with very little die impact • Subsumed subgraphs and wildcarding increase cross-application effectiveness 34
Questions? http://cccp.eecs.umich.edu 35
Selection • Uses estimates of performance improvement • Greedy Heuristic used ^ & ^ >> >> >> & & & + + + << << << << + + + + 39