Processor Acceleration Through Automated Instruction Set Customization

Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003 1

CPU ASIC Motivation • Cell phones, PDAs, digital cameras, etc. are everywhere • High performance yet low power design point • General core + ASIC solution • Limited post-programmability • General core + application specific instructions (CFUs) CPU CFU 2

+ ^ & << + + ^ ^ | & << CFU 2 ^ * + | & << + | ^ What is a CFU? • Combine multiple primitive operations • Smaller code size, fewer RF reads • Increases performance CFU 1 1 1 2 ^ * + 2 1 3

Automation is Key • This is ¼ of the DFG for a single basic block of blowfish 159 XOR 164 SHR 173 AND 4

Related Work • Tensilica Xtensa • Commercial example • MIPS core + manually constructed CFU • Automatic instruction set synthesis is mature field • See paper for comparison of techniques • Our contributions • Novel technique for automatic CFU creation • System to utilize CFUs in multiple applications • Analysis of how effectively CFUs for one application apply to other applications in the same domain 5

System Overview • Synthesis • Subgraph identification • Discover candidates for CFUs • Weed out what shouldn’t be picked • Selection • Determine which candidates to use as CFUs • Compilation • Subgraph replacement • Make use of the CFUs in a range of applications 6

Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output & ^ % << | 7

Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output & ^ % << | CFU Candidates & << 8

Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output • Sum of these factors determines value of each direction • NOT picking CFUs & ^ % << | CFU Candidates & + << & 9

Critical Path • Combining operations on the critical path will shrink the longer dependence chains • Maximize potential performance gain • Wt = • Slack is # cycles off longest dependence path ^ & 10/(0+1) = 10 ^ 10/(2+1) = 3.33 >> >> >> & & & + + + << << << << + + + + 10

Latency • Growing toward low latency operations allows combination of more nodes in a cycle • Maximize DFG compression • Wt = ^ & ^ >> >> >> & & & 10*0.3 / 0.36 = 8.33 + + + << << << << 10*0.3 / 0.6 = 5 + + + + 11

^ & ^ >> >> >> & & & + + + << << << << + + + + Area • Want the most benefit for the least area • Wt = • Area is the sum of macrocell areas 10*0.5/0.5 = 10 10*0.5/1.5 = 3.33 12

Input/Output • Want CFUs to use as few RF ports as possible • Smaller encoding • Allow growth of larger candidates • Wt = ^ & ^ 10*2/(4+1)= 4 >> >> >> & & & 10*2/(2+1)= 6.67 + + + << << << << + + + + 13

Example ^ & 28.5 35 ^ 30.8 37.5 37.5 28.5 >> >> >> & & & + + + << << << << + + + + 14

Example ^ & 28.5 35 ^ 30.8 40 28.5 >> >> >> 33.5 & & & + + + << << << << + + + + 15

Example ^ & 28.5 35 ^ 30.8 28.5 >> >> >> 36 36 & & & + + + << << << << + + + + 16

Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 17

Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 18

Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 19

Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 20

Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 21

Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 22

Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 23

^ & ^ >> >> >> & & & + + + << << << << + + + + Finished – Met External Constraints 24

^ ^ ^ ^ ^ ^ << << << << << << << << << << << ^ ^ + + & & & & & & & + + Set of Candidates ^ & ^ ^ ^ ^ ^ << << << << << << & & & & & & + + + + + ^ << << << << ^ << << << & & 25

1.50 1.38 1.25 Speedup 1.13 1.00 Avoids Exponential Explosion 26

Greedy Selection Heuristic • Use estimates of performance improvement / cost 27

1 4 1 4 2 5 2 5 CFU 3 6 3 Compiler Replacement • Multiple applications can utilize CFUs • Vflib pattern matcher [Cor ’99] Instruction Synthesis CFU Description Compiler 28

Experimental Setup • Implemented in the Trimaran toolset • Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle • CFUs use Int issue slot • CFU latency/area generated as sum of each individual macrocell • Pipeline latches were added if CFU latency >1 clock cycle • 300 MHz clock assumed • No branch or memory instructions in CFUs • Four application domains tested • Audio, Encryption, Image, Network 29

Native Encryption Results 30

Encryption Cross Compile 31

IN_1 0x8 >> 0xF | IN_2 + Generalizing CFUs Subsumed (Multiple Paths) Wildcards (Multiple Nodes) IN_1 0x8, 0x0 IN_1 0x8 >> >> 0xF, 0x0 0xF | |,& IN_2 IN_2 + +,- 32

2.0 CFUs Subsumed Subgraphs 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 sha rijn-sha sha-rijn rijndael blowfish bfish-rijn rijn-bfish bfish-sha sha-bfish Effects of Generalization Speedup 33

Conclusions • Developed two phase instruction set synthesis system • Guide function removes bad candidates • Greedy selection heuristic • Substantial speedups can be attained with very little die impact • Subsumed subgraphs and wildcarding increase cross-application effectiveness 34

Questions? http://cccp.eecs.umich.edu 35

Backup slides 36

Individual Factors - Blowfish 37

Individual Factors - Djpeg 38

Selection • Uses estimates of performance improvement • Greedy Heuristic used ^ & ^ >> >> >> & & & + + + << << << << + + + + 39

Processor Acceleration Through Automated Instruction Set Customization

Processor Acceleration Through Automated Instruction Set Customization

Presentation Transcript

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Instruction Set

MC68HC11 Instruction Set

MIPS Instruction Set

8085 Instruction Set

INSTRUCTION SET

Customization Using Variable Instruction Sets

Architecture and Instruction Set of the C6x Processor

INSTRUCTION SET

Instruction Set Architecture

Instruction Set

INSTRUCTION SET

ARM instruction set

ARM Instruction Set

ARM instruction set

Instruction Set Virtualization

Application Specific Instruction set Processor Design

CPU08 INSTRUCTION SET

Instruction Set Design

Instruction Set Architecture of MIPS Processor Presentation B

Embedded Systems: Hardware Computer Processor Basics ISA (Instruction Set Architecture)

Instruction Set Architecture of MIPS Processor Presentation B