Custom Instruction Generation Using Temporal Partitioning Techniques for a R econfigurable Functional Unit

Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Farhad Mehdipour†, Hamid Noori††, Morteza Saheb Zamani†, Kazuaki Murakami††, Koji Inoue††, Mehdi Sedighi† †Computer and IT Engineering Department, Amirkabir University of Technology {mehdipur,szamani,msedighi}@aut.ac.ir ††Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University noori@c.csce.kyushu-u.ac.jp, {murakami,inoue}@i.kyushu-u.ac.jp

Agenda • Introduction • Application-specific instruction set extension • Temporal Partitioning • Some Definitions • General overview of the architecture • RFU Architecture: A Quantitative Approach • Generating Custom Instructions • Mapping Custom Instructions • Integrating RFU with base processor • Integrated framework for generating and mapping custom instructions • Performance Evaluation • References

Introduction • An extensible processor with a reconfigurable functional unit (RFU) • can be an alternative to General Purpose Processors (GPPs), Application-Specific Integrated Circuits (ASICs) and Application-Specific Instruction set Processors (ASIPs) • to achieve enhanced performance in embedded systems • ASICs • not flexible • expensive and time consuming design process • GPPs • very flexible • may not offer the necessary performance

Introduction • ASIPs • more flexible than ASICs • more potential to meet the high-performance demands of embedded applications, compared to GPPs • needs to generation of a complete instruction set architecture for the targeted application • full-custom solution is too expensive and has long design turnaround times

Application-specific instruction set extension • Another Method for performance improvement • An extensible processor with a reconfigurable functional unit • favorable tradeoff between efficiency and flexibility • keeping design turnaround time much shorter. • Critical portions of an application’s dataflow graph (DFG) are accelerated by using custom functional units • The nodes of DFGs -> instructions of critical potions • Edges of DFGs -> dependencies between instructions

Temporal Partitioning • Partitioning a data flow graph into a number of partitions such that • each partition can fit into the target hardware and • dependencies among the graph nodes are not violated.

Some definitions • Hot Basic Block (HBB) • A basic block which execution frequency is greater than a given threshold specified in the profiler • Custom Instructions (CIs) • Are the extended Instruction Set Architecture (ISA) that are executed on the RFU • Reconfigurable Functional Unit (RFU) • Custom hardware for executing CIs

General overview of the architecture Adaptive Dynamic Extensible Processor N-way in-order general RISC Detects start addresses of Hot Basic Blocks (HBBs) Base Processor Fetch Reg File Augmented Hardware Decode Switches between main processor and RFU Profiler Execute RFU Memory Sequencer Write Executes Custom Instructions

Operation modes Training Mode Training Mode Normal Mode Running Tools for Generating Custom Instructions, Generating Configuration Data for ACC and Initializing Sequencer Table Monitors PC and Switches between main processor and ACC Detecting Start Address of HBBs Applications Applications Applications Binary-Level Profiling Processor Processor Processor Profiler Profiler Profiler Profiler ACC ACC ACC Sequencer Sequencer Sequencer Binary Rewriting Executing CIs

Tool Chain

Reconfigurable Functional Unit (RFU) • RFU is a matrix of Functional Units (FUs) • RFU has a two level configuration memory • A multi-context memory (keeps two or four config) • A cache • FUs support only logical operations, add/subtract, shifts and compare • RFU updates the PC • RFU has variable delay which depends on size of Custom Instruction

RFU Architecture: A Quantitative Approach • 22 programs of MiBench were chosen • Simplescalar toolset was utilized for simulation • RFU is a matrix of FUs • No of Inputs • No of Outputs • No of FUs • Connections • Location of Inputs & Outputs • Some definitions: • Considering frequency and weight in measurement • CI Execution Frequency • Weight (To equal number of executed instructions) • Average = for all CIs (ΣFreq*Weight) • Rejection: Percentage of CI that could not be mapped on the RFU • Coverage: Percentage of CI that could be mapped on the RFU • Basic Blocks:A sequence of instructions terminates in a control instruction • Hot Basic Blocks: A basic block executed more than a threshold

RFU Architecture • Distributing Inputs in different rows • Row1 = 7 • Row 2 = 2 • Row 3 = 2 • Row 4 = 2 • Row 5 = 1 • Connections with Variable Length • row1  row3 = 1 • row1  row4 = 1 • row1  row5 = 1 • row2  row4 = 1 Synthesis results using Hitachi 0.18 μm Area : 1.1534 mm2 Delay : 9.66 ns

Integrating RFU with the Base Processor Reg0 Reg31 ………………………………………………………………. Config Mem Decoder Sequencer DEC/EXE Pipeline Registers FU1 FU2 FU3 FU4 RFU Sequencer EXE/MEM Pipeline Registers

Generation of Custom Instructions • Custom instructions • Exclude floating point, multiply, divide and load instructions • Include at most one STORE, at most one BRANCH/JUMP and all other fixed point instructions • Simple algorithm for generating custom instructions • HBBs usually include 10~40 instructions for Mibench • Custom instruction generator is going to be executed on the base processor (in online training mode)

Mapping Custom Instructions • Mapping is the same as the well-known placement problem: • Determining the appropriate positions for DFG nodes on the RFU. • Assigning CI instructions to FUs is done based on the priority of the nodes.

Mapping Custom Instructions • Slackof each node represents its criticalityand alsotheir priority for partitioning. • Slackequal to 0 means that it is on the critical path of DFG and should be scheduled with the highest priority. • For the nodes with the same criticality, ASAP level of them determines their mapping order.

Mapping Algorithm (1/2) • First Step: determining an appropriate row for that node • Row number= Last Row (if the selected node is on a critical path with the length more than or equal to RFU depth) • Row number= ALAP- slack -1(to prevent the occupation of FUs in the lower RFU rows by the nodes do not belong to critical paths )

Mapping Algorithm (2/2) • Second Step: Determining an appropriate column • That is determined according to the minimum connection length criterion. • For each row, a maximum capacity is considered to prohibit gathering many nodes in a row. • Capacity of rows is determined with respect to longest critical path and the number of critical paths in the DFG.

An Example: Mapping of a CI on the RFU

Generating Custom Instruction for the Target RFU • In our primary CI generator we did not consider any constraints for the generated CIs and tried to generate CIs as large as possible. • Therefore, some of the generated CIs can not be mapped on the proposed RFU due to its constraints.

Customizing CI generator for the Target RFU – First Approach • Some primary constraints of RFU (number of inputs, number of outputs and number of nodes) were added to our CI generator tool to generate CIs that are mappable. • In this approach the CI generator is unaware of the mapping process results • Some of CIs may not be ultimately mapped to the RFU due to the routing constraints

Customizing CI generator for the Target RFU – Second Approach • Integrated Framework • Performs an integrated temporal partitioning and mapping process • Takes rejected CIs as input • Partitions them to appropriate mappable CIs • Adds nodes to the current partition while architectural constraints are satisfied • The ASAP level of nodes represents their order to execute according to their dependencies • Advantages • Reducing the number of rejected CI • Using a mapping-aware temporal partitioning process

Integrated Framework- Temporal Partitioning Algorithms • HTTP • Traverses DFG nodes horizontally according to the ASAP level of the nodes • usually brings about more parallelism for instruction execution • may require large intermediate data • The size of intermediate data affects data transfer rate and the size of configuration memory. • VTTP • Traverse the DFG nodes vertically • Creates partitions with longer critical paths • Reduces the size of intermediate data

Integrated Framework- Incremental Temporal Partitioning Algorithm • Incremental temporal partitioning process is performed iteratively • Each partition which does not satisfy RFU constraints is modified • A new iteration starts. • Two different partition modification strategies are used for HTTP and VTTP • The main difference is in the way of selecting the nodes to be moved to the next partition.

Integrated Framework- Incremental Temporal Partitioning Algorithm • Incremental HTTP • The node with the highest ASAP level is selected and moved to the subsequent partition. • Nodes selection and moving order: 15, 13, 11, 9, 14, 12, 10, 8, 3 and 7.

Integrated Framework- Incremental Temporal Partitioning Algorithm • Incremental VTTP: • A node with the highest ASAP level is selected and moved. • The other nodes are selected from the path where the previous moved node had been located in their ASAP level order. • Nodes selection and moving order:15, 14, 6, 13, 12, 5, 11, 10, 4 and 7.

Customizing Mapping Tool Spiral shaped mapping is possible thanks to the horizontal connections in the third and fourth rows of RFU

Performance Evaluation • Simplescalar was configured to behave as a 4-issue in-order RISC processor. The base processor supports MIPS instruction set. • 22 applications of Mibench

Delay of RFU according to CI length • Synopsys Tools + Hitachi 0.18μm

CIs length for Mibench applications

Intermediate data size

Maximum critical path length for CIs

Speedup comparison

References • Arnold, M., Corporaal, H., Designing domain-specific processors. In Proceedings of the Design, Automation and Test in Europe Conf, 2001, pp. 61-66. • Atasu, K., Pozzi, L., Lenne, P., Automatic application-specific instruction-set extensions under microarchitectural constraints, 40th Design Automation Conference, 2003. • Bobda, C., Synthesis of dataflow graphs for reconfigurable systems using temporal partitioning and temporal placement, Ph.D thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University of Paderborn, 2003. • Clark, N., Kudlur, M., Park, H., Mahlke, S., Flautner, K., Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization, In Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, 2004. • Karthikeya, M., Gajjala, P., Dinesh, B., Temporal partitioning and scheduling data flow graphs for reconfigurable computer, IEEE Transactions on Computers, vol. 48, no. 6, 1999, pp.579–590.

References • Kastner, R. Kaplan, A., Ogrenci Memik, S., Bozorgzadeh, E., Instruction generation for hybrid reconfigurable systems, ACM TODAES, vol. 7, no. 4, 2002, pp. 605-627. • Ouaiss, I., Govindarajan, S., Srinivasan, V., Kaul M., Vemuri R., An integrated partitioning and synthesis system for dynamically reconfigurable multi-FPGA architectures, In Proceedings of the Reconfigurable Architecture Workshop, 1998, pp. 31-36. • Spillane, J., Owen, H., Temporal partitioning for partially reconfigurable field programmable gate arrays, IPPS/SPDP Workshops, 1998, pp. 37-42. • Tanougast, C., Berviller, Y., Brunet, P., Weber, S., Rabah, H., Temporal partitioning methodology optimizing FPGA resources for dynamically reconfigurable embedded real-time system, International Journal of Microprocessors and Microsystems, vol. 27, 2003, pp. 115-130. • Yu, P., Mitra, T., Characterizing embedded applications for instruction-set extensible processors, In Proceedings of Design and Automation Conference, 2004, pp. 723- 728.

Thank you for your listening

Custom Instruction Generation Using Temporal Partitioning Techniques for a R econfigurable Functional Unit

Custom Instruction Generation Using Temporal Partitioning Techniques for a R econfigurable Functional Unit

Presentation Transcript

Clustering and Partitioning for Spatial and Temporal Data Mining

Functional Test Generation using Constraint Logic Programming

ADDITION USING PARTITIONING

Functional Data Analysis: Techniques for Exploring Temporal Processes in Music

Software / Hardware Partitioning Techniques

Functional Imaging Techniques

Instruction Fetch Unit Using I-cache

Functional Imaging Techniques

Instruction Generation for Hybrid Reconfigurable Systems

A Reconfigurable Functional Unit for Adaptable Custom Instructions

R ECONFIGURABLE C OMPUTING F OR S PACE

Statistical Techniques for Temporal Microarray Data Analysis

Hydride Generation Techniques

Using Functional Brain Imaging Techniques to Investigate Cognition

A Novel Algorithm Combining Temporal Partitioning and Sharing of Functional Units

Idea Generation Techniques

Temporal Logic Replication for Dynamically Reconfigurable FPGA Partitioning

Functional Test Program Generation for Microprocessors

Rules for Subtraction Using Partitioning

Instruction Generation For Hybrid Reconfigurable Architectures

Temporal Logic Replication for Dynamically Reconfigurable FPGA Partitioning

Clustering and Partitioning for Spatial and Temporal Data Mining