230 likes | 342 Views
Aristotle University of Thessaloniki. Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support. Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis
E N D
Aristotle University of Thessaloniki Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece E-mail: nivas@physics.auth.gr
Outline • Introduction • Target Architecture Overview • Partial Predicated Execution Enhancement • Virtual Opcode Enhancement • Development Framework • Experimental Results • Conclusions
Introduction • Characteristics of modern embedded applications • Diversity of algorithms • Rapid evolution of standards • High performance demands • To amortize cost over high production volumes embedded systems must: • Exhibit high levels of flexibility => fast Time-to-Market • Exhibit high levels of adaptability => increased reusability • An appealing option => couple a reconfigurable hardware (RH) to a typical processor • Processor => bulk of the flexibility • RH => adaptation to the target application • Support by a development framework that hides RH related issues • Maintain flexibility • Continue to target software-oriented group of users
Target Architecture • Reconfigurable Instruction Set Processor (RISP) • Core processor • 32-bit single issue RISC • 5 pipeline stages • Reconfigurable Functional Unit (RFU) • 1-D array of coarse-grain processing elements (PEs) • An interface that tightly couples the RFU to the core • Explicit communication
Target Architecture - ISA • Re=‘0’ => Standard Instruction Set • Flexibility to execute any program • Re=‘1’ => Reconfigurable Instruction Set Extensions • Offers the adaptation to the target application • Three types of Reconfigurable Instructions • Complex computational operations • Complex addressing modes • Complex control flow operations 32-Bit Instruction Word Format
Target Architecture - RFU • 1-D Array of coarse-grain PEs • Executes Reconfigurable Instructions • Multiple-Input-Single-Output (MISO) clusters of primitive operations • Un-registered output • Chain of operations in the same clock cycle • Registered output • Chain of pipelined operations • Floating PEs => Can operate in both core pipeline stages on demand • Better utilization of the available hardware
Target Architecture – Configuration • Local configuration memory • Multi-context • No overhead to select a context • Array of coarse-grain PEs => • Small number of configuration bit-stream per instruction
Target Architecture – Synthesis Results • A hardware model (VHDL) was designed • Synthesis results with STM 0.13um • Reasonable area overhead • No overhead to core critical path
Enhancement with Partial Predicated Execution • Predication • Eliminate branches from an instruction stream • Conditional execution of an instruction • Utilized to expose Instruction Level Parallelism • Our approach => partial predicated execution to eliminate the branch in an “if-then-else” statement • Large clusters of operations => increased performance
Support of Partial Predicated Execution • The available output network can be utilized • Extensions • Two configuration bits • Two multiplexers • Hardwired connections to PEs • Selection of the RFU output • Controlled by configuration bits => no predication • Controlled by comparison result => predicated execution • Comparison => implemented in a PE
Enhancement with Virtual Opcode • Explicitly communication between Core and RFU • Opcode explosion problem • Proposed solution => “Virtual” opcode • Virtual opcode = Natural opcode + code region • Overhead => Configuration memory size • Coarse grain => Small configuration size => 136 bits/per instr. • In general Virtual opcode can performed by flushing and reload the whole local memory • Large performance overhead • Applicable for different applications
Support of Virtual Opcode • Local Configuration memory => extended with extra level of contexts • First level = K contexts of locally available reconfigurable instructions • Second level = L copies of the first level for different code regions • For each code region only one of L contexts is active • The same natural opcode in different region context forms a virtual opcode • Partitioning of regions and issue of activation performed by the compiler • One cycle overhead to activate a context • Configuration memory size = K*L*Conf. Bits per Instr.
Development Framework • Automated framework for the development of applications in the architecture • Transparent incorporation of the reconfigurable instructions set extensions • Based on the SUIF/MachSUIF compiler infrastructure
Dev. Framework – Front End / Profiling • Application source code translated in CDFG (SUIFvm operations) • Perform machine independent optimizations • If-conversion for partial predicated execution can be applied • CDFG instrumented with profiling annotations • translated to equivalent C code • compiled and executed in the host • Profiling information are collected • Regions execution frequency
Dev. Framework – Instruction Generation • First step = Pattern Generation • In-house tool for the identification of MISO cluster of operations based on the MaxMISO algorithm • Second step = Mapping of MISO in the RFU • Place the SUIFvm nodes in PEs / Route the 1-D array • Analyze paths and set the output of a PE (reg./unreg.) to minimize delay • Report candidate instruction semantics Candidate1 PE1 PE2 Candidate2 PE3 Candidate2 src1: $vr1 src2: $vr1 src3: $vr3 dst: $vr4 { region: func1 – dfg1 PE1: sub, output: reg PE2: neg, output: un-reg ……………………………………… edg1: in1-PE1, in2-PE1…………. ………………………………………. latency: 1 cycle type: comp static gain: 2 }
No Virtual opcode Consider the whole application space Perform pair-wise graph isomorphism to identify identical candidate instructions Calculate dynamic gain offered by each candidate Dynamic = Static x Frequency Rank candidate instructions based on dynamic gain Select best L instructions L defined by the number of supported instructions Dev. Framework – Instruction Selection (1/2)
Dev. Framework – Instruction Selection (2/2) With Virtual opcode enabled • Partition application code into regions • Currently supporting only procedures • Perform Graph isomorphism per region • Calculate dynamic gain offered by each candidate for each region • Calculate overhead to set active the region contexts • Rank regions and candidate instructions based on dynamic gain • Select best K regions and best L instructions from each region • L, K defined by the supported contexts and instructions per context
Experimental Results • Prove the performance improvements offered by the proposed architecture • Evaluate the efficiency of the enhancements • A complete MPEG-2 encoding application is used • Source code from MediaBench benchmark suite • Input data => a video sequence consisting of 12 frames with resolution of 144x176 pixels
Exp. Results – SpeedUp Analysis • Speedup analysis for the most timing consuming functions of MPEG2 enc. • Accelerate only critical regions => small overall speedup (Amdahl) • Our approach accelerates the whole application’s space => overall speedup is preserved
Exp. Results – Evaluation of predication • Example of four instructions derived using if conversion and partial predicated execution • These instructions implement the SAD function • Significant performance improvements are offered
Exp. Results – Evaluation of Virtual Opcode • Virtual opcode can be used to preserve speedups for architectures with limited opcode space • Reasonable overhead for the local configuration memory size • Finer partitioning of regions could result to more impressive results
Conclusions • Two enhancements to a previously proposed RISP architecture have been proposed • Partial predicated execution => increase performance • Virtual opcode => relaxes opcode space pressure • An automated development framework have been presented • Hides the reconfigurable hardware from the user • Supports the two enhancements • The efficiency of the RISP and enhancements have been proved using an MPEG2 encoding application • Future research • Support full predication for further performance improvements • Support finer partitioning of regions for better utilization of virtual opcode
Thank you !!! Questions ??