150 likes | 263 Views
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT. Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
E N D
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece nivas@skiathos.physics.auth.gr Algarve, Portugal February 22-23, 2005
Outline • Motivations • Proposed Architecture • Software Development Environment • Demonstration • Results • Conclusions
Motivations • Quest for Performance and Flexibility • Large portion of computational complexity is concentrated in small kernels covering small parts of overall code • Performance Improved by Accelerating these kernels • Many Algorithms Show a relevant Instruction Level Parallelism (ILP) • Performance Improved by parallel execution • Traditional Processors have computation clock slack • Performance Improved by chaining of operations (Spatial Computation) Extending Embedded Processors With Application Specific Function Units Reconfigurable Instruction Set Processors for Performance with Maximum Flexibility
Proposed Architecture • Reconfigurable Instruction Set Processor (RISP) • Core Processor • 32-bit load/store RISC architecture • 5 Pipeline Stages • Single Issue Elaboration • Reconfigurable Logic Coupling • Reconfigurable Function Unit (RFU) approach => Low Communication Overhead • Tightly Coupled => RFU Fits in two RISC pipeline stages => Better Utilization of the Pipeline Stages • RFU • 1-D Array of Coarse Grain Processing Elements (PEs) • PE Functionality Configurable at Design Time to meet Application requirements • Exploits Instruction Level Parallelism – Spatial & Temporal Computation
Proposed Architecture • Core Processor • Commonly Used Function Units • Control Logic Properly Extended to Handle Reconfigurable Instructions • 4-Read-1-Write Register File • Core / RFU Interface • Receives & Delivers Control and Data Signals • Tightly Coupled RFU • Configuration-Processing-Interconnection Layers • Operates & Delivers Results in two Concurrent Pipeline Stages
Standard And Reconfigurable Instructions 32-Bit Instruction Word Format • Re=‘0’ => Standard Instruction • Control Logic : Configure Core Datapath • Operands : Source1-2 & Destination • ReOpCode = “nop” • Re=‘1’ => Reconfigurable Instruction • Control Logic : Configure Interface • Operands : Source1-4 & Destination • ReOpCode = “OpCode” • Three Types of Reconfigurable Instructions • Complex Computational Operations • Complex Addressing Modes • Complex Control Flow Operations • Each Instruction can be multicycle
Reconfigurable Function Unit (RFU) • Embedded RFU for Dynamic Extension of the Instruction Set • Executes Multiple-Input-Single-Output (MISO) Reconfigurable Instructions • 1-D Array of Coarse Grain Reconfigurable Blocks • Comprised of Three Layers • Processing Layer • Interconnection Layer • Configuration Layer
RFU-Processing Layer • PE Basic Structure • Configurable PE functionality for the targeted application • Unregistered Output => Spatial Computation • Register Output => Temporal Computation • Floating PEs => Can operate in both core pipeline stages on demand • Local Memory for Read Only Values • Execute Long Chains of Operation in one processor cycle
RFU-Interconnection Layer • 1-D Array of PEs • Operands from Register File • Constant Values from Local Memory • Input Network • Operand Select • Output Network => Delivers Results to corresponding pipeline stages
RFU-Configuration Layer • Configuration Bits Local Storage Structure • Multi-Context Configuration Layer • Coarse Grain => Small Number of Configuration Bits => Negligible Overhead to Download new Contexts
Architecture Synthesis & Evaluation • A Hardware Model (VHDL) was Designed for Evaluation Purposes • The Model was Synthesized with STM 0.13um Process • The RFU Area Overhead is 3.3x the Area of the Core Processor • No Caches were taken into account • No Overhead to Core Critical Path
Demonstration-RFU Elaboration • Largest MaxMISO for a Quantization Kernel • Execution on the Core => six cycles • Execution on the Core+RFU => one cycle • Performance Improvements • Reduced Instruction Memory Accesses
Results Speed-Ups for Several Kernels – Core Vs. Core+RFU Energy Consumption Dominated by Memory Accesses
Conclusions • A RISC Processor Enhanced by a Run-Time Reconfigurable Function Unit • 1-D Reconfigurable Array of Coarse Grain Processing Elements • Multiple-Input-Single-Output Reconfigurable Instructions • Specific Software Development Environment • Low Cost Performance and Energy Consumption Improvements Next Step => Expand to VLIW Elaboration to Boost Achieved Speed-Ups