230 likes | 330 Views
Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002. Prof. Jarmo Takala Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland Tel: +358 – 33115 3879; Email: jarmo.takala@tut.fi. Outline. Motivation
E N D
Design Methodology for Customizable Programmable ProcessorsBerkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland Tel: +358 – 33115 3879; Email: jarmo.takala@tut.fi
Outline • Motivation • Transport Triggered Architecture (TTA) • Design Methodology for TTAs • Research at TUT • Conclusions
Motivation • Programmable processors often used in products using digital signal processing (DSP) • Flexibility • Ease of verification • Traditionally DSP processor architectures have been developed based on average performance in several benchmark tasks (~100) • User applications often contain only subset of total benchmarks • Efficiency can be improved by customizing architecture according to given tasks
Motivation • DSP applications are often hard realtime constrained • execution should be deterministic • dynamic runtime behaviours should be avoided • Static scheduling lends itself to DSP • Current design complexities call for increase in designer productivity • High level languages should be used • DSP algorithms contain inherent parallelism • Instruction level parallelism (ILP) should be maximized
What is needed? • Application driven design process with easy design space exploration • Replace hardware complexity by software complexity • Compiler driven process • Use templated architecture • Flexible • heterogeneous function units • Modular • scalability • Orthogonal • compiler friendly
Choices for Architecture Template Application ILP Architectures Frontend sequential (superscalar) Determine Dependencies Determine Dependencies dependence(dataflow) Determine Independencies Determine Independencies independence(EPIC) Bind Function Units Bind Function Units independence (VLIW) Bind Datapaths & Execute Compilation time (Software) Run time (Hardware)
VLIW Gained Popularity in DSP FU-1 FU-2 Instruction Fetch Bypassing Network Instruction Decode Instruction Memory Data Memory Register File FU-3 FU-4 FU-5 CPU
Transport Triggered Architecture • VLIW drawbacks • Bypass complexity • Register file complexity • Register file design restricts FU flexibility • Operation encoding format restricts FU flexibility • Reverse programming paradigm [H. Corporaal, 94] • data transport operation • Instruction set contains only a single instruction: move
FU-1 FU-2 FU-3 Instruction Decode Instruction Fetch Bypassing Network FU-4 FU-5 RegisterFile From VLIW to TTA FU-1 FU-2 FU-3 Register File Instruction Memory Data Memory Instruction Decode Instruction Fetch Bypassing Network FU-4 FU-5 VLIW TTA
TTA Datapath Data Memory Load/StoreUnit Load/StoreUnit IntegerALU IntegerALU FloatALU Socket Integer RF Float RF Boolean RF Instruction Unit Immediate Unit Instruction Memory
Operands written to operand registers (O) Operation performed when last operand written to trigger register (T) Pipeline synchronized with control bits (C) Standard interface FU_ready Result_ready Global_lock Function Units Optional shadow register C T O logic C logic C logic C R optional
ILP Architectures Application Frontend sequential (superscalar) Determine Dependencies Determine Dependencies dependence(dataflow) Determine Independencies Determine Independencies independence(EPIC) Bind Function Units Bind Function Units independence (VLIW) Bind Datapaths Bind Datapaths independence (TTA) Execute Compilation time Run time
TTA Characteristics: HW • Modular • Can be constructed with standard building blocks • Very flexible and scalable • FU functionality can be arbitrary • Supports user defined Special Function Units (SFU) • Lower complexity • Reduction on # register ports • Reduced bypass complexity • Reduction in bypass connectivity • Reduced register pressure • Trivial decoding (implies long instructions)
TTA Characteristics: SW • Traditional operation-triggered instruction: • Transport-triggered instruction: • Reminds dataflow and time-stationary coding mul r1,r2,r3; r1mul.o; r2mul.t; mul.rr3; or r1mul.o, r2mul.t; mul.rr3;
TTA Design Tools • Design tools based on TTA architecture template have been developed at Delft University of Technology (DUT), Delft, the Netherlands • MOVE project lead by Prof. Henk Corporaal • Fully parametric C/C++ Compiler • buses, connections, function units, register files, etc. • Design space explorer • Processor generator
Code Generation Trajectory Application (C/C++) GCC or SUIF Compiler Frontend Architecture Description Sequential Simulator I/O Sequential Code Compiler Backend Profiling Data I/O Parallel Code Parallel Simulator (MOVE Project at DUT)
TTA Specific Optimizations • TTA allows extra scheduling optimizations • E.g., software bypassing • Bypassing can eliminate the need of RF access • However, more difficult to schedule ! Example: r1 → add.o, r2 → add.t; add.r →r3; r3→ sub.o, r4 → sub.t sub.r → r5; Translates to: r1 → add.o, r2 → add.t; add.r → sub.o, r4 → sub.t; sub.r → r5;
Design Space Exploration Application(C/C++) Resources(Mach) Frontend ResourceOptimization Map&Schedule Select Resources Simulator FU modelsCost Functions Design Points ConnectivityOptimization Map&Schedule Reduce Connections Simulator Design Point (MOVE Project at DUT)
ALU ALU LSU LSU LSU IRU IRU IU IU IU Exploration: Resourse Optimization (MOVE Project at DUT) Pareto curve represents the lowest bound of found architecture configurations Selected architecture for further optimization
ALU ALU LSU LSU LSU IRU IRU IU IU IU Exploration: Connectivity Optimization (MOVE Project at DUT) Reduced connections decrease bus delay Critical connections have been removed
Topics to be Investigated • Poor code density • good target for code compression techniques • apriori information of application, thus instruction propabilities known • Estimations • Power estimation • Fast estimations with sufficient accuracy • Flexibity, reuse • Applications may change, thus additional resources need to assigned although not needed by the original application • Tool-assisted special function unit generation • Analysis support • Model creation support • Characterization support • Parameterized processor generator • Interconnections, control, etc. maybe realized in several ways depending on the target • Low-power optimizations • Clustered TTAs • Interprocessor communication schemes • These topics considered in FlexDSP Project at TUT
TTA Processor New Design Environment Target of FlexDSP Project at TUT Functionality(C/C++) Frontend FU models(C, HDL)Cost Functions (area, power, speed) OperationAnalysis ResourceConstraints Design SpaceExploration SFU Generation Parametric Processor Generator Code Compression Parametric Compiler ParallelObject Code HDLCode
Conclusions • Design methodologies allowing processor customization will improve efficiency in certain application areas, e.g., multimedia, telecom • TTA is a promising candidate for architectural template for customized processors • In particular, support for custom function units allows powerful tailoring • Results of MOVE project at DUT have already proven the concept • Parameterized compiler allows tool-assisted design space exploration • Still more research needed on • Hardware implementations • Enhanced compiler strategies