Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002

Design Methodology for Customizable Programmable ProcessorsBerkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland Tel: +358 – 33115 3879; Email: jarmo.takala@tut.fi

Outline • Motivation • Transport Triggered Architecture (TTA) • Design Methodology for TTAs • Research at TUT • Conclusions

Motivation • Programmable processors often used in products using digital signal processing (DSP) • Flexibility • Ease of verification • Traditionally DSP processor architectures have been developed based on average performance in several benchmark tasks (~100) • User applications often contain only subset of total benchmarks • Efficiency can be improved by customizing architecture according to given tasks

Motivation • DSP applications are often hard realtime constrained • execution should be deterministic • dynamic runtime behaviours should be avoided • Static scheduling lends itself to DSP • Current design complexities call for increase in designer productivity • High level languages should be used • DSP algorithms contain inherent parallelism • Instruction level parallelism (ILP) should be maximized

What is needed? • Application driven design process with easy design space exploration • Replace hardware complexity by software complexity • Compiler driven process • Use templated architecture • Flexible • heterogeneous function units • Modular • scalability • Orthogonal • compiler friendly

Choices for Architecture Template Application ILP Architectures Frontend sequential (superscalar) Determine Dependencies Determine Dependencies dependence(dataflow) Determine Independencies Determine Independencies independence(EPIC) Bind Function Units Bind Function Units independence (VLIW) Bind Datapaths & Execute Compilation time (Software) Run time (Hardware)

VLIW Gained Popularity in DSP FU-1 FU-2 Instruction Fetch Bypassing Network Instruction Decode Instruction Memory Data Memory Register File FU-3 FU-4 FU-5 CPU

Transport Triggered Architecture • VLIW drawbacks • Bypass complexity • Register file complexity • Register file design restricts FU flexibility • Operation encoding format restricts FU flexibility • Reverse programming paradigm [H. Corporaal, 94] • data transport  operation • Instruction set contains only a single instruction: move

FU-1 FU-2 FU-3 Instruction Decode Instruction Fetch Bypassing Network FU-4 FU-5 RegisterFile From VLIW to TTA FU-1 FU-2 FU-3 Register File Instruction Memory Data Memory Instruction Decode Instruction Fetch Bypassing Network FU-4 FU-5 VLIW TTA

TTA Datapath Data Memory Load/StoreUnit Load/StoreUnit IntegerALU IntegerALU FloatALU Socket Integer RF Float RF Boolean RF Instruction Unit Immediate Unit Instruction Memory

Operands written to operand registers (O) Operation performed when last operand written to trigger register (T) Pipeline synchronized with control bits (C) Standard interface FU_ready Result_ready Global_lock Function Units Optional shadow register C T O logic C logic C logic C R optional

ILP Architectures Application Frontend sequential (superscalar) Determine Dependencies Determine Dependencies dependence(dataflow) Determine Independencies Determine Independencies independence(EPIC) Bind Function Units Bind Function Units independence (VLIW) Bind Datapaths Bind Datapaths independence (TTA) Execute Compilation time Run time

TTA Characteristics: HW • Modular • Can be constructed with standard building blocks • Very flexible and scalable • FU functionality can be arbitrary • Supports user defined Special Function Units (SFU) • Lower complexity • Reduction on # register ports • Reduced bypass complexity • Reduction in bypass connectivity • Reduced register pressure • Trivial decoding (implies long instructions)

TTA Characteristics: SW • Traditional operation-triggered instruction: • Transport-triggered instruction: • Reminds dataflow and time-stationary coding mul r1,r2,r3; r1mul.o; r2mul.t; mul.rr3; or r1mul.o, r2mul.t; mul.rr3;

TTA Design Tools • Design tools based on TTA architecture template have been developed at Delft University of Technology (DUT), Delft, the Netherlands • MOVE project lead by Prof. Henk Corporaal • Fully parametric C/C++ Compiler • buses, connections, function units, register files, etc. • Design space explorer • Processor generator

Code Generation Trajectory Application (C/C++) GCC or SUIF Compiler Frontend Architecture Description Sequential Simulator I/O Sequential Code Compiler Backend Profiling Data I/O Parallel Code Parallel Simulator (MOVE Project at DUT)

TTA Specific Optimizations • TTA allows extra scheduling optimizations • E.g., software bypassing • Bypassing can eliminate the need of RF access • However, more difficult to schedule ! Example: r1 → add.o, r2 → add.t; add.r →r3; r3→ sub.o, r4 → sub.t sub.r → r5; Translates to: r1 → add.o, r2 → add.t; add.r → sub.o, r4 → sub.t; sub.r → r5;

Design Space Exploration Application(C/C++) Resources(Mach) Frontend ResourceOptimization Map&Schedule Select Resources Simulator FU modelsCost Functions Design Points ConnectivityOptimization Map&Schedule Reduce Connections Simulator Design Point (MOVE Project at DUT)

ALU ALU LSU LSU LSU IRU IRU IU IU IU Exploration: Resourse Optimization (MOVE Project at DUT) Pareto curve represents the lowest bound of found architecture configurations Selected architecture for further optimization

ALU ALU LSU LSU LSU IRU IRU IU IU IU Exploration: Connectivity Optimization (MOVE Project at DUT) Reduced connections decrease bus delay Critical connections have been removed

Topics to be Investigated • Poor code density • good target for code compression techniques • apriori information of application, thus instruction propabilities known • Estimations • Power estimation • Fast estimations with sufficient accuracy • Flexibity, reuse • Applications may change, thus additional resources need to assigned although not needed by the original application • Tool-assisted special function unit generation • Analysis support • Model creation support • Characterization support • Parameterized processor generator • Interconnections, control, etc. maybe realized in several ways depending on the target • Low-power optimizations • Clustered TTAs • Interprocessor communication schemes • These topics considered in FlexDSP Project at TUT

TTA Processor New Design Environment Target of FlexDSP Project at TUT Functionality(C/C++) Frontend FU models(C, HDL)Cost Functions (area, power, speed) OperationAnalysis ResourceConstraints Design SpaceExploration SFU Generation Parametric Processor Generator Code Compression Parametric Compiler ParallelObject Code HDLCode

Conclusions • Design methodologies allowing processor customization will improve efficiency in certain application areas, e.g., multimedia, telecom • TTA is a promising candidate for architectural template for customized processors • In particular, support for custom function units allows powerful tailoring • Results of MOVE project at DUT have already proven the concept • Parameterized compiler allows tool-assisted design space exploration • Still more research needed on • Hardware implementations • Enhanced compiler strategies

Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002

Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002

Presentation Transcript

The Role of Programmable DSPs in 3G Handsets Chaitali Sengupta February, 2002

Software-Compiled System Design: A Methodology for Field Programmable System-on-Chip Design

Programmable Logic Devices

Residue number system enhancements for programmable processors

EECS150 - Digital Design Lecture 5 - Field Programmable Gate Arrays (FPGAs)

An Introduction to CUDA/ OpenCL and Manycore Graphics Processors

Lecture 14: Database Design

Chapter One Introduction to Pipelined Processors

Finland

Digital System Design

What is Design?

Platform Design

IC design options

Research Accelerator for Multiple Processors

Residue number system enhancements for programmable processors

Digital Design

Research Accelerator for Multiple Processors

Customer Display

FINLAND

A Validation Methodology for Graphics Processors

VCC: Function-Architecture Co-Design: Modelling and Examples EE 249: November 7, 2002