Overview September 2004

Implementing High Performance DSP Systems on Heterogeneous Programmable Platforms Roger Woods and John McAllister Programmable Systems Laboratory, Institute of Electronics, Communications and Information Technology (ECIT), Queen’s University Belfast Overview September 2004

PSL@QUB • Sonic Arts Research Centre • Involves Music, Computer Science, Electrical Engg. • FPGAs for synthesis of musical instruments • Electronic Communications and Inf. Tech. • RF, SoC, Software engineering, speech recognition, image processing • Strong application focus • Programmable Systems Lab • Programmable IC Platforms for Programmable IP Networks (PIPPIN) – network solutions for streaming video • System level design for heterogeneous FPGA-centric embedded DSP systems – Abhainn

Contents • Motivation • Drivers • Current design approaches • Platform based design • Function/architecture co-design • Modelling languages • Abhainn – System level FPGA-centric Embedded System Design • Multi-dimensional array dataflow graph • Relationship to hardware cores • Normalised Lattice Filter • Conclusions

Drivers • Heterogeneous platforms • GPPs, DSP, FPGA. • Programmable resource for complex DSP • Clear need for holistic, system level design flow • Increased abstraction - gives HW/SW view • Optimisation from high level – bigger impact • Huge body of pre-defined IP • Wide range of existing functions – optimised for performance • Element of bottom-up design e.g. pre-defined timing

Design approaches • Platform based design • Defined system platform, too big of design space • Function/Architecture codesign • Concurrent architecture derivation and algorithm refinement • Formalized approaches e.g. (DFG) are mature for multiprocessors e.g. GRAPE-II, Ptolemy • Independent model of computation (MoC) based specification. • Rapid system implementation from algorithm specification • Automated inter-processor communication (IPC) realization • Issues: • Optimization whilst protecting core implementation • Need for increased core utilization • Balance synthesis of dedicated /programmable FPGA resource

Abhainn Ethos

Abhainn Rapid Implementation • Generic Input: • Algorithm modelling tool • Target technologies • Target Device Technology Specific Mapping: Mapped to specific processing and IPC technologies Target Specific Mapping: Mapped to specific target devices

Gedae Multiprocessor Synthesis • Gedae provides: • Rapid implementation for multiprocessors in a platform portable manner • Designer control of the implementation via standardised transformations

Arc: streams of tokens Actor Input port: consume T tokens per firing Output port: produces T tokens per firing Dataflow Specification • Actors fire granularity (G) times per iteration • Designer control • T at each port • G of each actor • Dimensions (X) of token traversing arcs

Threshold Optimisation • Run-time overheads: • Inter-processor communication • Dynamic scheduling • Actor firing overheads • Solution: threshold multiplication • Threshold Multiplication • Enhanced run-time performance • Higher memory requirements

Granularity Optimisation • Sub-scheduling • Break N firings in one execution into one firing in each of N executions • Granularity factorisation • Granularity Scaling • Execute in smaller memory • Higher run-time overheads

Hardware Design Ethos • Parallelise on the p dimension of m2. • No. processors v. input matrix dimensions • Regular and parameterisable trade-off • Represents powerful trade-off when enabled in DFG • Enabled from GEDAE

Multidimensional Array SDF • Complements MSDF with variable size actor families - defined by processing graph method & used in GEDAE • y parameter variation trades-off number of m_mult operations and token dimensions for each • MASDF actor family→ family pipelined hardware components with variable token dimensions

Mult-iteration Token Processing • Tacomposed of a family of base tokens Tb • Each family child consumed in an invocation of the actor • Different behaviour of the actor over multiple firings • Cyclic dataflow

MASDF Actor Sharing • m_mult is now cyclo-static operator • On ith firing • T tokens consumed from ith input child port • T tokens produced on ith output child port

SFO Structure Control and Communications Wrapper:Implementing cyclic schedule switching data into and out of central WBC unit Parameter Bank: Local storage for core parameters e.g. tap weights White Box Component: Flexible pipelined core configurable for various token sizes

NLF Design Example • 8-stage NLF, 8 element vector tokens • Base token scalar • Only manipulating the y graph parameter • SFG architectural synthesis only capable of retiming • Smallest supporting Virtex-II Pro family member

NLF SFG • Primitive (lowest) level components single stage pipelined: • Adders: Programmable CLBs • Multipliers: Embedded mult18x18s

WBC Inefficiency • New input sample every 4 clock cycles

NLF Core Design • Results • VirtexII-Pro Target Device • Factor 3.9 increase in SFO throughput for no extra hardware • Order of magnitude reduction in required device size • All enabled by altering one parameter (y) on the DFG

Conclusions • FPGA is viewed as a hardware resource • Using existing functionality (IP cores) is a key aspect of the design process • Key is to represent this at the system level • Restrictions • Streaming based • Fixed hardware target platform • Reconfiguration • Specifically more suitable “reconfigurable” hardware is needed • Clear need to emphasise reconfiguration in design flow • Reconfiguration mux (Imperial College)

And finally….. • Acknowledgements Ying Yi J-P Heron Richard Turner Gaye Lightbody David Trainor Scott Fischaber Eoin Malins Tim Courtney Lok Kee Ting Sakir Sezer • Thanks for the invitation – great fun!

Overview September 2004