Model Based Design for DSP: Presentation to Stevens

Maryland DSPCAD Research Group (http://www.ece.umd.edu/DSPCAD/home/dspcad.htm) Department of Electrical and Computer Engineering, andInstitute for Advanced Computer Studies University of Maryland, College Park Model Based Design for DSP:Presentation to Stevens Will Plishker, Chung-Ching Shen, Nimish Sane, George Zaki, SoujanyaKedilaya, Shuvra S. Bhattacharyya

Outline • Model Based Design • Dataflow Interchange Format • Multiprocessor Scheduling • Preliminary Setup and Results with GPUs • Future Directions

Implementation Gap Introduction Abstract representation of an algorithm 1 Pattern (4 bits) Threshold Module • In modern, complex systems we would like to • Create an application description independent of the target • Interface with a diverse set of tools and teams • Achieve high performance • Arrive at an initial prototype quickly • But algorithms are far removed from their final implementation • Low level programming environments • Diverse and changing platforms • Non-uniform functional verification • Entrenched design processes • Tool selection 2 Pattern comparator 3 4 Decision (1 bit) 1 2 3 4 Decision check NO zero (38 bit) E Adder E/Gamma YES EGamma (1 bit) H Adder Finegrain (1 bit) 38 bit Fine Grain OR Low level, high performance, implementation Channel Et Adder Channel Et 4x9 bits

Model-Based Design for Embedded Systems • High level application subsystems are specified in terms of components that interact through formal models of computation • C or other “platform-oriented” languages can be used to specify intra-component behavior • Model-specific language can be used to specify inter-component behavior • Object-oriented techniques can be used to maintain libraries of components • Popular models for embedded systems • Dataflow and KPNs (Kahn process networks) • Continuous time, discrete event • FSM and related control formalisms

Dataflow-based Design: Related Trends • Dataflow-based design (in our context) is a specific form of model-based design • Dataflow-based design is complementary to • Object-oriented design • DSP C compiler technology • Synthesis tools for hardware description languages (e.g., Verilog and VHDL)

Example: Dataflow-based design for DSP Example from Agilent ADS tool

Example: QAM Transmitter in National Instruments LabVIEW Rate Control QAM Encoder PassbandSignal TransmitFilters Source: [Evans 2005]

Dataflow Models DSP Designs Crossing the Implementation Gap:Design Flow Using DIF Static SDF MDSDF Dynamic Meta-Modeling Signal Proc HSDF CSDF CFDF BDF PDF BLDF Image/Video Comm Sys The DIF Language (TDL) DIF Specification The DIF Package (TDP) Front-end Algorithms DIF-to-C AIF / Porting DIF Representation DSP Libraries Ptolemy Ex/Im DIF-A T Ex/Im Other Ex/Im Dataflow- based DSP Design Tools Ptolemy II Autocoding Toolset Other Tools VSIPL TI Other Embedded Processing Platforms Java Ada Other Embedded Platforms C Java VM VDM DSP

Dataflow with Software Defined Radio:DIF + GNU Radio GRC DIF specification (.dif) 1) Convert or generate .dif file (Complete) 3b) Architecture specification (.arch?) The DIF Package (TDP) XML Flowgraph (.grc) Python Flowgraph (.py) • Processors • Memories • Interconnect • 4) Architecture aware MP scheduling • (assignment, ordering, invocation) Uniprocessor Scheduling GNU Radio Engine Python/C++ DIF Lite 2) Execute static schedules from DIF (Complete) Schedule (.dif, .sched) Platform Retargetable Library 3a) Perform online scheduling Legend Existing or Completed Proposed Platforms Multi-processors GPUs Cell FPGA

Y Z X 5 Background: Dataflow Graphs • Vertices (actors) represent computation • Edges represent FIFO buffers • Edges may have delays, implemented as initial tokens • Tokens are produced and consumed on edges • Different models have different rules for production (SDF=fixed, CSDF=periodic, BDF=dynamic) p1 c1 p2 c2 e2 e1

Evolution of Dataflow Models of Computation for DSP: Examples • Computation Graphs and Marked Graphs [Karp 1966, Reiter 1968] • Synchronous dataflow, [Lee 1987] • Static multirate behavior • SPW (Cadence) , National Instruments LabVIEW, and others. • Well behaved stream flow graphs [1992] • Schemas for bounded dynamics • Boolean/integer dataflow [Buck 1994] • Turing complete models • Multidimensional synchronous dataflow [Lee 1992] • Image and video processing • Scalable synchronous dataflow [Ritz 1993] • Block processing • COSSAP (Synopsys) • CAL [Eker 2003] • Actor-based dataflow language • Cyclo-static dataflow [Bilsen 1996] • Phased behavior • Eonic Virtuoso Synchro, Synopsys El Greco and Cocentric, Angeles System Canvas • Bounded dynamic dataflow • Bounded dynamic data transfer [Pankert 1994] • The processing graph method [Stevens, 1997] • Reconfigurable dynamic dataflow • U. S. Naval Research Lab, MCCI Autocoding Toolset • Stream-based functions [Kienhuis 2001] • Parameterized dataflow [Bhattacharya 2001] • Reconfigurable static dataflow • Meta-modeling for more general dataflow graph reconfiguration • Reactive process networks [Geilen 2004] • Blocked dataflow [Ko 2005] • Image and video through parameterized processing • Windowed synchronous dataflow [Keinert 2006] • Parameterized stream-based functions [Nikolov 2008] • Enable-invoke dataflow [Plishker 2008] • Variable rate dataflow [Wiggers 2008]

X CSDF Modeling Design Space X r C, BDF, DDF e w X o PCSDF p X e PSDF v i s s e r X X p MDSD, WBDF CSDF, SSDF x E X SDF Verification / synthesis power

Dataflow Interchange Format • Describe DF graphs in text • Simple DIF file: dif graph1_1 { topology { nodes = n1, n2, n3, n4; edges = e1 (n1, n2), e2 (n2, n1), e3 (n1, n3), e4 (n1, n3), e5 (n4, n3), e6 (n4, n4); } }

More features of DIF • Ports interface { inputs = p1, p2:n2; outputs = p3:n3, p4:n4; } • Hierarchy refinement { graph2 = n3; p1 : e3; p2 : e4; p3 : e5; p4 : p3; }

More features of DIF 4096 4096 • Production and consumption production { e1 = 4096; e10 = 1024; ... } consumption { e1 = 4096; e10 = 64; ... } • Computation keyword • User defined attributes 1024 64

dataflowModelgraphID { basedon { graphID; } topology { nodes = nodeID, ...; edges = edgeID (srcNodeID, snkNodeID), ...; } interface { inputs = portID [:nodeID], ...; outputs = portID [:nodeID], ...; } parameter { paramID [:dataType]; paramID [:dataType] = value; paramID [:dataType] : range; } refinement { subgraphID = supernodeID; subPortID : edgeID; subParamID = paramID; } The DIF Language Syntax builtInAttr { [elementID] = value; [elementID] = id; [elementID] = id1, id2, ...; } attributeusrDefAttr{ [elementID] = value; [elementID] = id; [elementID] = id1, id2, ...; } actornodeID { computation = stringValue; attrID [:attrType] [:dataType] = value; attrID [:attrType] [:dataType] = id; attrID [:attrType] [:dataType] = id1, ...; } }

Uniprocessor Scheduling for Synchronous Dataflow • An SDF graph G = (V,E) has a valid schedule if it is deadlock-free and is sample rate consistent (i.e., it has a periodic schedule that fires each actor at least once and produces no net change in the number of tokens on each edge). • Balance eqs: eE, prd(e) xq[src(e)] = cns(e) xq[snk(e)]. • Repetition vector q is the minimum solution of balance eqs. • A valid schedule is then a sequence of actor firings where each actor v is fired q[v] (repetition count) times and the firing sequence obeys the precedence constraints imposed by the SDF graph.

Example: Sample Rate Conversion • Flat strategy • Topological sort the graph and iterate each actor vq[v] times. • Low context switching but large buffer requirement and latency • CD to DAT Flat Schedule: • (147A)(147B)(98C)(56D)(40E)(160F) CD to DAT: 44.1 kHz to 48 kHz sampling rate conversion. (A) (C) (D) (E) (F) (B)

Acyclic pairwise grouping of adjacent nodes (APGAN) An adaptable (to different cost functions) and low-complexity heuristic to compute a nested looped schedule of an acyclic graph in a way that precedence constraints (topological sort) is preserved through the scheduling process. Dynamic programming post optimization (DPPO) Dynamic programming over a given actor ordering (any topological sort). GDPPO, CDPPO, SDPPO. Recursive procedure call (RPC) based MAS Generate MASs for a given R-schedule through recursive graph decomposition. The resulting schedule is bounded polynomially in the graph size. Scheduling Algorithms

Representative Dataflow Analyses and Optimizations • Bounded memory and deadlock detection: consistency • Buffer minimization: minimize communication cost • Multirate loop scheduling: optimize code/data trade-off • Parallel scheduling and pipeline configuration • Heterogeneous task mapping and co-synthesis • Quasi-static scheduling: minimize run-time overhead • Probabilistic design: adapt system resources and exploit slack • Data partitioning: exploit parallel data memories • Vectorization: improve context switching, pipelining • Synchronization optimization: self-timed implementation • Clustering of actors into atomic scheduling units

Multiprocessor Scheduling • Multiprocessor scheduling problem: • Actor assignment (mapping) • Actor ordering • Actor invocation • Approaches to each of these tend to be platform specific • Tools can be brought under a common formal umbrella

Multiprocessor Scheduling Application Model, G(V, E, t(v), C(e)) Mapping/Scheduling

Multiprocessor Mapping Application Model, G(V, E, t(v), C(e)) Mapping P1 P2 P3 P4

Invocation Example: Self-Timed (ST) scheduling Assignment and ordering performed at compile-time. Invocation performed at run-time (via synchronization) Proc 1 Gantt Chart for ST schedule Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 A Proc 4 E D H Proc 3 B C Proc 5 F G Proc 2 Execution Times A, B, F : 3 C, H : 5 D : 6 E : 4 G : 2 Application Graph 18 TST=9

Multicore Schedules • Traditional multicore scheduling • Convert application DAG to Homogenous Synchronous Dataflow (HSDF) • Perform HSDF mapping • Problem: exponential graph explosion • Our solution: • single processor schedule (SPS) represented as a generalized schedule tree (GST) • generate an equivalent multiprocessor schedule (MPS) to be represented as a forest of GSTs.

Traditional Dataflow Multiprocessor Scheduling (MPS) Synchronous Dataflow (SDF) representation of application 3 2 A B C 1 1 A 1 1 1 B A 1 1 1 A 1 Homogenous SDF representation of application 1 C 1 A 1 1 1 B A 1 1 1 A 1

GST Representation for MPS - Simple Example (a) An SDF graph (b) SPS as a GST P1 P2 P3 (c) MPS represented as a forest of GSTs

Demonstration on GPUs:Start with parallel actors • Within an actor (FIR Filter). • Limitation (IIR Filter)

Individual actor results:CUDA FIR vs. Stock GR FIR

Individual Actor Results:Turbo Block Decode

Future Direction: Tackling the general MP scheduling problem with dataflow analysis • Many dataflow analysis techniques are available once the problem is well defined in dataflow terms • Maximizemulticore utilization by replicatingand fusing actors/blocks • Stateless vs. stateful • Computation to communication ratios • Firing rates/execution times to number of blocks • Once application is mapped to blocks/processors • Single processor scheduling to minimize buffering

Focus first on MP Scheduling for GPUs • Blocks • Threads • Memory

1 7 2 5 4 8 3 6 9 Refine to a simpler question:When to off-load onto a GPU? • Given: • An application graph • Actor timing characteristics for communication and computation • A target architecture with heterogeneous multiprocessing • Find optimal implementation • Latency • Throughput ? CPU GPU

Summary • Model Based Design • Dataflow Interchange Format • Multiprocessor Scheduling • Preliminary Setup and Results with GPUs • Future Directions

Model Based Design for DSP: Presentation to Stevens