550 likes | 934 Views
SVC & PROTEUS CRASH COURSE. Ron Diamant ( rond@tx.technion.ac.il ) * based on material from Beerel & Saifhashemi (USC). Agenda. Introduction SVC syntax Simulation Synthesis and P&R (Proteus) Advanced example (TEA). Agenda. Introduction SVC syntax Simulation
E N D
SVC & PROTEUSCRASH COURSE Ron Diamant (rond@tx.technion.ac.il) * based on material from Beerel & Saifhashemi (USC)
Agenda • Introduction • SVC syntax • Simulation • Synthesis and P&R (Proteus) • Advanced example (TEA)
Agenda • Introduction • SVC syntax • Simulation • Synthesis and P&R (Proteus) • Advanced example (TEA)
Motivation 60 IP blocks 350 RAMs: Communication Bottleneck • Async design has several proven advantages • speed, power, modularity, etc. • Usually, not all can be achieved together… • However, async design has never been widely adopted • lack of CAD tools • hard to translate sync to async • etc. • need for a mature, easy-to-use ASIC CAD flow: • Async HDL • Simulator + Waveform viewer • Synthesizer + Basic-Cell-Library • P&R
Commercialization Efforts • Fulcrum Microsystems (www.fulcrummicro.com) • Fabless semi-conductor company • High-performance computing and networking markets • Founded out of Caltech in 2000 • Uses high-performance async design as secret sauce • Achronix(www.achronix.com) • High-performance async FPGA core with synchronous interfaces • Founded out of Cornell research in 2006 • TimeLess Design Automation • ASIC Flow for Asynchronous Design • First target – high-performance – GHz+ silicon in 65nm • Founded out of USC in 2008 • Sold to Fulcrum Microsystems in 2010 • Tiempo(www.tiempo-ic.com) • IP Cores and ASIC Flow (Power/Performance Tradeoff) • Numerous failed start-ups • Handshake Solutions, Silistix, Inc, Elastix Corporation
Proteus • Proteus • an asynchronous ASIC CAD flow for high-performance circuits. • X2-3 faster than sync counterparts (test chips work @ 1.1GHz in semi-custom 65nm) • Targets: • Leverage mature (and known) RTL simulators, synthesis tools & place-and-route tools. • Cell library must be implemented from scratch (why?) • Changing of underlying async-protocol w/o massive changes to the design • Eliminate the need for long research for the “best protocol for us”
Agenda • Introduction • SVC syntax • Simulation • Synthesis and P&R (Proteus) • Advanced example (TEA)
SystemVerilog • SystemVerilog Overview • Started in 2002 by Accellera • IEEE Standard 1800-2005 • Mergedwith IEEE1364: IEEE 1800-2009 • SupersetofVerilog • Somenewfeatures • New datatypes • Structures & Unions • Enumerated Data Types • Interface
Interfaces • Hierarchical Structure on Ports • Encapsulate Communication Signals • Encapsulate Communication Protocol • Abstraction of Communication
How Interfaces Work Bundle signals interfaceintf; logicreq=0, ack=0; logic [7:0] data; endinterface module top; //instantiate interfaces intf i1; //instantiate modA and modB bit_genmodA (intfi1); bit_bucketmodB (intfi1); endmodule
Abstracting Communication P1 P2 TIME Channel • CSP (Hoare, 78) & CHP (Martin, 86) • Processes have one or more sequential threads of execution • Each process has a set of input & output ports • Two ports connected to each other form a channel • Process SEND and RECEIVE on their ports • SEND/RECEIVE are blocking actions! • Processes do not share variables
Abstract SystemVerilog Channels Abstract communication • The basic idea • Use SystemVerilog interface to abstract channel wires as well as Send/Receive tasks
Behind The Scenes: Channel Interface typedef enum{idle, r_pend, s_pend} ChannelStatus; typedef enum {P2PhaseBD, P4PhaseBD} ChannelProtocol; interface Channel; parameter WIDTH = 8; parameterChannelProtocol hsProtocol = P2PhaseBD; ChannelStatus status = idle;// Status of a channel logicreq=0, ack=0; // Handshaking signals logichsPhase=1; // Used in two-phase // handshaking logic[WIDTH-1:0] data; // Data being communicated endinterface: Channel • Channel details encapsulated within an “Interface” • Implementation details (below) hidden from user • Greatly simplifies debugging and evaluation of the design
Interface Send and Receive Tasks Arbitrary handshaking protocol Support most commonly used • Send/Receive tasks are analogous to CSP’s ! (output) and ? (input) • Semantics are based on synchronization of concurrent processes using SystemVerilog’s notion of update and evaluationevents
Viewing Channel Status Status of channels as a waveform Enumerated types make viewing channel status inherent to all standard SystemVerilog simulators The designer can monitor if and what processes are engaged in the communication over time Download ModelSim student edition:http://model.com/content/modelsim-pe-student-edition-hdl-simulation
Supports Mixed-Levels of Abstraction Block1 Block3 Block2 module mp_fb_csp (interface L, interface R); logic data; always begin L.Receive(data); R.Send(data); end endmodule High-level description of the buffer (Before synthesis) Completed blocks can be simulated with others still at behavioral level
Supports Mixed-Levels of Abstraction Block1 Block3 Block2 Gate-level description of the buffer (After synthesis) module mp_fb_gate (interface L, interface R); celement ce(L.req, pd_bar, c); not inv (pd_bar, pd); cap_pass cp (c, L.ack, R.ack, pd, L.data, R.data); endmodule Completed blocks can be simulated with others still at behavioral level
Agenda • Introduction • SVC syntax • Simulation • Synthesis and P&R (Proteus) • Advanced example (TEA)
Conditional Accumulator C1 C2 C3 CondAccum X1 S X2 • CHP description: { s=0 ; *[ x1=0 , x2=0 ; C1?c1 , C2?c2 , C3?c3 ; [c1 X1?x1 [] elseskip] ,[c2 X2?x2 [] elseskip] s = s + x1 + x2 ; [c3 S!s [] elseskip] ] }
Conditional Accumulator • SVC implementation of CondAccumulator
Conditional Accumulator • Source file can be found at:luna70: ~rond/svc_examples/CondAccumulator/sim • CondAccumulator_csp.sv (impl) • CondAccumulator_csp_tb.sv (tb) • Notice the general structure:
Conditional Accumulator • Simulation using ModelSim
Agenda • Introduction • SVC syntax • Simulation • Synthesis and P&R (Proteus) • Advanced example (TEA)
TimeLess Library Sync Library Clock Gating Clock Gating ClockFree Netlist Constraints Clock Tree Synthesis The Proteus Flow Circuit Description Design Goals Synthesis Netlist Constraints Netlist Constraints Physical Design Final Layout • TimeLess Library • Domino and QDI control gates • TSMC 65nm • Synthesis • Off-the-shelf commercial tools (RC) • Driven by Proteus’ scripts • ClockFree • Optimization & translation • Physical Design • Off-the-shelf commercial tools (Encounter) • Driven by Proteus scripts
Preparations • Connect to luna70.technion.ac.il • VLSI floor 7, ask Goel/Amir to open an account for you • ssh luna70.technion.ac.il • source /tools/proteus/proteus/scripts/proteus.cshfor setting the path (add this to your .cshrc file) • snapshot [before start]:
SVC2RTL • MANUALLYupdate CondAccumulator_csp.sv: • Comment-out all forks & joins • add the following line to the beginning of the file:`include “/tools/proteus/proteus/pdk/proteus/svc2rtl.sv” • Replace the interface keyword with e1ofx_y.dir • x – value N, for 1ofN protocol • y – bus-width • dir – In/Out • For every type of interface (protocol & bus-width), add the following macro call`E1OFN_M(x, y)
SVC2RTL • Updated CondAccumulator.sv:
SVC2RTL • svc2rtl: • svc2rtl translates SVC syntax to synthesizable verilog syntax • Design is separated to SEND/RECEIVE blocks, which are considered as hard-macros & a RTL body (which doesn’t include handshaking) • running svc2rtl: • svc2rtl CondAccumulator_csp.svCondAccumulator.rtl.nf.sv • snapshot [after svc2rtl]:
SVC2RTL • Output looks terrible…
SVC2RTL • Formatting svc2rtl’s output: • Using an open-source script for fixing indentation • Running the indentation-format script: • format.pl ./*.rtl.nf.sv • snapshot [after format.pl]:
SVC2RTL • Result looks better…
Synthesis CondAccumulator.config Can also add timing constraints here… • Synthesis • off-the-shelf synthesizer is used to synthesize the RTL-body • currently, Cadence’s RC is used • This stage might be slow… • Running rc: • Need to have a config-file: CondAccumulator.config • proteus-a --include=CondAccumulator.config --sv=1 --task=rc --force=1 • snapshot [after synthesis]:
Sync Image Library Image Gate-Level Netlist Real Gate-Level Netlist Async Real Library ClockFree ClockFree Synthesis
Clustering Sync Image Library SlackMatching Template Generation Image Gate-Level Netlist Real Gate-Level Netlist Async Real Library ClockFree
ClockFree Can also add timing constraints here… • Remove clk • Replace sync library cells with async library cells • Peephole optimizations • Remove back-to-back unconditional send & receive • Simplify unconditional send/receive blocks • Clustering • Be careful - avoid deadlocks • Fan-out fixing • Must not add delays to the critical path! • Slack matching
ClockFree Can also add timing constraints here… • Clustering • Map async image gates to pipe stages – “a cluster” • Each cluster gets own asynccontroller • Goal • Create as big clusters as possible while meeting all constraints • Minimizing handshaking overhead (area & fan-in)
ClockFree • Clustering • Must avoid deadlocks!
ClockFree • Slack matching • Add pipeline buffer to remove performance bottlenecks • Goal • Performance optimizations (achieve target performance)
ClockFree Can also add timing constraints here… • Running clock-free: • proteus-a --include=CondAccumulator.config --sv=1 –task=clockfree--force=1 • snapshot [after ClockFree]:
Place-and-Route Can also add timing constraints here… • Place-and-Route • off-the-shelf place-and-route tool is used • currently, Cadence’s Encounter is used • This stage might be slow… (and you probably don’t need it) • Running encounter: • Need to config-file (see example on next slide) CondAccumulator.config • proteus-a --include=CondAccumulator.config --sv=1 –task=encounter --force=1
Co-simulation Can also add timing constraints here… High Level Description Copy Merge & Compare • Design validation • implemented circuit vs. original circuit • Important to use the same testbench • verifies correctness of the implementation • usually done w/ formal methods in sync logic
Co-simulation Can also add timing constraints here… • Generating cosim wrapper: • cosim_wrapper.pl*.qdi/*.qdi.noclk.flat.v./CondAccumulator.qdi.noclk.flat.cosim.sv • snapshot [end of flow]:
Agenda • Introduction • SVC syntax • Simulation • Synthesis and P&R (Proteus) • Advanced example (TEA)
TEA (Tiny Encryption Algorithm) Can also add timing constraints here…
TEA (Tiny Encryption Algorithm) Can also add timing constraints here… • Specification (from Wikipedia)
TEA (Tiny Encryption Algorithm) Can also add timing constraints here… • SVC naïve implementation [1]
TEA (Tiny Encryption Algorithm) Can also add timing constraints here… • SVC naive implementation [2]
TEA (Tiny Encryption Algorithm) Can also add timing constraints here… • What will be implemented? • Design is huge! (32X7 adders) • Critical path is impossible! (32X2 adders) • rc will fail to finish execution... • What can we do? • Iterative implementation • Pipelined implementation