500 likes | 733 Views
ESL: System Level Design Bluespec ESEPro: ESL Synthesis Extenstions for SystemC. Rishiyur S. Nikhil CTO, Bluespec, Inc. (www.bluespec.com). 6.375 Lecture 16 Delivered by Arvind March 16, 2007 (Only a subset of Nikhil’s slides are included). Software. implements. refine.
E N D
ESL: System Level DesignBluespec ESEPro:ESL Synthesis Extenstions for SystemC Rishiyur S. Nikhil CTO, Bluespec, Inc. (www.bluespec.com) 6.375 Lecture 16 Delivered by Arvind March 16, 2007 (Only a subset of Nikhil’s slides are included)
Software implements refine explore architectures (for speed, area. power) Not avail. early; slower sim; HW-accurate HW Implementation The central ESL design problem Early software Early models HW/SW interface (e.g., register read/write) implements Avail. early; very fast sim; not HW-accurate (timing, area) First HW model(s) Required: uniform computational model (single paradigm), plus higher level than RTL, even for implementation
Another ESL design problem Reuse (models and implementations) SoC 1 SoC 2 SoC n Required: powerful parameterization and powerful module interface semantics
Bluespec enables ESL • Rules and Rule-based Interfaces provide a uniform computational model suitable both for high-level system modeling as well as for HW implementation • Atomic Transaction semantics are very powerful for expressing complex concurrency • Formal and informal reasoning about correctness • Automatic synthesis of complex control logic to manage concurrency • Map naturally to HW (“state machines); synthesizable; no mental shifting of gears during refinement • Can be mixed with regular SystemC, TLM, and C++, for mixed-model and whole-system modeling • Enables Design-by-Refinement; Design-by-Contract BSV: Bluespec SystemVerilog ESEPro: Bluspec’s ESL Synthesis Extensions to SystemC
Rule Concurrent Semantics • “Untimed” semantics: • “Timed”, or “Clock Scheduled” semantics (Bluespec scheduling technology) Forever: Execute any enabled rule In each clock: Execute a subset of enabled rules (in parallel, but consistent with untimed semantics)
Parsing Parsing Bluespec Tools Architecture ESEPro (SystemC*) BSV (SystemVerilog*) ESE and ESEPro ESEComp and BSC Bluesim systemc.h, esl.h Static Checking Rapid,Source-LevelSimulation andInteractiveDebug of BSV Optimization gcc CommonSynthesis Engine Blueview Debug Scheduling Untimed & Timed Power Optimization .exe RTL Generation sim synthesis sim Cycle-Accuratew/Verilog sim Cycle-Accuratew/Verilog sim RTL
Outline • Limitations of SystemC in modeling SoCs • ESEPro’s Rule-based Interfaces • Model-to-implementation refinement with SystemC and ESEPro modules • Seamless interoperation of SystemC TLM and ESEPro modules • ESEPro-to-RTL synthesis • An example
Example illustrating why modeling hardware-accurate complex concurrency is difficult in standard SystemC (threads and events)
Spec: Packets arrive on two input FIFOs, and must be switched to two output FIFOs Certain “interesting packets” must be counted DetermineQueue DetermineQueue +1 Count certain packets A 2x2 switch, with stats
DetermineQueue DetermineQueue +1 Count certain packets The first version of the SystemC code is easy void thread1 () {while (true) { Pkt x = in0->first(); in0->deq(); if (x.dest == 0) out0->enq (x); else out1->enq (x); if (count(x)) c++;} } void thread2 () {while (true) { Pkt x = in1->first(); in1->deq(); if (x.dest == 0) out0->enq (x); else out1->enq (x); if (count(x)) c++;} } • first(), deq() block if input fifo is empty; • enq() blocks if output fifo is full. • It all works fine because of “cooperative parallelism”
Cooperative parallelism model • The two increments to the counter do not need to be protected with “locks” because of SystemC’s definition of parallelism as cooperative, i.e., • Threads only switch at “wait()” statements • Threads do not interleave • But real hardware has real parallelism! • Gap between model and implementation • Further, cooperative multithreading also makes it hard to simulatemodelsin parallel (e.g., on a modern multi-core or SMP machine) This code would have problems with preemptive parallelism
DetermineQueue DetermineQueue +1 Count certain packets Cooperative parallelism Atomicity There could be some subtle mistakes void thread1 () {while (true) { int tmp = c ; Pkt x = in0->first(); in0->deq(); if (x.dest == 0) out0->enq (x); else out1->enq (x); if (count(x)) c = tmp + 1;} } void thread2 () {while (true) { int tmp = c; Pkt x = in1->first(); in1->deq(); if (x.dest == 0) out0->enq (x); else out1->enq (x); if (count(x)) c = tmp + 1;} } • If the threads interleave due to blocking • of first(), deq(), enq(), c will be • incorrectly updated (non-atomically)
DetermineQueue DetermineQueue +1 Count certain packets Hardware has additional “resource contention” constraints • Each output fifo can be enq’d by only one process at a time (in the same clock) • Need arbitration if both processes want to enq() on the same fifo simultaneously • SystemC’s cooperative multitasking makes it easy to ignore this, but much harder to model this accurately • Accurately modeling this makes • the code messier
DetermineQueue DetermineQueue +1 Count certain packets Hardware has additional “resource contention” constraints • The counter can be incremented by only one process at a time • Need arbitration if both want to increment • SystemC’s cooperative multitasking makes it easy to ignore this, but much harder to model this accurately • Accurately modeling this makes • the code messier
DetermineQueue DetermineQueue +1 Count certain packets Hardware has additional “resource contention” constraints • No intermediate buffering a process should transfer a packet only when both its input fifo and its output fifo are ready, and it has priority on its output fifo and the counter • SystemC’s blocking methods make it easy to ignore this, but much harder to model this accurately • Accurately modeling this makes • the code messier
Hardware typically has additional “resource contention” constraints • These constraints must be modeled in order to model HW performance accurately (latencies, bandwidth) • In SystemC, this exposes full/empty tests on fifos, adds locks/semaphores, polling of locks/semaphores, … • The code becomes a mess • If we want synthesizability, it more and more resembles writing RTL in SystemC notation
Limitations of SystemC/C++ • Accurate SoC modeling involves lots of concurrency and dynamic, fine-grain resource sharing • Because these are the characteristics of HW • Most blocks in an SoC are HW; a few blocks (e.g., processor, DSP) involve software (typically C, C++) • “Threads and Events” (SystemC’s concurrency model) are far too low-level for this • Require tedious, explicit management of concurrent access to shared state • Weak semantics for module composition • Does not scale to large systems • They are the source of the majority of bugs in RTL and SystemC (race conditions, inconsistent state, protocol errors, …) • Instead, advanced SW systems (e.g., Operating Systems, Database Systems, Transaction Processing Systems) use Atomic Transactions to manage complex concurrency
Other issues with SystemC/C++ • No early feedback on HW implementability during modeling, because of distance of SystemC semantics from HW • Threads, stacks, dynamic allocation, events, locks, global variables, undisciplined instantaneous access to global/remote data • Undisciplined access to shared resources • No credible synthesis from a sequential, thread-based model of computation (except for loop-and-array computational kernels) • The design has to be re-implemented in RTL
Literature on problems with threads(and the advantages of atomicity) • The Problem with Threads, Edward A. Lee, IEEE Computer, 39:5, pp 33-42, May 2006 • Why threads are a bad idea (for most purposes), John K. Ousterhout, Invited Talk, USENIX Technical Conference, January 1996 • Composable memory transactions, Tim Harris, Simon Marlow, Simon Peyton Jones and Maurice Herlihy, in ACM Conf. on Principles and Practice of Parallel Programming (PPoPP), 2005. • Atomic Transactions, Nancy A. Lynch, Michael Merritt, William E. Weihl and Alan Fekete, Morgan Kaufman, San Mateo, CA, 1994, 476 pp. • … and more …
DetermineQueue DetermineQueue +1 Count certain packets 2x2 switch:the meat of the ESEPro code ESL_RULE (r0); Pkt x = in0->first(); in0->deq(); if (x.dest == 0) out0->enq(x); else out1->enq(x); if (count(x)) c++; } ESL_RULE (r1); Pkt x = in1->first(); in1->deq(); if (x.dest == 0) out0->enq(x); else out1->enq(x); if (count(x)) c++; } • Atomicity of rules captures all the “resource • contention” constraints of hardware • implementation; further, this code is • synthesizable to RTL as written.
Managing change • Specs always change. Imagine: • Some packets are multicast (go to both FIFOs) • Some packets are dropped (go to no FIFO) • More complex arbitration • FIFO collision: in favor of r1 • Counter collision: in favor of r2 • Fair scheduling • Several counters for several kinds of interesting packets • Non exclusive counters (e.g., TCP IP) • M input FIFOs, N output FIFOs (parameterized) • What if these changes are required 6 months after original coding? With Rules these are easy, because the source code remains uncluttered by all the complex control and mux logic atomicity ensures correctness
Outline • Limitations of SystemC in modeling SoCs • ESEPro’s Rule-based Interfaces • Model-to-implementation refinement with SystemC and ESEPro modules • Seamless interoperation of SystemC TLM and ESEPro modules • ESEPro-to-RTL synthesis • An example
Interfaces: raising the level of abstraction(while preserving Rule semantics) • Interfaces can also contain other interfaces • We use this to build a hierarchy of interfaces • Get/Put Client/Server … • These capture common interface design patterns • There is no HW overhead to such abstraction • Connections between standard interfaces can be packaged (and used, and reused) • “Connectable” interfaces • All these are synthesizable
Get and Put Interfaces • Provide simple methods for getting data from a module or putting data into it • Easy to connect together template <typename T> ESL_INTERFACE ( Get ) { ESL_METHOD_ACTIONVALUE_INTERFACE ( get, T ); } template <typename T> ESL_INTERFACE ( Put ) { ESL_METHOD_ACTION_INTERFACE ( put, T x ); } get put
Get and Put Interfaces • Get and Put are just interface specifications • Many different kinds of modules can provide Get and Put interfaces • E.g., a FIFO’s enq() can be viewed as a put() operation, and a FIFO’s first()/deq() can be viewed as a get() operation
Interface transformers/transactors • Because of the abstractions of interfaces and modules, it is easy to write interface transformers/transactors • This example is from the ESEPro library, transforming a FIFO interface into a Get interface ESL_MODULE_TEMPLATE ( fifoToGet, Get, T ) { FIFO<T> *f; ESL_METHOD_ACTIONVALUE (get, true, T) { T temp = f->first(); f->deq(); return temp; } ESL_CTOR ( fifoToGet, FIFO<T> *ff ) : f ( ff ) { ESL_END_CTOR; } };
Interface transformers/transactors • Another example from the ESEPro library, transforming a FIFO interface into a Put interface: ESL_MODULE_TEMPLATE ( fifoToPut, Put, T ) { FIFO<T> *f; ESL_METHOD_ACTION (put, true, T x) { f->enq (x); } ESL_CTOR ( fifoToPut, FIFO<T> *ff ) : f ( ff ) { ESL_END_CTOR; } };
Nested interfaces • An interface can not only contain methods, but also nested interfaces template < typename Req_t, typename Resp_t > ESL_INTERFACE ( Server ) { ESL_SUBINTERFACE ( request, Put <Req_t> ); ESL_SUBINTERFACE ( response, Get <Resp_t> ); } put get
Sub-interfaces: using transformers • The ESEPro library provides functions to convert FIFOs to Get/Put ESL_MODULE ( mkCache, CacheIfc ) { FIFO<Req_t> *p2c; FIFO<Resp_t> *c2p; … rules expressing cache logic … ESL_CTOR ( mkCache, …) { request = new fifoToPut (“req”, p2c); response = new fifoToGet (“rsp”, c2p); } } typedef Server<Req_t, Resp_t> CacheIfc; get put Absolutely no difference in the HW! mkCache
Client/Server interfaces • Get/Put pairs are very common, and duals of each other, so the library defines Client/Server interface types for this purpose client get put ESL_INTERFACE ( Client<req_t, resp_t> ) { ESL_SUBINTERFACE (request, Get<req_t> ); ESL_SUBINTERFACE (response, Put<resp_t> ); }; ESL_INTERFACE ( Server<req_t, resp_t> ) { ESL_SUBINTERFACE ( request, Put<req_t> ); ESL_SUBINTERFACE ( response, Get<resp_t> ); }; req_t resp_t get put server
get get put put server server client client get get put put Client/Server interfaces ESL_INTERFACE ( CacheIfc ) { ESL_SUBINTERFACE ( ipc, Server<Req_t, Resp_t> ); ESL_SUBINTERFACE ( icm, Client<Req_t, Resp_t> ); }; mkProcessor ESL_MODULE ( mkCache, CacheIfc ) { // from / to processor FIFO<Req_t> *p2c; FIFO<Resp_t> *c2p; // to / from memory FIFO<Req_t> *c2m; FIFO<Resp_t> *m2c; … rules expressing cache logic … ESL_CTOR (mkCache ) { … ipc = fifosToServer (p2c, c2p); icm = fifosToClient (c2m, m2c); ESL_END_CTOR; } mkCache mkMem
Connecting Get and Put • A module m1 providing a Get interface can be connected to a module m2 providing a Put interface with a simple rule ESL_MODULE ( mkTop, …) { Get<int> *m1; Put<int> *m2; ESL_RULE ( connect, true ) { x = m1->get(); m2->put (x); // note implicit conditions } } get put
“Connectable” interface pairs • There are many pairs of types that are duals of each other • Get/Put, Client/Server, YourTypeT1/YourTypeT2, … • The ESEPro library defines an overloaded, templated module mkConnection which encapsulates the connections between such duals • The ESEPro library predefines the implementation of mkConnection for Get/Put, Client/Server, etc. • Because overloading in C++ is extensible, you can overload mkConnection to work on your own interface types T1 and T2
get put server client get put mkConnection • Using these interface facilities, assembling systems becomes very easy mkProcessor ESL_MODULE ( mkTopLevel, …) { // instantiate subsystems Client<Req_t, Resp_t> *p; Cache_Ifc<Req_t, Resp_t> *c; Server<Req_t, Resp_t> *m; // instantiate connections new mkConnection< Client<Req_t, Resp_t>, Server<Req_t, Resp_t> > (“p2c”, p, c->ipc); new mkConnection< Client<Req_t, Resp_t>, Server<Req_t, Resp_t> > (“c2m”, c->icm, m); } get put server (ipc) mkCache client (icm) get put mkMem
Outline • Limitations of SystemC in modeling SoCs • ESEPro’s Rule-based Interfaces • Model-to-implementation refinement with SystemC and ESEPro modules • Seamless interoperation of SystemC TLM and ESEPro modules • ESEPro-to-RTL synthesis • An example
Rules and Levels of abstraction Rules, C, C++, Matlab, … AL/FL (Algorithm/Function level) PV (Programmer’s View) Untimed Rules (no clocks) PVT (PV with Timing) AV (Architect’s View) CA (Cycle-accurate) Clocked Rules (scheduled) IM (Implementation)
A system model can contain a mixture of SystemC modules and ESEPro modules Typical SystemC modules: CPU ISS models Behavioral models C++ code targeted for behavioral synthesis Existing SystemC IP Typical ESEPro modules: Complex control Requiring HW-realistic architectural exploration Module structure SoC Model Processor (App/ISS) DSP (App/ISS) L2 cache Interconnect DMA Mem Controller Codec model DRAM model Legend Rule-based SystemC SystemC
core SystemC + TLM + TLM + TLM + ESL core SystemC class defs/libs ESL class defs/libs TLM class defs/libs Standard SystemC tools (gcc, OSCI sim, gdb, …) Simulation flow System Model Processor (ISS/App) DSP (ISS/App) L2 cache Interconnect DMA Mem Controller Codec model DRAM model Legend Rule-based SystemC SystemC
Synthesis flow System Model Processor (ISS) • Synthesizable subset: ESEPro Rule-based modules • much higher level than RTL • already validated in BSV DSP (App) L2 cache Interconnect Bluespec synthesis tool DMA Mem Controller Codec model RTL DRAM model RTL synthesis, Physical design Verilog sim Legend Rule-based SystemC SystemC Tapeout
System refinement Using ESEPro • ESEPro modules can be introduced early as they can be written at a very high level, can interface to TLM modules, and can themselves be refined • System-level testbenches can be reused at all levels • SystemC modules with standard TLM interfaces interoperate seamlessly with ESEPro modules • Behavioral models, Design IP, Verification IP, … More information at: http://www.bluespec.com/products/ESLSynthesisExtensions.htm Website also has a free distribution called “ESE”
Mixing models: all combinations TLM Master TLM Slave Replace Slave Replace Master ESEPro Slave ESEPro Master TLM Master TLM Slave Replace Master Replace Slave ESEPro Slave ESEPro Master Legend TLM Master and Slave are taken unmodified from OSCI TLM distribution examples Rule-based SystemC SystemC
Structure of TLM modules in demo(from OSCI_TLM/examples/example_3_2) TLM master TLM slave write (addr, data) 20 write (addr, data) read (addr, data &) read (addr, data &) 20 basic_initiator_port basic_slave_base RSP = transport (REQ) transport () is a basic TLM interface call
TLM master and ESEPro slave TLM master ESEPro slave write (addr, data) 20 write (addr, data) read (addr, data &) read (addr, data &) 20 basic_initiator_port Server <REQ, RSP> mkConnection (channel) transport ()
Initiator 0 Initiator 1 M M Respond to timer interrupt Set timer M S Router S M M M Target 0 Target 1 Timer S S S Example: ESEPro SoC model for synthesis(from ST/GreenSoCs “TAC” model) = Master interface = Slave interface (< 1000 lines of source code) M S
SoC Model: Behavior • Initiators repeatedly do read/write transactions to Targets, via Router • At startup, Initiator 1 writes to Timer registers via Router, starting the timer • When Timer’s time period expires, generates an interrupt to Initiator 1
Synthesis example SoC Model in ESEPro(from ST/GreenSocs “TAC” model) Simulation Synthesis ESEPro™ ESEComp™ Bluespec synthesis tool + ESL core SystemC class defs/libs ESL class defs/libs RTL Standard SystemC tools (gcc, OSCI sim, gdb, …) Verilog sim CycleAccurate This capability is unique to ESEComp Magma synthesis
Cycle 12 Target[1]: got request from initiator[1], addr is 1001 Target[1]: sending response, data 1011 Target[0]: got request from initiator[0], addr is 4 Target[0]: sending response, data 14 Initiator_with_intr_in[1]: forwarding req, addr = 1003 Initiator[0]: got response addr 2, data 12 Initiator[0]: sending req, addr = 6 ---------------- Cycle 13 Timer: generating interrupt Initiator[1]: sending req, addr = 4 ---------------- Cycle 14 Target[1]: got request from initiator[0], addr is 1005 Target[1]: sending response, data 1015 Target[0]: got request from initiator[1], addr is 2 Target[0]: sending response, data 12 Initiator_with_intr_in[1]: forwarding req, addr = 4 Initiator[1]: got response addr 0, data 10 Initiator[0]: got response addr 1003, data 1013 Initiator[0]: sending req, addr = 1007 ---------------- Cycle 15 Initiator_with_intr_in[1] received interrupt Initiator[1]: sending req, addr = 1005 Cycle 12 Initiator[0]: sending req, addr = 6 Initiator[0]: got response addr 2, data 12 Target[0]: got request from initiator[0], addr is 4 Target[0]: sending response, data 14 Target[1]: got request from initiator[1], addr is 1001 Target[1]: sending response, data 1011 Initiator_with_intr_in[1]: forwarding req, addr = 1003 ---------------- Cycle 13 Initiator[1]: sending req, addr = 4 Timer: generating interrupt ---------------- Cycle 14 Initiator[0]: sending req, addr = 1007 Initiator[0]: got response addr 1003, data 1013 Initiator[1]: got response addr 0, data 10 Target[0]: got request from initiator[1], addr is 2 Target[0]: sending response, data 12 Target[1]: got request from initiator[0], addr is 1005 Target[1]: sending response, data 1015 Initiator_with_intr_in[1]: forwarding req, addr = 4 ---------------- Cycle 15 Initiator[1]: sending req, addr = 1005 Initiator_with_intr_in[1] received interrupt Side-by-side simulation comparison SystemCSimulation Verilog (Generated)Simulation CycleAccurate (order of messages within each cycle varies, but that’s ok—from parallel actions)
SoC Router: Magma Synthesis Results • ESEComp’s Verilogoutput run throughMagma’s synthesistools • TSMC 0.18 µm libraries • Design easily meets 400 MHz Thanks