1 / 28

Transforming an implementation into a cycle-accurate simulator using BDN

Transforming an implementation into a cycle-accurate simulator using BDN. Murali Vijayaraghavan and Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. RAMP Workshop, Austin, TX June 25, 2009. IBM/MIT Collaboration Sept 2007 –.

Download Presentation

Transforming an implementation into a cycle-accurate simulator using BDN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. RAMP Workshop, Austin, TX June 25, 2009 http://csg.csail.mit.edu

  2. IBM/MIT CollaborationSept 2007 – • Motivation: Create an ecosystem to foster and promote the use of Power architecture in system research • Initial Goal: Create a flexible and synthesizable multithreaded, multicore PowerPC model that facilitates rapid architectural exploration • parameterized for the number of threads; the number and functionality of pipeline stages • Current Goals: • Cycle-accurate modeling • Open source distribution on widely available FPGAs by summer 2010 http://csg.csail.mit.edu

  3. The Team • Architecture and Bluespec Coding • K. Ekanadham, Jessica Tseng • MIT: Asif Khan, Murali Vijayaraghavan • Linux OS Bring-up Team • Hubertus Franke, Jimi Xenidis • FPGA Prototyping Team • Richard Kaufman, Kai Schleupen • Managers • Nancy Greco, Pratap Pattnaik • MIT: Arvind http://csg.csail.mit.edu

  4. Results • A 64-bit embedded PowerPC was created from scratch in Bluespec System Verilog (BSV) • Implemented on an IBM internal FPGA Platform that uses Xilinx Virtex-5 LX330 chip • Linux was booted on it by Nov 2008 • Jessica has ported this design onto Xilinx XUPV5 • Takes up 92% of the area • Running at 20Mhz but probably can be jacked up to 40MHz http://csg.csail.mit.edu

  5. Issues in “Prototype” RTL to FPGA mapping • Some structures consume a disproportionate amount of FPGA resources • multiported register file • CAM • multiply, divide • Prototype RTL implementations on FPGAs need to compensate for external memory timing • Lack of tools for mapping on multiple FPGAs Can be implemented in multiple cycles to save resources  Cycle-accurate modeling http://csg.csail.mit.edu

  6. Bounded Data Flow Networks (BDNs) as a theoretical frame for cycle-accurate modeling of synchronous sequentional machines Murali Vijayaraghavan & Arvind [MEMOCODE 2009] http://csg.csail.mit.edu

  7. Implementing RTL on FPGAs Simulate on BRAMs in multiple cycles 3-read 2-write Reg File Target RTL ASIC On FPGA In general, functional correctness requires cycle accuracy http://csg.csail.mit.edu

  8. I1 O1 I1 O1 S R In Om In Om BDN as a refinement of an SSM • There is a bijective mapping between the inputs (outputs) of S and R • for all n > 0, I(k) matches for S and R (1  k  n)  O(j) matches for S and R (1  j  n) Cycle Accuracy Refers to the kth enqueue in each input FIFO for a BDN http://csg.csail.mit.edu

  9. Combi- -national logic Combi- -national logic Patient SSMs: SSMs with a “start” signal to update registers enable http://csg.csail.mit.edu

  10. S1 S2 (big) S3 (big) S1 S2 (big) S3 (big) R1 R2’ (small) R3’ (small) SSM to BDN and refinements S2 (big) S3 (big) S1 SSM cut Patient SSMs to BDNs BDN refine-ments BDN http://csg.csail.mit.edu

  11. SSM to BDN • The translations has to be done such that the generated BDN is latency-insensitive, i.e., the input-output behavior of the BDN does not change if we change the latency of one of its component BDNs or the size of the FIFOs connecting the components http://csg.csail.mit.edu

  12. Implementing an SSM as a BDN a a c c f f b b d d rule O when (a.emptyb.emptyc.full d.full)  c.enq(f(a.first, b.first)); d.enq(b.first); a.deq ; b.deq This description can be easily translated into logic that serves as a wrapper for the original logic The SSM and BDN have the same input-output behavior http://csg.csail.mit.edu

  13. Deadlocks a a c c f f b b d d rule O when (a.emptyb.emptyc.full d.full)  c.enq(f(a.first, b.first)); d.enq(b.first); a.deq ; b.deq Extraneous dependencies -- d unnecessarily depends upon a and c http://csg.csail.mit.edu

  14. Another behavior for the same BDN a a c c f f b b cDone d d dDone • rule O1 when (a.emptyb.emptyc.full cDone) •  c.enq(f(a.first, b.first)); cDone <= True • rule O2 when (b.emptyd.full dDone) •  d.enq(b.first); dDone <= True • rule In when (cDone dDone) •  a.deq ; b.deq; cDone <= False; dDone <= False; No extraneous dependencies – No deadlock http://csg.csail.mit.edu

  15. Latency-Insensitive BDNs • No extraneous dependency property: if output Oi is not enqueued n times, assuming it is not full and all the inputs are enqueued n-1 times, then it must be that one of the inputs in Depends-on(Oi) is not enqueued n times • Self Cleaning property: If all outputs are enqueued n times then all inputs must be dequeued n times BDNs with these properties and do not deadlock http://csg.csail.mit.edu

  16. Writing an LI-BDN wrapper for an SSM LI-BDN: rule Oj when (donej)  donej <= True oj.enq( fj(ij1.first, ... ,ijIj.first, s) ) rule Finish when (done1 done2 ...)  done1 <= False; done2 <= False; ... s <= g(i1.first, i2.first, ... , s) i1.deq ; i2.deq ; ... Given the SSM: oj(t) = fj(ij1(t), ... ,ijIj(t), s(t)) // ij1, ij2, ... ijIj are in Depends-on(oj) s(t+1) = g(i1(t), i2(t), ... , s(t)) http://csg.csail.mit.edu

  17. All input deqs Patient SSM first Ii deq value enable Oj enq not-empty All dones donei not-full Depends-on(Oj) 1 0 The Wrapper Circuit http://csg.csail.mit.edu

  18. stall bypass AddrCalc BrRes Mem2 ALU Excep Crack BrPred Decode Mem1 PC Fetch RegRd RegWr epochs D$/DTlb2 D$/DTlb1 I$/ITlb1 I$/ITlb2 Mem Mem PPC In-order Pipeline • The designer specifies the FSM for each stage • The FIFOs are latency-insensitive, that is, the correctness of the specification does not depend upon the depth of FIFOs or the number of stages http://csg.csail.mit.edu

  19. Can be mechanized The steps in Cycle-accurate implementation on FPGAs • The specs are turned into Bluespec code to give a target SSM • Once the size of FIFOs is fixed the whole design has a precise timing specification • If the FPGA implementation requires refining some stages then cuts are made in the design to isolate the stages (SSMs) to be refined • Each SSM is turned into a BDN by introducing FIFOs for each input and output wire, including the wires going in and out of model FIFOs of the SSM • This converts the nth time cycle of the SSM into the nth enqueue into input FIFOs and nth dequeue from output FIFOs • Atomic rules for the operation of each BDN are defined so that no extraneous dependencies are introduced • This also ensures deadlock-free operation http://csg.csail.mit.edu

  20. Preliminary results • Cycle-accurate refinements onto Xilinx XUPV5 (Asif & Murali) • Slice Logic Utilization: • Number of Slice Registers: 15448 out of 69120 22% • Number of Slice LUTs: 16702 out of 69120 24% • Specific Feature Utilization: • Number of Block RAM/FIFO: 1 out of 148 0% (only 1 BRAM for the register file) • Number of DSP48Es: 12 out of 64 18% (these are used for the divider) • Minimum period: 7.988ns (Maximum Frequency: 125.188MHz) • Partially verified by running a 50 instruction program • Compared to Jessica has port onto Xilinx XUPV5 • Takes up 92% of the area; • 20Mhz  40Mhz No numbers yet for actual work done http://csg.csail.mit.edu

  21. Conclusion • Cycle-accurate modeling of processors on FPGAs is feasible and offers a 3-orders of magnitude improvement in performance over software simulators • BDNs offer a way to refine RTL without losing cycle-accuracy • Bluespec is makes quick RTL generation feasible • The generation of BDNs can be automated • We plan to release our Bluespec designs under open source licensing to strengthen PowrPC ecosystem. http://csg.csail.mit.edu

  22. Related work Luca Carloni et al for Latency-Insensitive refinements • HAsim: Joel Emer, Michael Pellauer, et al at Intel/MIT • Cycle accurate modeling using the A-ports abstraction • UTFast: Derek Chiou and students at UT Austin • speculative functional model, corrected by timing model when necessary • Protoflex: James Hoe, Eric Chung et al at CMU • RAMP Gold: Krste Asanovic et al at Berkeley http://csg.csail.mit.edu

  23. Thanks! http://csg.csail.mit.edu

  24. I1 O1 R o I In Om BDN Input/Output notation • Ii(n) represents the nth values enqueued in input buffer Ii I(n) represents the nth values enqueued in all input buffers • Oj(n) represents the nth values dequeued from output buffer Oj O(n) represents the nth values dequeued from all output buffers http://csg.csail.mit.edu

  25. r a b bDone Examples of primitive BDNs:Register A register whose reads and writes must match Behavior rule RO when (b.full  bDone) b.enq(r); bDone <= True rule RI when (a.empty  bDone) r <= a.first; a.deq; bDone <= False Initial Values bDone = False r = r0 http://csg.csail.mit.edu

  26. p aCnt a c bCnt b Examples of primitive BDNs:Mux A mux that accepts an input value on each input port but passes only the appropriate value to the output Behavior rule MuxO when c.full  p.empty if(p.first   a.empty) then c.enq(a.first); a.deq; bCnt<=bCnt+1 else if(!(p.first)  b.empty) then c.enq(b.first); b.deq; aCnt<=aCnt+1 rule MuxI1 when aCnt >0  a.empty  a.deq; aCnt<=aCnt-1 rule MuxI2 when bCnt >0  b.empty  b.deq; bCnt<=bCnt-1 Initial values aCnt = 0 bCnt = 0 http://csg.csail.mit.edu

  27. R1 R R1 R2 R2 Ii =Oj Ii Oj R1 R1 R Composition of BDNs • If R1 and R2 are BDNs then so is the parallel composition of R1 and R2 (R = R1 R2) • R1 is a BDN then so is the ( Ii ,Oj) iterative composition of R1 (R = (i,j)  R1) provided Ii  Depends-on(Oj)* * No direct combinational path http://csg.csail.mit.edu

  28. I1 O1 R o I In Om Deadlock-free BDN • Assuming an infinite sink, a BDN is deadlock-free if for all n > 0, if n values are enqueued into I then eventually n values will be dequeued from both O and I • we need a stronger property for deadlock-freeness to be preserved under composition http://csg.csail.mit.edu

More Related