250 likes | 407 Views
Bounded Dataflow Networks and Latency Insensitive Circuits Cont…. Arvind Computer Science and Artificial Intelligence Laboratory MIT Based on the work of Murali Vijayaraghavan and Arvind[MEMOCODE 2009]. Modular transformation. BDN 1. BDN 1. BDN 2. BDN 2. SSM 1. SSM 2. SSM. BDN.
E N D
Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and Artificial Intelligence Laboratory MIT Based on the work of Murali Vijayaraghavan and Arvind[MEMOCODE 2009] http://csg.csail.mit.edu/korea
Modular transformation BDN1 BDN1 BDN2 BDN2 SSM1 SSM2 SSM BDN BDN3 BDN3 SSM3 Is this transformation correct? Yes, provided each BDNiimplements SSMi and is latency insensitive then the resulting BDN implements SSM and is latency insensitive http://csg.csail.mit.edu/korea
BDN Implementing an SSM SSM BDN A BDN is said to implement an SSM iff • There is a bijective mapping between inputs (outputs) of the SSM and BDN • The output histories of the SSM and BDNmatch whenever the input histories match • The BDN is deadlock-free ... ... ... ... http://csg.csail.mit.edu/korea
Latency-Insensitive BDN (LI-BDN) • A BDN implementing an SSM is an LI-BDN iff it has • No extraneous dependencies property • Self cleaning property Theorem: A BDN where all the nodes are LI-BDNs will not deadlock http://csg.csail.mit.edu/korea
No-Extraneous Dependency (NED) property SSM Inputs combinationally connected to out out BDN Production of outQ waits only for these input FIFOs outQ http://csg.csail.mit.edu/korea
Self-Cleaning (SC) property If the BDN has enqueued all its outputs, it will dequeue all its inputs http://csg.csail.mit.edu/korea
Modular refinement - revisited LI-BDN2 Automatically generated SSM2 rest of the design SSM1 module to be refined LI-BDN1 implementing SSM1 LI-BDN2 LI-BDN1 refined manually http://csg.csail.mit.edu/korea
Writing an LI-BDN wrapper for an SSM Given the SSM: oj(t) = fj(ij1(t), ... ,ijIj(t), s(t)) // ij1, ij2, ... ijIj are combinationally connected to oj s(t+1) = g(i1(t), i2(t), ... , s(t)) LI-BDN: rule Oj when (donej) donej True; oj.enq( fj(ij1.first, ... ,ijIj.first, s) ) rule Finish when (done1 done2 ...) done1 False; done2 False; ...; s g(i1.first, i2.first, ... , s); i1.deq ; i2.deq ; ... introduce a done flag and a rule for each output introduce the Finish rule http://csg.csail.mit.edu/korea
Wrapper circuit All input deq Patient SSM first Ii deq value enable Oj enq not-empty All dones donej not-full Depends-on(Oj) 1 0 http://csg.csail.mit.edu/korea
Patient SSM ... ... Combinational Logic Combinational Logic Inputs ... ... Inputs Enable Outputs ... ... Outputs http://csg.csail.mit.edu/korea
Example3-port and 1-port Register Files ra0 interface RegisterFile3Ports method Value rd0(Addr a); method Value rd1(Addr a); method Action wr(Addr a, Value x); endinterface rf ra1 rd0 wen rd1 wa en rf wd R/W out interface RegisterFile1port method ActionValue#(Value) access(Req r); endinterface //Response to write access is // unconstrained typedef union tagged{ W struct{a:Addr,v:Value}; R struct{a:Addr}; } Req; a d http://csg.csail.mit.edu/korea
LI-BDN for a 3-port register file rule RD0 when (rd0Done) rd0.enq(rf.r1(ra0.first)); rd0Done True; rule RD1 when (rd1Done) rd1.enq(rf.r1(ra1.first)); rd1Done True; rule finish when (rd0Done rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.wr(wa.first, wd.first); rd0Done False; rd1Done False; ra0 rf ra1 rd0 wen rd1 wa wd rd0Done rd1Done http://csg.csail.mit.edu/korea
Refinement into a one-ported register file LI-BDN rule RD0 when (rd0Done) let x rf.access(R ra0.first); rd0.enq(x); rd0Done True rule RD1 when (rd1Done) let x rf.access(R ra1.first); rd1.enq(x); rd1Done True rule finish when (rd0Done rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.access(W {a:wa.first, v:wd.first}); rd0Done False; rd1Done False; ra0 rd1Done ra1 rd0 en rf R/W wen rd1 out a wa d wd rd0Done This uses 1 port http://csg.csail.mit.edu/korea
Pipelining combinational circuits S1 R1 a c a c S3 f1 f3 R3 f1 f3 e e b d b d S2 f2 R2 f2 Can potentially reduce the critical path of the entire circuit http://csg.csail.mit.edu/korea
Optimizing an LI-BDN mux c c a a d d b b • Does not wait for don’t-care inputs • Counters used to keep track of how many inputs to drop • Can potentially increase the throughput http://csg.csail.mit.edu/korea
Summary Latency Insensitive BDNs allow true modular refinement of a system, where even the timing contract of a module can be changed without affecting the rest of the system http://csg.csail.mit.edu/korea
A Design Flow issue Exception • We can apply the technique discussed to refine this design • But where does this design come from in the first place? Verilog? Verilog Compiler Output? Bluespec? Branch Resolution Branch Prediction Mem2/ ALU/ Exception Handler Reg File Addr Calc/ Branch Resolve Branch Pred Fetch1 Fetch2 Crack Decode Mem1 Register Write • Pipelined Multiplier • Multicycle divider Register file implemented as a BRAM http://csg.csail.mit.edu/korea
Design Flow Issues • Generation of appropriate RTL is the major problem • RTL / Specifications should be written in such a way that they are amenable to refinements Latency Insensitive Design Methodology http://csg.csail.mit.edu/korea
The PowerPC Project Cycle-accurate modeling of PowerPC on FPGAs http://csg.csail.mit.edu/korea
stall bypass AddrCalc BrRes Mem2 ALU Excep Crack BrPred Decode Mem1 PC Fetch RegRd RegWr epochs D$/DTlb2 D$/DTlb1 I$/ITlb1 I$/ITlb2 Mem Mem PPC In-order Pipeline • The designer specifies the FSM for each stage • The FIFOs are latency-insensitive, that is, the correctness of the specification does not depend upon the depth of FIFOs or the number of stages http://csg.csail.mit.edu/korea
Can be mechanized The steps in Cycle-accurate implementation on FPGAs • The specs are turned into Bluespec code to give a target SSM • Once the size of FIFOs is fixed the whole design has a precise timing specification • If the FPGA implementation requires refining some stages then cuts are made in the design to isolate the stages (SSMs) to be refined • Each SSM is turned into a BDN by introducing FIFOs for each input and output wire, including the wires going in and out of model FIFOs of the SSM • This converts the nth time cycle of the SSM into the nth enqueue into input FIFOs and nth dequeue from output FIFOs • Atomic rules for the operation of each BDN are defined so that no extraneous dependencies are introduced • This also ensures deadlock-free operation http://csg.csail.mit.edu/korea
Initial results using XUPV5 FPGA http://csg.csail.mit.edu/korea
Detailed Preliminary Results Asif Khan & Murali Vijayaraghavan (June 2009) • Cycle-accurate refinements onto Xilinx XUPV5 • Slice Logic Utilization: • Number of Slice Registers: 15448 out of 69120 22% • Number of Slice LUTs: 16702 out of 69120 24% • Specific Feature Utilization: • Number of Block RAM/FIFO: 1 out of 148 0% (only 1 BRAM for the register file) • Number of DSP48Es: 12 out of 64 18% (these are used for the divider) • Minimum period: 7.988ns (Maximum Frequency: 125.188MHz) • Partially verified by running a 50 instruction program • Compared to Jessica has port onto Xilinx XUPV5 • Takes up 92% of the area; • 20Mhz 40Mhz No numbers yet for actual work done http://csg.csail.mit.edu/korea
Conclusion • Cycle-accurate modeling of processors on FPGAs is feasible and offers a 3-orders of magnitude improvement in performance over software simulators • BDNs offer a way to refine RTL without losing cycle-accuracy • Bluespec makes quick RTL generation feasible • The generation of BDNs can be automated • We plan to release our Bluespec designs under open source licensing to strengthen PowrPC ecosystem. http://csg.csail.mit.edu/korea