610 likes | 764 Views
Bluespec. Lectures 3 & 4 w ith some slides from Nikhil Rishiyur at Bluespec and Simon Moore at the University of Cambridge. Course Resources. http ://cas.ee.ic.ac.uk/~ssingh Lecture notes (Power Point, PDF) Example Bluespec programs used in Lectures Complete Photoshop system (Bluespec)
E N D
Bluespec Lectures 3 & 4 with some slides from Nikhil Rishiyur at Bluespecand Simon Moore at the University of Cambridge
Course Resources • http://cas.ee.ic.ac.uk/~ssingh • Lecture notes (Power Point, PDF) • Example Bluespec programs used in Lectures • Complete Photoshop system (Bluespec) • Links to Bluespec code samples • User guide, reference guide: doc sub-directory of Bluespec installation • More information at http://bluespec.com
Rules, not clock edges • rules are atomic • they execute within one clock cycle • structure:rule name (explicit conditions) statements;endrule • conditions: • explicit – conditions (Boolean expression) provided • implicit – conditions that have to be met to allow the statements to fire, e.g. for fifo.enq only if fifo not full
Rules: powerfulalternative to always blocks • rules for state updates instead of always blocks • Simple concept: think if…then… • Rule can execute (or “fire”) only when its conditions are TRUE • Every rule is atomic with respect to other rules • Powerful ramifications: • Executable specification – design around operations as described in specs • Atomicity of rules dramatically reduces concurrency bugs • Automates management of shared resources – avoids many complex errors rule ruleName (<boolean cond>); <state update(s)> endrule
Bits, Bools and conversion • Bit#(width) • vector of bits • Bool • single bit for Booleans (True, False) • pack() • function to convert most things (pack) into a bit representation • unpack() • opposite of pack() • extend() • extend an integer (signed, unsigned, bits) • truncate() • truncate an integer
Reg and Bit/Uint/Int types • registers (initialised and uninitialised versions):Reg#(type) name0 <- mkReg(initial_value);Reg#(type) name1 <- mkRegU; • some types (unsigned and signed integer, and bits):UInt#(width), Int#(width), Bit#(width) • example:Reg#(UInt#(8)) counter <- mkReg(0);rulecount_up; counter <= counter+1;endrule interface type type parameter (e.g. UInt#(8)) since Reg is generic name of module to “make” (i.e. instantiate) N.B. modules are typically prefixed “mk”
Registers interfaceReg#(type a); methodAction _write (a x1); methoda _read (); endinterface: Reg • Polymorphic • Just library elements • In one cycle register reads must execute before register writes • x <= y + 1 is syntactic sugar forx._write (y._read + 1)
Scheduling Annotations for a Register • Two read methods would be conflict-free (CF), that is, you could have multiple methods that read from the same register in the same rule, sequenced in any order. • A write is sequenced after (SA) a read. • A read is sequenced before (SB) a write. • If you have two write methods, one must be sequenced before the other, and they cannot be in the same rule, as indicated by the annotation SBR.
Updating Registers Reg#(int) x <- mkReg (0) ; rulecountup (x < 30); inty = x + 1; x <= x + 1; $display ("x = %0d, y = %0d", x, y); endrule
Rules of Rules (The Three Basics) • Rules are atomic • Rules fire or don’t at most once per cycle • Rules don’t conflict with other rules
+1 x D Q ruler1; x <= y + 1; endrule rule r2; y <= x + 1; endrule y D Q +1 clk
(* synthesize *) module rules4 (Empty); Reg#(int) x <- mkReg (10); Reg#(int) y <- mkReg (100); ruler1; x <= y + 1; endrule ruler2; y <= x + 1; endrule rule monitor; $display ("x, y = %0d, %0d ", x, y); endrule endmodule +1 x2 D Q y2 D Q +1 clk $ ./rules4 -m 5 x, y = 10, 100 x, y = 10, 11 x, y = 10, 11 x, y = 10, 11
(* synthesize *) module rules5 (Empty); Reg#(int) x <- mkReg (10); Reg#(int) y <- mkReg (100); rule r ; x <= y + 1; y <= x + 1; endrule rule monitor; $display ("x, y = %0d, %0d ", x, y); endrule endmodule +1 x D Q y D Q +1 clk $ ./rules5 -m 5 x, y = 10, 100 x, y = 101, 11 x, y = 12, 102 x, y = 103, 13
(* synthesize *) module rules6 (Empty); Reg#(int) x <- mkReg (10); Reg#(int) y <- mkReg (100); rule r1; x <= y + 1; endrule rule r2; y <= x + 1; endrule (* descending_urgency = "r1, r2" *) rule monitor; $display ("x, y = %0d, %0d ", x, y); endrule endmodule +1 x2 D Q y2 D Q +1 clk $ ./rules6 -m 5 x, y = 10, 100 x, y = 101, 100 x, y = 101, 100 x, y = 101, 100
interface Rules7_Interface ; methodintreadValue ; method Action setValue (intnewXvalue) ; methodActionValue#(int) increment ; endinterface (* synthesize *) module rules7 (Rules7_Interface); Reg#(int) x <- mkReg (0); methodreadValue ; return x ; endmethod methodActionsetValue(intnewXvalue); x <= newXvalue; endmethod methodActionValue#(int) increment ; x <= x + 1 ; return x ; endmethod endmodule
interface Rules7_Interface ; (* always_ready *) methodintreadResult ; (* always_enabled *) methodActionsetValues (intnewX, intnewY, intnewZ) ; endinterface (* synthesize *) module rules7 (Rules7_Interface) ; Reg#(int) x <- mkReg (0) ; Reg#(int) y <- mkReg (0) ; Reg#(int) z <- mkReg (0) ; Reg#(int) result <- mkRegU ; Reg#(Bool) b <- mkReg (False) ; rule toggle ; b <= !b ; endrule rule r1 (b) ; result <= x * y ; endrule rule r2 (!b) ; result <= x * z ; endrule methodreadResult = result ; methodActionsetValues (intnewX, intnewY, intnewZ) ; x <= newX ; y <= newY ; z <= newZ ; endmethod endmodule // remaining internal signals assign x_MUL_y___d8 = x * y ; assign x_MUL_z___d5 = x * z ;
interface Rules8_Interface ; (* always_ready *) methodintreadResult ; (* always_enabled *) methodActionsetValues (intnewX, intnewY, intnewZ) ; endinterface (* synthesize *) module rules8 (Rules8_Interface) ; Reg#(int) x <- mkReg (0) ; Reg#(int) y <- mkReg (0) ; Reg#(int) z <- mkReg (0) ; Wire#(int) t <- mkWire ; Reg#(int) result <- mkRegU ; Reg#(Bool) b <- mkReg (False) ; rule toggle ; b <= !b ; endrule rulecomputeT ; if (b) t <= y ; else t <= z ; endrule rule r1 (b) ; result <= x * t ; endrule methodreadResult = result ; methodActionsetValues (intnewX, intnewY, intnewZ) ; x <= newX ; y <= newY ; z <= newZ ; endmethod endmodule // inlined wires assign t$wget = b ? y : z ; … // remaining internal signals assign x_MUL_t_wget___d6 = x * t$wget ;
High Level Synthesis • Most work on high level synthesis focuses on the automationscheduling and allocation to achieve resource sharing. • Perspective: high level synthesis in general applies to many aspects of converting high level descriptions into efficient circuits but there has been an undue level of effort on resource sharing in an ASIC context. • Bluespec automates many aspects of scheduling (it makes scheduling composable) but resource usage is under the explicit control of the designer. • For FPGA-based design this is often a better bit as a programming model.
0 2 1 +1 -1 +1 -1 x y Simple example withconcurrency and shared resources cond0 cond1 cond2 • Process 0: increments register x when cond0 • Process 1: transfers a unit from register x to register ywhen cond1 • Process 2: decrements register y when cond2 • Each register can only be updated by one process on each clock. Priority: 2 > 1 > 0 • Just like real applications, e.g.: • Bank account: 0 = deposit to checking, 1 = transfer from checking to savings, 2 = withdraw from savings Process priority: 2 > 1 > 0
0 2 1 +1 -1 +1 -1 x y cond0 cond1 cond2 Process priority: 2 > 1 > 0 Resource-access scheduling logic i.e., control logic always @(posedge CLK) begin if (cond2) y <= y – 1; else if (cond1) begin y <= y + 1; x <= x – 1; end if (cond0 && !cond1) x <= x + 1; end always @(posedge CLK) begin if (cond2) y <= y – 1; else if (cond1) begin y <= y + 1; x <= x – 1; end if (cond0 && (!cond1 || cond2) ) x <= x + 1; end Better scheduling * There are other ways to write this RTL, but all suffer from same analysis Fundamentally, we are scheduling three potentially concurrent atomic transactions that share resources. What if the priorities changed: cond1 > cond2 > cond0? What if the processes are in different modules?
0 2 1 +1 -1 +1 -1 x y With Bluespec, the design is direct cond0 cond1 cond2 Process priority: 2 > 1 > 0 (* descending_urgency = “proc2, proc1, proc0” *) rule proc0 (cond0); x <= x + 1; endrule rule proc1 (cond1); y <= y + 1; x <= x – 1; endrule rule proc2 (cond2); y <= y – 1; endrule Hand-written RTL:Explicit scheduling Complex clutter, unmaintainable BSV:Functional correctness follows directly from rule semantics (atomicity) Executable spec (operation-centric) Automatic handling of shared resource control logic Same hardware as the RTL
cond0 cond1 cond2 1 -1 +1 0 2 +1 -1 3 +2 -2 x y cond3 Process priority: 2 > 3 > 1 > 0 Now, let’s make a small change: add a new process and insert its priority
Changing the Bluespec design cond0 cond1 cond2 Process priority: 2 > 3 > 1 > 0 1 -1 +1 0 2 +1 -1 3 +2 -2 x y cond3 Pre-Change (* descending_urgency = "proc2, proc3, proc1, proc0" *) rule proc0 (cond0); x <= x + 1; endrule rule proc1 (cond1); y <= y + 1; x <= x - 1; endrule rule proc2 (cond2); y <= y - 1; x <= x + 1; endrule rule proc3 (cond3); y <= y - 2; x <= x + 2; endrule (* descending_urgency = “proc2, proc1, proc0” *) rule proc0 (cond0); x <= x + 1; endrule rule proc1 (cond1); y <= y + 1; x <= x – 1; endrule rule proc2 (cond2); y <= y – 1; endrule ?
Changing the Verilog design cond0 cond1 cond2 Process priority: 2 > 3 > 1 > 0 1 -1 +1 0 2 +1 -1 3 +2 -2 x y cond3 Pre-Change always @(posedge CLK) begin if ((cond2 && cond0) || (cond0 && !cond1 && !cond3)) x <= x + 1; else if (cond3 && !cond2) x <= x + 2; else if (cond1 && !cond2) x <= x - 1 if (cond2) y <= y - 1; else if (cond3) y <= y - 2; else if (cond1) y <= y + 1; end ? always @(posedge CLK) begin if (!cond2 && cond1) x <= x – 1; else if (cond0) x <= x + 1; if (cond2) y <= y – 1; else if (cond1) y <= y + 1; end
0 2 1 +1 -1 +1 -1 x y Alternate RTL style (more common) cond0 cond1 cond2 Process priority: 2 > 1 > 0 • Combinatorial explosion • Case 3’b111 is subtle • Many repetitions of update actions ( cut-paste errors) • cf. “WTO Principle” (Write Things Once—Gerard Berry) • Difficult to maintain/extend • Difficult to modularize always @ (posedge clk) case ({cond0, cond1, cond2}) 3'b000: begin // nothing happens x <= x; y <= y; end 3'b001: begin //proc2 fires y <= y-1; end 3'b010: begin //proc1 x <= x-1; y <= y+1; end 3'b011: begin //proc2 fires (2>1) y <= y-1; end 3'b100: begin //proc0 x <= x+1; end 3'b101: begin //proc2 + proc0 x <= x+1; y <= y-1; end 3'b110: begin //proc1 (1>0) x <= x-1; y <= y+1; end 3'b111: begin //proc2 + proc0 x <= x+1; // NOTE – subtle! y <= y-1; end endcase
Late Specifications • Late specification changes and feature enhancements are challenging to deal with. • Micro-architectural changes for timing/area/performance, e.g.: • Adding a pipeline stage to an existing pipeline • Adding a pipeline stage where pipelining was not anticipated • Spreading a calculation over more clocks (longer iteration) • Moving logic across a register stage (rebalancing) • Restructuring combinational clouds for shallower logic • Fixing bugs • Bluespec makes it easier to try out multiple macro/micro-architectures earlier in the design cycle
Why Rule atomicity improves correctness • Correctness is often couched (formally or informally) as an invariant • E.g., • Rule atomicity improves thinking about (and formally proving) invariants, because invariants can be verified one rule at a time • In contrast, in RTL and thread models, must think of all possible interleavings • cf. The Problem With Threads, Edward A. Lee, IEEE Computer 39(5), May 2006, pp. 33-42 “# ingress packets — # egress packets == packet-count register value”
Bank Account: Key Benefits • Executable specifications • Rapid changes • But, with fine-grained control of RTL: • Define the optimal architecture/micro-architecture • Debug at the source OR RTL level – designer understands both • The Quality of Results (QoR) of RTL!
FIFO FIFO FIFO FIFO FIFO FIFO FIFO FIFO A more complexexample, from CPU design RegisterFile RegisterFile Speculative, out-of-order Many, many concurrent activities Re-OrderBuffer(ROB) Re-OrderBuffer(ROB) ALUUnit ALUUnit Decode Decode Fetch Fetch FIFO FIFO MEMUnit MEMUnit Branch Branch InstructionMemory InstructionMemory DataMemory DataMemory
E Get operandsfor instr W Writebackresults State Instruction Operand 1 Operand 2 Result Head Get a readyALU instr Put ALU instr results in ROB Put MEM instr results in ROB Put aninstr intoROB Tail Resolvebranches Many concurrent actions on common state: nightmare to manage explicitly RegisterFile Empty Waiting Re-Order Buffer Instr - V - V - - E Instr - V - V - - E W Instr A V 0 V 0 - ALUUnit(s) W Instr B V 0 V 0 - W Instr C V 0 V 0 - DecodeUnit V 0 W Instr D V 0 - E Instr - V - V - - E Instr - V - V - - E Instr - V - V - - Get a readyMEM instr MEMUnit(s) E Instr - V - V - - Instr - V - V - - E Instr - V - V - - E Instr - V - V - - E E Instr - V - V - - Instr - V - V - - E Instr - V - V - - E
Dispatch Instr • Mark instructiondispatched • Forward to appropriateunit • Insert Instr in ROB • Put instruction in firstavailable slot • Increment tail pointer • Get source operands • - RF <or> prev instr • Write Back Results to ROB • Write back results toinstr result • Write back to all waitingtags • Set to done • Branch Resolution • … • … • … • Commit Instr • Write results to registerfile (or allow memorywrite for store) • Set to Empty • Increment head pointer In Bluespec… • ..you can code each operation in isolation, as a rule • ..the tool guarantees that operations are INTERLOCKED (i.e. each runs to completion without external interference)
0 2 1 +1 -1 +1 -1 x y Which oneis correct? cond0 cond1 cond2 Process priority: 2 > 1 > 0 always @(posedge CLK) begin if (!cond2 || cond1) x <= x – 1; else if (cond0) x <= x + 1; if (cond2) y <= y – 1; else if (cond1) y <= y + 1; end always @(posedge CLK) begin if (!cond2 && cond1) x <= x – 1; else if (cond0) x <= x + 1; if (cond2) y <= y – 1; else if (cond1) y <= y + 1; end What’s required to verify that they’re correct? What if the priorities changed: cond1 > cond2 > cond0? What if the processes are in different modules?
0 2 1 +1 -1 +1 -1 x y Some Verilog solutions cond0 cond1 cond2 Process priority: 2 > 1 > 0 always @(posedge CLK) begin if (!cond2 || cond1) x <= x – 1; else if (cond0) x <= x + 1; if (cond2) y <= y – 1; else if (cond1) y <= y + 1; end always @(posedge CLK) begin if (!cond2 && cond1) x <= x – 1; else if (cond0) x <= x + 1; if (cond2) y <= y – 1; else if (cond1) y <= y + 1; end Which one is correct? Functional code and scheduling code are deeply (inextricably) intertwined. What’s required to verify that they’re correct? What if the priorities changed: cond1 > cond2 > cond0? What if the processes are in different modules?
fsm fsm fsm fsm fsm fsm fsm fsm fsm fsm Finite State Machines in Bluespec for makigncomposable, parallel, nested, suspendable/abortableFSMs sequencing if-then-else sequential loops parallel FSMs (fork-join) hierarchy (with suspend and abort) Enables exponentially smaller descriptions compared to flat FSMs • Features: • FSMs automatically synthesized • Complex FSMs expressed succinctly • FSM actions have same atomic semantics as BSV rule bodies • Well-behaved on shared resources—no surprises • Standard BSV interfaces and BSV’s higher-order functions can write your own FSM generators This powerful capability is enabled by higher-order functions, polymorphic types, advanced parameterization and atomic transactions
FSM example (from testbench stimulus section) Stmt s = seq action rand_packets0.init; rand_packets1.init; endaction par for (j0 <= 0; j0 < n; j0 <= j0 + 1) action let pkt0 <- rand_packets0.next; switch.ports[0].put (pkt0); endaction for (j1 <= 0; j1 < n; j1 <= j1 + 1) action let pkt1 <- rand_packets1.next; switch.ports[1].put (pkt1); endaction endpar drain_switch; endseq; FSM fsm <- mkFSM (s); rule go; s.start; endrule Basic FSM statements are “Actions”, just like rule bodies, and have exactly the same atomic semantics. Thus, BSV FSMs are well-behaved with respect to concurrent resource contention and flow control.
Strong support for multiple clock and reset domains • Rich and mature support for MCD (multiple clock domains and resets) • Clock is a first-class data type • Cannot accidentally mix clocks and ordinary signals • Strong static checking ensures that it is impossible to accidentally cross clock domain boundaries (i.e., without a synchronizer) • No need for linting tools to check domain discipline • Clock manipulation • Clocks can be passed in and out of module interfaces • Library of clock dividers and other transformations • Module instantiation can specify an alternative clock (instead of inheriting parent’s default clock) • (Similarly: Reset and reset domains)
Synthesis of Atomic Actions Compute Predicates for each rule Select maximal subset of applicable rules scheduler p1 p2 state p3 Predicates computed for each rule with a combinational circuit enabled rules f1 f2 f3 read Compute next state for each rule Selector Mux’s & priority encoders d1 d2 d3 Potential update functions update
Key Issue: How to select to maximal subset of rules for firing? • Two rules R1 and R2 can execute simultaneously if they are “conflict free” i.e. • R1 and R2 do not update the same state; and • Neither R1 or R2 do not read the that the other updates (“sequentially composable” rules)
Rules of Rules (The Details 1-5/10) • Rules are atomic:rules fire completely or not at all, and you can imagine that nothing else happens duringtheir execution. • Explicit and implicit conditions may prevent rules from firing. • Every rule fires exactly 0 or 1 times every cycle (at this point in our product's history anyway ;) • Rules that conflict in some way mayfire together in the same cycle, but only if the compiler can schedule them in a valid order to do so -- that is, where the overall effect is as if they had happened one at at time as in (1) above. • Rules determine if they are going to fire or not before they actually do so. They are considered in their order of "urgency" (by a "greedy algorithm"): they "will fire" if they "can fire" and are not prevented by a conflict with a rule which has been selected already. It's OK to think of this phase as being completed (except for wires) before any rules are actually executed. This is what "urgency" is about.
Rules of Rules (The Details 6-10/10) • After determining which rules are going to fire, the simulator can then schedule their execution. (In hardware it's all done by combinational logic which has the same effect.) Rules do not need to execute in the same order as they were considered for deciding whether they "will fire". For example rule1 can have a higher urgency than rule2, but it is possible that rule2 executes its logic before rule1. Urgency is used to determine which rules "will fire“. Earliness defines the order they fire in. • All reads from a register must be scheduled before any writes to the same register: any rule which reads from a register must be scheduled "earlier" than any other rule which writes to it. • Constants may be "read" at any time; a register *might* have a write but no read. • The compiler creates a sequence of steps, where each step is essentially a rule firing. Its inputs are valid at the beginning of the cycle, its outputs are valid at the end of the cycle. Data is not allowed to be driven "backwards" in the schedule: that is, no action may influence any action that happened "earlier" in the cycle. This would go against causality, and constitutes a "feedback" path that the compiler will not allow. • If the compiler is not told otherwise, methods have higher urgency than rules, and will execute earlier than rules, unless there's some reason to the contrary. There is a compiler switch to flip this around and make rules have higher urgency.
The Swap Conundrum (* synthesize *) module rules9 (Empty) ; Reg#(int) x <- mkReg (12) ; Reg#(int) y <- mkReg (17) ; rule r1 ; x <= y ; endrule rule r2 ; y <= x ; endrule rule monitor ; $display ("x, y = %0d, %0d", x, y) ; endrule endmodule $ ./rules9 -m 5 x, y = 12, 17 x, y = 12, 12 x, y = 12, 12 x, y = 12, 12
The Swap Conundrum (* synthesize *) module rules9 (Empty) ; Reg#(int) x <- mkReg (12) ; Reg#(int) y <- mkReg (17) ; rule r1 ; x <= y ; endrule rule r2 ; y <= x ; endrule rule monitor ; $display ("x, y = %0d, %0d", x, y) ; endrule endmodule PROBLEM: register x must read before write
(* synthesize *) module rules10 (Empty) ; Reg#(int) x <- mkReg (12) ; Reg#(int) y <- mkReg (17) ; rule r ; x <= y ; y <= x ; endrule rule monitor ; $display ("x, y = %0d, %0d", x, y) ; endrule endmodule $ ./rules10 -m 5 x, y = 12, 17 x, y = 17, 12 x, y = 12, 17 x, y = 17, 12 Schedule wise, step 1 reads x and y at the beginning and writes x and y at the end.
Wires • In Bluespec from a scheduling perspective registers and wires are dual concepts. • In one cycle all register reads must execute before register writes. • In one cycle a wire must be written to (at most once) before it is read (any number of times).
Rules of Wires • Wires truly become wires in hardware: they do not save “state” between cycles (compare to signal in VHDL). • A wire’s schedule requires that it be written before it is read (as opposed to a register that is read before it is written). • A wire can not be written more than once in a cycle.
(* synthesize *) module rules11 (Empty) ; Reg#(int) x <- mkReg (12) ; Reg#(int) y <- mkReg (17) ; Wire#(int) xwire <- mkWire; rule r1 ; x <= y ; endrule rule r2 ; y <= xwire ; endrule ruledriveX ; xwire <= x ; endrule rule monitor ; $display ("x, y = %0d, %0d", x, y) ; endrule endmodule $ ./rules11 -m 5 x, y = 12, 17 x, y = 17, 12 x, y = 12, 17 x, y = 17, 12
(* synthesize *) module rules11 (Empty) ; Reg#(int) x <- mkReg (12) ; Reg#(int) y <- mkReg (17) ; Wire#(int) xwire <- mkWire; rule r1 ; x <= y ; endrule rule r2 ; y <= xwire ; endrule ruledriveX ; xwire <= x ; endrule rule monitor ; $display ("x, y = %0d, %0d", x, y) ; endrule endmodule $ cat rules11.sched === Generated schedule for rules11 === Rule schedule ------------- Rule: monitor Predicate: True Blocking rules: (none) Rule: driveX Predicate: True Blocking rules: (none) Rule: r2 Predicate: xwire.whas Blocking rules: (none) Rule: r1 Predicate: True Blocking rules: (none) Logical execution order: monitor, driveX, r1, r2 ======================================= Question: is monitor, driveX, r2, r1 a valid schedule?