350 likes | 493 Views
A Synthesis Algorithm for Modular Design of Pipelined Circuits. Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology. Overall Goal. Efficient, Synchronous, Parallel Implementation in Synthesizable Verilog. Modular,
E N D
A Synthesis Algorithm for Modular Design of Pipelined Circuits Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology
Overall Goal Efficient, Synchronous, Parallel Implementation in Synthesizable Verilog Modular, Asynchronous, Sequential Specification
... ... Example: Specification PC inc inc RF wen IM br RESET RESET Register Operand Fetch Compute and Writeback Instruction Fetch
Specification Properties • Basic Concepts • State • Registers and Memories • Conceptually Infinite Queues • Modules (state transformers) • Queues Provide Modularity • Decouple Modules • Enable Independent Development • Promote Reusable Modular Designs
Example: Implementation PC inc inc RF wen IM br RESET RESET Register Operand Fetch Compute and Writeback Instruction Fetch
Implementation Issues • Synthesizing Efficient Combinational Logic • Queue Finitization • Synchronous Global Scheduling
Type Declarations type reg = int(3), val = int(8), loc = int(8); type ins = <INC reg> | <JZ reg loc>; type irf = <INC reg val> | <JZ val loc>; State Declarations var PC: loc, IM: ins[N], RF: val[8]; var IQ: queue(ins), RQ: queue(irf);
Modules • Each module is set of update rules • Each Update Rule Consists of • Precondition • Action (set of updates) • Rule is enabled (and can execute) if precondition is true in current state • When rule executes, atomically applies updates in action to produce new state
Update Rules And Modules • Instruction Fetch Module TRUE iq = append(iq,im[pc]), pc = pc + 1; • Register Operand Fetch Module <INC r> = head(iq) and notin(rq, <INC r _>) iq = tail(iq), rq = append(rq, <INC r rf[r]>); <JZ r l> = head(iq) and notin(rq, <INC r _>) iq = tail(iq), rq = append(rq, <JZ rf[r] l>);
Update Rules And Modules • Compute and Writeback Module <INC r v> = head(rq) rf = rf[r = v+1], rq = tail(rq); <JZ v l> = head(rq) and (v == 0) pc = l, iq = nil, rq = nil; <JZ v l> = head(rq) and (v !=0) rq = tail(rq);
Abstract Model of Execution • Conceptually, system execution is a sequence of rule executions • while TRUE choose an enabled rule execute rule obtain new state • Concepts in Abstract Execution Model • Rules execute atomically • Rules execute asynchronously • Rules execute sequentially
Synthesis Algorithm • Starting Point • Asynchronous, sequential abstract execution • Conceptually infinite queues • Goal: efficient synchronous global schedule • In each clock cycle, multiple rules execute • Synchronously and Concurrently • (pipeline stages move together) • Implement each queue with a finite hardware buffer • Can read and write buffer in same cycle
Basic Idea • At Each Clock Cycle • Check each rule to see if enabled • If so, atomically update state to reflect execution • For each variable, generate expression that specifies new value at end of cycle • Challenge: sequential, atomic semantics for rules • Solution: symbolic rule execution
Final Result for PC if (<JZ v location> = rq and v == 0) new pc = location else if ((iq == nil) or (<INC r> = iq and <INC r _> != rq) or (<JZ r l> = iq and <INC r _> != rq)) new pc = pc+1 else new pc = pc
Algorithm Outline • Rule Numbering: for symbolic execution • Relaxation: shorten critical path by testing intial state • Queue Finitization: ensure rules execute only if will be room for the result in output queues • Symbolic Execution • Optimizations • Synthesizable Verilog Generation: from optimized expressions
Rule Numbering • Goal: • Resolve Conflicts Between Parallel Rule Executions • Approach: • For each state variable, number versions according to order • Feed results of previous rule into next rule
TRUE iq1 = append(iq0, im[pc0]), pc1 = pc0+1; • <INC r> = head(iq1) and notin(rq1,<INC r _>) iq2 = tail(iq1), rq2 = append(rq1, <INC r rf1[r]>); • <JZ r l> = head(iq2) and notin(rq2, <INC r _>) iq3 = tail(iq2), rq3 = append(rq2, <JZ rf2[r] l>); • <INC r v> = head(rq3) rf4 = rf3[r v+1], rq4 = tail(rq3); • <JZ v l> = head(rq4) and v != 0 rq5 = tail(rq4); • <JZ v l> = head(rq5) and v == 0 pc6 = l, rq6 = nil, iq6 = nil;
Relaxation • Issue: rule numbering may produce long clock cycle • Solution:for each rule Ri with precondition Pi for each variable instance vi in precondition Pi replace vi with its earliest safe version ... Rk-1: Pk-1 -> vk = ... ... Ri : Pi(vi,...) -> ... ... • vk safe for viif either • Pi[vk/vi] implies Pi • (Pi,Pk-1) mutually exclusive 0 1 2 => 0 1 2 3 3
Relaxation Result • Queues separate pipeline stages • Items traverse one stage per clock cycle • Safety: If a rule executes in new system • Then it also executes in old system • And it generates same result • Liveness: After relaxation, all rules test initial state • If rule enabled in old system but not in new system, then • Some rule executes in new system
TRUE iq1 = append(iq0, im[pc0]), pc1 = pc0+1; • <INC r> = head(iq0) and notin(rq0,<INC r _>) iq2 = tail(iq1), rq2 = append(rq1, <INC r rf1[r]>); • <JZ r l> = head(iq0) and notin(rq0, <INC r _>) iq3 = tail(iq2), rq3 = append(rq2, <JZ rf2[r] l>); • <INC r v> = head(rq0) rf4 = rf3[r v+1], rq4 = tail(rq3); • <JZ v l> = head(rq0) and v != 0 rq5 = tail(rq4); • <JZ v l> = head(rq0) and v == 0 pc6 = l, rq6 = nil, iq6 = nil;
Queue Finitization • Issue: • Conceptually unbounded queues • Finite hardware buffers • Assumption: queues start within length at beginning of cycle • Goal: generate circuit that makes queues remainwithin length at end of cycle • Basic Approach: • Before enabled rule executes • Be sure will be room for result in output queues at end of clock cycle
2 4 5 3 6 Queue Finitization Algorithm • Build Producer-Consumer Graph • Nodes are rules • Edge between rules if first inserts into queue and second removes from queue • In Example: 1
Acyclic Graphs • Process Rules in Topological Sort Order • Augment execution precondition • If rule inserts into a queue, require that either • there is room in queue when rule executes or • future rules will execute and remove items to make room in queue • Each queue has counter of number of elements in queue at start of cycle • Combinational logic tracks queue insertions and deletions
Example Instruction Fetch Rule After Queue Finitization Empty(iq0) or (<JZ v l> = head(rq0) and v == 0) or <INC r> = head(iq0) and notin(rq0,<INC r _>) or <JZ r l> = head(iq0) and notin(rq0, <INC r _>) iq1 = append(iq0, im[pc0]); pc1 = pc0+1;
Pipeline Implications • Counter becomes presence bit for single element queues • Additional preconditions can be viewed as pipeline stall logic • Design can be written to generate pipeline forwarding/bypassing instead of stall
Cyclic Graphs • Cyclic Graphs lead to Cyclic Dependences • Rule 1 depends on rule 2 to remove an item from a queue • But rule 2 depends on rule 1 to remove an item from another queue • Algorithm from acyclic case would generate recursive preconditions rule 1 rule 2
Solution to Cyclic Dependence Problem • Groups of rules must execute together • Use depth-first search on producer-consumer graph to find cyclic groups • Augment preconditions to allow all rules in cycle to execute together • Extensions include paths into and out of cyclic group
Symbolic Execution • Substitute out all intermediate versions of variables • Obtain expression for last version of each variable • Each expression defines new value of corresponding variable
Optimizations • Optimize expressions from symbolic execution • CSE: avoid unnecessary replication of HW • Mutual Exclusion Testing: • Eliminate computation of values that never occur in practice as result of mutually exclusive preconditions
Symbolic Execution with Optimization Final result for rq, assuming single item queues if <JZ v l> = head(rq) and v==0 new rq = nil else if <INC r>=head(iq) and notin(rq,<INC r _>) new rq = <INC r rf[r]> else if <JZ r l>=head(iq) and notin(rq,<INC r _>) new rq = <JZ rf[r] l> else new rq = nil
Verilog Generation • Synthesize HW directly from expressions: • Each queue as one or more registers • Each memory variable as library block • Each state variable as one or more registers, depending on type • Each expression as combinational logic that feeds back into corresponding registers
Experimental Results • We have implemented synthesis system • Used system to generate synthesizable Verilog for several specifications (map effort medium, area effort low, constraints 10ns) Benchmark Cycle Time Area (cells) Bubblesort 9.34ns ~370 Butterfly 9.57ns ~412 Processor 11.28ns ~387 Filter 9.51ns ~252
Conclusion • Starting Point: (Good for Designer) Modular, Asynchronous, Sequential Specification with Conceptually Infinite Queues • Ending Point: (Good for Implementation) Efficient, Synchronous, Globally Scheduled, Parallel Implementation with Finite Queues in Synthesizable Verilog • Variety of Techniques: • Symbolic Execution • Queue Finitization