790 likes | 892 Views
Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining. Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum. Presentation goals. Present and overview the synthesis framework Demonstrate a high-level pipeline model
E N D
Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum
Presentation goals • Present and overview the synthesis framework • Demonstrate a high-level pipeline model • Demonstrate the synthesis correctness • Illustrate how the correctness is guaranteed • Present experimental results • Conclusions • Future work
Objective Industrial quality EDA flow for automated synthesis of fine-grain pipelined robust circuits from high-level specifications • Industrial quality • Easy to integrate in RTL oriented environment • Capable of handling very large designs – scalability • Automated fine-grain pipelining • To achieve high performance (throughput) • Automated to reduce design time
Choice of paradigm • Synchronous RTL • 8 logic levels per stage is the limit • Due to register, clock skew and jitter overhead • Timing closure • No pipelining automation available – stage balancing is difficult • Performance limitations • To guarantee correctness with process variation etc • Asynchronous GTL • Lower design time • Automated pipelining possible from RTL specification • Higher performance • Gate-level (finest possible) pipelining achievable • Controllable power consumption • Smoothly slows down in case of voltage reduction • Improved yield • Correct operation regardless of variations
Easy integration & scalability: Weaver flow architecture • RTL tools reuse • Creates the impression that nothing has changed • Saves development effort • Substitution based transformations • Linear complexity • Enabled by using functionally equivalent DR (dual-rail: physical) and SR (single rail: virtual) libraries
Easy integration & scalability: Weaver flow architecture • Synthesis flow • Interfacing with host synthesis engine • Transforming Synchronous RTL to Asynchronous GTL – Weaving • Dedicated library(ies) • Dual-rail encoded data logic • Cells comprising entire stages • Internal delay assumptions only
Gate-level pipeline Combinational logic REG Automated fine-grain pipelining: Gate Transfer Level (GTL)
Gate-level pipeline Let gates communicate asynchronously and independently Automated fine-grain pipelining: Gate Transfer Level (GTL)
Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Automated fine-grain pipelining: Gate Transfer Level (GTL)
Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Templates already exist Automated fine-grain pipelining: Gate Transfer Level (GTL)
Weaving • Critical transformations • Mapping combinational gates (basic weaving) • Mapping sequential gates • Initialization preserving liveness and safeness • Optimizations • Performance optimization • Fine-grain pipelining (natural) • Slack matching • Area optimization • Optimizing out identity function stages
De Morgan transformation Dual-rail expansion Gate substitution Generating req/ack signals Merge insertion Fork insertion Reset routing Basic Weaving
Linear pipeline • pipeline PN model with global synchronization • pipeline PN (PPN) model with local handshake
Linear pipeline • pipeline PN model with global synchronization • pipeline PN (PPN) model with local handshake
Linear pipeline • pipeline PN model with global synchronization • PPN models asynchronous full-buffer pipelines
Linear pipeline • RTL implementation • GTL implementation
Correctness • Safeness • Guarantees that the number of data portions (tokens) stays the same over the time • Liveness • Guarantees that the system operates continuously • Flow equivalence • In both RTL and GTL implementations corresponding sequential elements hold the same data values • On the same iterations (order wise) • For the same input stream
Deterministic token flow Broadcasting tokens to all channels at Forks Synchronizing at Merges Data dependent token flow Ctrl is also a dual-rail channel To guarantee liveness MUXes need to match deMUXes – hard computationally Non-linear pipelines
Non-linear pipeline liveness • Currently guaranteed for deterministic token flow only by construction (weaving) • A marking of a marked graph is live if each directed PN circuit has a marker • Linear closed pipelines can be considered instead
Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition
Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition
Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition
Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition • Each implementation loop forms two directed circuits • Forward – has at least one token inferred for a DFF
Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition • Each implementation loop forms two directed circuits • Forward – has at least one token inferred for a DFF • Feedback – has at least one NULL inferred from CL or added explicitly
Every loop has at least 2 stages Token capacity for any loop: 1 C N- 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless Liveness conditions hold by construction (Weaving) Closed linear PPN pipeline is live iff(for full-buffer pipelines)
Initialization: FSM example HB HB HB HB HB HB … …
Flow equivalence • GTL data flow structure is equivalent to the source RTL by weaving • No data dependencies are removed • No additional dependencies introduced • In deterministic flow architecture • There are no token races (tokens cannot pass each other) • All forks are broadcast and all joins are synchronizers • Flow equivalence preserved by construction
Flow equivalence • GTL initialization is same as RTL 2 2 1 1 2 2 1 1 N N N 2 N N N N N N 1 N N N N 2 N N N N N 1
Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 3 N N 2 2 N N N N N N 3 N N N 2 N N N N N 1
Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 3 N N 2 2 2 N N N N N 3 3 N N 2 N N N N N N
Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 3 3 N N 2 2 N N N N N N 3 N N 2 2 N N N N N
Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 N 3 3 N N 2 N N N N N N 3 3 N N 2 2 N N N N
Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 N 3 3 N N 2 2 N N N N N N 3 3 N 2 2 2 N N N
Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 • In GTL “3” hits the first top register output N N 3 3 N N 2 N N N N 4 N 3 3 N 2 2 2 2 N N
Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 4 N N 3 N N 2 2 N N N 4 N N 3 N N 2 2 2 N N
Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 • In GTL “3” hits the first bottom register output 4 N N 3 3 N 2 2 2 N N 4 4 N 3 3 N N 2 2 2 N
Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 4 4 N 3 3 N N 2 2 2 N N 4 N N 3 N N N 2 2 N
Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 • In GTL “2” hits the second register output N 4 4 N 3 3 N 2 2 2 2 N N 4 N 3 N N N N 2 2
Flow equivalence • but token propagation is independent • In RTL “3” and “2” moved one stage ahead 3 3 2 2 3 3 2 2 • timing is independent, the order is unchanged N 4 4 N 3 3 N N 2 2 2 N 4 4 N 3 3 N N N 2 2
Optimizations • Area • Optimizing out identity function stages • Performance • Fine-grain pipelining (natural) • Slack matching
Optimizing out identity function stages • Identity function stages (buffers) are inferred for clocked DFFs and D-latches • Implement no functionality • Can be removed as long as • The token capacity is not decreased below the RTL level • The resulting circuit can still be properly initialized
Optimizing out identity function stages: example • Final implementation is the same as if the RTL had not been pipelined (except for initialization) • Saves pipelining effort DFF DFF HB HB HB HB HB HB HB HB HB HB HB HB CL CL
Slack matching implementation • Adjusting the pipeline slack to optimize its throughput • Implementation • leveling gates according to their shortest paths from primary inputs (outputs) • Inserting buffer stages to break long dependencies • Buffer stages initialized to NULL • Currently performed for circuits with no loops only • Complexity O(|X||C|2) • |X| - the number of primary inputs • |C| - the number of connection points in the netlist
Slack matching correctness • Increases the token capacity • Potentially increases performance • Does not affect the number of initial tokens • Liveness is not affected • Does not affect the system structure • The flow equivalence is not affected
RTL implementation Not pipelined GTL implementation Naturally fine-grain pipelined Slack matching performed Both implementations obtained automatically from the same VHDL behavior specification Experimental results: MCNC on average ~ x4 better performance
Experimental results: AES ~ x36 better performance ~ x12 larger
Base line • Demonstrated an automatic synthesis of • QDI (robust to variations) • automatically gate-levelpipelined • implementations from large behavioral specifications • Synthesis run time comparable with RTL synthesis (~2.5x slower) – design time could be reduced • Resulting circuits feature • increased performance (depth dependent ~4x for MCNC) • area overhead • Practical solution – first prerelease at http://async.bu.edu/weaver/ • Demonstrated correctness of transformations (weaving)