290 likes | 514 Views
Heuristic based throughput analysis and optimization of asynchronous pipelines. Alexander Smirnov Alexander Taubin. Goals and assumptions. Determine max throughput causes of throughput limit max achievable throughput cost of achieving a given throughput level
E N D
Heuristic based throughput analysis and optimizationof asynchronous pipelines Alexander Smirnov Alexander Taubin
Goals and assumptions • Determine • max throughput • causes of throughput limit • max achievable throughput • cost of achieving a given throughput level • Data independent token flow • No early evaluation • DEMUXes send data all ways • Cells across library/design implement the same handshaking protocol
Outline • Previous work • Cell characterization • Protocol characterization • Throughput of asynchronous pipelines (reminder) • Throughput analysis • Throughput optimization
Previous work • Early works on the throughput of async. pipelines: M. Greenstreet, K. Steighlitz; T. Willams; A. Lines • Time separation of events (TSE) based approaches to throughput analysis: T. Amon, H. Hulgaard, S. Burns, G. Boriello; S. Chakraborty, D. Dill; P. McGee, S. Nowick; • Simulation based approaches: C. Brej; K. Fazel • Slack matching (throughput optimization) approaches: P. Prakash, A. Martin; P. Beerel, M. Davies, A. Lines, N. Kim;
Cell characterization • Cell (in ASIC) is a physical implementation of a gate • Characterization is a way of abstracting away the details and specifying the parameters needed on the higher level of hierarchy • Cell characterization • abstracts away cell implementation details • specifies functionality, timing, area, power consumption, etc • necessary and sufficient for efficient synthesis, optimization and simulation • De-facto standard – Synopsys “Liberty” Cell characterization example (in Liberty)
Conventional cell vs. async. cell • Asynchronous stage • Implements function of input channels • Special signals • request • acknowledge • data0 • data1 • reset • Conventional gate: • Implements function of input wires • Special signals • clock • set • clear • etc
Asynchronous cell characterization • Reuse Synopsys Liberty whenever possible • Use attributes to specify roles of pins in handshaking, channel, etc • Specify functionality in terms of channels (abstract out control functionality) • Use Data → Data timing arcs to specify channel → channel attributes: slack, number of tokens at initialization * PCHB stage example
Protocol abstraction • Abstract channel: forward/backward control and forward data propagation • Assumption: handshake protocol is the same across the library/design L - Left/Right F - Forward/Backward C - Control/Data E - Evaluation/Reset
Protocol abstraction (PCHB) • Abstract channel: forward/backward control and forward data propagation • Assumption: handshake protocol is the same across the library/design • Use cell characterization to infer handshake protocol • Abstraction and characterization allow identifying protocol loops in every stage for every pair of channels L - Left/Right F - Forward/Backward C - Control/Data E - Evaluation/Reset
Protocol characterization • Goal: enumerate all handshake cycles • handshake cycles are same across the design (assumption) • for practical protocols a handshake cycle covers 3 stages • enumerate all possible cycles in a full timing graph of a 4-stage FIFO, normalize cycles and remove identical * PCHB stage example Complexity negligible
Throughput analysis • Asynchronous pipeline throughput is determined by loops • Handshaking • Algorithmic (rings) and congestion • Pipeline throughput is known for basic pipeline compositions • Bottleneck based – pipeline compositions are bottleneck candidates
Background: pipeline throughput • T. Willams (1990), A. Lines (1995): ThroughputT • x – token count • s – slack • d – dynamic slack • c – cycle time • x is invariant for a ring in a pipeline with deterministic (data independent) token flow ci lif
Background: serial composition T1 T2 Ti Tj for serial composition of pipelines with throughputs T1, T2 the resulting throughput Tresulting = min{T1,T2} Tresulting is observed at dmin x dmax
Background: parallel composition T1 T2 for parallel composition of pipelines with throughputs T1, T2 the resulting throughput Tresulting is observed at
Bottleneck candidates (BCs) • Peak throughput of a is limited by the slowest component to determine the throughput of a pipeline it is sufficient to discover that slowest combination of stages - throughput bottleneck • Bottleneck candidates (BCs): • Handshake (h/s) cycle • Re-converging paths • Algorithmic cycle (ring) • BC characterized by cycle time rang
Bottleneck: h/s cycle • Length of each h/s cycle in the protocol computed for each window of length 2 m 3 (HB stages). • Handshake cycles are known from protocol analysis • Lengths of each cycle (imin and imax) are computed for each cycle “in place” and then • Heuristic: cycles involving multiple branches not considered complexity or where vi are primary outputs of a stages environment reaction times * PCHB stages example
Bottleneck • Theorem: if a BC is a bottleneck, reaction times on its borders never exceed those used to compute • It follows from the theorem that BC can be analyzed in isolation to determine • BCs are sorted with respect to • BC with the highest is a bottleneck – it defines the throughput of the design
Bottleneck: re-converging paths • Requires results of handshake cycle analysis • Identify pairs of re-converging paths, compute • Reduce the number of pairs of re-converging paths: • one pair of re-converging paths identified per fork-join • pipelines is assumed to have deterministic (data independent token flow) number of initial tokens in any two re-converging paths is the same • Number of BCs can be reduced if optimization not needed
Bottleneck: ring • Heuristics for identifying rings, re-converging paths include: • consider two of any set of rings with common arc(s) (longest and shortest)
Re-converging paths: corner case • Throughput of rings, re-converging path pairs is computed using the equations from T. Willams, A. Lines BUT • If a handshake cycle covers re-converging paths (if the length of the shorter branch is 0-2 half-buffer stages) the equations from T. Willams, A. Lines do not apply • Throughput such bottleneck candidate is determined by the handshake cycles
Analysis algorithm • Identify handshake bottlenecks (slide window) • Optimize handshake bottlenecks (if necessary) • Identify BCs due to algorithmic loops and dynamic slack imbalance • CPM, modified to handle loops • Trade memory for time – store arrival times, significant predecessors • Eliminate unnecessary graph exploration
Predicted throughput variation • Predicted throughput variation range (% of the actual simulated throughput) • Predicted throughput variation depend on: • Due to asymmetry in library cells throughput varies depending on the data (actual throughput variation) • Uncertainty introduced by heuristics (currently incomplete synchronization trees introduce height uncertainty)
Predicted throughput precision • Throughput estimation is heuristic based i.e. error is possible • Shown is the % difference of the actual throughput and the predicted variation range bound weighted by actual throughput • In 92.5% of test cases measured throughput is within the predicted variation range, the maximum error observed is 27%
Throughput optimization • Alleviate bottlenecks with throughput less than the goal by • Handshake pipelining • Ring padding, slack matching • Iteratively • insert stages • update all BCs
Throughput optimization • Alleviate bottlenecks with throughput less than the goal by • Handshake pipelining • Ring padding, slack matching • Iteratively • insert stages • update all BCs
Actual and achievable throughput • The approach allows automatically optimize the throughput up to the level limited by: • library cells • data deficient (long non-pipelined) rings • Fully optimized throughput is higher (cycle time smaller) for • FIFOs • circuits without synchronization trees (fan-out 1)
Summary • Based on Synopsys Liberty developed asynchronous cell/stage characterization used for synthesis, throughput analysis/optimization • Protocol characterization automatically inferred from cell characterization • Support for hierarchical designs (with possible loss of precision) • All bottlenecks are identified • All bottlenecks except for data deficient rings are automatically alleviated • Optimization tested with stage insertion but other optimizations can be used • Analysis results easily adjusted to reflect non-structural changes
Why heuristic? • Currently not considering handshake cycles involving branches • Unless merges/forks are properly characterized analysis in hierarchical designs is imprecise • Currently synchronization trees are assumed balanced, for incomplete trees one sync cell delay I added to the variation range