1 / 29

Heuristic based throughput analysis and optimization of asynchronous pipelines

Heuristic based throughput analysis and optimization of asynchronous pipelines. Alexander Smirnov Alexander Taubin. Goals and assumptions. Determine max throughput causes of throughput limit max achievable throughput cost of achieving a given throughput level

ilya
Download Presentation

Heuristic based throughput analysis and optimization of asynchronous pipelines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heuristic based throughput analysis and optimizationof asynchronous pipelines Alexander Smirnov Alexander Taubin

  2. Goals and assumptions • Determine • max throughput • causes of throughput limit • max achievable throughput • cost of achieving a given throughput level • Data independent token flow • No early evaluation • DEMUXes send data all ways • Cells across library/design implement the same handshaking protocol

  3. Outline • Previous work • Cell characterization • Protocol characterization • Throughput of asynchronous pipelines (reminder) • Throughput analysis • Throughput optimization

  4. Previous work • Early works on the throughput of async. pipelines: M. Greenstreet, K. Steighlitz; T. Willams; A. Lines • Time separation of events (TSE) based approaches to throughput analysis: T. Amon, H. Hulgaard, S. Burns, G. Boriello; S. Chakraborty, D. Dill; P. McGee, S. Nowick; • Simulation based approaches: C. Brej; K. Fazel • Slack matching (throughput optimization) approaches: P. Prakash, A. Martin; P. Beerel, M. Davies, A. Lines, N. Kim;

  5. Cell characterization • Cell (in ASIC) is a physical implementation of a gate • Characterization is a way of abstracting away the details and specifying the parameters needed on the higher level of hierarchy • Cell characterization • abstracts away cell implementation details • specifies functionality, timing, area, power consumption, etc • necessary and sufficient for efficient synthesis, optimization and simulation • De-facto standard – Synopsys “Liberty” Cell characterization example (in Liberty)

  6. Conventional cell vs. async. cell • Asynchronous stage • Implements function of input channels • Special signals • request • acknowledge • data0 • data1 • reset • Conventional gate: • Implements function of input wires • Special signals • clock • set • clear • etc

  7. Asynchronous cell characterization • Reuse Synopsys Liberty whenever possible • Use attributes to specify roles of pins in handshaking, channel, etc • Specify functionality in terms of channels (abstract out control functionality) • Use Data → Data timing arcs to specify channel → channel attributes: slack, number of tokens at initialization * PCHB stage example

  8. Protocol abstraction • Abstract channel: forward/backward control and forward data propagation • Assumption: handshake protocol is the same across the library/design L - Left/Right F - Forward/Backward C - Control/Data E - Evaluation/Reset

  9. Protocol abstraction (PCHB) • Abstract channel: forward/backward control and forward data propagation • Assumption: handshake protocol is the same across the library/design • Use cell characterization to infer handshake protocol • Abstraction and characterization allow identifying protocol loops in every stage for every pair of channels L - Left/Right F - Forward/Backward C - Control/Data E - Evaluation/Reset

  10. Protocol characterization • Goal: enumerate all handshake cycles • handshake cycles are same across the design (assumption) • for practical protocols a handshake cycle covers 3 stages • enumerate all possible cycles in a full timing graph of a 4-stage FIFO, normalize cycles and remove identical * PCHB stage example Complexity negligible

  11. Throughput analysis • Asynchronous pipeline throughput is determined by loops • Handshaking • Algorithmic (rings) and congestion • Pipeline throughput is known for basic pipeline compositions • Bottleneck based – pipeline compositions are bottleneck candidates

  12. Background: pipeline throughput • T. Willams (1990), A. Lines (1995): ThroughputT • x – token count • s – slack • d – dynamic slack • c – cycle time • x is invariant for a ring in a pipeline with deterministic (data independent) token flow ci lif

  13. Background: serial composition T1 T2 Ti Tj for serial composition of pipelines with throughputs T1, T2 the resulting throughput Tresulting = min{T1,T2} Tresulting is observed at dmin x dmax

  14. Background: parallel composition T1 T2 for parallel composition of pipelines with throughputs T1, T2 the resulting throughput Tresulting is observed at

  15. Bottleneck candidates (BCs) • Peak throughput of a is limited by the slowest component  to determine the throughput of a pipeline it is sufficient to discover that slowest combination of stages - throughput bottleneck • Bottleneck candidates (BCs): • Handshake (h/s) cycle • Re-converging paths • Algorithmic cycle (ring) • BC characterized by cycle time rang

  16. Bottleneck: h/s cycle • Length of each h/s cycle in the protocol computed for each window of length 2 m 3 (HB stages). • Handshake cycles are known from protocol analysis • Lengths of each cycle (imin and imax) are computed for each cycle “in place” and then • Heuristic: cycles involving multiple branches not considered  complexity or where vi are primary outputs of a stages environment reaction times * PCHB stages example

  17. Bottleneck • Theorem: if a BC is a bottleneck, reaction times on its borders never exceed those used to compute • It follows from the theorem that BC can be analyzed in isolation to determine • BCs are sorted with respect to • BC with the highest is a bottleneck – it defines the throughput of the design

  18. Bottleneck: re-converging paths • Requires results of handshake cycle analysis • Identify pairs of re-converging paths, compute • Reduce the number of pairs of re-converging paths: • one pair of re-converging paths identified per fork-join • pipelines is assumed to have deterministic (data independent token flow)  number of initial tokens in any two re-converging paths is the same • Number of BCs can be reduced if optimization not needed

  19. Bottleneck: ring • Heuristics for identifying rings, re-converging paths include: • consider two of any set of rings with common arc(s) (longest and shortest)

  20. Re-converging paths: corner case • Throughput of rings, re-converging path pairs is computed using the equations from T. Willams, A. Lines BUT • If a handshake cycle covers re-converging paths (if the length of the shorter branch is 0-2 half-buffer stages) the equations from T. Willams, A. Lines do not apply • Throughput such bottleneck candidate is determined by the handshake cycles

  21. Analysis algorithm • Identify handshake bottlenecks (slide window) • Optimize handshake bottlenecks (if necessary) • Identify BCs due to algorithmic loops and dynamic slack imbalance • CPM, modified to handle loops • Trade memory for time – store arrival times, significant predecessors • Eliminate unnecessary graph exploration

  22. Predicted throughput variation • Predicted throughput variation range (% of the actual simulated throughput) • Predicted throughput variation depend on: • Due to asymmetry in library cells throughput varies depending on the data (actual throughput variation) • Uncertainty introduced by heuristics (currently incomplete synchronization trees introduce height uncertainty)

  23. Predicted throughput precision • Throughput estimation is heuristic based i.e. error is possible • Shown is the % difference of the actual throughput and the predicted variation range bound weighted by actual throughput • In 92.5% of test cases measured throughput is within the predicted variation range, the maximum error observed is 27%

  24. Throughput optimization • Alleviate bottlenecks with throughput less than the goal by • Handshake pipelining • Ring padding, slack matching • Iteratively • insert stages • update all BCs

  25. Throughput optimization • Alleviate bottlenecks with throughput less than the goal by • Handshake pipelining • Ring padding, slack matching • Iteratively • insert stages • update all BCs

  26. Actual and achievable throughput • The approach allows automatically optimize the throughput up to the level limited by: • library cells • data deficient (long non-pipelined) rings • Fully optimized throughput is higher (cycle time smaller) for • FIFOs • circuits without synchronization trees (fan-out 1)

  27. Summary • Based on Synopsys Liberty developed asynchronous cell/stage characterization used for synthesis, throughput analysis/optimization • Protocol characterization automatically inferred from cell characterization • Support for hierarchical designs (with possible loss of precision) • All bottlenecks are identified • All bottlenecks except for data deficient rings are automatically alleviated • Optimization tested with stage insertion but other optimizations can be used • Analysis results easily adjusted to reflect non-structural changes

  28. Why heuristic? • Currently not considering handshake cycles involving branches • Unless merges/forks are properly characterized analysis in hierarchical designs is imprecise • Currently synchronization trees are assumed balanced, for incomplete trees one sync cell delay I added to the variation range

More Related