1 / 59

Algorithmic Transformations

Algorithmic Transformations. Goals. The goal: Get the DSP algorithm in an amenable form before heading off to synthesize the design on the selected platform (FPGA or PDSP) No changes to the actual algorithms, just changes to the way the algorithms are prepared for implementation.

troyjordan
Download Presentation

Algorithmic Transformations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithmic Transformations

  2. Goals • The goal: Get the DSP algorithm in an amenable form before heading off to synthesize the design on the selected platform (FPGA or PDSP) • No changes to the actual algorithms, just changes to the way the algorithms are prepared for implementation. • This will require understanding aspects of • timing, • pipelining, • parallelism (C)2002-2004 Yu Hen Hu

  3. Overview • Algorithm Representations and Iteration Bound • Parallelism and Pipelining • Retiming • Unfolding • Folding (C)2002-2004 Yu Hen Hu

  4. (C)2002-2004 Yu Hen Hu

  5. (C)2002-2004 Yu Hen Hu

  6. (C)2002-2004 Yu Hen Hu

  7. Node: Computation Associated with a computing time. Direct edge: data path and delay Delay: iteration count Example y(n) = a*y(n-1) + b*u(n) The delay of 1 u.t. indicates that to compute y(n+1) in the next iteration depends on result y(n) of the present iteration. Delay labeled with D or positive integer on edges Data Flow Graph (C)2002-2004 Yu Hen Hu

  8. Intra-iteration dependency A direct edge without any delay Inter-iteration dependency Direct edge with 1 or more delays Node computing delay labeled with parenthesis. Critical path: longest path between registers Example: critical path delay = 4+2+2 = 8 t.u. Recursive DFG: contains loops. Must have at least one delay element along any loop. Otherwise, the algorithm is NON-computable! DFG x(n) D D M1 M2 (4) (4) M0 (4) y(n) A1 A0 (2) (2) (C)2002-2004 Yu Hen Hu

  9. T{A-B-A} = (2+4)/2 = 3 t.u. T = max{(2+4)/2, (2+4+5)/1} = max{3, 11} = 11 Loop bound and Iteration bound D (2) (5) (4) A B C 2D (2) (4) A B 2D (C)2002-2004 Yu Hen Hu

  10. (C)2002-2004 Yu Hen Hu

  11. (C)2002-2004 Yu Hen Hu

  12. Solution • To achieve high-speed, the length of the critical path can be reduced by pipelining and parallel processing (C)2002-2004 Yu Hen Hu

  13. Overview • Algorithm Representations and Iteration Bound • Parallelism and Pipelining • Retiming • Unfolding • Folding (C)2002-2004 Yu Hen Hu

  14. Parallel processing Pipelined processing Basic Ideas time time P1 P2 P3 P4 P1 P2 P3 P4 a1 a2 a3 a4 a1 b1 c1 d1 b1 b2 b3 b4 a2 b2 c2 d2 c1 c2 c3 c4 a3 b3 c3 d3 d1 d2 d3 d4 a4 b4 c4 d4 Less inter-processor communication Complicated processor hardware More inter-processor communication Simpler processor hardware Colors: different types of operations performed a, b, c, d: different data streams processed (C)2002-2004 Yu Hen Hu

  15. Parallel processing requires NO data dependence between processors Pipelined processing will involve inter-processor communication Data Dependence P1 P2 P3 P4 P1 P2 P3 P4 time time (C)2002-2004 Yu Hen Hu

  16. By inserting latches or registers between combinational logic circuits, the critical path can be shortened. Consequence: reduce clock cycle time, increase clock frequency. Suitable for DSP applications that have (infinity) long data stream. Method to incorporate pipelining: Cut-set retiming Cut set: A cut set is a set of edges of a graph. If these edges are removed from the original graph, the remaining graph will become two separate graphs. Retiming: The timing of an algorithm is re-adjusted while keeping the partial ordering of execution unchanged so that the results correct Usage of Pipelined Processing (C)2002-2004 Yu Hen Hu

  17. Pipelining (C)2002-2004 Yu Hen Hu

  18. Pipelining of FIR filters (C)2002-2004 Yu Hen Hu

  19. Pipelining (C)2002-2004 Yu Hen Hu

  20. Fine-grain pipelining To further reduce TM. Critical Path = Max {TM1, TM2, TA} (C)2002-2004 Yu Hen Hu

  21. x[n] z-1 z-1 h[0] h[1] y[n] h[2] ? = Graphic Transpose Theorem • The transfer function of a signal flow graph remain unchanged if • The directions of each arc is reversed • The input and output labels are switched. u[n] y[n] z-1 z-1 h[2] h[0] h[1] x[n] (C)2002-2004 Yu Hen Hu

  22. Algorithm transform may lead to pipelined structure without adding additional delays. Given a FIR filter SFG Critical path TM+2TA Use graph transposition theorem: Reverse all arcs Reverse input/output We obtain Critical path Max(TM, TA) No additional delay added! Data broadcast structure (C)2002-2004 Yu Hen Hu

  23. One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense) Block vector: [x(3k) x(3k+1) x(3k+2)] Clock cycle: can be 3 times longer Original (FIR filter): Rewrite 3 equations at a time: Define block vector Block formulation: Block Processing (C)2002-2004 Yu Hen Hu

  24. Block Processing (C)2002-2004 Yu Hen Hu

  25. General approach for block processing (C)2002-2004 Yu Hen Hu

  26. (C)2002-2004 Yu Hen Hu

  27. Timing Comparison x(1) x(2) x(3) x(4) MAC 1 2 3 4 y(1) y(2) y(3) y(4) • Pipelining • Block processing x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7) Add 1 2 3 4 5 6 7 8 y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7) a y(1) Mul 1 2 3 4 5 6 7 8 x(2) x(4) x(6) x(8) 2 2 4 4 6 6 8 8 x(1) x(3) x(5) x(7) 1 1 3 3 5 5 7 7 (C)2002-2004 Yu Hen Hu

  28. Overview • Algorithm Representations and Iteration Bound • Parallelism and Pipelining • Retiming • Unfolding • Folding (C)2002-2004 Yu Hen Hu

  29. Definitions • Retiming Retiming is a mapping from a given DFG, G to a retimed DFT, Gr such that the corresponding transfer function of G and Gr differ by a pure delay z-L. • Purposes • To facilitate pipelining to reduce clock cycle time • To reduce number of registers needed. (C)2002-2004 Yu Hen Hu

  30. Cut Set Retiming (C)2002-2004 Yu Hen Hu

  31. Cut set delay transfer (C)2002-2004 Yu Hen Hu

  32. Cut-set delay transfer failure (C)2002-2004 Yu Hen Hu

  33. Feed-forward cut-set: Feed-back cut-set Delay transfer theorem Adding arbitrary non-negative number of delays to each edge of a feed-forward cut-set of a DFG will not alter its output, except the output timing will be delayed. Transfer the same amount of delays from edges of the same direction across a feed-back cut set of a DFG to all edges of opposing edges across the same cut set will not alter the output, but its timing. Cut-set Retiming (C)2002-2004 Yu Hen Hu

  34. Consider the FIR digital filter and its DFG: y(n) = b0x(n) + b1x(n-1) Critical path length = TM+TA Select a cut set Insert a delay each to each edge in the cut set. Retiming: ynew(n) = b0x(n-1) + b1x(n-2) ynew(n) = y(n-1) Critical path = Max(TM, TA) Feed-forward Cut-Set Retiming D x(n) x(n-1) X b0 X b1 D x(n) x(n-1) + y(n) X b0 X b1 D D + y(n) (C)2002-2004 Yu Hen Hu

  35. Consider an IIR digital filter y(n) = a·y(n-2) + x(n) loop bound = (TM+TA)/2 clock cycle = TM+TA Shift 1 delay to the other edge across a feed-back cut set Filter remains unchanged. loop bound = (TM+TA)/2 clock cycle = Max(TM ,TA) Feed-back Cut Set Retiming x(n) y(n) x(n) y(n) + + 2D D D a a   (C)2002-2004 Yu Hen Hu

  36. Consider an IIR digital filter y(n) = ay(n-1) + x(n) loop bound = (TM+TA) throughput = 1/(TM+TA) x(2k-1)=x(k) x(2k) = 0 Clock period = (TM+TA) Throughput = 1/[2(TM+TA)] Feed-back Cut Set Retiming x(n) y(n) + x(m) y(m) + D 2D a  a  (C)2002-2004 Yu Hen Hu

  37. Time scaling (C)2002-2004 Yu Hen Hu

  38. Slowing down the input rate (C)2002-2004 Yu Hen Hu

  39. Loss of Efficiency (C)2002-2004 Yu Hen Hu

  40. Start with y(n) = a y(n-1) + x(n) clock cycle = Max(TM ,TA) Throughput = 1/[2max(TM,TA)] Start with y(n) = a y(n-2) + x(n) loop bound = (TM+TA)/2 clock cycle = Max(TM ,TA) throughput = 1/ Max(TM ,TA) Slowdown + Retiming x(n) y(n) x(m) y(m) + + D D D D a a   (C)2002-2004 Yu Hen Hu

  41. Slow Down for Cut-Set Retiming (C)2002-2004 Yu Hen Hu

  42. Node delay = 1 t.u. Before retiming: Critical path: a3  a4  a5  a6 Clock cycle time = 4 2 delay units After cut-set retiming Critical path: a3  a5, a4  a6 Clock cycle time = 2 6 delay units After additional retiming Critical path: none Clock cycle time = 1 11 delay units D a4 a2 D a6 D a1 D D D a3 a5 Example of retiming D a4 a2 a6 a1 D a5 a3 2D a4 a2 D D a6 2D a1 D D D 2D a3 a5 (C)2002-2004 Yu Hen Hu

  43. Transfer delay through a node in DFG: r(v) = # of delays transferred from out-going edges to incoming edges of node v w(e) = # of delays on edge e wr(e) = # of delays on edge e after retiming Retiming equation: subject to wr(e)  0. Let p be a path from v0 to vk then … e0 e1 ek v0 v1 vk Node Retiming e v u D 3D 2D r(v) = 2 v v 2D 3D D p (C)2002-2004 Yu Hen Hu

  44. Invariant Properties • Retiming does NOT change the total number of delays for each cycle. • Retiming does not change loop bound or iteration bound of the DFG • If the retiming values of every node v in a DFG G are added to a constant integer j, the retimed graph Gr will not be affected. That is, the weights (# of delays) of the retimed graph will remain the same. (C)2002-2004 Yu Hen Hu

  45. Node Retiming Examples r(2) = 1 (C)2002-2004 Yu Hen Hu

  46. DFG Illustration of the Example T = max. {(1+2+1)/2, (1+2+1)/3} = 2 Cr. Path Delay = max{2,2,1+1} = 2 t.u T = max. {(1+2+1)/2, (1+2+1)/3} = 2 Cr. Path delay = 2+1 = 3 t.u (C)2002-2004 Yu Hen Hu

  47. Note that retiming will NOT alter iteration bound T. Iteration bound is the theoretical minimum clock period to execute the algorithm. Let edge e connect node u to node v. If the node computing time t(u) + t(v) > T, then clock period T > T. For such an edge, we require that To generalize, for any path from v0 to vk, we have In other words, for any possible critical path in the DFG that is larger than T, we require wr(e)  1. Retiming for Minimizing Clock Period (C)2002-2004 Yu Hen Hu

  48. Retiming Example Revisited wr(e21)  0, since t(2)+t(1) = 2 = T. wr(e13)  1, since t(1)+t(3) = 3 > T. wr(e14)  1, since t(1)+t(4) = 3 > T. wr(e32)  1, since t(3)+t(2) = 3 > T. wr(e42)  1, since t(4)+t(2) = 3 > T. Use eq. wr(euv) = w(e) + r(v) – r(u), w(e21) + r(1) – r(2) = 1 + r(1) – r(2)  0 w(e13) + r(3) – r(1) = 1 + r(3) – r(1)  1 w(e14) + r(4) – r(1) = 2 + r(4) – r(1)  1 w(e32) + r(2) – r(3) = 0 + r(2) – r(3)  1 w(e42) + r(2) – r(4) = 0 + r(2) – r(4)  1 (C)2002-2004 Yu Hen Hu

  49. Since the retimed graph Gr remain the same if all node retiming values are added by the same constant. We thus can set r(1) = 0. The inequalities become 1 – r(2)  0 or r(2)  1 1 + r(3)  1 or r(3)  0 2 + r(4)  1 or r(4) –1 r(2) – r(3)  1 or r(3) r(2) - 1 r(2) – r(4)  1 or r(2)  r(4) + 1 Since one must have r(2) = +1. This implies r(3) 0. But we also have r(3)  0. Hence r(3)=0. These leave –1  r(4)  0. Hence the two sets of solutions are: r(3) = 0, r(2) = +1, and r(4) = 0 or -1. Solution continues (C)2002-2004 Yu Hen Hu

  50. Given a systems of inequalities: r(i) – r(j)  k; 1  i,j  N Construct a constraint graph: Map each r(i) to node i. Add a node N+1. For each inequality r(i) – r(j)  k, draw an edge eji such that w(eji) = k. Draw N edges eN+1,i = 0. The system of inequalities has a solution if and only if the constraint graph contains no negative cycles If a solution exists, one solution is where ri is the minimum length path from the node N+1 to the node i. Shortest path algorithms: Bellman-Ford algorithm Floyd-Warshall algorithm Systematic Solutions (C)2002-2004 Yu Hen Hu

More Related