1 / 47

VLSI Signal Processing

VLSI Signal Processing. Lecture 2 Unfolding Transformation. Multiple-Data Processing. Create a program with more than one iteration, e.g. J loops unrolling Example: Loop unrolling + software pipelining. operation. clock cycle. clock cycle. 1. 1. 1. 1. 2. 2. 2. 1. 2. 3. 3. 3. 1.

Download Presentation

VLSI Signal Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VLSI Signal Processing Lecture 2 Unfolding Transformation ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  2. Multiple-Data Processing • Create a program with more than one iteration, e.g. J loops unrolling • Example: Loop unrolling + software pipelining operation clock cycle clock cycle 1 1 1 1 2 2 2 1 2 3 3 3 1 2 3 4 1 4 2 3 5 2 5 3 6 3 6 7 1 7 8 2 8 ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  3. Parallel processing Pipelined processing Basic Ideas P1 P2 P3 P4 P1 P2 P3 P4 a1 a2 a3 a4 a1 b1 c1 d1 b1 b2 b3 b4 a2 b2 c2 d2 c1 c2 c3 c4 a3 b3 c3 d3 d1 d2 d3 d4 a4 b4 c4 d4 time time ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  4. Parallel processing requires NO data dependence between processors Pipelined processing will involve inter-processor communication Data Dependence P1 P2 P3 P4 P1 P2 P3 P4 time time ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  5. Parallel Processing • In a J-unfolded system, each delay is J-slow. That is, if input to a delay element is x(kJ+m), then the output is x((k-1)J+m) = x(kJ+m-J) ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  6. Parallel Processing • Block processing • the number of inputs processed in a clock cycle is referred to as the block size • at the k-th clock cycle, three inputs x(3k), x(3k+1), and x(3k+2) are processed simultaneously to generate y(3k), y(3k+1), and y(3k+2) ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  7. I/O Conversion • Serial to parallel converter • Parallel to serial converter ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  8. General approach for block processing ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  9. Mathematical Formulation • e.g. y(n) = ay(n-9) + x(n) • 2-parallel Y(2k) = ay(2k-9) + x(2k) Y(2k+1) = ay(2k-8) + x (2k+1) • In 2-parallel SDFG, one active clock edge leads two samples Y(2k) = ay(2(k-5)+1) + x(2k) Y(2k+1) = ay(2(k-4)+0) + x(2k+1) • Dependency with less than # parallelism of sample delays can be implemented with internal routing ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  10. Unfolding the DFG T=J Ts T=Ts Not trivial, even for a simple graph ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  11. Block Processing for FIR Filter • One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense) • Block vector: [x(3k) x(3k+1) x(3k+2)] • Clock cycle: can be 3 times longer • Original (FIR filter): • Rewrite 3 equations at a time: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  12. Block Processing ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  13. Block Processing for IIR Digital Filter • Original formulation: • Rewrite: • Vector formulation: n: sample period k: processor period Tsample≠Tclk ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  14. Block IIR Filter y(2(k-1))  D x(2k) y(2k) + x(n) S/P P/S y(n) y(2k+1) + x(2k+1) clock period not equal to sampling period y(2(k-1)+1)  D ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  15. Timing Comparison x(1) x(2) x(3) x(4) MAC 1 2 3 4 y(1) y(2) y(3) y(4) • Pipelining • Block processing x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7) Add 1 2 3 4 5 6 7 8 y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7) a y(1) Mul 1 2 3 4 5 6 7 8 x(2) x(4) x(6) x(8) 2 2 4 4 6 6 8 8 x(1) x(3) x(5) x(7) 1 1 3 3 5 5 7 7 ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  16. Unfolding is the process of unfolding a loop so that several iterations are unrolled into the same iteration. Also known as (a.k.a.) Loop unrolling (in compilers for parallel programs) Block processing Applications Reducing sampling period to achieve iteration bound (desired throughput rate) T. Parallel (block processing) to execute several iterations concurrently. Digit-serial or bit-serial processing Definitions ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  17. Unfolding the DFG • y(n)=ay(n-9)+x(n) • Rewrite the algorithm formulation: y(2k)=ay(2k-9)+x(2k) y(2k+1)=ay(2k-8)+x(2k+1) y(2k)=ay(2(k-5)+1)+x(2k) y(2k+1)=ay(2(k-4))+x(2k+1) • After J-folded unfolding, the clock period T = J Ts, where Ts is the data sampling period. ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  18. Above timing diagram is obtained assuming that the sampling period Ts remains unchanged. Thus, the clock period T is increased J-fold. Since 9/2 is not an integer, output (y(0), y(1)) will be needed by two different future iterations, 4T and 5T later. Timing Diagram y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10) y(11) y(12) y(13) 9 T T=Ts 9 T T=2Ts y(0) y(2) y(4) y(6) y(8) y(10) y(12) 4T 5T y(1) y(3) y(5) y(7) y(9) y(11) y(13) ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  19. Another DFG Unfolding Example J=2 S0 Q0 T0 S R0 Q T 3D 2D S1 R Q1 T1 T=3 R1 Step 1. Duplicate J copies of each node ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  20. Another DFG Unfolding Example J=2 S0 Q0 T0 S R0 Q T 3D 2D S1 R Q1 T1 T=3 R1 Step 2. Add all edges with 0 delay on them. ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  21. Another DFG Unfolding Example J=2 S0 Q0 T0 S D R0 Q T 2D D 3D 2D S1 R Q1 T1 T=3 D R1 Step 3. Use table on the left to figure out edges with delays. T=6 ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  22. Unfolding Transformation • For each node U in the original DFG, draw J node U0, U1,…, UJ-1 • For each edge UV with w delays in the original DFG, draw the J edges UiV(i + w)%J with floor[(i+w)/J] delays for i=0,1,…, J-1 Example • Unfolding of an edge with w delays in the original DFG produces J-w edges with no delays and w edges with 1delay in J-unfolded DFG for w < J • Unfolding preserves precedence constraints of a DSP algorithm ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  23. Precedence Preservation ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  24. Delay Preservation • Unfolding preserves the number of delays in a DFG • Let , where ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  25. Example • Unfold the following DFG using folding factor 2 and 5 ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  26. Unfolding preserves the number of registers (delays) in a DFG For a loop with w delays in a DFG that has been unfolded J times, it leads to g.c.d.(w, J) loops in the unfolded DFG, with each of these loops containing W/(g.c.d.(w,J)) delays and J/(g.c.d.(w,J)) copies of each node that appear in the original loop. Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT. A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG. Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding. Properties of Unfolding ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  27. When a Loop is Unfolded • A loop ℓ with w delays in a DFG • Travel the loop A~>A p times  also a loop with pw delays • In J-unfolded DFG, consider the path AiA(i+pw)%J . It is a loop if i=(i+ pw)%J. This implies that J | pw • The smallest p = J/gcd(J, w). That is, in J-unfolded DFG, one can travel the loop A~>A J/gcd(J, w) times. • Recall that there are totally J copies of node A. Hence, there are J/(J/gcd(J,w))=gcd(J, w) loops and each loop contains w/ gcd(J, w)delays. • The iteration bound in J-unfolded DFG is then ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  28. When a Path is Unfolded • If w<J, then a path containing w delays within a DFG will lead to (J-w) paths with no delays and w paths with 1 delay in the J-unfolded DFG. • If w≥J, then the path leads to J paths with one or more delays in the J-unfolded DFG. This implies that these paths are not critical. • Assume that the critical path of the J-unfolded DFG is c. If D(U,V)≥c, then Wr(UV)=W(UV)+r(V)-r(U) ≥ J • Any feasible clock cycle period that can be obtained by retiming the J-unfolded DFG can be achieved by retiming the original DFG directly and followed by J-unfolding. ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  29. When a Path is Unfolded • Suppose r’ is a legal retiming for the J-unfolded DFG, GJ, which leads to critical path c. • Let r(U) = i r’(Ui), 0≤i≤J-1. • r is a feasible retiming for the original DFG, G. • The retiming leads to a critical path c i 0≤i≤J-1 ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  30. Sample Period Reduction • Case1: A node in the DFG having computation time greater than T∞ • Case2: Iteration bound is not an integer • Case3: Longest node computation is larger than the iteration T∞, and T∞ is not an integer ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  31. Case 1 • Critical path dominates, since a node computation time is more than iteration bound Retiming cannot be used to reduce sample period ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  32. Sample Period Reduction • Rule of Thumb: T∞=6, Tcritical=6 ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  33. Case 2 • Iteration period cannot not achieve the iteration bound ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  34. Sample Period Reduction ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  35. Case 3 ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  36. Parallel Processing • Parallel processing can be performed by unfolding ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  37. Bit-Level Parallel Processing ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  38. ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  39. Bit-Serial Adder ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  40. Unfolding of Switches ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  41. Example ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  42. Example ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  43. Example ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  44. Example ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  45. Switches with Delays ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  46. Switch with Delays ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

  47. If Wordlength is not a Multiple of J ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)

More Related