1 / 61

CS137: Electronic Design Automation

CS137: Electronic Design Automation. Day 9: January 30, 2006 Parallel Prefix. Bit-Level Addition LUT Cascades For Sums Applications FSMs SATADD Data Forwarding Pointer Jumping Applications. Today. Introduction / Reminder. Addition in Log Time. Ripple Carry Addition.

Download Presentation

CS137: Electronic Design Automation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS137:Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

  2. Bit-Level Addition LUT Cascades For Sums Applications FSMs SATADD Data Forwarding Pointer Jumping Applications Today

  3. Introduction / Reminder Addition in Log Time

  4. Ripple Carry Addition • Simple “definition” of addition • Serially resolve carry at each bit

  5. Think about each adder bit as a computing a function on the carry in C[i]=g(c[i-1]) Particular function f will depend on a[i], b[i] G=f(a,b) CLA

  6. Functions • What functions can g(c[i-1]) be? • g(x)=1 • a[i]=b[i]=1 • g(x)=x • a[i] xor b[i]=1 • g(x)=0 • A[i]=b[i]=0

  7. Functions • What functions can g(c[i-1]) be? • g(x)=1 Generate • a[i]=b[i]=1 • g(x)=x Propagate • a[i] xor b[i]=1 • g(x)=0 Squash • A[i]=b[i]=0

  8. Want to combine functions Compute c[i]=gi(gi-1(c[i-2])) Compute compose of two functions What functions will the compose of two of these functions be? Same as before Propagate, generate, squash Combining

  9. Compose Rules (LSB MSB)

  10. Compose Rules (LSB MSB)

  11. Combining • Do it again… • Combine g[i-3,i-2] and g[i-1,i] • What do we get?

  12. Reduce Tree

  13. Associative Reduce  Prefix • Shows us how to compute the Nth value in O(log(N)) time • Can actually produce all intermediate values in this time • w/ only a constant factor more hardware

  14. Prefix Tree Prefix Tree

  15. Parallel Prefix • Important Pattern • Applicable any time operation is associative • Function Composition is always associative

  16. Generalizing LUT Cascade

  17. Cascaded LUT Delay Model • Tcascade =T(3LUT) + T(mux) • Don’t pay • General interconnect • Full 4-LUT delay

  18. Parallel Prefix LUT Cascade? • Can we do better than N×Tmux? • Can we compute LUT cascade in O(log(N)) time? • Can we compute mux cascade using parallel prefix? • Can we make mux cascade associative?

  19. Parallel Prefix Mux cascade • How can mux transform Smux-out? • A=0, B=0  mux-out=0 • A=1, B=1  mux-out=1 • A=0, B=1  mux-out=S • A=1, B=0  mux-out=/S

  20. Parallel Prefix Mux cascade • How can mux transform Smux-out? • A=0, B=0  mux-out=0 Stop= S • A=1, B=1  mux-out=1 Generate= G • A=0, B=1  mux-out=S Buffer = B • A=1, B=0  mux-out=/S Invert = I

  21. Parallel Prefix Mux cascade • How can 2 muxes transform input? • Can I compute 2-mux transforms from 1 mux transforms?

  22. SSS SGG SBS SIG Two-mux transforms • GSS • GGG • GBG • GIS • BSS • BGG • BBB • BII • ISS • IGG • IBI • IIB

  23. Generalizing mux-cascade • How can N muxes transform the input? • Is mux transform composition associative?

  24. Associative Reduce Mux-Cascade Can be hardwired, no general interconnect

  25. For Sums

  26. Prefix Sum • Common Operation: • Want B[x] such that B[x]=A[0]+A[1]+…A[x] • For I=0 to x • B[x]=B[x-1]+A[x]

  27. Prefix Sum • Compute in tree fashion • A[I]+A[I+1] • A[I]+A[I+1]+A[I+2]+A[I+3] • … • Combine partial sums back down tree • S(0:7)+S(8:9)+S(10)=S(0:10)

  28. Other simple operators • Prefix-OR • Prefix-AND • Prefix-MAX • Prefix-MIN

  29. Find-First One • Useful for arbitration • Finds first (highest-priority) requestor • Also magnitude finding in numbers • How: • Prefix-OR • Locally compute X[I-1]^X[I] • Flags the first one

  30. Arbitration • Often want to find first M requestors • E.g. Assign unique memory ports to first M processors requesting • Prefix-sum across all potential requesters • Counts requesters, giving unique number to each • Know if one of first M • Perhaps which resource assigned

  31. Partitioning • Use something to order • E.g. spectral linear ordering • …or 1D cellular swap to produce linear order • Parallel prefix on area of units • If not all same area • Know where the midpoint is

  32. Channel Width • Prefix sum on delta wires at each node • To compute net channel widths at all points along channel • E.g. 1D ordered • Maybe use with cellular placement scheme

  33. Rank Finding • Looking for I’th ordered element • Do a prefix-sum on high-bit only • Know m=number of things > 01111111… • High-low search on result • I.e. if number > I, recurse on half with leading zero • If number < I, search for (I-m)’th element in half with high-bit true • Find median in log2(N) time

  34. FA/FSM Evaluation (regular expression recognition)

  35. Finite Automata • Machine has finite state: S • On each cycle • Input I • Compute output and new state • Based on inputs and current state • Oi,S(i+1)=f(Si,Ii) • Intuitively, a sequential process • Must know previous state to compute next • Must know state to compute output

  36. Function Specialization • But, this is just functions • …and function composition is associative • Given that we know input sequence: • I0,I1,I2… • Can compute specialized functions: • fi(s)=f(s,Ii) • What is fi(s)? • Worst-case, a translation table: • S=0  NS0, S=1 NS1 ….

  37. Function Composition • Now: O(i+m),S(i+m+1)= f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si)))) • Can we compute the function composition? • f(i+1,i)(s)=f(i+1)(fi(s)) • What is f(i+1,i)(s)? • A translation table just like fi(s) and f(i+1)(s) • Table of size |S|, can fillin in O(|S|) time

  38. Recursive Function Composition • Now: O(i+m),S(i+m+1)= f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si)))) • We can compute the composition • f(i+1,i)(s)=f(i+1)(fi(s)) • Repeat to compute • f(i+3,i)(s)=f(i+3,i+2)(f(i+1,i)(s)) • Etc. until have computed: f(i+m,i)(s) in O(log(m)) steps

  39. Implications • If can get input stream, • Any FA can be evaluated in O(log(N)) time • Regular Expression recognition in O(log(N)) • Any streaming operator with finite state • Where the input stream is independent of the output stream • Can be run arbitrarily fast by using parallel-prefix on FSM evaluation

  40. Saturated Addition • S(i+1)=max(min(Ii+Si,maxval),minval) • Could model as FSM with: • |S|=maxval-minval • So, in theory, FSM result applies • …but |S| might be 216, 224

  41. SATADD Composition • Can compute composition efficiently [Papadantonakis et al. FPT2005]

  42. SATADD Composition

  43. SATADD Reduce Tree

  44. Data Forwarding UltraScalar From Henry, Kuszmaul, et al. ARVLSI’99, SPAA’99, ISCA’00

  45. Consider Machine • Each FU has a full RF • FU=Functional Unit • RF=Register File • Build network between FUs • use network to connect produce/consume • user register names to configure interconnect • Signal data ready along network

  46. Ultrascalar: concept model

  47. Ultrascalar Concept • Linear delay • O(1) register cost / FU • Complete renaming at each FU • different set of registers • so when say complete RF at each FU, that’s only the logical registers

  48. Ultrascalar: cyclic prefix

  49. Parallel Prefix • Basic idea is one we saw with adders • An FU will either • produce a register (generate) • or transmit a register (propagate) • can do tree combining • pair of FUs will either both propagate or will generate • compute function by pair in one stage • recurse to next stage • get log-depth tree network connecting producer and consumer

  50. Ultrascalar: cyclic prefix

More Related