610 likes | 628 Views
CS137: Electronic Design Automation. Day 9: January 30, 2006 Parallel Prefix. Bit-Level Addition LUT Cascades For Sums Applications FSMs SATADD Data Forwarding Pointer Jumping Applications. Today. Introduction / Reminder. Addition in Log Time. Ripple Carry Addition.
E N D
CS137:Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix
Bit-Level Addition LUT Cascades For Sums Applications FSMs SATADD Data Forwarding Pointer Jumping Applications Today
Introduction / Reminder Addition in Log Time
Ripple Carry Addition • Simple “definition” of addition • Serially resolve carry at each bit
Think about each adder bit as a computing a function on the carry in C[i]=g(c[i-1]) Particular function f will depend on a[i], b[i] G=f(a,b) CLA
Functions • What functions can g(c[i-1]) be? • g(x)=1 • a[i]=b[i]=1 • g(x)=x • a[i] xor b[i]=1 • g(x)=0 • A[i]=b[i]=0
Functions • What functions can g(c[i-1]) be? • g(x)=1 Generate • a[i]=b[i]=1 • g(x)=x Propagate • a[i] xor b[i]=1 • g(x)=0 Squash • A[i]=b[i]=0
Want to combine functions Compute c[i]=gi(gi-1(c[i-2])) Compute compose of two functions What functions will the compose of two of these functions be? Same as before Propagate, generate, squash Combining
Combining • Do it again… • Combine g[i-3,i-2] and g[i-1,i] • What do we get?
Associative Reduce Prefix • Shows us how to compute the Nth value in O(log(N)) time • Can actually produce all intermediate values in this time • w/ only a constant factor more hardware
Prefix Tree Prefix Tree
Parallel Prefix • Important Pattern • Applicable any time operation is associative • Function Composition is always associative
Generalizing LUT Cascade
Cascaded LUT Delay Model • Tcascade =T(3LUT) + T(mux) • Don’t pay • General interconnect • Full 4-LUT delay
Parallel Prefix LUT Cascade? • Can we do better than N×Tmux? • Can we compute LUT cascade in O(log(N)) time? • Can we compute mux cascade using parallel prefix? • Can we make mux cascade associative?
Parallel Prefix Mux cascade • How can mux transform Smux-out? • A=0, B=0 mux-out=0 • A=1, B=1 mux-out=1 • A=0, B=1 mux-out=S • A=1, B=0 mux-out=/S
Parallel Prefix Mux cascade • How can mux transform Smux-out? • A=0, B=0 mux-out=0 Stop= S • A=1, B=1 mux-out=1 Generate= G • A=0, B=1 mux-out=S Buffer = B • A=1, B=0 mux-out=/S Invert = I
Parallel Prefix Mux cascade • How can 2 muxes transform input? • Can I compute 2-mux transforms from 1 mux transforms?
SSS SGG SBS SIG Two-mux transforms • GSS • GGG • GBG • GIS • BSS • BGG • BBB • BII • ISS • IGG • IBI • IIB
Generalizing mux-cascade • How can N muxes transform the input? • Is mux transform composition associative?
Associative Reduce Mux-Cascade Can be hardwired, no general interconnect
Prefix Sum • Common Operation: • Want B[x] such that B[x]=A[0]+A[1]+…A[x] • For I=0 to x • B[x]=B[x-1]+A[x]
Prefix Sum • Compute in tree fashion • A[I]+A[I+1] • A[I]+A[I+1]+A[I+2]+A[I+3] • … • Combine partial sums back down tree • S(0:7)+S(8:9)+S(10)=S(0:10)
Other simple operators • Prefix-OR • Prefix-AND • Prefix-MAX • Prefix-MIN
Find-First One • Useful for arbitration • Finds first (highest-priority) requestor • Also magnitude finding in numbers • How: • Prefix-OR • Locally compute X[I-1]^X[I] • Flags the first one
Arbitration • Often want to find first M requestors • E.g. Assign unique memory ports to first M processors requesting • Prefix-sum across all potential requesters • Counts requesters, giving unique number to each • Know if one of first M • Perhaps which resource assigned
Partitioning • Use something to order • E.g. spectral linear ordering • …or 1D cellular swap to produce linear order • Parallel prefix on area of units • If not all same area • Know where the midpoint is
Channel Width • Prefix sum on delta wires at each node • To compute net channel widths at all points along channel • E.g. 1D ordered • Maybe use with cellular placement scheme
Rank Finding • Looking for I’th ordered element • Do a prefix-sum on high-bit only • Know m=number of things > 01111111… • High-low search on result • I.e. if number > I, recurse on half with leading zero • If number < I, search for (I-m)’th element in half with high-bit true • Find median in log2(N) time
FA/FSM Evaluation (regular expression recognition)
Finite Automata • Machine has finite state: S • On each cycle • Input I • Compute output and new state • Based on inputs and current state • Oi,S(i+1)=f(Si,Ii) • Intuitively, a sequential process • Must know previous state to compute next • Must know state to compute output
Function Specialization • But, this is just functions • …and function composition is associative • Given that we know input sequence: • I0,I1,I2… • Can compute specialized functions: • fi(s)=f(s,Ii) • What is fi(s)? • Worst-case, a translation table: • S=0 NS0, S=1 NS1 ….
Function Composition • Now: O(i+m),S(i+m+1)= f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si)))) • Can we compute the function composition? • f(i+1,i)(s)=f(i+1)(fi(s)) • What is f(i+1,i)(s)? • A translation table just like fi(s) and f(i+1)(s) • Table of size |S|, can fillin in O(|S|) time
Recursive Function Composition • Now: O(i+m),S(i+m+1)= f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si)))) • We can compute the composition • f(i+1,i)(s)=f(i+1)(fi(s)) • Repeat to compute • f(i+3,i)(s)=f(i+3,i+2)(f(i+1,i)(s)) • Etc. until have computed: f(i+m,i)(s) in O(log(m)) steps
Implications • If can get input stream, • Any FA can be evaluated in O(log(N)) time • Regular Expression recognition in O(log(N)) • Any streaming operator with finite state • Where the input stream is independent of the output stream • Can be run arbitrarily fast by using parallel-prefix on FSM evaluation
Saturated Addition • S(i+1)=max(min(Ii+Si,maxval),minval) • Could model as FSM with: • |S|=maxval-minval • So, in theory, FSM result applies • …but |S| might be 216, 224
SATADD Composition • Can compute composition efficiently [Papadantonakis et al. FPT2005]
Data Forwarding UltraScalar From Henry, Kuszmaul, et al. ARVLSI’99, SPAA’99, ISCA’00
Consider Machine • Each FU has a full RF • FU=Functional Unit • RF=Register File • Build network between FUs • use network to connect produce/consume • user register names to configure interconnect • Signal data ready along network
Ultrascalar Concept • Linear delay • O(1) register cost / FU • Complete renaming at each FU • different set of registers • so when say complete RF at each FU, that’s only the logical registers
Parallel Prefix • Basic idea is one we saw with adders • An FU will either • produce a register (generate) • or transmit a register (propagate) • can do tree combining • pair of FUs will either both propagate or will generate • compute function by pair in one stage • recurse to next stage • get log-depth tree network connecting producer and consumer