330 likes | 349 Views
CS137: Electronic Design Automation. Day 13: May 20, 2002 Page Generation (Area and IO Constraints). [working problem with Eylon Caspi]. Today. Cover/clustering Minimize Weight W/ area and IO constraints Motivation: SCORE Page generation Also energy minimization Techniques
E N D
CS137:Electronic Design Automation Day 13: May 20, 2002 Page Generation (Area and IO Constraints) [working problem with Eylon Caspi]
Today • Cover/clustering • Minimize Weight • W/ area and IO constraints • Motivation: SCORE Page generation • Also energy minimization • Techniques • Current Results • FPGA/hardware implementation?
Abstract Problem • Given: Graph (V,E) with a single weight (area) on each node and two weights (IO, cost) on the edges. • Cluster nodes into subsets Vi, such that • S (Cost(Vi)) minimized • IO(Vi) < IO limit • A(Vi) < Area limit • Cost(Vi) = S(cost(e) | e E st. e1 Vi and e2Vi)
memory segment memory segment Compile TDF operator compute page stream stream SCORE Compilation Programming ModelExecution Model • Graph of TDF FSMD operators • Graph of page configs - unlimited size, # IOs - fixed size, # IOs - no timing constraints - timed, single-cycle firing
How Big is an Operator? • JPEG Encode • JPEG Decode • MPEG (I) • MPEG (P) • Wavelet Encode • IIR • Wavelet Decode • Wavelet Encode • JPEG Encode • MPEG Encode
Clustering is Critical • Inter-page comm. latency may be long • Inter-page feedback loops are slow • Cluster to: • Fit feedback loops within page • Fit feedback loops on device
DF CF i two_i *2 state pipeline pipeline Pipeline Extraction • Hoist uncontrolled FF data-flow out of FSMD • Benefits: • Shrink FSM cyclic core • Extracted pipeline has more freedom for scheduling and partitioning i Extract state foo(i): acc=acc+2*i state foo(two_i): acc=acc+two_i
Pipeline Extraction – Extractable Area • JPEG Encode • JPEG Decode • MPEG (I) • MPEG (P) • Wavelet Encode • IIR
Page Generation • Pipeline extraction • removes dataflow can freely extract from FSMD control • Still have to partition potentially large FSMs • approach: turn into a clustering problem
IA IB OA OB State Clustering • Start: consider each state to be a unit • Cluster states into page-size sub-FSMDs • Inter-page transitions become streams • Possible clustering goals: • Minimize delay (inter-page latency) • Minimize IO (inter-page BW) • Minimize area (fragmentation)
State Clustering to Minimize Inter-Page State Transfer • Inter-page state transfer is slow • Cluster to: • Contain feedback loops • Minimize frequency ofinter-page state transfer • Previously used in: • VLIW trace scheduling [Fisher ‘81] • FSM decomposition for low power[Benini/DeMicheli ISCAS ‘98] • VM/cache code placement • GarpCC code selection [Callahan ‘00]
Clustering Problem • SCORE Page • Fixed area (# of LUTs) • Fixed IO • Cost on edges is probability take state transition • Clustering Goal is to minimize page-to-page transition • Maximize expected transitions within same page • Find page-count/page-transition tradeoff curve
Pages Inter-Page Communication Frequency Abstract Problem • Given: Graph (V,E) with a single weight (area) on each node and two weights (IO, cost) on the edges. • Cluster nodes into subsets Vi, such that • S (Cost(Vi)) minimized • IO(Vi) < IO limit • A(Vi) < Area limit • Cost(Vi) = S(cost(e) | e E st. e1 Vi and e2Vi)
DSM • Possibly relevant for minimizing delay in DSM • Previously discussed: • Larger area longer wires, slower • Want to cluster logic locally • Maybe: • Cluster common computations together • Make distant computation transfer uncommon
Island Packing for Energy • Note: Modern FPGAs pack cluster of LUTs into an endpoint • e.g. Altera LAB
Island Packing for Energy • Modern FPGAs pack cluster of LUTs into an endpoint • e.g. Altera LAB • Local wiring less energy cost than long wiring • Covering for energy: • minimize exposed activity factor • same covering problem
Clusters/Islands Switching Activity Abstract Problem • Given: Graph (V,E) with a single weight (area) on each node and two weights (IO, cost) on the edges. • Cluster nodes into subsets Vi, such that • S (Cost(Vi)) minimized • IO(Vi) < IO limit • A(Vi) < Area limit • Cost(Vi) = S(cost(e) | e E st. e1 Vi and e2Vi)
First Try • Use FBB (flow cut) [Wong/cs137a:day7] • Pick seed element • Compute mincut • On mix of IO, cost edge weights? • If too small, • Cluster in node and repeat • Else • Cluster out node and repeat
Mincut lessons • Couldn’t consistently control IO • Non-monotonic results adjusting weight • Not clear what to cluster in
Idea #2 • If we had an ordering of nodes • (wishful thinking) • Then easy to know how to include more • Just pick the next node • Order: 1D list of nodes • Cluster: a contiguous sequence of nodes in list • Specify start, finish
From Sequence to Clusters • Easy to know if a contiguous subsequence • Meets area constraints • Meets io constraints • Cover • Set of (non-overlapping) subsequences • Include all nodes
Covering • Not clear when to put more or less stuff in a cluster…versus leave with next cluster • Can’t build clusters greedily • Like associative/parthesization problem saw earlier [day 5]
Similar But compute from all breaks across a diagonal Not just nearest neighbor Hence extra O(N) Day 5 Parenthesis Matching
Dynamic Programming • For each subsequence start,end • Either the area and io match • OR want to find a breakpoint between cluster sets • Cluster sets startmidpoint, midpointend may each either be single or multiple clusters • Different splits may • Minimize number of clusters • Minimize cost • Keep dominator set [day11]
Algorithm • Compute Linear Order • Compute IO, Area on each subsequence • Think NxN table (but sparse) • Use Dynamic Programming to cover
Compute Order? • Could experiment with various techniques • Considering: Spectral Ordering • [Hall/cs137a:day7] • How weight edges? • IO, cost, mix? • Try linear mix…vary mix weighting
Weight Mix • Why unclear? • IO weight good to cluster connectivity • If Ios limited, allows to use fewer clusters • Pack more stuff into pageless cases need to transition • Cost weight what we’re minimizing • Cluster high cost edges together • Hide in page • But, cost ordering may get less stuff in page if poorly IO clustered…
spp results • [see HTML]
Discussion • Promising Results • New capability not clear what compare to • Maybe LUT clustering to validate algorithm • Absolutes look promising • Weighting • Not clear how to search for best • Maybe should try other ways of weighting? • [Michael suggests try taking log(trans)]
Spatial/Hdw Implementation? • Compute Linear Order • Use 1D FDSA? • Compute IO, Area on each subsequence • Parallel prefix sum scan • One for each start point? • Use Dynamic Programming to cover • Like parenthesis • Maybe 1D and combine with area/io scan?
Promising Ideas • Compute good ordering • Easy to vary inclusion when know what’s next to include/exclude • Mix weights • Cluster to minimize exposed (cut) costs