290 likes | 438 Views
ESLT The next generation of Design Automation Tools. Agenda. Goal of ESL tools History Motivation for ESLT in these times The USU-ESLT On-going research Conclusions. Goal of ESL tools. To automate the generation of SoC solutions from HLPL (such as C/C++/Java..)
E N D
Agenda • Goal of ESL tools • History • Motivation for ESLT in these times • The USU-ESLT • On-going research • Conclusions
Goal of ESL tools • To automate the generation of SoC solutions from HLPL (such as C/C++/Java..) • To reduce design time of digital circuits from months to weeks • Initial VHDL generation should complete in minutes • Functional verification/testing may take a few weeks
History of Electronic System Level tools • Tools • Cones (1988) • HardwareC (Stanford) • Transmogrifier C • System C • C2Verilog (1998) • Handel C • Bach C • SpecC • Trident C (LANL) • SPARK (UCI) • CASH (CMU) • Mitrion C • Impulse C (2004) • Catapult C (2006, MG) • Challenges • C is a sequential programming language • What does a pointer or dynamic memory allocation mean in hardware? • Recursion • Floating-point arithmetic • How is I/O represented? • How are hardware design parameters introduced? • Solutions • Support only a subset of C • User-specified parallelism • User-specified I/O • Extensive use of macros to guide circuit generation
Motivation for Renewed Research in ESL tools after 20 years of failure? • 2 Primary Trends • Panic in the Microprocessors Industry • Next generation chips from Intel, AMD, Apple are all multi-core with integrated heterogeneous components • Hennessey/Patterson guideline not good enough anymore • Renewed rigor into computer architecture research • Systems on a Chip are way too complicated to explore architecture options at RTL • Emergence of FPGAs as a viable computing entity • Industry accepted platforms for architecture prototyping and research • Extremely complicated to explore VLSI architecture options at RTL
Our Approach: It’s a workbench • Restrict the ESL tool to a small set of algorithms that need acceleration beyond what microprocessors can provide • Take advantage of user expertise in describing a template for the architecture • Let the tool explore low level architecture optimization • Take advantage of gcc optimizations • Ability to integrate 3rd party IP cores
Example ;; Function anneal (anneal) anneal (current){ int next[10]; int next_val; int current_val; float temperature; double D.3292; float D.3291; int D.3290; # BLOCK 0 # PRED: ENTRY (fallthru) temperature = 1.0e+4; current_val = 2147483647; goto <bb 2> (<L1>); # SUCC: 2 (fallthru) # BLOCK 1 # PRED: 2 (true)<L0>:; copy (current, &next); alter (&next); D.3290 = evaluate (&next); next_val = D.3290; accept (¤t_val, next_val, current, &next, temperature); D.3291 = adjustTemperature (); temperature = D.3291; # SUCC: 2 (fallthru) # BLOCK 2 # PRED: 0 (fallthru) 1 (fallthru)<L1>:; D.3292 = (double) temperature; if (D.3292 > 1.00000000000000004792173602385929598312941379845e-4) goto <L0>; else goto <L2>; # SUCC: 1 (true) 3 (false) # BLOCK 3 # PRED: 2 (false)<L2>:; return; # SUCC: EXIT } void anneal(int *current){ float temperature; int current_val, next_val; int next[MAX_EVENTS]; current_val = RAND_MAX; while (temperature > STOP_THRESHOLD) { copy(current, next); alter(next); next_val = evaluate(next); accept(¤t_val, next_val, current, next, temperature); temperature = adjustTemperature(); }} • Problem: Given a circuit specification consisting of a set of components (adders / multipliers / etc.), estimate the FPGA resources (slices / BRAMs / DSP48s) used • Solution: Create a fifth-order equation for each (component, resource type) pair, representing usage as a function of data width • Done using discrete values and Matlab curve-matching feature • Fifth-order equation necessary for adequate estimation • y = C5n5 + C4n4 + C3n3 + C2n2 + C1n + C0
List Scheduling Also known as “Critical Path Scheduling” Assign a static priority to each node in the graph Schedule the nodes according to priority Static priorities are assigned by measuring the “distance” from the node in question and a sink node Given a set of resources, determines time needed to complete a set of operations represented as a dependency graph
List Scheduling Example Schedule DFG on one multiplier, one adder, and one divider Multiplication and division take two cycles each (non-pipelined), addition takes one
List Scheduling Heuristic method – does not guarantee an optimal schedule Computational complexity of only O(Tn), where T is the number of time slots and n is the number of nodes to be scheduled Improvement Methods Modified Critical Path Earliest Time First Dynamic Critical Path Critical Node Parent Trees Cone-Based Clustering Partial Critical Path scheduling All O(n2) to O(n3) – too complicated for use inside of a simulated annealing loop
Solution – Ripple-List Scheduling assign static priority to each node in graph initialize time to 0 Loop while unscheduled nodes exist Loop until no nodes can be scheduled on time step update list of ready nodes schedule highest priority node possible adjust priority of remaining nodes EndLoop increment time EndLoop
Ripple Factor (Rf) The degree of a vertex is the number of edges (both incoming and outgoing in the case of a directed graph) incident to it DG = The largest vertex degree in the entire graph d = distance between two vertices
Ripple Factor DG = 3 The priorities of nodes that are one step away get updated by a ripple factor of 1/31, those that are two steps away get updated by 1/32, etc. Priorities are adjusted dynamically, but never jump to another priority band Maximum ripple distance is applied to cut off updates and save computation (<<O(n2))
Balancing Latency across Pipeline Stages through ILP extraction Goal: Maximize pipelined architecture performance within specified resource constraints A pipeline can only run as fast as the latency of the slowest stage An efficient pipeline will balance the latency of each stage as much as possible Some stages can be redesigned to support additional parallelism, others are fixed
Algorithm for Pipelined Processor DSE Generate minimal set of ALUs needed for each stage in the pipeline Compute latency of all stages (generate the architecture) Loop Mark stage with “worst latency” Reduce the latency of this stage through exploitation of parallelism until “Worst latency” can be passed to another stage If 1 is not possible, reduce latency as much as possible Intertwined SA and RLS algorithms or Data-port width extension where applicable End Loop when “worst latency” cannot be passed to another stage
Example Generate minimal architecture for all stages Copy: 101 cycles 300 slices Alter: 21 cycles 390 slices Evaluate : 233 cycles 317 slices Accept: 54 cycles 1408 slices Mark stage with “worst latency”
Example Reduce the latency of this stage through exploitation of parallelism until “worst latency” can be passed to another stage Evaluate : 233 cycles 317 slices Evaluate: 95 cycles 777 slices Allocation of additional resources
Example New numbers for all stages Copy: 101 cycles 300 slices Alter: 21 cycles 390 slices Evaluate : 95 cycles 777 slices Accept: 54 cycles 1408 slices Mark stage with “worst latency”
Example Reduce the latency of this stage through exploitation of parallelism until “worst latency” can be passed to another stage Copy: 101 cycles 300 slices Copy: 51 cycles 600 slices Widening of memory ports to allow for 2-word transfers
Example New numbers for all stages Copy: 51 cycles 600 slices Alter: 21 cycles 390 slices Evaluate : 95 cycles 777 slices Accept: 54 cycles 1408 slices Repeat process until FPGA resources are exhausted or no more parallelism can be extracted from worst-performing stage
DSE Summary Stage performances can be improved through Allocating additional computational resources to a stage such as adders, multipliers, etc. Widening memory ports to accelerate block data transfers Some stages cannot be improved If the task does not have any ILP
Performance Xilinx V4-SX35
On-going Research: FLEX VLSI architecture • The FLEX (flexible processor) can perform either DFG 1 or DFG 2 computations • Designed by taking the union of DFG 1 and DFG 2 data flow graphs • The FLEX processor can switch modes dynamically, depending on computational needs • Branch probabilities from gcov can guide the FLEX design – DFGs executed more frequently should be more optimized • Considerably superior to Partial Dynamic Reconfiguration using Xilinx EAPR 0.6 0.4
On-going Technology Enhancement (1): FLEX Processor: Code Profiling using gcov function main called 4 returned 100% blocks executed 100% -: 5:{ -: 5-block 0 call 0 returned 100% -: 5-block 1 branch 1 taken 86% (fallthrough) branch 2 taken 14% -: 5-block 2 -: 5-block 3 -: 5-block 4 branch 3 taken 86% (fallthrough) branch 4 taken 14% -: 5-block 5 -: 5-block 6 branch 5 taken 75% (fallthrough) branch 6 taken 25% -: 5-block 7 -: 5-block 8 branch 7 taken 86% (fallthrough) branch 8 taken 14% -: 5-block 9 -: 5-block 10 -: 5-block 11 …
Challenges: Hardware Verification • VHDL code can be compared with architecture description • Third-party Design Automation software used for synthesis, placement, debugging, verification, etc. • ChipScope Pro (Xilinx) • Timing closure • Improved metadata • Stringent constraint imposition on DSE