Thesis idea evaluation - Automatic configuration of ASIP cores

Thesis idea evaluation - Automatic configuration of ASIP cores by Shobana Padmanabhan June 23, 2004

Introduction • ASIP – (parameterized) embedded soft core • In between custom and general-purpose designs • E.g. ArcCores, HP, Tensilica, LEON • Advantages • Better application performance than a generic processor • Reuse existing components • Lower cost compared to custom processors • Goal is to get fastest or min runtime

Methodology considerations • Customize per application domain, not app

Methodology considerations • Customize per application domain, not app • Base architecture + customizations • Customizations • Increased # of functional units, registers, memory accesses in parallel, depth of pipeline, possibly new instructions, …

Methodology considerations • Customize per application domain, not app • Base architecture + customizations • Customizations • Increased # of functional units, registers, memory accesses in parallel, depth of pipeline, possibly new instructions, … • Avoid exhaustive simulation • As the number of configurations is exponential • Simulating large data sets would be prohibitively time consuming…

Methodology considerations • Customize per application domain, not app • Base architecture + customizations • Customizations • Increased # of functional units, registers, memory accesses in parallel, depth of pipeline, possibly new instructions, … • Avoid exhaustive simulation • As the number of configurations is exponential • Simulating large data sets would be prohibitively time consuming… • Constraints • FPGA – limited area (cost, power constraints)

Methodology considerations • Customize per application domain, not app • Base architecture + customizations • Customizations • Increased # of functional units, registers, memory accesses in parallel, depth of pipeline, possibly new instructions, … • Avoid exhaustive simulation • As the number of configurations is exponential • Simulating large data sets would be prohibitively time consuming… • Constraints • FPGA – limited area (cost, power constraints) • Architectural parameters are not independent

Methodology considerations • Evaluation of proposed methodology • Compare the resulting configuration and runtime with hand-optimized configuration of benchmarks

Approach 1 - Compiler directed • Compiler-directed customization of ASIP cores • by Gupta - UMD, Ko - Cornell, Barua – UMD • for the methodology • Processor evaluation in an embedded systems design environment • by Gupta, Sharma, Balakrishna – IIT Delhi, Malik – Princeton • for details of Processor description language and architectural parameters • Predicting performance potential of modern DSPs, • Retargetable estimation scheme for DSP architecture selection • by Ghazal, Newton, Rabaey – UC Berkeley • use more advanced processor features and compiler optimizations

Methodology – basic idea • Start with basic architecture • Estimate application performance • Now, vary architecture (<= chip area) and find the best runtime • To avoid (exhaustive) simulation • Estimate runtime for a given configuration • Use a profiler • When the configuration changes, re-compile and not re-run • Change configuration, check area and infer new runtime • By using statistical data on inter-dependence of parameters

Approach App

Approach Profiler App

Approach Profiler Retargetable performance estimator App

Approach Base arch + space of proposed parameters Profiler Retargetable performance estimator App

Approach Base arch + space of proposed parameters Profiler Architecture exploration engine Retargetable performance estimator App

Approach Base arch + space of proposed parameters Area estimates & budget Profiler Architecture exploration engine Retargetable performance estimator App

Approach Base arch + space of proposed parameters Area estimates & budget Profiler Optimal architectural parameters Architecture exploration engine Retargetable performance estimator App

Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block)

Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block) • Basic block (i.e. # of instructions that can be executed in parallel) by • converting to an internal format (Stanford University IF, which provides libraries to extract such info)

Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block) • Basic block by • converting to an internal format (Stanford University IF, which provides libraries to extract such info) • Execution frequencies of each basic block by • A compiler-inserted instruction increments a global variable for each basic block

Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block) • Basic block by • converting to an internal format (Stanford University IF, which provides libraries to extract such info) • Execution frequencies of each basic block by • A compiler-inserted instruction increments a global variable for each basic block • Number of clock cycles • A scheduler schedules each basic block to derive execution time on the processor (taking into account all parameters) • A processor description is needed for this and a language was developed (context free grammar)

Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block) • Basic block by • converting to an internal format (Stanford University IF, which provides libraries to extract such info) • Execution frequencies of each basic block by • A compiler-inserted instruction increments a global variable for each basic block • Number of clock cycles • A scheduler schedules each basic block to derive execution time on the processor (taking into account all parameters) • A processor description is needed for this and a language was developed (context free grammar) • Scheduler combines this time, with frequencies of basic blocks, to estimate overall runtime

Performance estimation, more formally • Derive runtime vs. parameter curve for each parameter (just recompile for every param) • Runtime = (profile-collected basic block frequencies) * (scheduler-predicted runtime of that block) • Runtime_function(pi) =(runtime for pi) / (base runtime)

Area estimation, formally • Obtain area vs. parameter curve for every parameter • Area_function(pi) = additional gate area for pi

Retargetable performance estimator • Profiler • Computes execution frequencies of each basic block • A compiler-inserted instruction increments a global variable for this

Retargetable performance estimator • Profiler • Computes execution frequencies of each basic block • A compiler-inserted instruction increments a global variable for this • Data flow graph builder, for scheduling • Directed acyclic graph for a basic block – captures all dependencies (blocks in sequence; within a block in parallel) • Priority of operation, based on height of that operation in dependency graph

Retargetable performance estimator • Profiler • Computes execution frequencies of each basic block • A compiler-inserted instruction increments a global variable for this • Data flow graph builder, for scheduling • Directed acyclic graph for a basic block – captures all dependencies (blocks in sequence; within a block, in parallel) • Priority of operation, based on height of that operation in dependency graph • Fine-grain scheduler estimates # of clock cycles by taking into account different architecture parameters • Schedules each basic block to derive execution time on the processor • Combines this with frequencies to estimate overall runtime • List scheduling is a greedy method that chooses next instruction in DAG in order of their priority (longer critical paths have higher priority)

Retargetable performance estimator • Assumptions • All operations operate on operands in registers • Address computation of an array instruction are carried out by insertion of explicit address computation instructions

The processor description language • Can express most embedded VLIW processors • Functional units in data path, w/ their operations, corresponding latencies, delays • Constraints in terms of operation slots & slot restrictions • Number of registers, write buses, ports in memory • Delay of branch operations • Concurrent load/ store operations • Final operation delay = (delay of functional unit) * (delay of operation)

Architecture exploration engine • Chooses optimal parameter values – constrained optimization problem • Sum of all area_functions <= area_budget

Architecture exploration engine • Chooses optimal parameter values – constrained optimization problem • Sum of all area_functions <= area_budget • If parameters are independent, pred_runtime = product of runtime for every parameter

Architecture exploration engine • Chooses optimal parameter values – constrained optimization problem • Sum of all area_functions <= area_budget • If parameters are independent, pred_runtime = product of runtime for every parameter • Since they are not, pred_runtime = (product of runtime for every parameter) / dependence_constant(p1, …, pn) where dependence_constant is …

Interdependence of parameters • dependence_constant is a heuristic for every combo of parameters that adjusts the gain for that combo

Interdependence of parameters • dependence_constant is a heuristic for every combo of parameters that adjusts the gain for that combo • obtained by one-time, exhaustive simulation of standard benchmarks, for a combo of parameters

Interdependence of parameters • dependence_constant is a heuristic for every combo of parameters that adjusts the gain for that combo • obtained by one-time, exhaustive simulation of standard benchmarks, for a combo of parameters • Dependence_constant(p1,…,pn) • = 1 for all pi = basei • = 1 for pj != basej, for all i != j, pi = basei • = (product of all runtime_function) / (actual_runtime for that combo)

Evaluated parameters • On Philips TriMedia VLIW processor • Presence or absence of MAC • HW/ SW floating point • Single or dual-ported memory for parallel memory operations • Pipelined or non-pipelined memory unit

Other customizable parameters • Register file size • Number of architectural clusters • Number and nature of functional units • Presence of an address generation unit • Optimized special operations • Multi-operation patterns • Memory data packing/ unpacking support • Memory addressing support • Control-flow support • Loop-level optimizations • Loop-level optimized patterns • Loop vectorization • Architecture-independent optimization

For DSP applications • Functional unit composition • Ignore: cache misses, branch mis-predictions, separation of register files (or functional unit banks), register allocation conflicts • Register casting, if data-dependency interlocks exist in the architecture

Performance gain from INDIVIDUALparameters • Runtime_function for each benchmark • Application for each of the chosen parameters – MAC, FPU, dual-ported memory, pipelined memory Figure from Gupta et al.

Performance gain from COMBINEDparameters • Runtime_function for each benchmark • Application for selected combination of chosen parameters Figure from Gupta et al.

Dependence constants for the combinations Figure from Gupta et al.

(DSP) FFT benchmark Figure from Gupta et al.

Results • Performance estimation error 2.5% • Recommended configuration same as hand-optimized

Profile & use app parameters to eliminate processor or processor configuration

App parameters & relevant processor features • Average block size • (Acceptable) branch penalty • # of multiply-accumulate operations • MAC • Ratio of address computation instructions to data computation instructions • Separate address generation ALU • Ratio of I/O instructions to total instructions • Memory bandwidth requirements • Average arc length in the data flow graph • Total # of registers • Unconstrained ASAP scheduler results • Operation concurrency and lower bound on performance • Assumptions • In average block size module, instructions associated with condition code evaluation of conditional structures and loops ignored • Each array instruction contributes to total by twice the # of dimensions • Array accesses are assumed to point to data in memory…

Related work • Related work evaluates exhaustively or in isolation; no cost-area analysis • Commercial soft cores • User optimizes instruction set, addressing modes & sizes of internal memory banks; tool estimates area • Gong et. al • Performance analyzer evaluates machine parallelism, number of buses & connectivity, memory ports; does not account for dependency • Ghazal et. al • Predict runtime for advanced processor features & compiler optimizations such as optimized special operations, memory addressing support, control-flow support & loop-level optimization support. • Gupta et. al • Analyze application to select processor; no quantification of features; performance estimation thru exhaustive simulation • Kuulusa et. al, Herbert et. al, Shackleford et. al • Tools for architecture exploration by exhaustive search; evaluate instruction extensions • Custom fit processors • Also exhaustive search but targets a VLIW architecture – changeable memory sizes, register sizes, kinds and latencies of functional units and clustered machines; speedup/ cost graphs are derived for all combinations yielding pareto points

Other related papers • Kuulusa et. al., Herbert et. al., Shackleford et. al. evaluate extensions to instruction set • Managing multi-configuration hardware via dynamic working set analysis • By Dhodapkar, Smith, Wisc • Reconfigurable custom computing as a supercomputer replacement • By Milne, University of South Australia

Discussion

Thesis idea evaluation - Automatic configuration of ASIP cores