660 likes | 818 Views
Automatic Application-Specific Customization of Soft Processor Microarchitecture. Shobana Padmanabhan Roger D. Chamberlain, Ron K. Cytron, John D. Lockwood Washington University Funded by NSF under grant 03-13203 http://www.arl.wustl.edu/~sp3 Apr 26, 2006. Outline. Motivation
E N D
Automatic Application-Specific Customization of Soft Processor Microarchitecture Shobana Padmanabhan Roger D. Chamberlain, Ron K. Cytron, John D. Lockwood Washington University Funded by NSF under grant 03-13203 http://www.arl.wustl.edu/~sp3 Apr 26, 2006
Outline • Motivation • Automatic optimization technique • a novel application of a standard optimization technique • Evaluation & Results
Constrained embedded applications • Embedded applications • Very restrictive FPGA and power constraints • Demanding application performance requirements • Requirement-constraint trade-offs • Soft processors • For application performance improvement • As prototype for custom hardware design
FPGA resources Power App performance Number of registers Set size Set Associativity Soft processors • Parameterized general purpose processors • Customization is performance-cost tradeoff • More “knobs” more options for customization
Soft processor customization • LEON: 10 reconfigurable subsystems • Instruction cache • Parameters: sets, set size, line size, replacement policy • 4 * 7 * 2 * 3 = 168 configurations (4 parameters; 16 values) • Data cache • sets, set size, line size, replacement, fast read, fast write, local RAM, local RAM size • 168 * 2 * 2 * 2 * 7 = 9,408 configns (8 params; 29 values) • Integer unit • multiplier, registers, fast jump, fast decode, ICC, load delay, FPU enable, co-processor enable, hardware watchpoints • 119,040 configurations (10 parameters; 56 values) • & Floating-point unit, memory controller, peripherals,… • 190 parameter values; 5*(1024) configurations!!
Existing approaches • Scaling problems • Runtime measurement problems • Estimation is quick but inaccurate • Simulators are extremely slow
Highlights of our optimization technique • Customize “all” parameters • Parameter independence assumption • Linear with number of parameters • Build only 100’s instead of 5*(1024) of configurations • Search space still includes all 5*(1024) configurations • Feasible and scalable • Formulate as binary integer nonlinear optimization program • A novel application of a standard technique • Use “actual” costs, to be accurate
Cost measurement • Application runtime • From direct execution • Hardware-based profiler • Non-intrusive, cycle-accurate, in “real-time” • Part of Liquid architecture platform • Runtime cost is application-specific • FPGA resources • In terms of LUTs and BRAM, from actual build • Takes >30 minutes • Harder than traditional optimization problems • Resource cost is processor-specific • Power (energy): future work
Outline • Motivation • Automatic optimization technique • a novel application of a standard optimization technique • Evaluation & Results
Our optimization technique Out-of-box soft processor; base configuration Assumes parameter independence Perturb parameter values one by one, build configuration, track resource cost Run application on each configuration, trackruntime cost Formulate costs as Binary Integer NonlinearProgram Near-optimal in practice Solve using TOMLAB/MatLab
Our optimization technique Out-of-box soft processor; base configuration Perturb parameter values one by one, build configuration, track resource cost Run application on each configuration, trackruntime cost Formulate costs as Binary Integer NonlinearProgram Solve using TOMLAB/MatLab
Processor ICache reconfiguration xi = 0 or 1 (off or on)
Processor ICache reconfiguration xi = 0 or 1 (off or on)
Processor ICache reconfiguration xi = 0 or 1 (off or on) No constraint needed
FPGA resource constraints • LUTs • BRAM xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations
Optimization • Optimize application runtime • Optimize resource utilization also xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations
Problem formulation recapped • Minimize • Subject to … … Parameter validity constraints FPGA resource constraints Binary variables constraint xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations
Evaluation & Results • Evaluate the impact of parameter independence assumption • Compare against exhaustive runs of a small subsystem • Dcache parameters of sets and setsize
Evaluation Our technique selects the same configuration Despite parameter independence assumption, near-optimal configuration
Highlights of results • 6.2 – 19.4% improvement in application performance • 2 - 3% savings in resources • Solutions customized simultaneously along many parameters • Customization is indeed application-specific
Conclusion • Our optimization technique • Linear with number of parameter values • Assuming parameter independence • Feasible, scalable • Near-optimal results in practice • Actual costs formulated as Binary Integer Nonlinear Program • A novel application of the technique • Only hours for configuration generation; seconds for optimization • Without any knowledge of architecture • Without any changes to application http://www.arl.wustl.edu/~sp3
Additional LEON-imposed constraints • LRR replacement with only 2-sets • LRU replacement with 2, 3, or 4-sets xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations
Additional FPGA resource constraints • Cache size = (#sets) * (set size) Non-linear icache setsize dcache setsize xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations
Optimization results • Customization is indeed application-specific.
Cost approximations of the solutions We overestimate runtime decrease, underestimate resource increase
Cost approximations of the solutions We overestimate chip resource decrease, underestimate runtime decrease (except Arith, where we match)
Result: cost approximation range, summary 0 to 19.75% -2 to 3% Nonlinear for LUTs is slightly worse Linear for BRAM is worse
LEON processor reconfiguration • Icache (Instruction cache) • Dcache (Data cache) • IU (Integer Unit)
Processor DCache reconfiguration Xi = 0 or 1 (off or on)
Processor IU reconfiguration Xi = 0 or 1 (off or on) Valid parameters; next, fit on chip…
Optimization • Optimize application runtime • Optimize resource utilization also • Similarly, optimize power consumption xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations
Future work • Further analysis • Improve cost approximations • To be optimal for all xi • To match actual costs closer • Extensions • Power optimization • Energy optimization • For applications with long runtimes, sampling technique • Run applications on an operating system • ISA reconfiguration • “Give back” • Integrate this with LEON… • Evaluate technique on other configuration/ feature management problems
Existing approaches • Compiler-directed customization of ASIP cores by Gupta et al. (2002) • Considers only 4 functional units; only DSP benchmarks • Tuning caches to applications for low-energy embedded systems by Ross et al. (2004) • Analytical (hierarchical) searching of parameters in their own dimensions, with some full parameter exploration to avoid local minimal • Efficient architecture/compiler co-exploration for ASIPs by Fischer et al. (2002) • Considers only 3 architectural parameters, 4 compiler optimizations • Estimates chip costs • Towards automatic synthesis of a class of application-specific sensor networks by Bakshi et al. (2002) • Analytical model, followed by simulation-based refinement, but no optimization • Automatic generation of application specific processors by Goodwin et al. (2003) • Execution profiles to include/ exclude new “instructions” • Shortcomings • Scaling problems • Runtime measurement problems • Estimating application performance through models is quick but inaccurate • Simulators are slow; hence scale down the application or limit to single execution
Applications • BLASTN • computation, memory intensive • Commbench DRR (Deficit Round Robin) • computation, memory intensive • Commbench FRAG • computation, memory intensive • BYTE Arith • computation intensive
main () { int index = 0, counter = 0, found = 0, matches = 0, *ans; unsigned int currentString = 348432612, base = 0, random = 0; //currentString above is used as a seed also ans = (int*)0x40000004; //memlocation where the # matches are stored for (index = 0; index < SIZE; index++) { hashTable[index] = 4194304; } fillQuery(NUM_QUERY); //populates the hashtable // the loop below generates random bases for the database for (counter = 0; counter < NUM_DATABASE; counter++) { random = Rnd(&random); if (random <= MINT / 4) { base = 0; } else if (random <= MINT / 2) { base = 1; } else if (random <= ((MINT / 2) + (MINT / 4))){ base = 2; } else { base = 3; } found = findMatch(base, ¤tString); if (found == 1) { matches++; } } //printf ("Total number of matches found = %d\n", matches); ans[0] = matches; } void fillQuery(int qNum) { int success, index; unsigned int currentString = 473246; unsigned int random = 782333; unsigned int base = 0; for (index = 0; index < qNum; index++) { random = Rnd(&random); if (random <= MINT/ 4) { base = 0; } else if (random <= MINT / 2) { base = 1; } else if (random <= ((MINT / 2) + (MINT / 4))){ base = 2; } else { base = 3; } success = addQuery(base, ¤tString); if (success) { success = 0; } else { } } } //uses open address, double hashing unsigned int findMatch(unsigned int base1, unsigned int *currentString) { unsigned int base, step, last, current; *currentString = computeKey(base1, *currentString); base = computeBase(*currentString); step = computeStep(*currentString); last = (base + (SIZE - 1) * step) % SIZE; if (coreLoop(base, step, last, currentString)) { return 1; } else { return 0; } } hash_leon_coreLoop_32K_HT 2388B