1 / 66

Automatic Application-Specific Customization of Soft Processor Microarchitecture

Automatic Application-Specific Customization of Soft Processor Microarchitecture. Shobana Padmanabhan Roger D. Chamberlain, Ron K. Cytron, John D. Lockwood Washington University Funded by NSF under grant 03-13203 http://www.arl.wustl.edu/~sp3 Apr 26, 2006. Outline. Motivation

Download Presentation

Automatic Application-Specific Customization of Soft Processor Microarchitecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Application-Specific Customization of Soft Processor Microarchitecture Shobana Padmanabhan Roger D. Chamberlain, Ron K. Cytron, John D. Lockwood Washington University Funded by NSF under grant 03-13203 http://www.arl.wustl.edu/~sp3 Apr 26, 2006

  2. Outline • Motivation • Automatic optimization technique • a novel application of a standard optimization technique • Evaluation & Results

  3. Constrained embedded applications • Embedded applications • Very restrictive FPGA and power constraints • Demanding application performance requirements • Requirement-constraint trade-offs • Soft processors • For application performance improvement • As prototype for custom hardware design

  4. FPGA resources Power App performance Number of registers Set size Set Associativity Soft processors • Parameterized general purpose processors • Customization is performance-cost tradeoff • More “knobs” more options for customization

  5. Soft processor customization • LEON: 10 reconfigurable subsystems • Instruction cache • Parameters: sets, set size, line size, replacement policy • 4 * 7 * 2 * 3 = 168 configurations (4 parameters; 16 values) • Data cache • sets, set size, line size, replacement, fast read, fast write, local RAM, local RAM size • 168 * 2 * 2 * 2 * 7 = 9,408 configns (8 params; 29 values) • Integer unit • multiplier, registers, fast jump, fast decode, ICC, load delay, FPU enable, co-processor enable, hardware watchpoints • 119,040 configurations (10 parameters; 56 values) • & Floating-point unit, memory controller, peripherals,… • 190 parameter values; 5*(1024) configurations!!

  6. Existing approaches • Scaling problems • Runtime measurement problems • Estimation is quick but inaccurate • Simulators are extremely slow

  7. Highlights of our optimization technique • Customize “all” parameters • Parameter independence assumption • Linear with number of parameters • Build only 100’s instead of 5*(1024) of configurations • Search space still includes all 5*(1024) configurations • Feasible and scalable • Formulate as binary integer nonlinear optimization program • A novel application of a standard technique • Use “actual” costs, to be accurate

  8. Cost measurement • Application runtime • From direct execution • Hardware-based profiler • Non-intrusive, cycle-accurate, in “real-time” • Part of Liquid architecture platform • Runtime cost is application-specific • FPGA resources • In terms of LUTs and BRAM, from actual build • Takes >30 minutes • Harder than traditional optimization problems • Resource cost is processor-specific • Power (energy): future work

  9. Outline • Motivation • Automatic optimization technique • a novel application of a standard optimization technique • Evaluation & Results

  10. Our optimization technique Out-of-box soft processor; base configuration Assumes parameter independence Perturb parameter values one by one, build configuration, track resource cost Run application on each configuration, trackruntime cost Formulate costs as Binary Integer NonlinearProgram Near-optimal in practice Solve using TOMLAB/MatLab

  11. Our optimization technique Out-of-box soft processor; base configuration Perturb parameter values one by one, build configuration, track resource cost Run application on each configuration, trackruntime cost Formulate costs as Binary Integer NonlinearProgram Solve using TOMLAB/MatLab

  12. Processor ICache reconfiguration

  13. Processor ICache reconfiguration xi = 0 or 1 (off or on)

  14. Processor ICache reconfiguration xi = 0 or 1 (off or on)

  15. Processor ICache reconfiguration xi = 0 or 1 (off or on) No constraint needed

  16. FPGA resource constraints • LUTs • BRAM xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

  17. Optimization • Optimize application runtime • Optimize resource utilization also xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

  18. Problem formulation recapped • Minimize • Subject to … … Parameter validity constraints FPGA resource constraints Binary variables constraint xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

  19. Evaluation & Results • Evaluate the impact of parameter independence assumption • Compare against exhaustive runs of a small subsystem • Dcache parameters of sets and setsize

  20. Evaluation Our technique selects the same configuration Despite parameter independence assumption, near-optimal configuration

  21. Distribution of generated configurations

  22. Illustration of all configurations being searched

  23. Highlights of results • 6.2 – 19.4% improvement in application performance • 2 - 3% savings in resources • Solutions customized simultaneously along many parameters • Customization is indeed application-specific

  24. Conclusion • Our optimization technique • Linear with number of parameter values • Assuming parameter independence • Feasible, scalable • Near-optimal results in practice • Actual costs formulated as Binary Integer Nonlinear Program • A novel application of the technique • Only hours for configuration generation; seconds for optimization • Without any knowledge of architecture • Without any changes to application http://www.arl.wustl.edu/~sp3

  25. Backup

  26. Additional LEON-imposed constraints • LRR replacement with only 2-sets • LRU replacement with 2, 3, or 4-sets xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

  27. Additional FPGA resource constraints • Cache size = (#sets) * (set size) Non-linear icache setsize dcache setsize xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

  28. Evaluation

  29. Cost approximations during eval - BLASTN

  30. Cost approximations during eval - Commbench DRR

  31. Cost approximations during eval – Commbench FRAG

  32. Optimization results • Customization is indeed application-specific.

  33. Cost approximations of the solutions We overestimate runtime decrease, underestimate resource increase

  34. Cost approximations of the solutions We overestimate chip resource decrease, underestimate runtime decrease (except Arith, where we match)

  35. BLASTN results

  36. CommBench DRR results

  37. CommBench FRAG results

  38. BYTE Arith results

  39. Evaluation: cost approximation range, summary

  40. Result: cost approximation range, summary 0 to 19.75% -2 to 3% Nonlinear for LUTs is slightly worse Linear for BRAM is worse

  41. Clock cycles

  42. LEON processor reconfiguration • Icache (Instruction cache) • Dcache (Data cache) • IU (Integer Unit)

  43. Processor DCache reconfiguration Xi = 0 or 1 (off or on)

  44. Processor IU reconfiguration Xi = 0 or 1 (off or on) Valid parameters; next, fit on chip…

  45. Optimization • Optimize application runtime • Optimize resource utilization also • Similarly, optimize power consumption xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

  46. Future work • Further analysis • Improve cost approximations • To be optimal for all xi • To match actual costs closer • Extensions • Power optimization • Energy optimization • For applications with long runtimes, sampling technique • Run applications on an operating system • ISA reconfiguration • “Give back” • Integrate this with LEON… • Evaluate technique on other configuration/ feature management problems

  47. Details Backup

  48. Existing approaches • Compiler-directed customization of ASIP cores by Gupta et al. (2002) • Considers only 4 functional units; only DSP benchmarks • Tuning caches to applications for low-energy embedded systems by Ross et al. (2004) • Analytical (hierarchical) searching of parameters in their own dimensions, with some full parameter exploration to avoid local minimal • Efficient architecture/compiler co-exploration for ASIPs by Fischer et al. (2002) • Considers only 3 architectural parameters, 4 compiler optimizations • Estimates chip costs • Towards automatic synthesis of a class of application-specific sensor networks by Bakshi et al. (2002) • Analytical model, followed by simulation-based refinement, but no optimization • Automatic generation of application specific processors by Goodwin et al. (2003) • Execution profiles to include/ exclude new “instructions” • Shortcomings • Scaling problems • Runtime measurement problems • Estimating application performance through models is quick but inaccurate • Simulators are slow; hence scale down the application or limit to single execution

  49. Applications • BLASTN • computation, memory intensive • Commbench DRR (Deficit Round Robin) • computation, memory intensive • Commbench FRAG • computation, memory intensive • BYTE Arith • computation intensive

  50. main () { int index = 0, counter = 0, found = 0, matches = 0, *ans; unsigned int currentString = 348432612, base = 0, random = 0; //currentString above is used as a seed also ans = (int*)0x40000004; //memlocation where the # matches are stored for (index = 0; index < SIZE; index++) { hashTable[index] = 4194304; } fillQuery(NUM_QUERY); //populates the hashtable // the loop below generates random bases for the database for (counter = 0; counter < NUM_DATABASE; counter++) { random = Rnd(&random); if (random <= MINT / 4) { base = 0; } else if (random <= MINT / 2) { base = 1; } else if (random <= ((MINT / 2) + (MINT / 4))){ base = 2; } else { base = 3; } found = findMatch(base, &currentString); if (found == 1) { matches++; } } //printf ("Total number of matches found = %d\n", matches); ans[0] = matches; } void fillQuery(int qNum) { int success, index; unsigned int currentString = 473246; unsigned int random = 782333; unsigned int base = 0; for (index = 0; index < qNum; index++) { random = Rnd(&random); if (random <= MINT/ 4) { base = 0; } else if (random <= MINT / 2) { base = 1; } else if (random <= ((MINT / 2) + (MINT / 4))){ base = 2; } else { base = 3; } success = addQuery(base, &currentString); if (success) { success = 0; } else { } } } //uses open address, double hashing unsigned int findMatch(unsigned int base1, unsigned int *currentString) { unsigned int base, step, last, current; *currentString = computeKey(base1, *currentString); base = computeBase(*currentString); step = computeStep(*currentString); last = (base + (SIZE - 1) * step) % SIZE; if (coreLoop(base, step, last, currentString)) { return 1; } else { return 0; } } hash_leon_coreLoop_32K_HT 2388B

More Related