1 / 45

System-Level Exploration for Pareto-Optimal Configurations in Parameterized Systems-on-a-Chip Architectures

This paper discusses system-level exploration techniques for finding optimal configurations in parameterized Systems-on-a-Chip architectures, considering power and performance tradeoffs. It explores the design space and presents experiments and conclusions.

joannmoore
Download Presentation

System-Level Exploration for Pareto-Optimal Configurations in Parameterized Systems-on-a-Chip Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. System-level Exploration for Pareto-optimal Configurations in Parameterized Systems-on-a-chip Architectures Tony Givargis (Frank Vahid, Jörg Henkel) Center for Embedded Computer Systems University of California Irvine, CA 92697 givargis@ics.uci.edu

  2. Size = {1K, 4K, 8K} Line = {4, 8, 16} Assoc = {1, 2, 4} SOC void main(){ while(1){ Receive(); Decode(); Display(); } } CPU Memory BRIDGE I$-D$ JPEG CODEC Math/FPU UART Application Explore Overview • Given: • Parameterized SOC architecture • Fixed application • Automatically explore the design space • Find optimal points w/respect to power and performance

  3. Motivation • Design trends: • Growing demand for portable devices • Growing demand for low power design • Increased application complexity • Shrinking time-to-market windows • Technology trends: • Increased chip capacity • Increased I/O pins • Improved on-chip integration techniques (storage, digital, analog, digital, …) • SOC era Need for greater designer productivity!

  4. JPEG CODEC2 USB AMBA BRIDGE UART ARM MIPS JPEG CODEC1 SRAM Math/FPU ISA BRIDGE RAM DRAM ? SOC CPU Memory ? ? ? MMX BRIDGE JPEG CODEC ? Math/FPU UART Motivation • One approach: reuse of existing IP • IP selection ? • IP integration ? • SOC verification ? • Multi-source IP licensing • More…

  5. Parameterized SOC CPU Memory BRIDGE MMX JPEG CODEC Math/FPU UART Motivation • Alternate approach: reuse of SOC • Designed, integrated, tested • Domain specific • Parameterized • Designed by firms specializing in SOC • User: map application, then, “configure-and-execute” (successors to microcontrollers!)

  6. Parameterized SOC CPU Memory BRIDGE MMX JPEG CODEC Math/FPU UART Motivation • Composed of 100s of cores • Cores are “configurable” • Configurations impact power/performance • Large number of total configurations! Architecture is otherwise fixed!

  7. Motivation • ATI Technologies – XILLEON™ 220 SOC for Digital Set-top Box Market • Tensilica – Xtensa™ 1040 configurable processor cores • Philips Semiconductors – Velocity RSP9™ SOC platforms • Adelante Technologies – offers complete SOC customizable platforms for DSP domains • More…

  8. Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion

  9. Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion

  10. Previous Work • Parameterized SOC design • [Malik00], [Veidenbaum99], [Vahid99], [Stan95] • Power/performance evaluation • [Barndolese00], [Simunic99], [Li98], [Tiwari94] • Design space exploration (manual) • [givargis99], [Lieverse99] • Design space exploration (automatic) • Focus of this work…

  11. Application Application Architecture Application Application Application Mapping Auto Analysis Numbers Previous Work Y-chart [Lieverse99]

  12. Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion

  13. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture

  14. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.

  15. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.

  16. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.

  17. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.

  18. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.

  19. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • 26 parameters • 1014 configurations • What are the optimal configuration (given a fixed application)?

  20. Parameterized SOC CPU Memory BRIDGE MMX JPEG CODEC Math/FPU UART Problem Summary • What are the possible power/performance tradeoffs? (100 trillion) • Need to efficiently evaluate power/performance (1/sec150,000 years) • Need to explore the configuration space

  21. Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion

  22. Power Evaluation 180000 • Exploration works with: • Chip instrumentation (real-time) • System-level simulation • RTL simulation • Gate-level simulation • Circuit-level simulation • Relative accuracy required! 28800 5400 440 1 Digital camera application mapped on our SOC, capturing 1 image.

  23. Power Evaluation 180000 • Exploration works with: • Chip instrumentation (real-time) • System-level simulation • RTL simulation • Gate-level simulation • Circuit-level simulation • Relative accuracy required! 28800 5400 440 1 Digital camera application mapped on our SOC, capturing 1 image.

  24. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Power Evaluation - Processor • [Tiwari94/00]’s instruction-level • Measure watt/inst • Account for stalls + dependency • Apply traces

  25. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Power Evaluation – Cache/Mem. • [Evans95] • Capacitance model of sub- components • Switching obtained via simulation (parameter dependent)

  26. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Power Evaluation – Buses • [Chern92] • Model bus capacitance • Switching derived from I/O traffic (parameter dependent)

  27. MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Power Evaluation – Peripherals • Observation: cores execute instructions! • Apply a technique similar to that used for processors!

  28. MIPS (10%) I-Cache (8%) D-Cache (8%) Memory (8%) Bridge (5%) Peripheral Bus UART (5%) DMA (5%) DCT CODEC (5%) Power Evaluation – Summary ~50-100K instruction/second! (Platune)

  29. Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion

  30. Exploration Problem formulation • P1, P2, … , Pn • A configuration (point) is an assignment of values to all parameters • How to efficiently generate all Pareto-optimal configurations?

  31. Algorithm Idea A (10) B (32) • A and B interdependent A (10) C (32) • A and C are independent • C and B are independent C (32) B (32) + = 64 points • With knowledge about dependency we prune 98.6% 138 points A (10) B (32) C (32) * * = 10240 points Exploration * = 320 points + = 42 points • Directed graph

  32. Exploration • A  B : Pareto-optimal configurations of B calculated after Pareto-optimal configurations of nodes along the path A  B • ABA, (cycle) : Pareto-optimal configurations of all the parameters on the cycle calculated simultaneously • A : Pareto-optimal configurations calculated in isolation

  33. Node Core Parameter Node Core Parameter A MIPS Voltage scale L CPU D$ bus Data bus width B I$ Total size M Data bus code C Line size N Addr bus width C D Associativity O Addr bus code A B E D$ Total size P I/D$ Mem bus Data bus width D H F Line size Q Data bus code I F G Associativity R Addr bus width J K E H CPU I$ bus Data bus width S Addr bus code G T U I Data bus code T Peripheral bus Data bus width L J Addr bus width U Data bus code M P Q X K Addr bus code V Addr bus width N O R S X UART Tx buffer size W Addr bus code V Y W Y Rx buffer size Z DCT CODEC Pixel resolution Z Exploration Dependency Graph

  34. C A B D H I F J K E G T U L M P Q X N O R S V Y W Z Exploration Dependency graph • Based on designer knowledge • Computed by simulating all pairs of nodes (quadratic time complexity, approx.) • One time effort

  35. A C F B E D H I G L M J P K Q N O R S X T U Y V Z W Exploration – Algorithm Step 1: Clustering followed by simulation

  36. J,K,T,U B,C,D,E,F,G Z A,H,I L,M,P,Q X,Y,R,S N,O,V,W X,Y,R,S A,H,I,B,C,D,E,F,G J,K,T,U,Z L,M,P,Q,N,O,V,W A,H,I,B,C,D,E,F,G,J,K,T,U,Z L,M,P,Q,N,O,V,W,X,Y,R,S A,H,I,B,C,D,E,F,G,J,K,T,U,Z,L,M,P,Q,N,O,V,W,X,Y,R,S Exploration – Algorithm Step 2: Pair-wise merge followed by simulation

  37. Exploration Exhaustive solution • Evaluate all points • Sort by decreasing execution time • Walk through the space, eliminate points with power > minimum seen so far! • Substitute heuristics (only works for 1-4 parameters!)

  38. Exploration • Complexity: O((K + log(K)) * 2N/K) • K is the number of clusters • N is the number of parameters • 2N/K bounds the exhaustive comp. • (K + log(k)) bounds the number of iterations • Worse case K=1, best case K=N • 2N/K decrease rapidly as K increases (e.g., 226/2+226/2 is much smaller than 226!)

  39. Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion

  40. Exploration – Results JPEG • Exploration time: 29.1 min • Config. visited: 12352 (141) • 5.10x exe. time • 7.51x power • 2.73x energy • Pruning ratio > 0.99997

  41. Exploration – Results CKEY • Exploration time: 108 min • Config. visited: 15890 (223) • 8.31x exe. time • 6.08x power • 2.57x energy • Pruning ratio > 0.99993

  42. Exploration – Results IMAGE • Exploration time: 50.2 min • Config. visited: 10135 (80) • 8.29x exe. time • 8.57x power • 1.81x energy • Pruning ratio > 0.99998

  43. Exploration – Results MATRIX • Exploration time: 73.6 min • Config. visited: 12623 (84) • 10.7x exe. time • 8.16x power • 3.18x energy • Pruning ratio > 0.99997

  44. Exploration – Results JPEG JPEG JPEG

  45. Conclusion • Gave a system-level algorithm for exploring the solution space of an application mapped to a parameterized SOC architectures • Given a dependency graph we extensively prune the solution space • Pruning ratio > 0.99997 in experiments • Future work: • Automatically compute the dependency model • Replace the exhaustive sub-algorithm with a heuristic (e.g., gradient search, GA)

More Related