450 likes | 587 Views
System-level Exploration for Pareto-optimal Configurations in Parameterized Systems-on-a-chip Architectures. Tony Givargis (Frank Vahid, Jörg Henkel) Center for Embedded Computer Systems University of California Irvine, CA 92697 givargis@ics.uci.edu. Size = {1K, 4K, 8K} Line = {4, 8, 16}
E N D
System-level Exploration for Pareto-optimal Configurations in Parameterized Systems-on-a-chip Architectures Tony Givargis (Frank Vahid, Jörg Henkel) Center for Embedded Computer Systems University of California Irvine, CA 92697 givargis@ics.uci.edu
Size = {1K, 4K, 8K} Line = {4, 8, 16} Assoc = {1, 2, 4} SOC void main(){ while(1){ Receive(); Decode(); Display(); } } CPU Memory BRIDGE I$-D$ JPEG CODEC Math/FPU UART Application Explore Overview • Given: • Parameterized SOC architecture • Fixed application • Automatically explore the design space • Find optimal points w/respect to power and performance
Motivation • Design trends: • Growing demand for portable devices • Growing demand for low power design • Increased application complexity • Shrinking time-to-market windows • Technology trends: • Increased chip capacity • Increased I/O pins • Improved on-chip integration techniques (storage, digital, analog, digital, …) • SOC era Need for greater designer productivity!
JPEG CODEC2 USB AMBA BRIDGE UART ARM MIPS JPEG CODEC1 SRAM Math/FPU ISA BRIDGE RAM DRAM ? SOC CPU Memory ? ? ? MMX BRIDGE JPEG CODEC ? Math/FPU UART Motivation • One approach: reuse of existing IP • IP selection ? • IP integration ? • SOC verification ? • Multi-source IP licensing • More…
Parameterized SOC CPU Memory BRIDGE MMX JPEG CODEC Math/FPU UART Motivation • Alternate approach: reuse of SOC • Designed, integrated, tested • Domain specific • Parameterized • Designed by firms specializing in SOC • User: map application, then, “configure-and-execute” (successors to microcontrollers!)
Parameterized SOC CPU Memory BRIDGE MMX JPEG CODEC Math/FPU UART Motivation • Composed of 100s of cores • Cores are “configurable” • Configurations impact power/performance • Large number of total configurations! Architecture is otherwise fixed!
Motivation • ATI Technologies – XILLEON™ 220 SOC for Digital Set-top Box Market • Tensilica – Xtensa™ 1040 configurable processor cores • Philips Semiconductors – Velocity RSP9™ SOC platforms • Adelante Technologies – offers complete SOC customizable platforms for DSP domains • More…
Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion
Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion
Previous Work • Parameterized SOC design • [Malik00], [Veidenbaum99], [Vahid99], [Stan95] • Power/performance evaluation • [Barndolese00], [Simunic99], [Li98], [Tiwari94] • Design space exploration (manual) • [givargis99], [Lieverse99] • Design space exploration (automatic) • Focus of this work…
Application Application Architecture Application Application Application Mapping Auto Analysis Numbers Previous Work Y-chart [Lieverse99]
Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • Voltage scale • Size, line, associativity • Bus width, encoding (gray, invert) • UART tx/rx buffer size • DCT resol.
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Target Architecture • 26 parameters • 1014 configurations • What are the optimal configuration (given a fixed application)?
Parameterized SOC CPU Memory BRIDGE MMX JPEG CODEC Math/FPU UART Problem Summary • What are the possible power/performance tradeoffs? (100 trillion) • Need to efficiently evaluate power/performance (1/sec150,000 years) • Need to explore the configuration space
Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion
Power Evaluation 180000 • Exploration works with: • Chip instrumentation (real-time) • System-level simulation • RTL simulation • Gate-level simulation • Circuit-level simulation • Relative accuracy required! 28800 5400 440 1 Digital camera application mapped on our SOC, capturing 1 image.
Power Evaluation 180000 • Exploration works with: • Chip instrumentation (real-time) • System-level simulation • RTL simulation • Gate-level simulation • Circuit-level simulation • Relative accuracy required! 28800 5400 440 1 Digital camera application mapped on our SOC, capturing 1 image.
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Power Evaluation - Processor • [Tiwari94/00]’s instruction-level • Measure watt/inst • Account for stalls + dependency • Apply traces
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Power Evaluation – Cache/Mem. • [Evans95] • Capacitance model of sub- components • Switching obtained via simulation (parameter dependent)
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Power Evaluation – Buses • [Chern92] • Model bus capacitance • Switching derived from I/O traffic (parameter dependent)
MIPS I-Cache D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC Power Evaluation – Peripherals • Observation: cores execute instructions! • Apply a technique similar to that used for processors!
MIPS (10%) I-Cache (8%) D-Cache (8%) Memory (8%) Bridge (5%) Peripheral Bus UART (5%) DMA (5%) DCT CODEC (5%) Power Evaluation – Summary ~50-100K instruction/second! (Platune)
Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion
Exploration Problem formulation • P1, P2, … , Pn • A configuration (point) is an assignment of values to all parameters • How to efficiently generate all Pareto-optimal configurations?
Algorithm Idea A (10) B (32) • A and B interdependent A (10) C (32) • A and C are independent • C and B are independent C (32) B (32) + = 64 points • With knowledge about dependency we prune 98.6% 138 points A (10) B (32) C (32) * * = 10240 points Exploration * = 320 points + = 42 points • Directed graph
Exploration • A B : Pareto-optimal configurations of B calculated after Pareto-optimal configurations of nodes along the path A B • ABA, (cycle) : Pareto-optimal configurations of all the parameters on the cycle calculated simultaneously • A : Pareto-optimal configurations calculated in isolation
Node Core Parameter Node Core Parameter A MIPS Voltage scale L CPU D$ bus Data bus width B I$ Total size M Data bus code C Line size N Addr bus width C D Associativity O Addr bus code A B E D$ Total size P I/D$ Mem bus Data bus width D H F Line size Q Data bus code I F G Associativity R Addr bus width J K E H CPU I$ bus Data bus width S Addr bus code G T U I Data bus code T Peripheral bus Data bus width L J Addr bus width U Data bus code M P Q X K Addr bus code V Addr bus width N O R S X UART Tx buffer size W Addr bus code V Y W Y Rx buffer size Z DCT CODEC Pixel resolution Z Exploration Dependency Graph
C A B D H I F J K E G T U L M P Q X N O R S V Y W Z Exploration Dependency graph • Based on designer knowledge • Computed by simulating all pairs of nodes (quadratic time complexity, approx.) • One time effort
A C F B E D H I G L M J P K Q N O R S X T U Y V Z W Exploration – Algorithm Step 1: Clustering followed by simulation
J,K,T,U B,C,D,E,F,G Z A,H,I L,M,P,Q X,Y,R,S N,O,V,W X,Y,R,S A,H,I,B,C,D,E,F,G J,K,T,U,Z L,M,P,Q,N,O,V,W A,H,I,B,C,D,E,F,G,J,K,T,U,Z L,M,P,Q,N,O,V,W,X,Y,R,S A,H,I,B,C,D,E,F,G,J,K,T,U,Z,L,M,P,Q,N,O,V,W,X,Y,R,S Exploration – Algorithm Step 2: Pair-wise merge followed by simulation
Exploration Exhaustive solution • Evaluate all points • Sort by decreasing execution time • Walk through the space, eliminate points with power > minimum seen so far! • Substitute heuristics (only works for 1-4 parameters!)
Exploration • Complexity: O((K + log(K)) * 2N/K) • K is the number of clusters • N is the number of parameters • 2N/K bounds the exhaustive comp. • (K + log(k)) bounds the number of iterations • Worse case K=1, best case K=N • 2N/K decrease rapidly as K increases (e.g., 226/2+226/2 is much smaller than 226!)
Outline • Previous work • Target architecture • Power/performance estimation • Parameter space exploration • Experiments • Conclusion
Exploration – Results JPEG • Exploration time: 29.1 min • Config. visited: 12352 (141) • 5.10x exe. time • 7.51x power • 2.73x energy • Pruning ratio > 0.99997
Exploration – Results CKEY • Exploration time: 108 min • Config. visited: 15890 (223) • 8.31x exe. time • 6.08x power • 2.57x energy • Pruning ratio > 0.99993
Exploration – Results IMAGE • Exploration time: 50.2 min • Config. visited: 10135 (80) • 8.29x exe. time • 8.57x power • 1.81x energy • Pruning ratio > 0.99998
Exploration – Results MATRIX • Exploration time: 73.6 min • Config. visited: 12623 (84) • 10.7x exe. time • 8.16x power • 3.18x energy • Pruning ratio > 0.99997
Exploration – Results JPEG JPEG JPEG
Conclusion • Gave a system-level algorithm for exploring the solution space of an application mapped to a parameterized SOC architectures • Given a dependency graph we extensively prune the solution space • Pruning ratio > 0.99997 in experiments • Future work: • Automatically compute the dependency model • Replace the exhaustive sub-algorithm with a heuristic (e.g., gradient search, GA)