180 likes | 261 Views
Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos School of Electrical and Computer Engineering. From Adaptive to Self-Tuning Systems. Power. ILP. Leakage current increases 7.5X with each generation [3]. Pipeline in-order OOO aggressive OOO. Architectural Challenges.
E N D
Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos School of Electrical and Computer Engineering From Adaptive to Self-Tuning Systems
Power ILP Leakage current increases 7.5X with each generation [3] Pipeline in-order OOO aggressive OOO Architectural Challenges • Negative returns with power • Increasing inefficiencies due to • speculation • control flow Frequency Wall Power Wall Not much headroom left in the stage to stage times (currently 8-12 FO4 delays) [4] Single Thread Performance Memory Wall Source:http://techreport.com/reviews/2005q2/opteron-x75/dualcore-chip.jpg • Cache Area • 80% of transistor budget 50% of total area [1] • Defects in cache affect processor yield • Significant power consumers (e.g. > 40% of total power in Strong ARM)[2] • On-chip-DRAM gap continues to grow Economic Wall • Costs of developing next generation processors • Design & Manufacturing costs • Extreme Device Variability • P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA 2000 • Michael Zhang, Krste Asanovic “Fine-Grain CAM-Tag Cache Resizing Using Miss Tags” ISLPED 02 • S. Borkar “Design Challenges of Technology Scaling” Micro 1999 • Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000
Large scale P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M System View 1. Capture and adapt to intrinsic application behavior Dynamic, on-line, evolutionary behaviors Static, off-line characterizations Many-core, Heterogeneous System 2. Device-Level Variations reduce architecture yield Solution: Systems are self-tuning
Ill- Structured Workloads Structured Workloads Rigid, HW/SW Boundaries Evolutionary or Self-Tuning Systems P P P M M M P M Traditional Architectures (Fixed) Architectures Change At SW-determined Points of Execution The Space of Solutions State of the Practice P M Architectures continuouslyautonomously evolve and adapt Ability to Customize Architectures Before Application Deployment
From Adaptive to Self Tuning • Where do we make future investments in transistors and software? • Hardware software co-design for continuous monitoring and/or tuning • Expose and (dynamically) eliminate design redundancies • Two Examples • Cache memory hierarchy • On-Chip Networks
Generational Behavior of Caches Memory Lines miss Idle interval hit new generation new generation Time 1. Kaxiras, S., Hu, Z. and Martonosi, M., "Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power“ ISCA 2001 2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle “IATAC: a smart predictor to turn-off L2 cache lines.” TACO 2005
Cache Tuning: Conceptual Model • Remap memory into the cache shape the cache • Match the program footprint resize the cache
y z x Cache Tuning: System Model & Opportunities statement Static analysis or programmer supplied statement Placement( B[][], param ) Structured accesses remapping directive Region A loop Placement( B[][] , param) statement statement Profile based insertion end loop P L1 Run-time tuning Thread 2 Thread 1 L2 AT logic LUT Alternative implementations M
Static Tuning: Scientific Applications • Targeted to programs with predictable access patterns • Compiler can both resizeand remap • Advanced compiler optimizations made possible
Dynamic Tuning: Folding Heuristics • Find and utilize redundancies in the design • Miss folding fold misses via re-mapping memory lines into the same cache set Comparisons shown for a 256KB L2 cache S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007
Tuning for Yield: Decreasing Defect Sensitivity* • Performance Yield yield at a given performance (e.g. AMAT) for 1000 units • Up to four times greater than modulo placement • Exploiting redundancies application to power management Recovering Design Inefficiencies S. Ramaswamy, S. Yalamanchili,“Customizable Fault Tolerant Caches for Embedded Processors,” ICCD 2006
Opportunities • Voltage scaling • Combine voltage scaling and remapping for program phase dependent power management • Compiler-directed hardware optimizations • For example concurrent data layout + cache placement • Application to multi-threaded and multi-core domains • Cache sharing across threads • Challenge: coherency traffic
The On-Chip Network • The network is in the critical path (performance) • Operand networks • Cache hierarchy • System on Chip • Increasing impact of wire (channel) delays • Wire delays must be actively managed • On-demand resource management • Initial studies: link tuning • Reference: Research at EPFL & Stanford on robust link design
A System for Tuning and Actively Reconfiguring SoC Links (STARS) Too Fast Well Tuned Too Slow Latch 1 Value 1 Value 2 Latch 2 Value 1 Value 2 Latch 3 Value 1 Value 2 Time • Variable delays and and cascaded registers measure link delay • Digital PLL tunes the clock to match the link delay
FPGA Tests Monitoring Find End of Link Transition Find Start of Link Transition Tuning Adjust Clock Frequency Determine Slack In the Link • Low speed tests to validate the control strategy
Prototyping: 180nm • Variable Delay Elements (VDE) • Variable delay from 118ps to 1.47ns • 10 bits of resolution • 502 transistors • Digitally Controlled Oscillator (DCO) • Clock period from 240ps to 2.97ns • 10 bits of resolution • 528 transistors • Digital Clock Divider (DCD) • Min input clock period 480ps • 8 bits of resolution • 1127 transistors • Allows tuning links up to 2.083 GHz • From reference clock of 8.13MHz
Extensions • Modulate link widths • Modulate buffer organizations • Channels/depth • Feedback between local congestion detection and link and buffer resources
Summary • Application demands will be time varying • Technology will introduce time-varying hardware characteristics • Continuous cooperative HW/SW tuning provides a methodology for addressing these concerns • Need the support of abstractions for tuning • Influence of prior applications to datapaths (Razor-UMich), communication systems (Vizor-GT), and reliable links (Stanford/EPFL) • Build on existing research in cache performance & power management