220 likes | 384 Views
Visualization Enables the Programmer to Reduce Cache Misses. Kristof Beyls, Erik H. D’Hollander, Yijun Yu Ghent University PDCS - November 2002. Overview. Introduction Reuse Distance Metric Data Locality Visualization Case Study: MCF Conclusion. Overview. Introduction
E N D
Visualization Enables the Programmer to Reduce Cache Misses Kristof Beyls, Erik H. D’Hollander, Yijun Yu Ghent University PDCS - November 2002
Overview • Introduction • Reuse Distance Metric • Data Locality Visualization • Case Study: MCF • Conclusion
Overview • Introduction • Reuse Distance Metric • Data Locality Visualization • Case Study: MCF • Conclusion
Introduction • Anti-law of Moore Relatieve speed versus 1980 1000 PROCESSOR 100 SpeedGap 10 vergeleken met 1980 Relatieve snelheid MEMORY 1 1980 1985 1990 1995 2000
Cache capacity misses dominate 3 cache miss types: Cold, Conflict, Capacity.
Optimization at different levels • Cache optimization at 3 levels: • Hardware: only resolves conflict misses. • Compiler: only resolves tiny portion of the capacity misses. • Algorithm: performed by programmer. Should try to eliminate capacity misses which cannot be handled well by hardware and compiler • Problem: cache behavior is not obvious in source code.
Objectives for cache Visualization • Cache behavior should be visualized program-centric. • Cache behavior should be described accurately in concise way. • Independent of the specific cache parameters. Reuse Distance metric meets above objectives.
Overview • Introduction • Reuse Distance Metric • Data Locality Visualization • Case Study: MCF • Conclusion
Backward reuse distance > cache sizeCapacity miss 3 03 0 Reuse Distance: Definition • Reuse pair • Reuse distance of reuse pair • Backward reuse distance A B C D D A F
Overview • Introduction • Reuse Distance Metric • Data Locality Visualization • Case Study: MCF • Conclusion
Visualization: Overview 1. Instrumentation 2. Simulation 3. Filtering 4. Visualization 5. Program Optimization
1. Instrumentation • For every load, store and prefetch-instruction, a memory access is inserted: profile_memaccess( instr_id, address) • Implemented in the Open Research Compiler.
2. Simulation • Library implements profile_memaccess • uses hash tables and binary treaps to quickly compute • reuse distance per reuse pair • reuse distance distribution of all reuse pairs between any two instructions. • Only the distribution is stored to disk, using XML.
3. Filtering • Only reuse distance larger than cache size generate capacity misses. • Those are filtered out using an XSLT-filter. E.g.: <reference id="pbeampp.c/primal_bea_mpp:21"> <reuse> <log2distance>15</log2distance> <fromid>pbeampp.c/primal_bea_mpp:21</fromid> <count>24628629</count> </reuse> </reference>
22.12% 48.09% 4. Visualization • In our prototype, XSLT-script generates input to the VCG-visualizer.
Overview • Introduction • Reuse Distance Metric • Data Locality Visualization • Case Study: MCF • Conclusion
22.12% 48.09% 70% of capacity misses 5. Optimization for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost<0 && arc->ident == AT_LOWER || red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }
5. Optimization for( ; arc < stop_arcs; arc += nr_group ) { #define PREFETCH_DISTANCE 8 PREFETCH(arc+nr_group*PREFETCH_DISTANCE) if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost<0 && arc->ident == AT_LOWER || red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }
Overview • Introduction • Reuse Distance Metric • Data Locality Visualization • Case Study: MCF • Conclusion
Conclusion • Complement hardware and compiler techniques with programmer-driven optimizations. • Reuse distance indicates cache bottlenecks for a wide range of cache configurations • MCF: speedup between 24% and 48% on CISC, RISC and EPIC processors • Reuse distance visualization enables portable and platform-independent cache optimizations.