220 likes | 347 Views
Harald Servat Program Development for Extreme-Scale Computing May 3rd, 2010. Benefits of sampling in tracefiles. Outline. Instrumentation and sampling Folding Summarized traces Some results Current work. Instrumentation. Performance tools based on instrumentation
E N D
Harald Servat Program Development for Extreme-Scale Computing May 3rd, 2010 Benefits of sampling in tracefiles
Program Development for Extreme-Scale Computing Outline • Instrumentation and sampling • Folding • Summarized traces • Some results • Current work
Program Development for Extreme-Scale Computing Instrumentation • Performance tools based on instrumentation • Granularity of the results depends on the application structure • Data gathered includes: • Performance counters, callstack, message size…
Program Development for Extreme-Scale Computing Sampling • Sampling reaches any application point at a interval • Easily tunable frequency • Gather performance counters and callstack
Program Development for Extreme-Scale Computing Main objective • Combine both mechanisms • Deeper performance details • Using PAPI_overflow(..) • ... what about frequency trade-off? • Not too high to disrupt the performance data • Not too low to get useful information
Program Development for Extreme-Scale Computing Work done: Folding • Harald Servat, Germán Llort, Judit Giménez, Jesús Labarta: Detailed performance analysis using coarse grain sampling. PROPER, 2009. • Objective: get detailed metrics with few samples • Benefits from both high and low frequencies! • Take advantage of stationary behavior of scientific applications • Build synthetic region from scattered samples • Reintroduce into the tracefile at chosen ratio
Program Development for Extreme-Scale Computing Folding: Moving samples Steps • Main idea: Move samples to the target iteration preserving their original relative time.
Program Development for Extreme-Scale Computing Folding: Interpolation • Instructions evolution for routine copy_faces of NAS MPI BT B • No instrumentation points within the routine, but we got details • Red crosses represent the folded samples and show the completed instructions from the start of the routine • Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile • Blue line is the derivative of the curve fitting
Program Development for Extreme-Scale Computing Folding areas • Folding is applied to delimited regions • Previously instrumented • User function • Iteration • Automatically obtained from the gathered results • Clusters of computation bursts • Juan González, Judit Giménez, Jesús Labarta, Automatic detection of parallel applications computation phases, IPDPS 2009 • Delimited time regions • Marc Casas, Rosa M. Badia, Jesús Labarta, Automatic Structure Extraction from MPI Applications Tracefiles, Euro-Par 2007
Program Development for Extreme-Scale Computing Impact of the sampling frequency • The more samples being fold, the more detailed results • Longer executions • Increase frequency • Reach stability? • Example: • NAS BT class B copy_faces • showing from 10 to 200 iterations • 20 samples per second @ SGI Altix
Program Development for Extreme-Scale Computing Impact of the sampling frequency • Choosing a sampling frequency is important • Sampling frequency can couple with application frequency • Choose frequencies based on prime factors
Program Development for Extreme-Scale Computing Outline • Instrumentation and sampling • Folding • Summarized traces • Some results • Current work
Program Development for Extreme-Scale Computing Dealing with large scale traces • Jesús Labarta, Judit Giménez, Eloy Martínez, Pedro González, Harald Servat, Germán Llort, Xavier Aguilar: Scalability of tracing and visualization tools, PARCO 2005. • Application’s behavior can be divided in: • Communication phases • Intensive computation phases • Instrumentation library that identifies relevant computation phases
Program Development for Extreme-Scale Computing Dealing with large scale traces • Information emitted at phase change • Punctual (callstack) • Aggregated • Hardware Counters • Software Counters • Number of point-to-point and collective operations • Number of bytes transferred • Time in MPI
Program Development for Extreme-Scale Computing Example • PEPC 16384 tasks on Jaguar Duration of the computation bursts # of MPI collective operations
Program Development for Extreme-Scale Computing Benefits of summarized tracefiles • Important trace size reduction • Gadget2 (128) – 10 Gbytes down to 428 Mbytes • PEPC (16k) – 19 Gbytes down to 400 Mbytes • PFLOTRAN (16k) – +250Gbytes down to 6 Gbytes • Whole execution analysis
Program Development for Extreme-Scale Computing Working with large traces? • We're dealing with large scale executions • Maintain scalability of tracing + sampling • By adding more data? • Use folding to reduce data • Example (Gadget2 using 128 tasks) • 100 its, 5 samples/s during 90minutes ~ 236MB • Folding on 1 iteration @ 200 samples/s ~ 64 MB
Program Development for Extreme-Scale Computing Outline • Instrumentation and sampling • Folding • Summarized traces • Combining mechanisms • Some results • Current work
Program Development for Extreme-Scale Computing Gadget2 analysis, 128 tasks force_tree.c +75 - gravity_tree.c +167 predict.c +92 - pm_periodic.c +385 32% 16% 13% 8% gravity_tree.c +528 - density.c +167 force_tree.c +1701 - hydra.c +246
Program Development for Extreme-Scale Computing PEPC analysis, 32 tasks tree_aswalk.f90 +162 - tree_aswalk.f90 +380 tree_aswalk.f90 +380 - tree_aswalk.f90 +162 45% 37% 5% 3% tree_domains.f90 +548 - tree_branches.f90 +155 tree_branches.f90 +548 - tree_properties.f90 +328
Program Development for Extreme-Scale Computing Current directions • We work on: • Is there an optimal sampling frequency? • Quantify correctness and validate the results • Callstack analysis
Program Development for Extreme-Scale Computing • Thank you!