120 likes | 223 Views
Performance M onitoring Update. Daniele Francesco Kruse. April 2010. Summary. Monitoring CMSSW on Nehalem SDL substitute candidates Monitoring Geant4. 2. A performance model for Nehalem : Overview. Total_cycles: CPU_CLK_UNHALTED:THREAD_P Useless_uops:
E N D
Performance Monitoring Update Daniele Francesco Kruse April 2010
Summary Monitoring CMSSW on Nehalem SDL substitute candidates Monitoring Geant4 2
A performance model for Nehalem: Overview Total_cycles: CPU_CLK_UNHALTED:THREAD_P Useless_uops: (UOPS_EXECUTED:PORT015 + UOPS_EXECUTED:PORT234_CORE) - UOPS_RETIRED:ANY PORT015_uop_execution_rate: UOPS_EXECUTED:PORT015 / UOPS_EXECUTED:PORT015(CMASK=1) PORT234_CORE_uop_execution_rate: UOPS_EXECUTED:PORT234_CORE / UOPS_EXECUTED:PORT234_CORE(CMASK=1) Uop_execution_rate: PORT015_uop_execution_rate + PORT234_CORE_uop_execution_rate Useless_cycles: Useless_uops / Uop_execution_rate Useful_cycles: UOPS_RETIRED:ANY / Uop_execution_rate Active_cycles: Useless_cycles + Useful_cycles Stalled_cycles: Total_cycles - Active_cycles 3
Cycle Accounting Analysis for Nehalem (Intel core i7) Total Cycles (Application total execution time) CPU_CLK_UNHALTED:THREAD_P Issuing μops Not Issuing μops Active_Cycles = Useless_cycles + Useful_cycles Total_cycles - Active_cycles Retiring μops (useful work) Not retiring μops (useless work) Stalled (no work) UOPS_RETIRED:ANY / Uop_execution_rate Useless_uops / Uop_execution_rate Total_cycles - Active_cycles 4
Nehalem: Overview of memory and cache stalls • Memory and cache related stalls: • MEM_LOAD_RETIRED:DTLB_MISS // ~10 cycles • MEM_LOAD_RETIRED:L1D_HIT // too small: penalty hidden • MEM_LOAD_RETIRED:L2_HIT // ~14.5 cycles • MEM_LOAD_RETIRED:L3_MISS // ~180 cycles (arch. dependent) • MEM_LOAD_RETIRED:L3_UNSHARED_HIT // ~42 cycles • MEM_LOAD_RETIRED:OTHER_CORE_L2_HIT_HITM // ~74 cycles • ITLB_MISS_RETIRED // too small: penalty hidden • Other Stalls: • ILD_STALL:ANY // BROKEN?!?! • RAT_STALLS • RESOURCE_STALLS • SEG_RENAME_STALLS • SQ_FULL_STALL_CYCLES • STORE_BLOCKS • And finally what happened to store-forward stalls? • - Loads spanning across cache lines cause almost no stalls anymore • - Loads blocked by unknown address stores and loads blocked because they are not completely contained in preceding store still cause stalls • - Unfortunately no direct event to count these situations 5
Monitoring CMSSW on Nehalem • First Nehalem results with CMSSW 3.6.0 pre2 (here) • (compare with Core results here) • Tool discovers architecture at runtime (CPUID) • First performance considerations on Nehalem • Faster (generally 3 – 15% over Core cycle count) • Lower CPI (no & type of instructions stays the same obviously) • Stalled cycles cut down to around 30% of Core values • (but we need to verify coverage accurately) • Same percentage of mispredicted branches • Seems more useful cycles required to do the same job 6
Structure and libraries zlib Start Version with graphs Analysis Configuration libpng zlib Performance Data Analysis libSDL Performance Data Taking Program Run libpfm libSDL_ttf Browsable HTML results End Performance Data Output 7
SDL substitute candidates • libSDL_ttf is not part of standard SLC5 installation • SDL substitute candidates (both successfully tested): • HTML5’s <canvas> tag: • Supported by Firefox, Opera, Safari & Chrome • Text drawing supported only by Firefox (Gecko), Safari & Chrome • Internet Explorer also supports it through Mozilla’s plugin • ROOT: • A little heavier and more difficult to adapt • Works the same way as the current SDL implementation (png output) 8
Monitoring Geant4 • Overall and symbol analysis already possible • pfmon command line tool & FullCMS simulation example • Modular analysis through User Actions • Probably RunAction and EventAction combined • Type of particle, direction and energy determine complexity and type of event • This triple may be used to describe the “module” of the analysis • Proposal: event-level granularity 9
Conclusions • CMSSW 3.6.0 has been successfully monitored on a Nehalem machine • Two proposed substitutes for results graphics display have been successfully tested for suitability: ROOT & <canvas> • A study to apply modular monitoring of Geant4 is underway 10
What’s next? • Further study stall impacts on Nehalem and validate Cycle Accounting Analysis (possibly with David Levinthal in may) • Implement graphics display without SDL dependency, using ROOT or HTML5’s <canvas> tag • Make Geant4 monitoring exercises with simple examples and later with FullCMS application 11