1 / 12

Performance M onitoring Update

Performance M onitoring Update. Daniele Francesco Kruse. April 2010. Summary. Monitoring CMSSW on Nehalem SDL substitute candidates Monitoring Geant4. 2. A performance model for Nehalem : Overview. Total_cycles: CPU_CLK_UNHALTED:THREAD_P Useless_uops:

azra
Download Presentation

Performance M onitoring Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Monitoring Update Daniele Francesco Kruse April 2010

  2. Summary Monitoring CMSSW on Nehalem SDL substitute candidates Monitoring Geant4 2

  3. A performance model for Nehalem: Overview Total_cycles: CPU_CLK_UNHALTED:THREAD_P Useless_uops: (UOPS_EXECUTED:PORT015 + UOPS_EXECUTED:PORT234_CORE) - UOPS_RETIRED:ANY PORT015_uop_execution_rate: UOPS_EXECUTED:PORT015 / UOPS_EXECUTED:PORT015(CMASK=1) PORT234_CORE_uop_execution_rate: UOPS_EXECUTED:PORT234_CORE / UOPS_EXECUTED:PORT234_CORE(CMASK=1) Uop_execution_rate: PORT015_uop_execution_rate + PORT234_CORE_uop_execution_rate Useless_cycles: Useless_uops / Uop_execution_rate Useful_cycles: UOPS_RETIRED:ANY / Uop_execution_rate Active_cycles: Useless_cycles + Useful_cycles Stalled_cycles: Total_cycles - Active_cycles 3

  4. Cycle Accounting Analysis for Nehalem (Intel core i7) Total Cycles (Application total execution time) CPU_CLK_UNHALTED:THREAD_P Issuing μops Not Issuing μops Active_Cycles = Useless_cycles + Useful_cycles Total_cycles - Active_cycles Retiring μops (useful work) Not retiring μops (useless work) Stalled (no work) UOPS_RETIRED:ANY / Uop_execution_rate Useless_uops / Uop_execution_rate Total_cycles - Active_cycles 4

  5. Nehalem: Overview of memory and cache stalls • Memory and cache related stalls: • MEM_LOAD_RETIRED:DTLB_MISS // ~10 cycles • MEM_LOAD_RETIRED:L1D_HIT // too small: penalty hidden • MEM_LOAD_RETIRED:L2_HIT // ~14.5 cycles • MEM_LOAD_RETIRED:L3_MISS // ~180 cycles (arch. dependent) • MEM_LOAD_RETIRED:L3_UNSHARED_HIT // ~42 cycles • MEM_LOAD_RETIRED:OTHER_CORE_L2_HIT_HITM // ~74 cycles • ITLB_MISS_RETIRED // too small: penalty hidden • Other Stalls: • ILD_STALL:ANY // BROKEN?!?! • RAT_STALLS • RESOURCE_STALLS • SEG_RENAME_STALLS • SQ_FULL_STALL_CYCLES • STORE_BLOCKS • And finally what happened to store-forward stalls? • - Loads spanning across cache lines cause almost no stalls anymore • - Loads blocked by unknown address stores and loads blocked because they are not completely contained in preceding store still cause stalls • - Unfortunately no direct event to count these situations 5

  6. Monitoring CMSSW on Nehalem • First Nehalem results with CMSSW 3.6.0 pre2 (here) • (compare with Core results here) • Tool discovers architecture at runtime (CPUID) • First performance considerations on Nehalem • Faster (generally 3 – 15% over Core cycle count) • Lower CPI (no & type of instructions stays the same obviously) • Stalled cycles cut down to around 30% of Core values • (but we need to verify coverage accurately) • Same percentage of mispredicted branches • Seems more useful cycles required to do the same job 6

  7. Structure and libraries zlib Start Version with graphs Analysis Configuration libpng zlib Performance Data Analysis libSDL Performance Data Taking Program Run libpfm libSDL_ttf Browsable HTML results End Performance Data Output 7

  8. SDL substitute candidates • libSDL_ttf is not part of standard SLC5 installation • SDL substitute candidates (both successfully tested): • HTML5’s <canvas> tag: • Supported by Firefox, Opera, Safari & Chrome • Text drawing supported only by Firefox (Gecko), Safari & Chrome • Internet Explorer also supports it through Mozilla’s plugin  • ROOT: • A little heavier and more difficult to adapt • Works the same way as the current SDL implementation (png output) 8

  9. Monitoring Geant4 • Overall and symbol analysis already possible • pfmon command line tool & FullCMS simulation example • Modular analysis through User Actions • Probably RunAction and EventAction combined • Type of particle, direction and energy determine complexity and type of event • This triple may be used to describe the “module” of the analysis • Proposal: event-level granularity 9

  10. Conclusions • CMSSW 3.6.0 has been successfully monitored on a Nehalem machine • Two proposed substitutes for results graphics display have been successfully tested for suitability: ROOT & <canvas> • A study to apply modular monitoring of Geant4 is underway 10

  11. What’s next? • Further study stall impacts on Nehalem and validate Cycle Accounting Analysis (possibly with David Levinthal in may) • Implement graphics display without SDL dependency, using ROOT or HTML5’s <canvas> tag • Make Geant4 monitoring exercises with simple examples and later with FullCMS application 11

  12. Thank you, Questions ?

More Related