FNAL Geant4 Performance Group Issues and Progress Daniel Elvira for

FNAL Geant4 Performance Group Issues and Progress Daniel Elvirafor M. Fischler, J. Kowalkowski, M. Paterno

G4 Performance Work @FNAL • The LHC experiments are the main G4 customers within the HEP, FNAL (and the FNAL/G4 team) is involved in CMS. Then… • It makes sense for the FNAL/G4 performance group to use (primarily) the CMS simulation application to profile Geant4 code • For the work presented here: G4.8.1p01, QGSP/QGSP_Bertini (CMSSW180), Z’(dijets) • identify places for improvement, • design and implement improvements, • feed those improvements back to Geant4. • We use a tool of our own design, SimpleProfiler, to collect detailed call stack samples. We also use our own tools to manage, analyze, present the data. Work is in progress to make these tools available to the public.

CHIPS performance analysis • CMS reported that large number of small memory allocations were causing memory fragmentation and this troubled them. • Average t-tbar event in CMS simulation created and destroyed about 1.6 million G4QHadron objects. • G4QHadron constructors took >1% of program time; odd for a constructor. • Derived class G4QNucleus constructors took ~2% of program time.

CHIPS code modification & result • Problem was one data member: • std::vector<G4double>* Tb; • There was no need for this vector to be on the heap; change the data member to: • std::vector<G4double> Tb; • Result is 1.5% speed improvement in the whole CMS simulation.

Bertini Performance Analysis 2% (median)of program time in G4ElementaryParticleColliderconstructor. Partial profiling result for100 jobs of 100 events each.Overall percentage of time time taken in functions (no children). These large spreads will be discussed later.

Bertini code modification • We reorganized G4ElementaryParticleCollider. • Removed 20 of 21 data members that did not need to be part of its state (they were equivalent for all instances of the class). • Much of the redundant code (across the 20 data members) was replaced by a few templates. • Result was a reduction in source code bulk of about 40 printed pages

Bertini performance improvement • Performance increase ~ 4% • Note that the increase in speed is greater than the obvious expectations initial profiling • Reduction in object allocation / deletion benefits other classes • Note: we have a large enough data sample to accurately characterize the differences

Irregularity in random number use • Recall that the Bertini performance analysis showed an unusually large spread in function speed across runs of the same job. We took this to indicate that there might be a reproducibility problem. We decided to investigate this further. It looks as though there are two distinct groups of measurements.

First observation of irreducibility • 91 runs each represented by a line on the plot • Each line shows the time taken per event • Expect all lines to form one “cable” • Observed that after event 38, the jobs separate into two branches, each of which appear to process the events differently • What is the cause of this? • Highly reproducible for a long period of time. Unfortunately it disappeared. • Apparently, there were changes to the cluster.

Architecture dependence in output • Excerpt from 4GB Geant4 log file (at line 1,093,802) showing subtle difference in physics output on different platforms. • Discovered while trying to understand the irreproducibility problem. AMD: 507 -598 -482 - 2.75e+03 0 0.0962 0.282 2.92e+03 TECBackDisk KaonPlusInelastic :----- List of 2ndaries - #SpawnInStep= 10 : -598 -482 -2.75e+03 4.39e+03 pi+ : -598 -482 -2.75e+03 362 pi- : -598 -482 -2.75e+03 93.9 pi+ : -598 -482 -2.75e+03 1.43e+03 pi- : -598 -482 -2.75e+03 654 proton : -598 -482 -2.75e+03 3.78e+03 pi- : -598 -482 -2.75e+03 3.14e+03 kaon+ : -598 -482 -2.75e+03 5.68 gamma : -598 -482 -2.75e+03 2 gamma : -598 -482 -2.75e+03 0.835 C11[0.0] INTEL: 507 -598 -482 - 2.75e+03 0 0.0962 0.282 2.92e+03 TECBackDisk KaonPlusInelastic :----- List of 2ndaries - #SpawnInStep= 9 : -598 -482 -2.75e+03 4.39e+03 pi+ : -598 -482 -2.75e+03 362 pi- : -598 -482 -2.75e+03 93.9 pi+ : -598 -482 -2.75e+03 1.43e+03 pi- : -598 -482 -2.75e+03 654 proton : -598 -482 -2.75e+03 3.78e+03 pi- : -598 -482 -2.75e+03 3.14e+03 kaon+ : -598 -482 -2.75e+03 7.68 gamma : -598 -482 -2.75e+03 0.854 C11[0.0]

Random number usage issue • Starting looking at the random number generators as cause of the physics differences. • The random number streams are identical on AMD and Intel architectures. • Different amount of random numbers drawn from the generator depending on architecture, observable on the first event • This result is completely reproducible AMD = 59,230,872 random numbers Intel = 59,511,723 random numbers (difference of 280,851) We are still investigating the cause.

Plans for the future What we had in mind in the short term (rest of 2008): 1-Perform a major design review for the CHIPS library. 2-Investigate the Intel vs. AMD dependence of the simulation output. 3- Return to profiling, resuming with the newest version of Geant4. 4- Continue work with two interns from Northern Illinois University, who are helping to improve out data collation, analysis, and display tools; the tools will be made public once they are sufficiently robust. But… Fermilab management has temporarily pulled out J. Kowalkowski, M. Paterno from G4 efforts (for at least 2-3 months effective ~Sep 15th. ) M. Fischler + 1 FTE (computer scientist) will undertake (1) shortly, with minimum guidance from M. Paterno.

Plans for the future In the longer term 2009: • M. Fischler, J. Kowalkowski, M. Paterno will resume (2), (3) • The FNAL/G4 team has interest in efforts to support multi-core/multi-threaded programming: make code thread safe, create code that scales well with multiple CPUs. Man-power is an issue. During this workshop we should discuss with Gabriele/John a wish list for 2009. A well defined long term program would help allocate FNAL resources to G4.

FNAL Geant4 Performance Group Issues and Progress Daniel Elvira for

FNAL Geant4 Performance Group Issues and Progress Daniel Elvira for

Presentation Transcript

Measuring Performance and Transition Progress

Geant4-MT migration and UI issues

Elvira Kaegi 1 Daniel Berner 2 Adrian Peter 2

Synchrotron Radiation Progress (RR-setup, GEANT4 and IRSYN*)

Liveness and Performance Issues

Tabular Editors for Geant4 Geant4 Geometry Editor and Geant4 Physics Editor

Geant4 Modular Performance Monitoring

Charles Plager ( UCLA/FNAL ) For the LJMet Group.

Performance and Progress 2006/2007

News and Issues for FNAL BT

Group Communications and Database Replication: techniques, issues and performance

Geant4 hadronics group meeting

GEANT4 installation progress report

Progress with Geant4

Summary of Geant4 Computing Performance Activities V. Daniel Elvira (Fermilab)

FNAL LT Testing Issues

Geant4 Electromagnetic Physics Progress

Performance Adhoc Group IEEE 802.17 Summary of Progress

FNAL Production Progress