1 / 18

Hercules Development Status

Hercules Development Status. John Urbanic ASTA Presentation July 16 th , 2009. A Brief Overview Of Our Experiences and Expectations. What Science are we trying to accomplish? What does our code do? How well does it do it? Scalability Issues Debugging Issues What are we shooting for next?

lyle-beck
Download Presentation

Hercules Development Status

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hercules Development Status John Urbanic ASTA Presentation July 16th, 2009

  2. A Brief Overview Of Our Experiences and Expectations • What Science are we trying to accomplish? • What does our code do? • How well does it do it? • Scalability Issues • Debugging Issues • What are we shooting for next? • Credits and Partners

  3. Our Special Thing: Wavelength-Adaptive Tetrahedral Mesh

  4. FE Approach • Allows us to scale well despite disparities in element sizes • Forces us to deal with complex domain decompositions • If done right, we end up with one of the best ingredients for scalability: almost completely nearest-neighbor communications • Processing each element requires some pointer chasing: we can not lay out data contiguous in memory. • Makes cache use complicated

  5. ShakeOut Results and Verification

  6. Scaling Hotspots • Solver portion is almost perfectly scalable due to nearest neighbor communication • Exception is IO code in that loop. No machine has completely scalable hardware. • Stations: easy. A few dozen points. • Planes: could be one small area, or a bunch stacked to make volumes. Resolution determined by user requirements, not generally simulation res. • Volumes: no complete volume dumps for now.

  7. Scaling IO • To make IO as scalable as possible on any existing system, you must accommodate limitations/configuration of output devices. We have done this by • Spreading writes out from many app nodes • MPI’ing that data to many disk writing nodes • Doing that in variable packet sizes • Limiting number of outstanding packets • All of these are easily configurable.

  8. Scaling Issues • IO System • Reliability issues have caused us to scale back on IO for these runs • Not doing volumes • Limited number of planes (we would like 1000’s) • We are also not going to force through any epic (8h, 64K PE) runs just yet. • Network • Looks great thus far, but is seemingly affected by Lustre accesses by other jobs

  9. Hercules Scaling Weak Scaling Strong Scaling Seconds Processors Processors Data gathered using Kraken at NICSfor a ShakeOut-type problem

  10. Debugging Issues • Lustre misdirection. If you see this: • [2048] MPICH has run out of unexpected buffer space. • Try increasing the value of env var MPICH_UNEX_BUFFER_SIZE (cur value is 62914560), • and/or reducing the size of MPICH_MAX_SHORT_MSG_SIZE (cur value is 50000). • aborting job: • out of unexpected buffer space • [NID 4867]Apid 949492: initiated application termination • Application 949492 exit codes: 255 • Application 949492 exit signals: Killed • What would you try next?

  11. Debugging Issues • Wrong answer: • Start playing with MPICH_UNEX_BUFFER_SIZE or MPICH_MAX_SHORT_MSG_SIZE. These will get you nowhere, but will confuse you greatly when they seem to do something due to job load variations. • Answer: • Rerun when machine is less loaded (jobcount seems to be metric here) and it will likely disappear.

  12. Debugging Issues • To be fair, you can easily (and will likely) generate legitimate MPI buffer issues as you scale up. A few helpful hints: • Do not take Cray “suggestion” messages too seriously, as per the above. • The primary documentation on the growing list of effective environment variables is “man mpi”. • This documentation is not entirely correct or self-consistant, so if you spot/encounter seemingly contradictory behavior, just note it and more on.

  13. Totalview • Has been very useful to us for the past several years • Used at scale of 4-256 cores • We have had an outstanding offer to try to debug with 16K cores that we have not yet required – but may very soon • This requires a unique license on Kraken, but Totalview guys are willing to work around that for us, and probably for you

  14. Next Steps • Fix large source issues • Scale to 64 and 128K on Kraken and Intrepid • Implement Non-linear soil response

  15. Regional NonlinearSoil Response Study Case: The Euroseistest in the Volvi area in Thessaloniki, Greece von Mises and Drucker-Prager Material Models incorporated in Hercules with a explicit solution method

  16. Results: Elastic v. Elastoplastic Synthetics Stress-Strain relationships in time

  17. Non-Linear Implementation • These appear as conditional exceptions in solver kernel • Would be very expensive change if our elements were well vectorized • They are not, due to irregular nature of mesh, so we don’t pay much of a performance price

  18. Credits Whole SCEC group has been valuable on many fronts, but these immediate results are attributable to Jacobo Bielak, Julio Lopez, Leonardo Ramirez Guzman, Haydar Karaoglu and doubly so to Ricardo Taborda (for his pretty graphics as well as expertise).

More Related