100 likes | 216 Views
Computational Needs Going Forward. Quentin F. Stout. Some Differences. We have changed Initially most computation was for code development Models 1D 2D 3D Did not need all resources provided Now in transition to production UQ runs Will have extensive, continual, usage
E N D
Computational Needs Going Forward Quentin F. Stout
Some Differences • We have changed • Initially most computation was for code development • Models 1D 2D 3D • Did not need all resources provided • Now in transition to production UQ runs • Will have extensive, continual, usage • Within PSAAP, we are somewhat unique • Appear to be more involved with large development runs (other than DATs), thus encounter more problems • scheduling • cluster performance • bandwidth • I/O . . .
Access and Performance • Scheduling overly optimized for small jobs • DATs suitable for production use of large jobs • DATs poor for code development • Other than DATs, on Lobo (LANL) large jobs (1000 cores, 16 hours) can only be iterated twice per week • 75 hour wait in queue • System errors more likely to appear on large jobs • Stressed the I/O system, encountered serious problems • Uncovered node performance problems on Hera (LLNL) • Overall, Lobo has many more problems • Situation forced us to use local resources to produce the 3D results we have presented
Application Mean Time to Interrupt John T Daly, Performance Challenges for Extreme Scale Computing, 2007
Data Pathway Problematic Difficulties • Major Impediment • ≈ 1 Mb/sec from Hera (LLNL) • timeout 12 hr • ≈ 40 Mb/sec from Lobo (LANL) • timeout 4 hr
Production CRASH UQ Runs • UQ will guide needs, current estimate • 2D multigroup: > 1000 / year • each: 256 cores × 16 hr • 3D gray: > 100 / year • each: 1000 cores × 24 hr • Additional runs, such as sensitivity studies • 1000*2D-mg + 100*3D-g ≈ Hera +Loboreplacement • If Lobo replacement allocation as expected • If everything works, and we get timely access
Production PDT UQ Runs • Based on UQ needs and scaling studies • 2D: 15 weekend DATs / year • each: 2048 cores × 16 hr • 3D: 10 weekend DATs / year • each: 8192 cores × 60 hr • 15*2D-PDT + 10*3D-PDT ≈ Hera + Loboreplace • If everything works … • In addition to the production runs, UM + TAMU need • ≈ 1 M core-hours for code development, scaling, etc.
3D Multigroup • Well-resolved 3D multigroup would be quite useful • However, ≈ 1 month on 1000 cores • Feasible on BlueGene ? • Perhaps ≈ 2 dayon 32K cores • Scaling is a serious concern • Strong scaling on Hera poor • Reasonable weak scale Pleiades • Requires tuning for BG • Have run MHD on small BG • Initially investigate with modest effort • If appears feasible then will decide how to proceed • Not on critical path
3D PDT • Moderately resolved 3D PDT is even more daunting • 3D at 512 × 512 × 1024 ≈ 2.5M core hours • Only hope is BG, • or worthy successor • 2.5M ≈ 32K cores × 80 hr • Again, scaling a concern • BG scaling studies start • next month • If scales as hoped, will be largest Alliance use of BG
Computational Challenges of the Coming Year • Continue transition to productionUQ, continue code • development, efficiency improvements, … • Scaling CRASH and PDT to BG • (if will remain available) • Challenges of • Obtaining sufficient allocation • Improved end-to-end performance • Scheduling • Bandwidth • I/O • …