Computational Needs Going Forward

Computational Needs Going Forward Quentin F. Stout

Some Differences • We have changed • Initially most computation was for code development • Models 1D 2D 3D • Did not need all resources provided • Now in transition to production UQ runs • Will have extensive, continual, usage • Within PSAAP, we are somewhat unique • Appear to be more involved with large development runs (other than DATs), thus encounter more problems • scheduling • cluster performance • bandwidth • I/O . . .

Access and Performance • Scheduling overly optimized for small jobs • DATs suitable for production use of large jobs • DATs poor for code development • Other than DATs, on Lobo (LANL) large jobs (1000 cores, 16 hours) can only be iterated twice per week • 75 hour wait in queue • System errors more likely to appear on large jobs • Stressed the I/O system, encountered serious problems • Uncovered node performance problems on Hera (LLNL) • Overall, Lobo has many more problems • Situation forced us to use local resources to produce the 3D results we have presented

Application Mean Time to Interrupt John T Daly, Performance Challenges for Extreme Scale Computing, 2007

Data Pathway Problematic Difficulties • Major Impediment • ≈ 1 Mb/sec from Hera (LLNL) • timeout 12 hr • ≈ 40 Mb/sec from Lobo (LANL) • timeout 4 hr

Production CRASH UQ Runs • UQ will guide needs, current estimate • 2D multigroup: > 1000 / year • each: 256 cores × 16 hr • 3D gray: > 100 / year • each: 1000 cores × 24 hr • Additional runs, such as sensitivity studies • 1000*2D-mg + 100*3D-g ≈ Hera +Loboreplacement • If Lobo replacement allocation as expected • If everything works, and we get timely access

Production PDT UQ Runs • Based on UQ needs and scaling studies • 2D: 15 weekend DATs / year • each: 2048 cores × 16 hr • 3D: 10 weekend DATs / year • each: 8192 cores × 60 hr • 15*2D-PDT + 10*3D-PDT ≈ Hera + Loboreplace • If everything works … • In addition to the production runs, UM + TAMU need • ≈ 1 M core-hours for code development, scaling, etc.

3D Multigroup • Well-resolved 3D multigroup would be quite useful • However, ≈ 1 month on 1000 cores • Feasible on BlueGene ? • Perhaps ≈ 2 dayon 32K cores • Scaling is a serious concern • Strong scaling on Hera poor  • Reasonable weak scale Pleiades • Requires tuning for BG • Have run MHD on small BG • Initially investigate with modest effort • If appears feasible then will decide how to proceed • Not on critical path

3D PDT • Moderately resolved 3D PDT is even more daunting • 3D at 512 × 512 × 1024 ≈ 2.5M core hours • Only hope is BG, • or worthy successor • 2.5M ≈ 32K cores × 80 hr • Again, scaling a concern • BG scaling studies start • next month • If scales as hoped, will be largest Alliance use of BG

Computational Challenges of the Coming Year • Continue transition to productionUQ, continue code • development, efficiency improvements, … • Scaling CRASH and PDT to BG • (if will remain available) • Challenges of • Obtaining sufficient allocation • Improved end-to-end performance • Scheduling • Bandwidth • I/O • …

Computational Needs Going Forward

Computational Needs Going Forward

Presentation Transcript

Several Challenges Going Forward

CREPC Going Forward

Going Forward

Going Forward Together

Equality: Going forward, not standing still

Visions Going Forward

Going Forward

Nepal Social Protection Index – Going Forward

Our Success Going Forward

Guidelines - Going Forward

Renal Registry – going forward

National Networking Going Forward Scenarios

Going Forward: What Can We Do?

MUCM going forward

Going Forward…

Waste to Energy – Going Forward

Going Forward

ITS Learning: Going Forward

Budget Request: “Cost Going Forward”

The public finances going forward

Several Challenges Going Forward

The public finances going forward