110 likes | 262 Views
On the Path to Petascale: Top Challenges to Scientific Discovery. Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead. 1. Code Performance. From 2004 - 2008, computing power for codes like GTC will go up 3 orders of magnitude! 2 Paths for Pscale computing for most simulations.
E N D
On the Path to Petascale:Top Challenges to Scientific Discovery Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead
1. Code Performance • From 2004 - 2008, computing power for codes like GTC will go up 3 orders of magnitude! • 2 Paths for Pscale computing for most simulations. • More physics. Larger problems. • Code Coupling. • My personal definition of leadership class computing. • “Simulation runs on >50% of cores, running for >10 hours.” • One ‘small’ simulation will cost $38,000 on a Pflop computer. • Science scales with processors. • XGC and GTC fusion simulations will run on 80% of cores for 80 hours ($400,000/simulation).
Data Generated. • MTF will be ~2 days. • Restarts contain critical information to replay the simulation at different times. • Typical Restarts = 1/10 of memory. Dumps every 1 hour. (Big 3 apps support this claim). • Analysis files dump every physical timestep. Typically every 5 minutes of simulation. • Analysis files vary. We estimate for ITER size simulations data output will be roughly 1GB/5 minutes. • DEMAND I/O < 5% of calculation. • Total simulation will potentially produce =1280TB + 960GB. • Need > (16*1024+12)/(3600 * .05) = 91GB/sec. • Asynchronous I/O is needed!!! (Big 3 apps (combustion, fusion, astro allow buffers). • Reduces I/O rate to (16*1024+12)/3600 = 4.5GB/sec. (with lower overhead). • Get the data off the HPC, and over to another system! • Produce HDF5 files on another system (too expensive for HPC system).
Workflow Automation is desperately needed. (with high-speed data-in-transit techniques). • Need to integrate Autonomics into workflows…. • Need to make it easy for the scientists. • Need to make it fault tolerant/robust.
A few days in the life of Sim Scientist.Day 1 -morning. • 8:00AM Get Coffee, Check to see if job is running. • Ssh into jaguar.ccs.ornl.gov (job 1) • Ssh into seaborg.nersc.gov (job 2) (this is running yea!) • Run gnuplot to see if run is going ok on seaborg. This looks ok. • 9:00AM Look at data from old run for post processing. • Legacy code (IDL, Matlab) to analyze most data. • Visualize some of the data to see if there is anything interesting. • Is my job running on jaguar? I submitted this 4K processor job 2 days ago! • 10:00AM scp some files from seaborg to my local cluster. • Luckily I only have 10 files (which are only 1 GB/file). • 10:30AM first file appears on my local machine for analysis. • Visualize data with Matlab.. Seems to be ok. • 11:30AM see that the second file had trouble coming over. • Scp the files over again… Dohhh
Day 1 evening. • 1:00PM Look at the output from the second file. • Opps, I had a mistake in my input parameters. • Ssh into seaborg, kill job. Emacs the input, submit job. • Ssh into jaguar, see status. Cool, it’s running. • bbcp 2 files over to my local machine. (8 GB/file). • Gnuplot data.. This looks ok too, but still need to see more information. • 1:30PM Files are on my cluster. • Run matlab on hdf5 output files. Looks good. • Write down some information in my notebook about the run. • Visualize some of the data. All looks good. • Go to meetings. • 4:00PM Return from meetings. • Ssh into jaguar. Run gnuplot. Still looks good. • Ssh into seaborg. My job still isn’t running…… • 8:00PM Are my jobs running? • ssh into jaguar. Run gnuplot. Still looks good. • Ssh into seaborg. Cool. My job is running. Run gnuplot. Looks good this time!
And Later • 4:00AM yawn… is my job on jaguar done? • Ssh into jaguar. Cool. Job is finished. Start bbcp files over to my work machine. (2 TB of data). • 8:00AM @@!#!@. Bbcp is having troubles. Resubmit some of my bbcp from jaguar to my local cluster. • 8:00AM (next day). Opps still need to get the rest of my 200GB of data over to my machine. • 3:00PM My data is finally here! • Run Matlab. Run Ensight. Oppps…. Something’s wrong!!!!!!!!! Where did that instability come from? • 6:00PM finish screaming!
Typical Monitoring Look at volume averaged quantities. At 4 key times this quantity looks good. Code had 1 error which didn’t appear in the typical ascii output to generate this graph. Typically users run gnuplot/grace to monitor output. Need metadata integrated into the high-performance I/O, and integrated for simulation monitoring. • More advanced monitoring • 5 seconds move 600MB, and process the data. • Really need to use FFT for 3D data, and then process data + particles • 50 seconds (10 time steps) move & process data. • 8 GB for 1/100 of the 30 billion particles. • Demand low overhead <5%!
Parallel Data Analysis. • Most applications use scalar data analysis. • IDL • Matlab. • Ncar graphics. • Need techniques such as PCA • Need help, since data analysis is written quickly, and changed often… No harden versions…. Maybe….
New Visualization Challenges. • Finding the needle in the haystack. • Feature identification/tracking! • Analysis of 5D+time phase-space (with 1x1012) particles! • Real-time visualization of codes during execution. • Debugging Visualization.
Where is my data? • ORNL, NERSC, HPSS (NERSC,ORNL), local cluster, laptop? • We need to keep track of multiple copies? • We need to query the data. Query based visualization methods. • Don’t want to distinguish between different disks/tapes.