340 likes | 486 Views
Astrophysics, Biology, Climate, Combustion, Fusion, HEP, Nanoscience. DOE Simulation Driven Applications. Sim Scientist DOE NL. Workflows. Critical need: Enable (and Automate) Scientific Work Flows Data Generation. Data Storage Data Transfer Data Analysis Visualization
E N D
Astrophysics, Biology, Climate, Combustion, Fusion, HEP, Nanoscience DOE Simulation Driven Applications Sim Scientist DOE NL
Workflows • Critical need: Enable (and Automate) Scientific Work Flows • Data Generation. • Data Storage • Data Transfer • Data Analysis • Visualization • An order of magnitude more effort can be spent on manually managing these work flows than on performing the simulation itself. • Workflows are not static. Chicago Meeting DOE Data Management
Simulations • Simulations run in batch mode. • Remaining workflow interactive or “on demand.” • Simulation and analyses performed by distributed teams of research scientists. • Need to access remote and distributed data, resources. • Need for distributed collaborative environments. • We will not present solutions in this talk! • Some solutions will be problem dependent. • Example: Remote Viz. vs. Local Viz., Parallel HDF5 vs. Parallel netcdf, … Chicago Meeting DOE Data Management
How do we do simulation science (I) • Let’s suppose that we have a verified HPC code. • I will use the Gyrokinetic Toroidal Code (GTC) to serve as an example. • We also suppose that we have a suite of analysis and visualization programs. • We want to eventually compare the output of this to theoretical and/or experimental and/or other simulation results. Chicago Meeting DOE Data Management
A fast peek at the workflow TB’s movies Data Transfer Data Generation Data Analysis Data Visualization Data Storage Thought VIZ viz features metadata HPC Compute 1d and 2d radial and velocity profiles Compute volume average Compute tracer particle energy, position momentum Global Analysis tools VIZ VIZ Feature tracking of the heat potential Compute correlation functions VIZ VIZ paper VIZ VIZ requirements: 1TB/sim now: 10TB/year 100TB/sim 5yr: .5PB/year 58Mbs now, 1.6Gbs 5 yr Thought Let’s go through the scientific process Chicago Meeting DOE Data Management
thought Question Time Stage 1: Initial Question + Thought • Scientist thinks of a problem to answer a physical question. • Example: • What saturates transport driven by Ion Temperature Gradient? • Requirements: • Possible changes in the code. • New visualization routines to examine particles. • New modifications in analysis tools. Collaborate with O(5) people: face to face, phone. Question Thought Chicago Meeting DOE Data Management
I/O I/O 1TS Runtime computation Stage 2: Change code add analysis • If • Code is mature, go to stage 4. • Else • Scientists modify HPC code to put in new routines for new physics, new capabilities. • Scientists change the code to answer the question. • If necessary, analysis/viz routines are added/modified • where do the inputs come from? • experiments, other sims, theory. Thought Code input Code input weeks HPC Total output = 1TB/full run 40 hours= 58Mbs: now 5 years: 0.1PB/hero run 150 hours= 1.6Gbs O(5) people modify code thought Code modifications Question Time Chicago Meeting DOE Data Management
thought Stage 3: Debugging Stage • Scientists modify HPC code to put in new routines for new physics • Scientist generally run a parameter survey to answer the question(s). • Scientist change the code to answer the question. • 1 to 2 people debug the code. • Verify code again, regression test. Thought weeks… HPC Total output = 0.1Mbs Compute volume average VIZ Continue Run sequence Thought Code modifications Question results are thrown away Time Chicago Meeting DOE Data Management
Stage 4: Run production code. Thought • Now the scientist has confidence in the modifications. • Scientist generally run a parameter survey and/or sensitivity analysis to answer the question(s). • Scientist need good analysis and visualization routines. • O(3) look at raw data and run analysis programs. • Filter data • Look for features for the larger group. • O(10) look at end viz. and interpret the results. 1000TS HPC VIZ scalar 60 Mbs Particles 50Mbs .5% TS Compute 1d and 2d radial and velocity profiles Compute volume average Compute tracer particle energy, position, momentum VIZ 0.01Mbs VIZ VIZ data can flow from RAM to RAM/disk/WAN/LAN. thought Production run Interpret results Question Time Chicago Meeting DOE Data Management
thought Code modifications Time Stage 4a: Data Management Observations. Thought • We must understand • Data Generation from simulation and analysis routines. • Size of Data being generated. • Latency issues for access patterns. • Can we develop good compression techniques? • Bandwidth/disk speed issues. • Do we need non-volatile storage? RAM-RAM, RAM – Disk-tape • “Plug and play” analysis routines, need a common data model • non-trivial to transfer from N processors to M processors! • Bottleneck analysis is too slow. particles 50Mbs 1000TS VIZ scalar 60 Mbs HPC Particles 50Mbs .5% TS VIZ 0.01Mbs VIZ VIZ • Save scalar data for more post-processing. • Save Viz data • Toss particle Data Chicago Meeting DOE Data Management
… Production run Interpret results Stage 5: Feedback Stage Thought • After the production run we interpret the results • We then ask a series of questions: • Do I have adequate analysis routines? • Was the original hypothesis correct? • Should the model equations change? • Do we need to modify it? • If everything is ok, should we continue the parameter survey? VIZ HPC VIZ Compute correlation function VIZ VIZ VIZ comparison to other data, theory, sim., experiments Thought The workflow is changing! Time Chicago Meeting DOE Data Management
Stage 5: Observations • To expedite this process • Need standard data model(s). • Can we build analysis routines which can be used for multiple codes and or multiple disciplines?? • Data Model must allow flexibility. • Commonly we add/remove variables used in the simulations/analysis routines. • Need for metadata, annotation, and provenance: • Nature of Metadata • Code versions, compiler information, machine configuration. • Simulation parameters, model parameters. • Information on simulation inputs. • Need for tools to record provenance in databases. • Additional provenance (above that provided by the above metadata) needed to describe: • Reliability of data; how the data arrived in the form in which it was accessed; data ownership. Production run Interpret results … Chicago Meeting DOE Data Management
… Production run Interpret results Stage 5: Observations Thought • Data Analysis routines can include • Data Transformation • Format transformation • Reduction • Coordinate transformation • Unit transformation • Creation of derived data • … • Feature detection, extraction, tracking • Define metadata • Find regions of interest • Perform level set analyses in spacetime • Perform born analyses. • Inverse feature tracking • Statistical Analysis: PCA, Comparative Component Analyses, data fitting, correlations VIZ HPC VIZ VIZ VIZ VIZ Thought Time Chicago Meeting DOE Data Management
Thought VIZ HPC VIZ VIZ VIZ VIZ Thought … Production run Interpret results Stage 5: Observations • Visualization Needs • Local, Remote, Interactive, Collaborative, Quantitative, Comparative • Platforms • Fusion of different data types • Experimental, Theoretical, Computational,… • New representations Time Chicago Meeting DOE Data Management
Stage 6: Complete parameter survey Thought • Complete all of the runs for the parameter survey to answer the question. • 1 – 3 are looking at the results during the parameter survey. VIZ HPC VIZ Feature tracking VIZ VIZ VIZ VIZ Thought … Production run Interpret results Production run Interpret results Time Chicago Meeting DOE Data Management
Thought VIZ HPC VIZ VIZ VIZ VIZ VIZ Thought Stage 7: Run a “large” Hero run • Now we can run a high resolution case, which will run for a very long time. • O(10) are looking at the results. … Interpret results LARGE Herorun, Time Chicago Meeting DOE Data Management
Stage 8: Assimilate the results. movies TB’s viz features metadata • Did I answer the question? • Yes • Now publish a paper. • O(10+) look at results. • Compare to experiment • Details here. • What do we need stored? • Short term storage • Long term storage • NO • Go back to Stage 1: Question Data repository Global Analysis tools Data Mining tools VIZ … Interpret results assimilate results Time Chicago Meeting DOE Data Management
movies TB’s viz features metadata Global Analysis tools Data repository VIZ Stage 9: Other scientist use information • Now other scientist can look at this information and use it for their analysis, or input for their simulation. • What is the data access patterns • Global Interactive VIZ: GB’s of data/time slice, TB’s in the future. • Bulk data is accessed numerous times. • Look at derived quantities. MB’s to GB’s of data. • How long do we keep the data? • Generally less than 5 years. … Interpret results Time Chicago Meeting DOE Data Management
Let Thought be the bottleneck • Simulation Scientists generally have scripts to semi-automate parts of the workflow. • To expedite this process they need to • Automate the workflow as much as possible. • Remove the bottlenecks • Better visualization, better data analysis routines, will allow users to decrease the interpretation time. • Better routines to “find the needle in the haystack” will allow the thought process to be decreased: Feature detection/tracking • Faster turn around time for simulations will decrease the code runtimes. • Better numerical algorithms, more scalable algorithms. • Faster processors, faster networking, faster I/O. • More HPC systems, more end stations. Chicago Meeting DOE Data Management
Summary: • Biggest bottleneck: Interpretation of Results. • This is the biggest bottleneck because • Babysitting • Scientists spend their “real-time” babysitting computational experiments. [trying to interpret results, move data, and orchestrate the computational pipeline]. • Deciding if the analysis routines are working properly with this “new” data. • Non-scalable data analysis routines • Looking for the “needle in the haystack”. • Better analysis routines could mean less time in the thought process and in the interpretation of the results. • The entire scientific process can not be fully automated. Chicago Meeting DOE Data Management
Workflows • No changes in these workflows. Chicago Meeting DOE Data Management
Section 3: Astrophysical Simulation Workflow Cycle Application Layer Start New Simulation? Run Simulation batch job on capability system Continue Simulation? Simulation generates checkpoint files Archive checkpoint files to HPSS Migrate subset of checkpoint files to local cluster Vis & Analysis on local Beowulf cluster Parallel I/O Layer Parallel HDF5 Storage Layer PVFS or LUSTRE HPSS GPFS MSS, Disks, & OS Chicago Meeting DOE Data Management
Biomolecular Simulation Design Molecular System Analysis & Visualization Computer Simulation Archive Trajectories Parameterization Molecular Trajectories Review/Curation Workflow Statistical Analysis Molecular System Construction Visualization Storage Management, Data Movement And Access Trajectory Database Server (e.g.BioSimGrid) Structure Database (e.g. PDB) Large Scale Temporary Storage Raw Data Hardware, OS, Math Libraries, MSS (HPSS) Chicago Meeting DOE Data Management
Combustion Workflow Chicago Meeting DOE Data Management
Deposit the charge of very particle on the grid Solve Poisson equation to get the potential on the grid Calculate the electric field Gather the forces from the grid to the particles and push them Do process migration with the particles that have moved out of their current domain GTC Workflow analysis Compute volume averaged quantities GTC viz viz Compute tracer Particle Energy, position momentum viz Compute 1d and 2d radial and velocity profiles viz Compute Correlation functions viz Chicago Meeting DOE Data Management
Images Images Images Images Animations Animations Animations NIMROD Workflow Input files dump.00000 nimset dump.* ~100 files NIMROD fluxgrid.in nimrod.in Restart file Run-timeConfig discharge energy nimhist data for every time step Screen Run-timeConfig Xdraw AVS/Express SCIRun OpenDX nimhdf, nimfl, nimplot, … nimhdf.in nimfl,.in … Phi.h5 nimfl.bin Chicago Meeting DOE Data Management
M3D Simulation Studies 2009 (rough estimate) VMEC, JSOLVER, EFIT, etc Run M3D at NERSC on 10,000 processors for 20 hours per segment Initial Run Restart 1 Restart 2 Restart N Done HPSS (NERSC) 1 TB files, transfer time 10 min, if parallel? PPPL Local Project Disks Post-process locally on PPPL upgraded cluster. Requires 10 min per time slice to analyze. Typically analyze 20 time slices. Chicago Meeting DOE Data Management
A Simplified VORPAL Workflow Time slices D1 D2 D3 Dn Initial Parameters VORPAL InputData InputData Data Filtering/Extraction InputData D1 D2 D3 Dn Run-time Configurations Image Generator (Xdraw) png1 png2 png3 pngn Currently, the workflow is handled by a set of scripts. Data movement is handled either by scripts or manually. Sim1Animation Sim2Animation SimXAnimation Chicago Meeting DOE Data Management
TRANSP Workflow Experiments (CMod, DIII-D, JET, MAST, NSTX) 20-50 signals {f(t), f(x,t)} Plasma position, Shape, Temperatures, Densities Field, Current, RF and Beam Injected Powers. Preliminary data Analysis and Preparation (largely automated) Diagnostic Hardware Pre- and Post-processing at the experimental site… TRANSP Analysis*: Current diffusion, MHD equilibrium, fast ions, thermal plasma heating; power, particle and momentum balance. Experiment simulation Output Database ~1000-2000 signals {f(t), f(x,t)} Visualization Load Relational Databases Detailed (3d) time-slice physics simulations (GS2, ORBIT, M3D…) Chicago Meeting DOE Data Management D. McCune 23 Apr 2004
Workflow for Pellet Injection Simulations Preliminary Analysis (Deciding run parameters) Input Files Table of energy sinkterm as a function offlux-surface and time Majority of Time Run 1D Pellet code Run AMR Production Code HDF5 data files HDF5 data files of plotting variables Run post-processing codeto compute visualizationvariables and other diagnosticquantities (e.g. total energy) for plotting ASCII files of diagnostic variables Create diagnostic plots Visualize field quantitiesin computational spaceusing ChomboVis Interpolate solution on finest mesh. Create data files for plotting field quantities in a torus Visualize field quantitiesin a torus using AVS or ensight Interpolated data files (binary) Chicago Meeting DOE Data Management
Degas2 Workflow Chicago Meeting DOE Data Management
High-Energy Physics WorkflowTypical of a major collaboration 100s of terabytes today 10s of petabytes in 2010 SIMULATION Users: Simulation Team At: 10s of sites ANALYSIS Users: All Physicists At: 100+ Sites RECONSTRUCTION (Feature Extractions) Users: Reconstruction Team At: few sites SKIMMING/FILTERING Users: Skim Team At: few sites DATA ACQUISITION Users: DAQ team At: 1 site DATABASES: < 1 terabyte Conditions, Metadata And Workflow Chicago Meeting DOE Data Management
Nuclear Physics WorkflowTypical of a major collaboration 100s of terabytes today 10s of petabytes in 2010 SIMULATION Users: Simulation Team At: 10s of sites ANALYSIS Users: All Physicists At: 100+ Sites RECONSTRUCTION (Feature Extractions) Users: Reconstruction Team At: few sites SKIMMING/FILTERING Users: Skim Team At: few sites DATA ACQUISITION Users: DAQ team At: 1 site DATABASES: < 1 terabyte Conditions, Metadata And Workflow Chicago Meeting DOE Data Management
Comments from others Chicago Meeting DOE Data Management