130 likes | 298 Views
Evolving Scientific Data Workflow. CAS 2011 Pamela Gillman pjg@ucar.edu. Overview. Traditional Data Workflow Evolving Scientific Data Workflow Design Technical Challenges GLobally Accessible Data Environment New Workflow Example NWSC Steps Forward. Traditional Workflow.
E N D
Evolving Scientific Data Workflow CAS 2011 Pamela Gillman pjg@ucar.edu
Overview • Traditional Data Workflow • Evolving Scientific Data Workflow • Design Technical Challenges • GLobally Accessible Data Environment • New Workflow Example • NWSC Steps Forward
Traditional Workflow Process Centric Data Model
Traditional Data Workflow Challenges • Common data movement issues • Time consuming to move data between systems • Bandwidth to archive system is insufficient • Lack of sufficient disk space • Need to evolve data management techniques • Workflow management systems • Standardize metadata information • User Education • Effective methods for understanding workflow • Effective methods for streamlining workflow
Evolving Scientific Workflow Information Centric Data Model
Design Technical Challenges • Determining actual workflow behaviors • chicken and the egg problem • current environment potentially shapes behavior • change the environment, does behavior change • Storage cost curves are steeper than compute cost curves • Finding the right balance • Archive cost curve is unsustainable • Need a better balance between disk and archive use
GLADEGLobally Accessible Data Environment • Unified and consistent data environment for NCAR HPC • Supercomputers, DAV, and storage • Shared transfer interface and support for projects • Support for analysis of IPCC AR5 data • Service Gateways for ESG & RDA data sets • Data is available at high bandwidth to any server or supercomputer within the GLADE environment • Resources outside the environment can manipulate data using common interfaces • Choice of interfaces supports current projects; platform is flexible to support future projects
GLADE Data Workflow Solutions • Information centric • Data can stay in place through entire workflow • Access from supercomputing, data post-processing, analysis and visualization resources • Direct access to NCAR data collections • Availability of persistent longer-term storage • Allows completion of entire workflow prior to final storage of results either at NCAR or offsite • Provides high-bandwidth data transfer services between NCAR and peer institutions
New Workflow Example RDA/ESG GridFTP scp / sftp bbcp GLADE Data Transfer Gateway Science Gateways Supercomputers Data Analysis Visualization scratch Project Space Data Collection hsihtar GridFTP HPSS
Scale of Data Environment Changing • Current NCAR Data Scale • HPC Scratch and DAV Space: 1 PB • Data Collection Space: 1 PB • Archive Size: 14 PB • HPC System: 77 Teraflops • NWSC Scale Projections • Global File System: 10-15PB • ~80 GB/s burst I/O rate • Archive Size: 20PB • initial growing to >170PB By 2016 • HPC System: ~1.5 Petaflops
NWSC Conceptual Data Architecture Remote Vis Partner Sites TeraGrid Sites Science Gateways RDA, ESG Data Transfer Services 10Gb/40Gb/100Gb Ethernet HPSS 170 PB High Bandwidth I/O Network (Infiinband) 10Gb/40Gb Ethernet Data Collections Project Spaces Scratch Archive Interface Storage Cluster 15 PB 80GB/s burst Data Analysis, Visualization and Computational Clusters
Summary • Exciting times for Data-intensive Science! • Many unknowns at this scale, but • We’re working to prepare as much as possible • Risk Mitigation is in the forefront • mid-course corrections based on current efforts • tools for observing changes in workflow behaviors • phased procurement options • Preparing users between now and NWSC deployment • Allocation, charging enhancements • New workflow strategies
pjg@ucar.edu QUESTIONS?