1 / 13

Evolving Scientific Data Workflow

Evolving Scientific Data Workflow. CAS 2011 Pamela Gillman pjg@ucar.edu. Overview. Traditional Data Workflow Evolving Scientific Data Workflow Design Technical Challenges GLobally Accessible Data Environment New Workflow Example NWSC Steps Forward. Traditional Workflow.

emile
Download Presentation

Evolving Scientific Data Workflow

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolving Scientific Data Workflow CAS 2011 Pamela Gillman pjg@ucar.edu

  2. Overview • Traditional Data Workflow • Evolving Scientific Data Workflow • Design Technical Challenges • GLobally Accessible Data Environment • New Workflow Example • NWSC Steps Forward

  3. Traditional Workflow Process Centric Data Model

  4. Traditional Data Workflow Challenges • Common data movement issues • Time consuming to move data between systems • Bandwidth to archive system is insufficient • Lack of sufficient disk space • Need to evolve data management techniques • Workflow management systems • Standardize metadata information • User Education • Effective methods for understanding workflow • Effective methods for streamlining workflow

  5. Evolving Scientific Workflow Information Centric Data Model

  6. Design Technical Challenges • Determining actual workflow behaviors • chicken and the egg problem • current environment potentially shapes behavior • change the environment, does behavior change • Storage cost curves are steeper than compute cost curves • Finding the right balance • Archive cost curve is unsustainable • Need a better balance between disk and archive use

  7. GLADEGLobally Accessible Data Environment • Unified and consistent data environment for NCAR HPC • Supercomputers, DAV, and storage • Shared transfer interface and support for projects • Support for analysis of IPCC AR5 data • Service Gateways for ESG & RDA data sets • Data is available at high bandwidth to any server or supercomputer within the GLADE environment • Resources outside the environment can manipulate data using common interfaces • Choice of interfaces supports current projects; platform is flexible to support future projects

  8. GLADE Data Workflow Solutions • Information centric • Data can stay in place through entire workflow • Access from supercomputing, data post-processing, analysis and visualization resources • Direct access to NCAR data collections • Availability of persistent longer-term storage • Allows completion of entire workflow prior to final storage of results either at NCAR or offsite • Provides high-bandwidth data transfer services between NCAR and peer institutions

  9. New Workflow Example RDA/ESG GridFTP scp / sftp bbcp GLADE Data Transfer Gateway Science Gateways Supercomputers Data Analysis Visualization scratch Project Space Data Collection hsihtar GridFTP HPSS

  10. Scale of Data Environment Changing • Current NCAR Data Scale • HPC Scratch and DAV Space: 1 PB • Data Collection Space: 1 PB • Archive Size: 14 PB • HPC System: 77 Teraflops • NWSC Scale Projections • Global File System: 10-15PB • ~80 GB/s burst I/O rate • Archive Size: 20PB • initial growing to >170PB By 2016 • HPC System: ~1.5 Petaflops

  11. NWSC Conceptual Data Architecture Remote Vis Partner Sites TeraGrid Sites Science Gateways RDA, ESG Data Transfer Services 10Gb/40Gb/100Gb Ethernet HPSS 170 PB High Bandwidth I/O Network (Infiinband) 10Gb/40Gb Ethernet Data Collections Project Spaces Scratch Archive Interface Storage Cluster 15 PB 80GB/s burst Data Analysis, Visualization and Computational Clusters

  12. Summary • Exciting times for Data-intensive Science! • Many unknowns at this scale, but • We’re working to prepare as much as possible • Risk Mitigation is in the forefront • mid-course corrections based on current efforts • tools for observing changes in workflow behaviors • phased procurement options • Preparing users between now and NWSC deployment • Allocation, charging enhancements • New workflow strategies

  13. pjg@ucar.edu QUESTIONS?

More Related