410 likes | 422 Views
The Earth System Grid (ESG) & The Community Data Portal (CDP) (NCAR’s Data & GriD Efforts) for. COMMISSION FOR BASIC SYSTEMS INFORMATION SYSTEMS and SERVICES INTERPROGRAMME TASK TEAM ON THE FUTURE WMO INFORMATION SYSTEM KUALA LUMPUR, 20 - 24 OCTOBER 2003 Courtesy: Don Middleton
E N D
The Earth System Grid (ESG)&The Community Data Portal (CDP)(NCAR’s Data & GriD Efforts)for COMMISSION FOR BASIC SYSTEMS INFORMATION SYSTEMS and SERVICES INTERPROGRAMME TASK TEAM ON THE FUTURE WMO INFORMATION SYSTEM KUALA LUMPUR, 20 - 24 OCTOBER 2003 Courtesy: Don Middleton NCAR Scientific Computing Division
“Atkins Report” • “A new age has dawned…” • “The Panel’s overarching recommendation is that the National Science Foundation should establish and lead a large-scale, interagency, and internationally coordinated Advanced Cyberinfrastructure Program (ACP) to create, deploy, and apply cyberinfrastructure in ways that radically empower all scientific and engineering research and allied education. We estimate that sustained new NSF funding of $1 billion per year is needed to achieve critical mass and to leverage the coordinated co-investment from other federal agencies, universities, industry, and international sources necessary to empower a revolution. The cost of not acting quickly or at a subcritical level could be high, both in opportunities lost and in increased fragmentation and balkanization of the research.” • Atkins Report, Executive Summary
The Earth System Grid http://www.earthsystemgrid.org • U.S. DOE SciDAC funded R&D effort - a “Collaboratory Pilot Project” • Build an “Earth System Grid” that enables management, discovery, distributed access, processing, & analysis of distributed terascale climate research data • Build upon Globus Toolkit and DataGrid technologies and deploy (Rubber on the road) • Potential broad application to other areas
ANL Ian Foster (PI) Veronika Nefedova (John Bresenhan) (Bill Allcock) LBNL Arie Shoshani Alex Sim ORNL David Bernholdte Kasidit Chanchio Line Pouchard LLNL/PCMDI Bob Drach Dean Williams (PI) USC/ISI Anne Chervenak Carl Kesselman (Laura Perlman) NCAR David Brown Luca Cinquini Peter Fox Jose Garcia Don Middleton (PI) Gary Strand ESG Team
Baseline Numbers • T42 CCSM (current, 280km) • 7.5GB/yr, 100 years -> .75TB • T85 CCSM (140km) • 29GB/yr, 100 years -> 2.9TB • T170 CCSM (70km) • 110GB/yr, 100 years -> 11TB
Capacity-related Improvements Increased turnaround, model development, ensemble of runs Increase by a factor of 10, linear data • Current T42 CCSM • 7.5GB/yr, 100 years -> .75TB * 10 = 7.5TB
Capability-related Improvements Spatial Resolution: T42 -> T85 -> T170 Increase by factor of ~ 10-20, linear data Temporal Resolution: Study diurnal cycle, 3 hour data Increase by factor of ~ 4, linear data CCM3 at T170 (70km)
Capability-related Improvements Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice Increase by another factor of 2-3, data flat Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics), middle Atmosphere Model… Increase by another factor of 10+, linear data
Model Improvement Wishlist Grand Total: Increase compute by a Factor O(1000-10000)
ESG Scenario • End 2002: 1.2 million files comprising ~75TB of data at NCAR, ORNL, LANL, NERSC, and PCMDI • End 2007: As much as 3 PB (3,000 TB) of data (!) • Current practice is already broken – the future will be even worse if something isn’t done…
ESG Scenario (cont.) • Data • Different formats are converted to netCDF • netCDF is not standardized to the CF model • Different sites require knowledge of different methods of access • Metadata • Most kept in online files separate from data and unsearchable unless one is “in the know” • Some kept in people’s brains • Access control • Manual • Not formalized • Data requests • Beginnings of a formal process (e.g., the PCMDI model) • Beginnings of web portals • Far too much done by hand • Logging nearly non-existent
ESG: Challenges • Enabling the simulation and data management team • Enabling the core research community in analyzing and visualizing results • Enabling broad multidisciplinary communities to access simulation results We need integrated scientific work environments that enable smooth WORKFLOW for knowledge development: computation, collaboration & collaboratories, data management, access, distribution, analysis, and visualization.
ESG: Strategies • Move data a minimal amount, keep it close to computational point of origin when possible • Data access protocols, distributed analysis • When we must move data, do it fast and with a minimum amount of human intervention • Storage Resource Management, fast networks • Keep track of what we have, particularly what’s on deep storage • Metadata and Replica Catalogs • Harness a federation of sites, web portals • Globus Toolkit -> The Earth System Grid -> The UltraDataGrid
HRM HRM Storage/Data Management Tera/Peta-scale Archive Server Client Selection Control Monitoring Tools for reliable staging, transport, and replication HRM Server Tera/Peta-scale Archive
HRM aka “DataMover” • Running well across DOE/HPSS systems • New component built that abstracts NCAR Mass Storage System • Defining next generation of requirements with climate production group • First “real” usage “The bottom line is that it now works fine and is over 100 times faster than what I was doing before. As important as two orders of magnitude increase in throughput is, more importantly I can see a path that will essentially reduce my own time spent on file transfers to zero in the development of the climate model database” – Mike Wehner, LBNL
OPeNDAP An Open Source Project for a Network Data Access Protocol (originally DODS, the Distributed Oceanographic Data System)
OPeNDAP-g • Transparency • Performance • Security • Authorization • (Processing) Distributed Data Access Services Typical Application Distributed Application Application Application Application netCDF lib OPeNDAP Client ESG client OPeNDAP Via http OPeNDAP Via Grid ESG + DODS data OpenDAP Server ESG Server Data (local) Data (remote) Big Data (Multiple remotes)
ESG: NcML Core Schema • For XML encoding of metadata (and data) of any generic netCDF file • Objects: netCDF, dimension, variable, attribute • Beta version reference implementation as Java Library (http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm) nc:netCDFType nc:dimension nc:VariableType nc:attribute netCDF nc:variable nc:values nc: attribute
isA Person [0,1] firstName [0,1] lastName [0,1] contact LEGEND Object [1] id Institution [0,1] name [0,1] type [0,1] contact AbstractClass worksFor Class participant role= isA inheritance association Project [0,n] topic type= [0,1] funding Activity [0,1] name [0,1] description [0,1] rights [0,n] date type= [0,n] note [0,n] participant role= [0,n] reference uri= Service [0,1] name [0,1] description isA isPartOf isA Campaign serviceId Investigation Ensemble isA isPartOf Experiment Analysis Observation hasParent hasChild hasSibling Simulation [0,n] simulationInput type= [0,n] simulationHardware Dataset [0,1] type [0,1] conventions [0,n] date type= [0,n] format type= uri= [0,1] timeCoverage [0,1] spaceCoverage generatedBy isPartOf
ESG Metadata Progress • Co-developed NcML with Unidata • CF conventions in progress, almost done • Developed & evaluated a prototype metadata system • Finalized an initial schema for PCM/CCSM • Address interoperability with federal standards and NASA/GCMD via the generation of DIF/FGDC/ISO • Address interoperability with digital libraries via the creation of Dublin Core • Testing relational and native XML databases, and OGSA-DAI • Exploratory work for first-generation ontology • Authoring of discovery metadata in progress
ESG Topology ANL CAS LBNL gridFTP SERVER NCAR ORNL LAS SERVER HPSS visualize gridFTP SERVER HRM gridFTP DISK cache HRM HPSS RLS gridFTP gridFTP SERVER RLS HRM execute MSS LLNL DISK gridFTP HRM RLS cross-update cross-update gridFTP SERVER RLS GRAM GATEKEEPER query ESG WEB PORTAL Tomcat/Struts submit authenticate ISI MyProxy OGSA-DAI MySQL RDBMS query
Collaborations & Relationships • CCSM Data Management Group • The Globus Project • Other SciDAC Projects: Climate, Security & Policy for Group Collaboration, Scientific Data Management ISIC, & High-performance DataGrid Toolkit • OPeNDAP/DODS (multi-agency) • NSF National Science Digital Libraries Program (UCAR & Unidata THREDDS Project) • U.K. e-Science and British Atmospheric Data Center • NOAA NOMADS and CEOS-grid • Earth Science Portal group (multi-agency, intnl.)
Immediate Directions • Broaden usage of DataMover and refine • Continue building metadata catalogs • Revisit overall security model and consider simplified approaches • Redesign and implement user interface • Alpha version of OPeNDAPg • Test and evaluate with client applications • Develop automation for data publishing (GT3) • Deploy for IPCC runs
The Community Data Portal (CDP) “The dataportal has changed my life…” Ben Kirtman, COLA • Provide a common portal to NCAR, UCAR, and university data • Provide a sustainable cyberinfrastructure that dramatically lowers the cost of sharing data (there is HUGE interest in this) • Directly couple to simulation systems and DataMonster • Begin capturing rich metadata and catalog our scientific experiments for the world • MSS -> A Petascale Mass Knowledge System • Federate internationally (ESG, THREDDS, U.K. e-Science, NOMADS, PRISM, GEON, etc.)
Foster Revolutionary Change Petascale Knowledge Repository Mass Storage System (1.5PB) Establish a new paradigm for managing and accessing scientific data based on semantic organization.
Community Data Portal • Purpose: • Build an infrastructure using different methods for data exploration and delivery • Web-based retrieval and interactive analysis for MSS collections • Data sharing for multi-institution cooperative studies • Browse, select, compare, download data sets, & specify data subsets using – graphical, text entry, choice of output format • Components: • User interface, Live Access Server (LAS) • Middleware, Ferret, NCL, GrADS • File service, local, or DODS • Status: • Pilot working (2 years), more middleware testing
Live Access Client Live Access Server Data Access Massive Data Simulation & Retrospective Ferret NCL Other Engines DODS Data Collections CSM, PCM, DSS, MM5, WRF, MICOM, CMIWG
Community Data Portal architecture user interface UI UI UI UI Struts GDS DODS aggregation server LAS middleware Tomcat Tomcat Tomcat Tomcat catalogs parsing & metadata ingestion catalogs browsing data access (OPeNDAP, FTP, HTTP) data visualization (NCL, Ferret) core services MSS data retrieval data search & discovery dataportal.ucar.edu hardware raid disks MSS
Community Data Portal Metadata Software ESG metadata DC metadata NcML metadata other metadata THREDDS catalog parser application parses reference stores full XML doc shreds XML doc into tables THREDDS catalogs XML native DB (Xindice relational DB (MySQL) displays XML viewer web application future advanced query (Xpath, Xquery) schema- specific stylesheets simple query (SQL) uses Results: list of triplets (dataset id, metadata schema, metadata URL) THREDDS catalogs browser Web application links to Search & Discovery web application
CDP Data/Catalog Contributors • ACD: MOZART v2.1 standard run (Louisa Emmons) • ATD: Radar almost ready for today! • CGD: CAS satellite data example (Lesley Smith) • CGD: CDAS and VEMAP data (Steve AulenBach, Nan Rosenbloom, Dave Schimmel) • CGD: CCSM 1000 year run (Lawrence Buja) • CGD: PCM 16 top datasets (Gary Strand) • SCD: DSS full data holdings (Bob Dattore, Steve Worley) • SCD: VETS example visualization catalog (Markus Stobbs, Luca Cinquini) • COLA: Jennifer Adams, Jim Kinter, Brian Doty
Next Steps • Recruiting (!) • One student for data ingest • One software engineer • Systems • Expanding storage by 20TB (SCD cosponsor) • Ongoing publication of datasets • Publishing documents on plans, design, how to partner, standard services, and management procedures • Building partnerships, DMWG meeting August
Closing Thoughts • Building a sustainable infrastructure for the long-term • Difficult, expensive, and time-consuming • Requires longer-term projects • Team-building is a critical process • Collaboration technologies really help • Managing all the collaborations is a challenge • But extremely valuable • Good progress, first real usage
Links • Earth System Grid • www.earthsystemgrid.org • Community Data Portal • dataportal.ucar.edu
We Will Examine Practically Every Aspect of the Earth System from Space in This Decade Longer-term Missions - Observation of Key Earth System Interactions Aqua Terra Landsat 7 Aura ICEsat Jason-1 QuikScat Exploratory - Explore Specific Earth System Processes and Parameters and Demonstrate Technologies Triana GRACE SRTM VCL Cloudsat EO-1 PICASSO Courtesy of Tim Killeen, NCAR
Characteristics of Infrastructure • Essential • So important that it becomes ubiquitous • Reliable • Example: the built environment of the Roman Empire • Expensive • Nothing succeeds like excess (e.g. Interstate system • Inherently one-off (often, few economies of scale) • Clear factorization between research and practice • Generally deploy what provably works
COLA CGD/VEMAP ACD,HAO/WACCM CGD/CCSM, CAM CGD/CAS MMM/WRF UCAR/JOSS UCAR/Unidata CGD,SCD,CU/GridBGC NOAA/NOMADS GODAE HAO/TIEGCM,MLSO ATD/Radar, HIAPER ACD/Mozart, BVOC, Aqua proposal BioGeo/CDAS SCD/DSS DOE/Earth System Grid DLESE GIS Initiative CDP Interactions & Opportunities