310 likes | 555 Views
Cyberinfrastructure Challenges for Environmental Observatories. Barbara Minsker Director, Environmental Engineering, Science, & Hydrology Group, National Center for Supercomputing Applications; Professor, Dept of Civil & Environ. Engineering; University of Illinois, Urbana, IL, USA
E N D
Cyberinfrastructure Challenges for Environmental Observatories Barbara Minsker Director, Environmental Engineering, Science, & Hydrology Group, National Center for Supercomputing Applications; Professor, Dept of Civil & Environ. Engineering; University of Illinois, Urbana, IL, USA January 9, 2007 National Center for Supercomputing Applications
Background • NSF Office of Cyberinfrastructure is funding NCSA and SDSC to: • Work with leading edge communities to develop cyberinfrastructure to support science and engineering • Incorporate successful prototypes into a persistent cyberinfrastructure • NCSA runs the CLEANER Project Office, which is leading planning for the WATERS Network, one of 3 NSF proposed environmental observatories • Co-Directors: Barbara Minsker, Jerald Schnoor (U of Iowa), Chuck Haas (Drexel U) • To support WATERS planning, NCSA’s Environmental CyberInfrastructure Demonstrator (ECID) project is creating a prototype CI • Driven by requirements gathering and close community collaborations National Center for Supercomputing Applications
WATERS NetworkWATer and Environmental Research Systems Network • Joint collaboration between the CLEANER Project Office and CUAHSI, Inc, sponsored by ENG & GEO Directorates at the National Science Foundation (NSF) • CLEANER = Collaborative Large Scale Engineering Analysis Network for Environmental Research • CUAHSI = Consortium of Universities for the Advancement of Hydrologic Science • Planning underway to build a nationwide environmental observatory network using NSF’s Major Research Equipment and Facility Construction (MREFC) funding • Target construction date: 2011 • Target operation date: 2015
WATERS DRAFT VISION The WATERS Network will transform our understanding of the Earth’s water and related biogeochemical cycles across multiple spatial and temporal scales to enable forecasting and management of critical water processes affected by human activities.
WATERS DRAFT GRAND CHALLENGES • To detect the interactions of human activities and natural perturbations with the quantity, distribution and quality of water in real time. • To predict the patterns and variability of processes affecting the quantity and quality of water at scales from local to continental. • To achieve optimal management of water resources through the use of institutional and economic instruments.
Network Design Principles: • Enable multi-scale, dynamic predictive modeling for water, sediment, • and water quality (flux, flow paths, rates), including: • Near-real-time assimilation of data • Feedback for observatory design • Point- to national-scale prediction • Network provides data sets and framework to test: • Sufficiency of the data • Alternative model conceptualizations • Master Design Variables: • Scale • Climate (arid vs humid) • Coastal vs inland • Land use, land cover, population • density • Energy and materials/industry • Land form and geology Nested (where appropriate) Observatories over Range of Scales: Point Plot (100 m2) Subcatchment (2 km2) Catchment (10 km2) – single land use Watershed (100–10,000 km2) – mixed use Basin (10,000–100,000 km2) Continental Environmental Field Facilities (EFFs) Observatory Scale
CI Requirements Gathering • Interviews at conferences and meetings (Tom Finholt and staff, U. of Michigan) • Usability studies (NCSA, Wentling group) • Community survey (Finholt group) • AEESP and CUAHSI surveyed in 2006 as proxies for environmental engineering and hydrology communities • 313 responses out of 600 surveys mailed (52.2% response rate) • Key findings are driving ECID cyberenvironment development National Center for Supercomputing Applications
Nonstandard/ inconsistent units/formats • Metadata problems • Other obstacles What is the single most important obstacle to using data from different sources? • 55% concerned about insufficient credit for shared data • N=278 National Center for Supercomputing Applications
What three software packages do you use most frequently in your work? • *Other: • MS Word • MS PowerPoint • Statistics applications (e.g., Stata, R, S-Plus) • SigmaPlot • PHREEQC • MathCAD • FORTRAN compiler • Mathematica • GRASS GIS • Groundwater models • Modflow Majority are not using high-end computational tools. National Center for Supercomputing Applications
Factors influencing technology adoption Ease of use, good support, and new capabilities are essential. National Center for Supercomputing Applications
What are the three most compelling factors that would lead you to collaborate with another person in your field? Community seeks collaborations to gain different expertise. National Center for Supercomputing Applications
WATERS CI Challenges • Clearly, the first requirement for observatory CI is that the community must gain access to observatory data • However, simply delivering the data through a Web portal is not going to allow the observatories to reach their full potential and meet the community’s requirements National Center for Supercomputing Applications
WATERS CI Challenges, Cont’d. • Understanding data quality and getting credit for data sharing requires an integrated provenance system to track what has been done with the data • Enabling users who do not have strong computational skills to work with the flood of environmental data requires: • Easy-to-use tools for manipulating large data sets, analyzing them, and assimilating them into models • Workflow integrators that allow users to integrate their tools and models with real-time streaming environmental data • The vast community of observatory users & the resources they generate create a need for knowledge networking tools to help them find collaborators, data, workflows, publications, etc. • To address these requirements, cyberenvironments are needed National Center for Supercomputing Applications
Environmental CI Architecture: Research Services Integrated CI ECID Project Focus: Cyberenvironments Supporting Technology Data Services Workflows & Model Services Knowledge Services Meta-Workflows Collaboration Services Digital Library HIS Project Focus Analyze Data &/or Assimilate into Model(s) Link &/or Run Analyses &/or Model(s) Create Hypo-thesis Obtain Data Discuss Results Publish Research Process National Center for Supercomputing Applications
Cyberenvironments • Couple traditional desktop computing environments coupled with the resources and capabilities of a national cyberinfrastructure • Provide unprecedented ability to access, integrate, automate, and manage complex, collaborative projects across disciplinary and geographical boundaries. • ECID is demonstrating how cyberenvironments can: • Support observatory sensor and event management, workflow and scientific analyses, and knowledge networking, including provenance information to track data from creation to publication. • Provide collaborative environments where scientists, educators, and practitioners can acquire, share, and discuss data and information. • The cyberenvironments are designed with a flexible, service-oriented architecture, so that different components can be substituted with ease National Center for Supercomputing Applications
SSO ECID CyberEnvironment Components CyberCollaboratory: Collaborative Portal CI:KNOW: Network Browser/ Recommender CyberIntegrator: Exploratory Workflow Integration CUAHSI HIS Data Services Tupelo Metadata Services Single Sign-On Security (coming) Community Event Management/Processing National Center for Supercomputing Applications
CyberIntegrator • Studying complex environmental systems requires: • Coupling analyses and models • Real-time, automated updating of analyses and modeling with diverse tools • CyberIntegrator is a prototype workflow executor technology to support exploratory modeling and analysis of complex systems. Integrates the following tools to date: • Excel • IM2Learn image processing and mining tools, including ArcGIS image loading • D2K data mining • Java codes, including event management tools • Matlab & Fortran codes to be added soon. Additional tools will be included based on high priority needs of beta users. National Center for Supercomputing Applications
CyberIntegrator Architecture Example of CyberIntegrator Use: Carrie Gibson created a fecal coliform prediction model in ArcGIS using Model Builder that predicts annual average concentrations. Ernest To rewrote the model as a macro in Excel to perform Monte Carlo simulation to predict median and 90th percentile values. CyberIntegrator’s goal: Reduce manual labor in linking these tools, visualizing the results, and updating in real time. National Center for Supercomputing Applications
Real-Time Simulation of Copano Bay TMDL with CyberIntegrator CyberIntegrator Excel Executor Im2Learn Executor 1 2 3 4 Streamflows to Distributions (Excel) Fecal Coliform Concentrations Model (Excel) Load Shapefiles (Im2Learn) Geo-reference and Visualize Results (Im2Learn) USGS Daily Streamflows (web services) Shapefiles For Copano Bay call data National Center for Supercomputing Applications
Sensor Anomaly Detection Scenario Listens for data events & creates event when anomaly discovered. User subscribes to anomaly detector workflows Alerts user to anomaly detection, along with other events (logged-in users, new documents, etc.) Dashboard Event Manager Anomalies Anomaly Detector 1 Anomalies Anomaly Detector 2 CCBay Sensor Map Sensor data Shares workflow to server Sensor Data CC Bay Sensor Monitor Page Sensor map shows nearby related sensors so user can check data. Anomaly detector is faulty. CI-KNOW recommends alternate anomaly detector from Chesapeake Bay observatory. CyberIntegrator loads recommended workflow. User adjusts parameters to CCBay Sensor. CI-KNOW Network CyberIntegrator National Center for Supercomputing Applications
CyberDashboard Desktop Application Raw Data Anomaly Subscription JMS Broker (ActiveMQ 4.0.1) JMS JMS Data and Anomaly Subscriptions Anomaly Publication Data Subscriptions JMS JMS JMS Sensor Page Reference CyberCollaboratory URL Workflow Service CyberIntegrator Workflow Workflow Reference CyberIntegrator Workflow URL Recommender Network Web Service CyberIntegrator SOAP Workflow Publication/ Retrieval Web Services CI-KNOW SOAP ECID Managed Data/Metadata Tupelo RDBMS Provenance User Subscriptions Workflow Templates Semantic Content Event Topics Cyberenvironment Technologies Metadata Data Anomalies National Center for Supercomputing Applications
ECID & Corpus Christi Bay (CCBay) WATERS Observatory Testbed • CCBay WATERS Observatory Testbed is one of 10 observatory testbeds recently funded by NSF • Collaboration of environmental engineering, hydrology, biology, and information technology researchers • Goal of the testbed: • Integrate ECID and HIS technology to create end-to-end environmental information system • Use the technology to study hypoxia in CCBay • Use real-time data streams from diverse monitoring systems to predict hypoxia one day ahead • Mobilize manual sampling crews when conditions are right National Center for Supercomputing Applications
Sensors in Corpus Christi Bay National Datasets (National HIS) Regional Datasets (Workgroup HIS) USGS NCDC TCOON Dr. Paul Montagna TCEQ SERF NCDC station TCOON stations TCEQ stations Hypoxic Regions Montagna stations USGS gages SERF stations National Center for Supercomputing Applications
CCBay Environmental Information System CCBay Sensors Event-Triggered Workflow Execution Dashboard Alert Event-drivenResearch Anomaly Detector Hypoxia Predictor Storage for LaterResearch CyberIntegrator: Forecast CyberCollaboratory: Contact Collaborators National Center for Supercomputing Applications
D2K workflows Visualize Hypoxia Risk Water Quality Model Fortran numerical models Hypoxia Model Integrator Visualize Hydrodynamics Replace or Remove Errors Anomaly Detection Hypoxia Machine Learning Models Hydrodynamic Model Update Boundary Condition Models Data Archive CCBay Near-Real-Time Hypoxia Prediction Sensor net C++ code IM2Learn workflows National Center for Supercomputing Applications
CCBay CI Challenges • Automating QA/QC in a real-time network • David Hill is creating sensor anomaly detectors using statistical models (autoregressive models using naïve, clustering, perceptron, and artificial neural network approaches; and multi-sensor models using dynamic Bayesian networks) • While statistical models can identify anomalies, it is sometimes difficult to differentiate sensor errors from unusual environmental phenomena • Getting access to the data, which are collected by different groups, stored in multiple formats in different locations • The project is defining a common data dictionary and units and will build Web services to translate National Center for Supercomputing Applications
CCBay CI Challenges, Contd. • Integrating data into diverse models • Calibration uses historical data, typically done by hand • Near-real-time updating needs automated approaches • Models are complex and derivative-based calibration approaches would be difficult to implement • Model integration • Grids change from one type of model to another – defining a common coarse grid, with finer grids overlaid where needed • Data transformers must be built between models National Center for Supercomputing Applications
Conclusions • Creating CI for environmental data is challenging but the benefits in enabling larger-scale, near-real-time research will be enormous • The ECID Cyberenvironment demonstrates the benefits of end-to-end integration of cyberinfrastructure and desktop tools, including: • HIS-type data services • Workflow • Event management • Provenance and knowledge management, and • Collaboration for supporting environmental researchers, educators, and outreach partners • This creates a powerful system for linking observatory operations with flexible, investigator-driven research in a community framework (i.e., the national network). • Workflow and knowledge management support testing hypotheses across observatories • Provenance supports QA/QC and rewards for community contributions in an automated fashion. National Center for Supercomputing Applications
Acknowledgments • Contributors: • NCSA ECID team (Peter Bajcsy, Noshir Contractor, Steve Downey, Joe Futrelle, Hank Green, Rob Kooper, Yong Liu, Luigi Marini, Jim Myers, Mary Pietrowicz, Tim Wentling, York Yao, Inna Zharnitsky) • Corpus Christi Bay Testbed team (PIs: Jim Bonner, Ben Hodges, David Maidment, Barbara Minsker, Paul Montagna) • Funding sources: • NSF grants BES-0414259, BES-0533513, and SCI-0525308 • Office of Naval Research grant N00014-04-1-0437 National Center for Supercomputing Applications