470 likes | 584 Views
Exercises:. Visit sites: GCMD – http://gcmd.nasa.gov/ ORNL DAAC – http://www-eosdis.ornl.gov/ NBII – http:// nbii .gov/ ESA – http://esapubs.org/archive/ Video Hans Rosling. Introduction to SEEK & CI. William Michener LTER Network Office, University of New Mexico January 2007.
E N D
Exercises: • Visit sites: • GCMD – http://gcmd.nasa.gov/ • ORNL DAAC – http://www-eosdis.ornl.gov/ • NBII – http://nbii.gov/ • ESA – http://esapubs.org/archive/ • Video • Hans Rosling
Introduction to SEEK & CI William Michener LTER Network Office, University of New Mexico January 2007
Cyberinfrastructure for Environmental Biology • Environmental sciences increasingly focus on collaboration and synthesis • Cyberinfrastructure supports science by: • Supporting data access and discovery • Facilitating the integration of heterogeneous data • Enabling complex analysis, modeling and forecasting
Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling
Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling
Access, integration, and analysis • Synthesis projects have distinctive needs • Need to access large numbers of data sets • Little of the data are 'theirs', so they know few details about them • Studies are 'data limited‘ • Must engage the broader community to really solve the access issue • Need to integrate those data in a meaningful way • Studies not designed to be used together, so many pitfalls • Integrated product differs for every synthesis project • Need to analyze and model the data efficiently and collaboratively • Leverage dynamic data loading to increase efficiency • Use scientific workflow systems to work collaboratively
Dilemma: no unified model • No single database suffices • numerous data warehouses exist, but not extensible for all data • VegBank, ClimbDB, GenBank, PDB, etc. • data warehouses use federated schemas • any data that does not fit is not captured • this is a form of data integration for one purpose • Custom development for 1000’s of databases is not feasible
Discovery, Access, and Archive • Effectively store and archive data • Effectively locate and access data from dispersed collections • Approaches • Repositories for some data exist • KNB Metacat, SRB, EcoGrid • Professional Society registries • Structured metadata + ontologies • Ecological Metadata Language • FGDC metadata not sufficient • Smart search • Replication • Challenge areas • Better recall and precision • Exploiting semantics • Building effective ontologies • Need to target individual scientists + students
Data sharing is increasing • KNB Metacat • ESA Data Registry • NCEAS • LTER (LNO, SBC) • PISCO • OBFS • UC Natural Reserves • Pelagic Fisheries DB • LITS DB (UK) • Kruger Nat. Park (SA) • OSU, FIU 12000 Data Packages in the KNB 10000 8000 Cumulative count 6000 4000 2000 0 2002 2003 2004 2005 2006 Year
Metadata Loosly coupled data repositories accommodate heterogeneity • Metadata • Metadata • Metadata • An absolute necessity • Ecological Metadata Language • FGDC/NBII metadata (good but insufficient)
Data heterogeneity • Data are heterogeneous • Differing formats, logical organization, and interpretation • Syntax • Format of the data (e.g., csv, NetCDF, Excel, etc.) • Schema • Logical model of the data (e.g., relational models, hierarchical models, etc.) • Semantics • Meaning of the data (e.g., conceptual links, formalized methods, interpretation) • Broad array of relevant data sources • Ecological (population survey, community survey, behavioral, etc.) • Physical (hydrology, meteorology, chemistry, etc.) • Social (demographic data, land use patterns, policy information, etc.) • Economic (economic valuations, demographic data, etc.)
Data Integration • Combining heterogeneous data is necessary for synthesis • Approaches • Manual • Semi-automated integration that leverages domain knowledge • Challenges • Integration constrained by intended analyses as well as data input • Not the traditional data warehouse approach • Difficult to build consistent knowledge base • Automated reasoning tested on small data sources • Little semantics support in software tools for domain science Integration needs to be: Ad-hoc (No global view) Fast (hours instead of months)
Data Clean Analyze Graph Analysis and Modeling • Current practices are ad-hoc and non-repeatable • Model the steps used by researchers during analysis • Graphical model of flow of data among processing steps • Each step often occurs in different software • Refer to these graphs as ‘Scientific Workflows’
Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling
Science Environment for Ecological Knowledge • SEEK extends informatics approaches to improve analysis and modeling to support broad scale synthesis • Expose ecological, biodiversity, environmental data through a common architecture • Create a framework for executing, preserving and communicating complex quantitative analytical processes • Address myriad challenges associated with integrating heterogeneous data for use in analysis
EcoGrid • Data access to diverse data systems • Lightweight web service interfaces • Common query syntax • Common mechanism to access • ecological data (100’s of field stations) • museum specimen data (100’s of museums) • environmental data (data in SRB at SDSC) • geological data (GEON portal)
Kepler: Analysis and Modeling • Scientific workflow paradigm • Models data flow among modular components • Improved user interface for complex processes • Benefits • Improves documentation • Simplifies sharing of custom models with colleagues • Promotes modular components • Hierarchical models can hide complexity • Direct access to data via EcoGrid • Access common analysis tools (e.g., R, Matlab) from a single framework
Semantic Mediation • Mediation layer for Kepler and EcoGrid • Addresses data heterogeneity and integration issues • Uses a formal reasoning approach for • Smart data discovery • Semi-automated data integration • Workflow design • Workflow validation • Relies upon good knowledge model • Developed by Knowledge Representation group
Knowledge Representation • Ecologists and computer scientists together • capture critical knowledge about ecological data • Extensible Observation Ontology (OBOE) captures the semantics of scientific data • Semantics of observations and measurements • Unit types • Observation context • Sampling hierarchies • Used by the Semantic Mediation system
Evolving collaborations • Ecological Metadata Language – started in 1997 • KNB/Morpho/Metacat – KDI 1999 • Lifemapper – KDI 1998 • Kepler – SEEK ITR 2002 • Production work – Mellon Foundation 2002 • Collaboration…organic and evolving
Kepler Collaboration • Open-source • Builds on Ptolemy II from UC Berkeley • Collaborators • SEEK Project • SciDAC SDM Center • Ptolemy Project • GEON Project • ROADNet Project • Resurgence Project • Goals • Create powerful analytical tools that are useful across disciplines • Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, … Ptolemy II
Broader Cyberinfrastructure Landscape • I. Data and metadata systems • Physical (NBDC buoy data) • Molecular bio (GenBank) • Biological Collections (DiGIR) • Oceanography (OpenDAP) • II. Domain Applications/Algorithms • Sequence processing (BLAST) • Ecological Niche Modeling (GARP) • Site selection (Marxan) • III. Analysis and modeling frameworks • Grid Systems (Globus) • Workflow systems (Triana)
Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling
Analysis and Modeling • Need framework for designing, executing, preserving, and sharing analyses and models • Approaches • Scientific workflows – modular, re-usable components, archive-friendly • Challenges • Incorporating semantics • Enabling effective model design • Effective access to grid computing
Source (e.g., data) Sink (e.g., display) B A B C A’ D E F Scientific workflows • Features of scientific workflows • Graphical model of data flow among processing steps • Inputs and Outputs of components are precisely defined • Components are modular and reusable • Flow of data controlled by a separate execution model • Support for hierarchical models Processor (e.g., regression)
Data source from EcoGrid (metadata-driven ingestion) R processing script res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res) Kepler: dynamic data loading • Kepler supports dynamic data loading: • Data sources are discovered via metadata queries • EML metadata allows arbitrary schemas to be loaded into an embedded database • Data queries can be performed before data flows downstream
Publish The local library Fast access to local components that are developed by the Kepler team and ship with Kepler • Statistics (R and Matlab, etc.) • Logic and math functions • Graphics and visualization • Geospatial data processing • Molecular data processing • Domain specific models • Web services • Grid services • Data sources and sinks • Ecology data • Geology data • Taxonomic data • And much, much more… Import The remote repository Publish and share custom analyses, models, and components with colleagues. • Components contributed by scientists • ‘Upload to repository’ function in Kepler • Saved in repository, explicit versioning • Can be shared with colleagues • Can be referenced in published papers • Components can be downloaded and executed • Downloaded components can be customized • Promotes replication of analyses and models Kepler Component Library
Active work in Kepler • Real-time data in Kepler • Scientist’s view: real-time data accessed like archived data • Engineer’s view: drill-down to manage sensor network resources • Semantics • Semantic annotation connects models to knowledge • Smart Search (data and components) • Smart Data Integration • Smart Workflow Linking * by “smart” we mean these services are informed by metadata and ontology information • Improved user interfaces for Grid computing
Kepler and Sensor Networks • Collaborators: NCEAS, SDSC, UC Davis, OSU, CENS (UCLA), Opendap • Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System (CEO:P) • Startup October, 2006 • Major foci: • Sensor network management – standardized services model • Analysis of data from sensors and archives • Public web view of sensor data • Opendap and EcoGrid compatibility
Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling
int int int int int string int int bodysize bodysize rainfall bodysize A B Semantics in scientific workflows • Components and their ports typically have: • Explicit ‘structural type’ • e.g., int, float, string, {double} • Implicit semantic type • Not sure whether the stream of values from a port represents ‘rainfall’ values or ‘body size’ values
Data Ontology Workflow Components Semantic Annotation • Labeldata with semantic types • Labelinputs and outputs of analytical components with semantic types • Grounded at level of measurement and data, avoiding some pitfalls of upper ontologies
Goal: generically describe the structure of scientific observation and measurement as found in a data set Extension points Provide extension points for loading specialized domain ontologies Observation ontology Observations can provide context for other observations. Entities represent real-world objects or concepts that can be measured. Observations are made about particular entities. Every measurement has a characteristic, which defines the property of the entity being measured. Entities, through observations, can be associated with one or more measured characteristics. Every measurement relates a characteristic to a standard or unit. Measurements assign values and units to characteristics of observed entities. Measurements have precision. A value is typically a cell in a data set.
Semantic annotation ObservationOntology Mapping between data and the ontology via semantic annotation Data set
Smart (Data) Integration: Merge • Discover data of interest • … connect to merge actor • … “compute merge” • align attributes via annotations • open dialog for user refinement • store merge mapping in MOML • … enjoy! • … your merged dataset
Navigate errors and warnings within the workflow Search for and insert “adapters” to fix (structural and semantic) errors … Statically perform semantic and structural type checking Smart Linking (Workflow Design)
Semantic capabilities • Answer semanticdataqueries: • Find sites in California where current abundance of molluscs is < 10% of historical abundance in 1900 • Validatesemanticcorrectness of workflows • Workflowdesign tools that exploit semantic context • Dataintegration that dynamically matches data sources to target schema needed for analysis
In summary… • Typical analytical models are complex and difficult to comprehend and maintain • Scientific workflows provide • An intuitive visual model • Structure and efficiency in modeling and analysis • Abstractions to help deal with complexity • Direct access to data • Means to publish and share models • Kepler is an evolving but effective tool for scientists • Looking for ways to transition from research prototype to a production software tool • Scalable data integration is our main challenge
For more information, see: • http://kepler-project.org/
Acknowledgements The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676. Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence
Roadmap ahead: • Monday, Research Design: January 8, 2007 • 1:15 – 4:30 Scientific workflows– Pennington • Tuesday, Data Grids: January 9, 2007 • 8:30 -- 10:30 Grid technologies and activity – Servilla and Pennington • 10:30 – 2:15 EML/Metadata best practices/Morpho – Tyburczy • 2:30 -- 3:00 QA/QC – Vanderbilt • 3:00 -- 4:30 Good Practices on storing data –Vanderbilt/White • Wednesday, Workflows I: Using pre-built workflows in Kepler January 10, 2007 • 8:30 -- 12:00 Introduction to Kepler with demos– Pennington/Romanello • 1:15 -- 2:00 Using Desktop data in Kepler – Higgins • 2:00 -- ??? Bosque, dinner at the Socorro Brew Pub • Thursday, Workflows II: Tools in Kepler January 11, 2007 • 8:30 -- 10:30 Using R in Kepler (demo + exercise) – Higgins • 10:30 -- 12:00 Visualization in Kepler (demo + exercise) – Higgins/Pennington • 1:15 -- 2:15 Biodiversity example in Kepler (demo + exercise) – “/” • 2:30 -- 4:30 Taxonomic resolution in Kepler –Stewart • Friday, Workflows III: Semantic approaches in Kepler January 12, 2007 • 8:30 -- 12:00 Knowledge representation and semantic mediation – Bowers • 1:15 -- 2:00 Recapping the week • 2:00 -- 3:00 Preparing to use ecoinformatics in the classroom – Katz • 3:00 -- 4:00 Roundtable Discussion – Katz