1 / 47

Exercises:

Exercises:. Visit sites: GCMD – http://gcmd.nasa.gov/ ORNL DAAC – http://www-eosdis.ornl.gov/ NBII – http:// nbii .gov/ ESA – http://esapubs.org/archive/ Video Hans Rosling. Introduction to SEEK & CI. William Michener LTER Network Office, University of New Mexico January 2007.

willem
Download Presentation

Exercises:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exercises: • Visit sites: • GCMD – http://gcmd.nasa.gov/ • ORNL DAAC – http://www-eosdis.ornl.gov/ • NBII – http://nbii.gov/ • ESA – http://esapubs.org/archive/ • Video • Hans Rosling

  2. Introduction to SEEK & CI William Michener LTER Network Office, University of New Mexico January 2007

  3. Cyberinfrastructure for Environmental Biology • Environmental sciences increasingly focus on collaboration and synthesis • Cyberinfrastructure supports science by: • Supporting data access and discovery • Facilitating the integration of heterogeneous data • Enabling complex analysis, modeling and forecasting

  4. Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling

  5. Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling

  6. Access, integration, and analysis • Synthesis projects have distinctive needs • Need to access large numbers of data sets • Little of the data are 'theirs', so they know few details about them • Studies are 'data limited‘ • Must engage the broader community to really solve the access issue • Need to integrate those data in a meaningful way • Studies not designed to be used together, so many pitfalls • Integrated product differs for every synthesis project • Need to analyze and model the data efficiently and collaboratively • Leverage dynamic data loading to increase efficiency • Use scientific workflow systems to work collaboratively

  7. Dilemma: no unified model • No single database suffices • numerous data warehouses exist, but not extensible for all data • VegBank, ClimbDB, GenBank, PDB, etc. • data warehouses use federated schemas • any data that does not fit is not captured • this is a form of data integration for one purpose • Custom development for 1000’s of databases is not feasible

  8. Discovery, Access, and Archive • Effectively store and archive data • Effectively locate and access data from dispersed collections • Approaches • Repositories for some data exist • KNB Metacat, SRB, EcoGrid • Professional Society registries • Structured metadata + ontologies • Ecological Metadata Language • FGDC metadata not sufficient • Smart search • Replication • Challenge areas • Better recall and precision • Exploiting semantics • Building effective ontologies • Need to target individual scientists + students

  9. Data sharing is increasing • KNB Metacat • ESA Data Registry • NCEAS • LTER (LNO, SBC) • PISCO • OBFS • UC Natural Reserves • Pelagic Fisheries DB • LITS DB (UK) • Kruger Nat. Park (SA) • OSU, FIU 12000 Data Packages in the KNB 10000 8000 Cumulative count 6000 4000 2000 0 2002 2003 2004 2005 2006 Year

  10. Metadata Loosly coupled data repositories accommodate heterogeneity • Metadata • Metadata • Metadata • An absolute necessity • Ecological Metadata Language • FGDC/NBII metadata (good but insufficient)

  11. Data heterogeneity • Data are heterogeneous • Differing formats, logical organization, and interpretation • Syntax • Format of the data (e.g., csv, NetCDF, Excel, etc.) • Schema • Logical model of the data (e.g., relational models, hierarchical models, etc.) • Semantics • Meaning of the data (e.g., conceptual links, formalized methods, interpretation) • Broad array of relevant data sources • Ecological (population survey, community survey, behavioral, etc.) • Physical (hydrology, meteorology, chemistry, etc.) • Social (demographic data, land use patterns, policy information, etc.) • Economic (economic valuations, demographic data, etc.)

  12. Data Integration • Combining heterogeneous data is necessary for synthesis • Approaches • Manual • Semi-automated integration that leverages domain knowledge • Challenges • Integration constrained by intended analyses as well as data input • Not the traditional data warehouse approach • Difficult to build consistent knowledge base • Automated reasoning tested on small data sources • Little semantics support in software tools for domain science Integration needs to be: Ad-hoc (No global view) Fast (hours instead of months)

  13. Data Clean Analyze Graph Analysis and Modeling • Current practices are ad-hoc and non-repeatable • Model the steps used by researchers during analysis • Graphical model of flow of data among processing steps • Each step often occurs in different software • Refer to these graphs as ‘Scientific Workflows’

  14. Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling

  15. Science Environment for Ecological Knowledge • SEEK extends informatics approaches to improve analysis and modeling to support broad scale synthesis • Expose ecological, biodiversity, environmental data through a common architecture • Create a framework for executing, preserving and communicating complex quantitative analytical processes • Address myriad challenges associated with integrating heterogeneous data for use in analysis

  16. EcoGrid • Data access to diverse data systems • Lightweight web service interfaces • Common query syntax • Common mechanism to access • ecological data (100’s of field stations) • museum specimen data (100’s of museums) • environmental data (data in SRB at SDSC) • geological data (GEON portal)

  17. Kepler: Analysis and Modeling • Scientific workflow paradigm • Models data flow among modular components • Improved user interface for complex processes • Benefits • Improves documentation • Simplifies sharing of custom models with colleagues • Promotes modular components • Hierarchical models can hide complexity • Direct access to data via EcoGrid • Access common analysis tools (e.g., R, Matlab) from a single framework

  18. Kepler: Analysis and Modeling

  19. Semantic Mediation • Mediation layer for Kepler and EcoGrid • Addresses data heterogeneity and integration issues • Uses a formal reasoning approach for • Smart data discovery • Semi-automated data integration • Workflow design • Workflow validation • Relies upon good knowledge model • Developed by Knowledge Representation group

  20. Semantic Mediation

  21. Knowledge Representation • Ecologists and computer scientists together • capture critical knowledge about ecological data • Extensible Observation Ontology (OBOE) captures the semantics of scientific data • Semantics of observations and measurements • Unit types • Observation context • Sampling hierarchies • Used by the Semantic Mediation system

  22. Knowledge Representation

  23. Taxonomic Nomenclature

  24. Evolving collaborations • Ecological Metadata Language – started in 1997 • KNB/Morpho/Metacat – KDI 1999 • Lifemapper – KDI 1998 • Kepler – SEEK ITR 2002 • Production work – Mellon Foundation 2002 • Collaboration…organic and evolving

  25. Kepler Collaboration • Open-source • Builds on Ptolemy II from UC Berkeley • Collaborators • SEEK Project • SciDAC SDM Center • Ptolemy Project • GEON Project • ROADNet Project • Resurgence Project • Goals • Create powerful analytical tools that are useful across disciplines • Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, … Ptolemy II

  26. Broader Cyberinfrastructure Landscape • I. Data and metadata systems • Physical (NBDC buoy data) • Molecular bio (GenBank) • Biological Collections (DiGIR) • Oceanography (OpenDAP) • II. Domain Applications/Algorithms • Sequence processing (BLAST) • Ecological Niche Modeling (GARP) • Site selection (Marxan) • III. Analysis and modeling frameworks • Grid Systems (Globus) • Workflow systems (Triana)

  27. Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling

  28. Analysis and Modeling • Need framework for designing, executing, preserving, and sharing analyses and models • Approaches • Scientific workflows – modular, re-usable components, archive-friendly • Challenges • Incorporating semantics • Enabling effective model design • Effective access to grid computing

  29. Source (e.g., data) Sink (e.g., display) B A B C A’ D E F Scientific workflows • Features of scientific workflows • Graphical model of data flow among processing steps • Inputs and Outputs of components are precisely defined • Components are modular and reusable • Flow of data controlled by a separate execution model • Support for hierarchical models Processor (e.g., regression)

  30. Data source from EcoGrid (metadata-driven ingestion) R processing script res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res) Kepler: dynamic data loading • Kepler supports dynamic data loading: • Data sources are discovered via metadata queries • EML metadata allows arbitrary schemas to be loaded into an embedded database • Data queries can be performed before data flows downstream

  31. Publish The local library Fast access to local components that are developed by the Kepler team and ship with Kepler • Statistics (R and Matlab, etc.) • Logic and math functions • Graphics and visualization • Geospatial data processing • Molecular data processing • Domain specific models • Web services • Grid services • Data sources and sinks • Ecology data • Geology data • Taxonomic data • And much, much more… Import The remote repository Publish and share custom analyses, models, and components with colleagues. • Components contributed by scientists • ‘Upload to repository’ function in Kepler • Saved in repository, explicit versioning • Can be shared with colleagues • Can be referenced in published papers • Components can be downloaded and executed • Downloaded components can be customized • Promotes replication of analyses and models Kepler Component Library

  32. Active work in Kepler • Real-time data in Kepler • Scientist’s view: real-time data accessed like archived data • Engineer’s view: drill-down to manage sensor network resources • Semantics • Semantic annotation connects models to knowledge • Smart Search (data and components) • Smart Data Integration • Smart Workflow Linking * by “smart” we mean these services are informed by metadata and ontology information • Improved user interfaces for Grid computing

  33. ORB

  34. Kepler and Sensor Networks • Collaborators: NCEAS, SDSC, UC Davis, OSU, CENS (UCLA), Opendap • Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System (CEO:P) • Startup October, 2006 • Major foci: • Sensor network management – standardized services model • Analysis of data from sensors and archives • Public web view of sensor data • Opendap and EcoGrid compatibility

  35. Outline • Cyberinfrastructure challenges • Overview of the Science Environment for Ecological Knowledge (SEEK) architecture • EcoGrid, Kepler, Semantic Mediation System • Scientific Workflows and Kepler • Semantics in integration, analysis and modeling

  36. int int int int int string int int bodysize bodysize rainfall bodysize A B Semantics in scientific workflows • Components and their ports typically have: • Explicit ‘structural type’ • e.g., int, float, string, {double} • Implicit semantic type • Not sure whether the stream of values from a port represents ‘rainfall’ values or ‘body size’ values

  37. Data Ontology Workflow Components Semantic Annotation • Labeldata with semantic types • Labelinputs and outputs of analytical components with semantic types • Grounded at level of measurement and data, avoiding some pitfalls of upper ontologies

  38. Goal: generically describe the structure of scientific observation and measurement as found in a data set Extension points Provide extension points for loading specialized domain ontologies Observation ontology Observations can provide context for other observations. Entities represent real-world objects or concepts that can be measured. Observations are made about particular entities. Every measurement has a characteristic, which defines the property of the entity being measured. Entities, through observations, can be associated with one or more measured characteristics. Every measurement relates a characteristic to a standard or unit. Measurements assign values and units to characteristics of observed entities. Measurements have precision. A value is typically a cell in a data set.

  39. Semantic annotation ObservationOntology Mapping between data and the ontology via semantic annotation Data set

  40. Smart (Data) Integration: Merge • Discover data of interest • … connect to merge actor • … “compute merge” • align attributes via annotations • open dialog for user refinement • store merge mapping in MOML • … enjoy! • … your merged dataset

  41. Navigate errors and warnings within the workflow Search for and insert “adapters” to fix (structural and semantic) errors … Statically perform semantic and structural type checking Smart Linking (Workflow Design)

  42. Semantic capabilities • Answer semanticdataqueries: • Find sites in California where current abundance of molluscs is < 10% of historical abundance in 1900 • Validatesemanticcorrectness of workflows • Workflowdesign tools that exploit semantic context • Dataintegration that dynamically matches data sources to target schema needed for analysis

  43. In summary… • Typical analytical models are complex and difficult to comprehend and maintain • Scientific workflows provide • An intuitive visual model • Structure and efficiency in modeling and analysis • Abstractions to help deal with complexity • Direct access to data • Means to publish and share models • Kepler is an evolving but effective tool for scientists • Looking for ways to transition from research prototype to a production software tool • Scalable data integration is our main challenge

  44. Prototype NEON Portal – “myNEON”

  45. For more information, see: • http://kepler-project.org/

  46. Acknowledgements The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676. Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence

  47. Roadmap ahead: • Monday, Research Design: January 8, 2007 • 1:15 – 4:30 Scientific workflows– Pennington • Tuesday, Data Grids: January 9, 2007 • 8:30 -- 10:30 Grid technologies and activity – Servilla and Pennington • 10:30 – 2:15 EML/Metadata best practices/Morpho – Tyburczy • 2:30 -- 3:00 QA/QC – Vanderbilt • 3:00 -- 4:30 Good Practices on storing data –Vanderbilt/White • Wednesday, Workflows I: Using pre-built workflows in Kepler January 10, 2007 • 8:30 -- 12:00 Introduction to Kepler with demos– Pennington/Romanello • 1:15 -- 2:00         Using Desktop data in Kepler – Higgins • 2:00 -- ???           Bosque, dinner at the Socorro Brew Pub • Thursday, Workflows II: Tools in Kepler January 11, 2007 • 8:30 -- 10:30 Using R in Kepler (demo + exercise) – Higgins • 10:30 -- 12:00 Visualization in Kepler (demo + exercise) – Higgins/Pennington • 1:15 -- 2:15 Biodiversity example in Kepler (demo + exercise) – “/” • 2:30 -- 4:30 Taxonomic resolution in Kepler –Stewart • Friday, Workflows III: Semantic approaches in Kepler January 12, 2007 • 8:30 -- 12:00 Knowledge representation and semantic mediation – Bowers • 1:15 -- 2:00 Recapping the week • 2:00 -- 3:00 Preparing to use ecoinformatics in the classroom – Katz • 3:00 -- 4:00 Roundtable Discussion – Katz

More Related