250 likes | 396 Views
Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows. Bertram Ludäscher Dept. of Computer Science, UC Davis UC Davis Genome Center ludaesch @ uc davis .edu. Shawn Bowers UC Davis Genome Center
E N D
Semantic Mediation in SEEK/Kepler:Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows Bertram Ludäscher Dept. of Computer Science, UC Davis UC Davis Genome Center ludaesch @ ucdavis.edu Shawn Bowers UC Davis Genome Center sbowers @ ucdavis.edu seek.ecoinformatics.org | kepler-project.org | www.sdsc.edu | dbis.ucdavis.edu | genomics.ucdavis.edu
Science Environment for Ecological Knowledge SEEK is an NSF-funded, multidisciplinary research project to facilitate … Access to distributed ecological, environmental, and biodiversity data • Enable data sharing & reuse • Enhance data discovery at global scales Scalable analysis and synthesis • Taxonomic, spatial, temporal, conceptual integration of data, addressing data heterogeneity issues • Enable communication and collaboration for analysis • Enable reuse of analytical components • Support scientific workflow design and modeling
SEEK data access, analysis, mediation Data Access (EcoGrid) • Distributed data network for environmental, ecological, and systematics data • Interoperate diverse environmental data systems Workflow Tools (Kepler) • Problem-solving environment for scientific data analysis and visualization “scientific workflows” Semantic Mediation (SMS) • Leverage ontologies for “smart”data/component discovery and integration
Managing Data Heterogeneity • Data comes from heterogeneous sources • Real-world observations • Spatial-temporal contexts • Collection/measurement protocols and procedures • Many representations for thesame information (count, area, density) • Data, Syntax, Schema, Semantic heterogeneity • Discovery and “synthesis” (integration) performed manually • Discovery often based on intuitive notion of “what is out there” • Synthesis of data is very time consuming, and limits use
A simple Kepler workflow Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions, ...) (T. McPhillips)
A simple Kepler workflow Lists Nexus filesto process (project) Reads text files Parses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. (T. McPhillips)
A simple Kepler workflow An example workflow run, executed as a Dataflow Process Network
SMS motivation • Scientific Workflow Life-cycle • Resource Discovery • discover relevant datasets • discover relevant actors or workflow templates • Workflow Design and Configuration • data actor (data binding) • data data (data integration / merging / interlinking) • actor actor (actor / workflow composition) • Challenge: do all this in the presence of … • 100’s of workflows and templates • 1000’s of actors (e.g. actors for web services, data analytics, …) • 10,000’s of datasets • 1,000,000’s of data items • … highly complex, heterogeneous data – price to pay for these resources: $$$ (lots) –scientist’s time wasted: priceless!
Approach & SMS capabilities Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration
Approach & SMS capabilities • SEEK KR group is developing OWL-DL ontologies: • Various workflow-component ontologies (for categorizing by function, project, scientific discipline, …) • Scientific observation ontology (OBOE), an upper ontology for defining and relating observations, measurements, and units • Domain specific ontologies that extend OBOE (standard and derived units, ecology and biodiversity concepts, …) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration
Approach & SMS capabilities • Annotations “connect” resources to ontologies • Conceptually describe a resource and/or its “data schema” • Annotations provide the means for ontology-based discovery, integration, … Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration
O : Observation obsProperty.SpeciesOccurrence S : SpeciesData(site, day, spp, occ) O O Oout S S Sout Structural Types: Given a structural type language S • Datasets, inputs, and outputs can be assigned structural types S S Semantic Types: Given an ontology language O (e.g., OWL-DL) • Datasets, inputs, and outputs can be assigned ontology types O O Oout Oin Semantically compatiblebut structurally incompatible A1 A2 Sout Sin Semantic & structural types can be combined using logic constraints := (site,day,sp,occ)SpeciesData(site, day, sp, occ) (y)Observation(y), obsProp(y, occ),SpeciesOccurrence(occ) “Hybrid” types … Semantic + Structural Typing
Semantic Type Annotation in Kepler • Component input and output port annotation • Each port can be annotated with multiple classes from multiple ontologies • Annotations are stored within the component metadata
Component Annotation and Indexing • Component Annotations • New components can be annotated and indexed into the component library (e.g., specializing generic actors) • Existing components can also be revised, annotated, and indexed (hiding previous versions)
Approach & SMS capabilities • Ontology-based “smart” search • Find components by semantic types • Find components by input/output semantic types • Ontology-based query rewriting for discovery/integration • Joint work with GEON project (see SSDBM-04, SWDB-04) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration
Browse for Components Search for Component Name Search for Category / Keyword Smart Search Find a component (here: an actor) in different locations (“categories”) • … based on the semantic annotation of the component (or its ports)
Searching in context • Search for components with compatible input/output semantic types • … searches over actor library • … applies subsumption checking on port annotations
Approach & SMS capabilities • Workflow validation and analysis • Check that workflows are semantically & structurally well-typed • Infer semantic type annotations of derived data (ie, type inference) • An initial approach and prototype based on mapping composition (see QLQP-05) • User-oriented provenance • Collect & query data-lineage of WF runs (see IPAW-06) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration
Navigate errors and warnings within the workflow Search for and insert “adapters” to fix (structural and semantic) errors … Statically perform semantic and structural type checking Workflow validation in Kepler
Approach & SMS capabilities • Integrating and transforming data • Merge (“smart union”) datasets • Find mappings between data schemas for transformation • data binding, component connections (see DILS-04) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration
Smart (Data) Integration: Merge • Discover data of interest • … connect to merge actor • … “compute merge” • align attributes via annotations • open dialog for user refinement • store merge mapping in MOML • … enjoy! • … your merged dataset • almost, can be much more complicated
Biomass Site Site Biomass a1 a2 a3 a4 a 5 10 b 6 11 a1 a3 a4 a 5.0 10 b 6.0 11 a 0.1 c 0.2 d 0.3 Merge Result a5 a6 a7 a8 0.1 a 0.2 c 0.3 d Under the hood of “Smart Merge” … • Exploits semantic type annotations and ontology definitions to find mappings between sources • Executing the merge actor results in an integrated data product (via “outer union”) a1 a3 a1a8 a4 a3a6 Merge a6 a4 a8
Approach & SMS capabilities • Workflow design support • (Semi-) automatically combine resource discovery, integration, and validation • Abstract Executable WF • … ongoing work! Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration Automated SWF Refinement
Summary • Outlook: • Ontologies and semantic anotations for WF design & reuse • Put ontologies to actual use in Kepler • Continue to develop Kepler tools for annotation (KR observation ontology), discovery, integration, design, … • Issues & Challenges: • Tools/approaches for ontology (OWL) management, organization, reasoning • Open source (distributed) ontology (OWL) storage and reasoning • Tools and techniques for robust ontology versioning, and extension • Acknowledgements • Timothy McPhillips, Dave Thau (UC Davis) • Mark Schildhauer, Josh Madin, Matt Jones (UCSB) • Deana Pennington (UNM) • Rich Williams (Microsoft Research) • Ferdinando Villa, Sergey Krivov (UVM)