210 likes | 388 Views
Scientific Workflows. Deana Pennington, PhD University of New Mexico LTER Network Office, Sevilleta LTER PI CI-Team: Advancing CI-Based Science through Education, Training, and Mentoring of Science Communities CoPI Science Environment for Ecological Knowledge (SEEK) project July 10, 2007.
E N D
Scientific Workflows Deana Pennington, PhD University of New Mexico LTER Network Office, Sevilleta LTER PI CI-Team: Advancing CI-Based Science through Education, Training, and Mentoring of Science Communities CoPI Science Environment for Ecological Knowledge (SEEK) project July 10, 2007
Scientific Workflows Knowledge- intensive Human cognition Ontologies Semantic query Theory Data-intensive Analyses Data mining High Performance Cmp Bio-inspired algs. Sci Visualization Inductive, Descriptive Statistics Web Dessimination E Journals Dynamic websites Info Visualization Query Data Management Data models Metadata Storage Conduct Analyses Deductive, Prescriptive Mechanistic Conceptual Model Assumptions Idealizations Simplification Collect Data Research Design Results Hypothesis Generation Informatics and the Research Cycle
Workflows: Process Support Scientific Workflow Systems Analytical Component Analytical Component Data Analytical Component Data Business Workflow Systems Files Files
Scientific Workflow Systems Input Data Site 1 Site 3 Site 4 Input Data Native functionality Site 2 Integration => Transformations SEEK: Kepler Workflow System Derived Data Analytical Component Analytical Component Analytical Component Data Data Derived Data • Goals: • Visual modeling of end-to-end analytical process • Discovery of distributed data and analytical components • Easy incorporation of distributed data/components • Automated transformation between heterogeneous data/components
Goals: • Visual modeling of end-to-end analytical process • Discovery of distributed data and analytical components • Easy incorporation of distributed data/components • Automated transformation between heterogeneous data/components • Not linear • Involve multiple data sets • Involve multiple analytical steps
Nested workflows SW0 ASx TS1 ASy ASz TS2 ASr Image Processing Pipeline Signal Processing Pipeline Integrated Field Data Search for relevant data and analyses (Query) TS2 ASr Ground Sensors Imagery
Goals: • Visual modeling of end-to-end analytical process • Discovery of distributed data and analytical components • Easy incorporation of distributed data/components • Automated transformation between heterogeneous data/components • Scripts Single platform • Visual modeling Single environment environment • Workflows: • Cross-platform • Cross-environment • Distributed data & analyses
Scientific Workflows Workflow archive Compute grid Data grid Shared Data Registry Algorithm Web Service WSDL Data Site 1 Service Broker (UDDI) Metadata Simulation Model Data Site 2 Get Data Query Data Grid to find data Query Service broker to find services Archive output data to Grid Archive workflow Return URL Return URL & call functions Get Component SEEK: EcoGrid => Kepler: EarthGrid
Goals: • Visual modeling of end-to-end analytical process • Discovery of distributed data and analytical components • Easy incorporation of distributed data/components • Automated transformation between heterogeneous data/components Generally speaking, an ontology • specifies a conceptual model by … • defining and relating … • generic concepts representing features of the real or abstract world (within a domain of interest)
Ontologies Ontology: river use concepts from (explicitly or implicitly) Informal Conceptual Model: stream Informal Conceptual Model: tributary Design Artifact Schema: STR Schema: STRM Schema: TRB Schema:ABC Metadata Data An ontology can then be used as a standard that supports exchange and integration of heterogeneous data sources and applications
SEEK’s Observation Ontology (OBOE) Characteristic Entity Standard Value Measurement Observation Ontologies: Entity, Characteristic, and Standards Limited functionality in Kepler currently (more coming!)
Scientists design their research at the conceptual workflow level • Often done on the fly over the period of time the research is being conducted • For automated approaches, this must be well thought out from the beginning • HOWEVER, because of the automation it is easy to modify the analysis and rerun it many times, so you are not locked into the original design
Productivity Example Biomass Conceptual Workflow Merge Model Predict Climate Temp Abstract Workflow Soil Executable Workflow DS DS AS AS AS AS “View1”: Excel GIS SAS GIS DS “View2”: VBScript R Script GA R DS Data Step DS TS AS DS Analysis Step DS TS AS TS AS TS AS AS Transformation Step TS DS DS TS DS Dessimination Mental Model Biomass == f ( Temp Soil Et al. C Concept
+A2 +A3 +A1 Ecological niche modeling conceptual workflow Test sample Species pres. & abs. points Species pres. & abs. points Model quality parameters EcoGrid DataBase Training sample GARP rule set Transformation Data Calculation Sample Data EcoGrid Query Validation EcoGrid DataBase GARP rule set Integrated layers Native range prediction map User Map Generation Env. layers Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata
+A2 +A3 +A1 Ecological niche modeling conceptual workflow Test sample Species pres. & abs. points Species pres. & abs. points Model quality parameters EcoGrid DataBase Training sample GARP rule set Transformation Data Calculation Sample Data EcoGrid Query Validation EcoGrid DataBase GARP rule set Integrated layers Native range prediction map User Map Generation Env. layers Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata Spatial location Temporal extent
Generic Workflow +A3 +A2 +A1 Occurrence Data Binary, Categorical or Numeric Test sample Model quality parameters EcoGrid DataBase Training sample GARP rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation EcoGrid DataBase GARP rule set Integrated layers Prediction map User Map Generation Environmental layers Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata
Temperature Interpolation Workflow +A3 +A2 +A1 Weather station temperature data Test sample Model quality parameters EcoGrid DataBase Training sample GARP rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation EcoGrid DataBase GARP rule set Integrated layers Environmental layers: elevation, aspect, land cover Prediction map: Interpolated temperature grid User Map Generation Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata
Sinkhole Interpolation Workflow +A3 +A2 +A1 Sinkhole occurrence Test sample Model quality parameters EcoGrid DataBase Training sample GARP rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation EcoGrid DataBase GARP rule set Integrated layers Environmental layers: Groundwater level, chemistry, etc Prediction map: Sinkhole distribution User Map Generation Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata
Current Benefits • Reusable analysis steps, pipelines, and workflows • Formal documentation of methods • Reproducibility of methods • Visual creation and communication of methods • Versioning