220 likes | 348 Views
Requirements for caBIG Infrastructure to Support Semantic Workflows. Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California gil@isi.edu http://www.isi.edu/~gil. Outline. Brief background on semantic workflows
E N D
Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California gil@isi.edu http://www.isi.edu/~gil
Outline • Brief background on semantic workflows • Semantic workflow representations in Wings • Five uses of semantic workflows to assist users and their resulting requirements • Reproducibility • Validation • Metadata generation • Data discovery • Workflow discovery • Requirements for architecture components • Ontology repositories and services • Data/metadata catalogs and services • Component/service catalogs and services • Workflow catalogs and services
Benefits of Semantic Workflows [Gil JSP-09] Execution management: Automation of workflow execution Managing distributed computation Managing large data sets Security and access control Provenance recording Low-cost high fidelity reproducibility Semantics and reasoning: Workflow retrieval and discovery Automation of workflow generation Systematic exploration of design space Validation of workflows Automated generation of metadata Guarantees of data pedigree “Conceptual” reproducibility
Semantic Workflows in Wings [Kim et al CCPEJ 08; Gil et al IEEE eScience 09; Gil et al K-CAP 09; Kim et al IUI 06; Gil et al IEEE IS 2010] Workflows augmented with semantic constraints Each workflow constituent has a variable associated with it Nodes, links, workflow components, datasets Workflow variables can represent collections of data as well as classes of software components Constraints are used to restrict variables, and include: Metadata properties of datasets Constraints across workflow variables Incorporate function of workflow components: how data is transformed Reasoning about semantic constraints in a workflow Algorithms for semantic enrichment of workflow templates Algorithms for matching queries against workflow catalogs Algorithms for generating workflows from high-level user requests Algorithms for generating metadata of new data products Algorithms for assisting users w/creation of valid workflow templates
Semantic Workflows in WINGS Workflow templates Dataflow diagram Each constituent (node, link, component, dataset) has a corresponding variable • Semantic properties • Constraints on workflow variables (TestData dcdom:isDiscrete false) (TrainingData dcdom:isDiscrete false)
Semantic Constraints as Metadata Properties Constraints on reusable template (shown below) Constraints on current user request (shown above) [modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]
Outline • Brief background on semantic workflows • Semantic workflow representations in Wings • Five uses of semantic workflows to assist users and their resulting requirements • Reproducibility • Validation • Metadata generation • Data discovery • Workflow discovery • Requirements for architecture components • Ontology repositories and services • Data/metadata catalogs and services • Component/service catalogs and services • Workflow catalogs and services
Uses of Semantic Workflows:1) Easily Replicate Previously Published Results • A catalog of carefully crafted workflows of select state-of-the-art methods to cover a wide range of common analyses • Many implementations of same algorithm, some proprietary • Same implementation but new versions and bug fixes • With such catalog, the effort involved in reproducing results is greatly reduced • Semantics needed to assist users to use workflows correctly
Resulting Requirements (1) Semantic representations of workflows need to abstract from software implementation Representing abstract classes of software components Instances are the implemented codes Workflow steps refer to component classes Representing abstract kinds of data (eg exclude format) Semantic reasoning needed to specialize workflow To map the abstract workflow into an execution-ready workflow To insert lower level steps (eg data transformations)
Uses of Semantic Workflows:2) Ensure Correct Use of State-of-the-Art Methods Analytic software and methods are well documented but all is text (papers, manuals, etc) Time consuming, hard to spot interdependencies, no validation Semantics needed to guide users to set up workflows correctly and customize them to their datasets and goals
Requirements (2) Semantic workflows can check constraints and guide users Representing requirements of software components Constraints on input data Constraints on parameter settings given properties of input data Representing metadata properties of datasets Semantic reasoning needed: To check constraints of each workflow step To propagate constraints across the workflow
Uses of Semantic Workflows:3) Automatic Generation of Metadata Metadata annotations are tedious and involved Often not done, an obstacle to sharing and to reuse Semantic workflows can automate the generation of metadata for analysis data products
Requirements (3) Semantic representations needed: Representing expected characteristics of output dataset for each software component given the input metadata Representing metadata properties of input datasets Semantic reasoning needed: To propagate metadata for each workflow step To propagate metadata across the workflow
Uses of Semantic Workflows:4) Discovery of Relevant Data • Workflows reused from a catalog may require additional data besides what is provided by the user • Semantic workflows can help identify characteristics of required datasets and query data catalogs to find them for the user Need a dataset of updated common (known) loci to annotate findings, where can I find one?
Requirements (4) • Semantic representations needed: • Metadata properties of any additional input datasets in the workflow, including: • Default properties for the given workflow • Augmented properties that result from the specific input data provided by the user • Semantic reasoning needed: • Propagation of semantic constraints through the workflow • Formulation of queries to data catalogs based on semantic properties required of datasets in the workflow
Uses of Semantic Workflows:5) Retrieval of Workflows • Hard to find workflows for the type of analysis a user wants • Semantic information is not provided when creating the workflow • However, retrieval queries are often based on metadata properties of data • e.g., “Find workflows that can normalize data which is continuous and has missing values [<- constraints on inputs] to create a decision tree model [constraint on intermediate data products]” • Semantic workflows needed to augment user-provided workflows with semantic constraints from metadata catalogs and component catalogs
Requirements (5) Semantic representations are needed: For workflow constituents Metadata properties of input, intermediate and final data products Metadata properties of workflow and component function For user queries Express workflow sketches containing partial data descriptions (constraints) Reasoning capabilities Automatic creation of metadata for expected workflow data products Workflow matching to queries (exact and partial)
Outline • Brief background on semantic workflows • Semantic workflow representations in Wings • Five uses of semantic workflows to assist users and their resulting requirements • Reproducibility • Validation • Metadata generation • Data discovery • Workflow discovery • Requirements for architecture components • Ontology repositories and services • Data/metadata catalogs and services • Component/service catalogs and services • Workflow catalogs and services
Requirements on Core Ontology Repositories and Services • Component/service ontologies • Extend with semantic representations that support reasoning, not just their execution • Workflow ontologies • Develop workflow ontologies that enable shared workflow repositories • Develop semantic layer for the workflow ontologies • Workflow steps must be able to represent component classes • Support reasoning about workflows in all architecture components
Requirements on Data/Metadata Catalogs and Services • Representing abstracts kinds of data (eg exclude format) • Representing metadata properties that are relevant to data analysis • Eg: the organization that contributed the data may be less relevant than the instrument used to collect it, its calibration, its quality and accuracy, etc.
Requirements on Component/Service Catalogs and Services • Represent abstract classes of software components • Instances correspond to implemented codes/services • Represent constraints on input data • Metadata properties that make the component appropriate for a given input dataset • Represent constraints on output data • Metadata properties of expected input datasets given the required outcome of the execution of the component • Represent constraints on parameter values • Constraints on parameter settings given properties of input or output data • Represent how metadata properties of inputs is related to metadata of outputs • Metadata properties of output datasets given the properties of the input datasets
Requirements on Workflow Catalogs and Services • Semantic reasoning to specialize workflows • Given user requirements and a high-level workflow, automatically generate valid execution-ready workflows • Automatically insert lower level steps when needed (eg data format conversions) • Semantic reasoning to propagate constraints of each workflow step • Check constraints of each workflow step and propagate them throughout the workflow • Incorporate constraints coming from the user’s requirements with constraints from the individual steps of the workflow • Formulation of data catalog queries based on the metadata properties of a given dataset in the workflow • Workflow discovery and matching for a given user query • Need a language to express user queries as workflow sketches containing partial data descriptions (constraints) and partial dataflow patterns • Need semantic reasoning for matching such queries, both exact and partial matching