Data Workflow Management, Data Preservation and Stewardship

Data Workflow Management, Data Preservation and Stewardship Peter Fox Data Science – ITEC/CSCI/ERTH-6961 Week 10, November 6, 2012

Contents • Scientific Data Workflows • Data Stewardship • Summary • Next class(es)

Scientific Data Workflow • What it is • Why you would use it • Some more detail in the context of Kepler • www.kepler-project.org • Some pointer to other workflow systems

What is a workflow? • General definition: series of tasks performed to produce a final outcome • Scientific workflow – “data analysis pipeline” • Automate tedious jobs that scientists traditionally performed by hand for each dataset • Process large volumes of data faster than scientists could do by hand

Background: Business Workflows • Example: planning a trip • Need to perform a series of tasks: book a flight, reserve a hotel room, arrange for a rental car, etc. • Each task may depend on outcome of previous task • Days you reserve the hotel depend on days of the flight • If hotel has shuttle service, may not need to rent a car • E.g. tripit.com?

What about scientific workflows? • Perform a set of transformations/ operations on a scientific dataset • Examples • Generating images from raw data • Identifying areas of interest in a large dataset • Classifying set of objects • Querying a web service for more information on a set of objects • Many others…

More on Scientific Workflows • Formal models of the flow of data among processing components • May be simple and linear or more complex • Can process many data types: • Archived data • Streaming sensor data • Images (e.g., medical or satellite) • Simulation output • Observational data

Challenges • Questions: • What are some challenges for scientists implementing scientific workflows? • What are some challenges to executing these workflows? • What are limitations of writing a program?

Challenges • Mastering a programming language • Visualizing workflow • Sharing/exchanging workflow • Formatting issues • Locating datasets, services, or functions

Kepler Scientific Workflow Management System • Graphical interface for developing and executing scientific workflows • Scientists can create workflows by dragging and dropping • Automates low-level data processing tasks • Provides access to data repositories, compute resources, workflow libraries

Benefits of Scientific Workflows • Documentation of aspects of analysis • Visual communication of analytical steps • Ease of testing/debugging • Reproducibility • Reuse of part or all of workflow in a different project

Additional Benefits • Integration of multiple computing environments • Automated access to distributed resources via web services and Grid technologies • System functionality to assist with integration of heterogeneous components

Why not just use a script? • Script does not specify low-level task scheduling and communication • May be platform-dependent • Can’t be easily reused • May not have sufficient documentation to be adapted for another purpose

Why is a GUI useful? • No need to learn a programming language • Visual representation of what workflow does • Allows you to monitor workflow execution • Enables user interaction • Facilitates sharing of workflows

The Kepler Project • Goals • Produce an open-source scientific workflow system • enable scientists to design scientific workflows and execute them • Support scientists in a variety of disciplines • e.g., biology, ecology, astronomy • Important features • access to scientific data • flexible means for executing complex analyses • enable use of Grid-based approaches to distributed computation • semantic models of scientific tasks • effective UI for workflow design

Usage statistics • Projects using Kepler: • SEEK (ecology) • SciDAC (molecular bio, ...) • CPES (plasma simulation) • GEON (geosciences) • CiPRes (phylogenetics) • CalIT2 • ROADnet (real-time data) • LOOKING (oceanography) • CAMERA (metagenomics) • Resurgence (Computational chemistry) • NORIA (ocean observing CI) • NEON (ecology observing CI) • ChIP-chip (genomics) • COMET (environmental science) • Cheshire Digital Library (archival) • Digital preservation (DIGARCH) • Cell Biology (Scripps) • DART (X-Ray crystallography) • Ocean Life • Assembling theTree of Life project • Processing Phylodata (pPOD) • FermiLab (particle physics) • Source code access • 154 people accessed source code • 30 members have write permission Kepler downloads Total = 9204 Beta = 6675 red=Windows blue=Macintosh

Distributed execution • Opportunities for parallel execution • Fine-grained parallelism • Coarse-grained parallelism • Few or no cycles • Limited dependencies among components • ‘Trivially parallel’ • Many science problems fit this mold • parameter sweep, iteration of stochastic models • Current ‘plumbing’ approaches to distributed execution • workflow acts as a controller • stages data resources • writes job description files • controls execution of jobs on nodes • requires expert understanding of the Grid system • Scientists need to focus on just the computations • try to avoid plumbing as much as possible

Distributed Kepler • Higher-order component for executing a model on one or more remote nodes • Master and slave controllers handle setup and communication among nodes, and establish data channels • Extremely easy for scientist to utilize • requires no knowledge of grid computing systems IN OUT Controller Controller Master Slave

Data Management Token Token {1,5,2} ref-276 {1,5,2} • Need for integrated management of external data • EarthGrid access is partial, need refactoring • Include other data sources, such as JDBC, OpeNDAP, etc. • Data needs to be a first class object in Kepler, not just represented as an actor • Need support for data versioning to support provenance • e.g., Need to pass data by reference • workflows contain large data tokens (100’s of megabytes) • intelligent handling of unique identifiers (e.g., LSID) A B

Science Environment for Ecological Knowledge SEEK is an NSF-funded, multidisciplinary research project to facilitate … Access to distributed ecological, environmental, and biodiversity data • Enable data sharing & reuse • Enhance data discovery at global scales Scalable analysis and synthesis • Taxonomic, spatial, temporal, conceptual integration of data, addressing data heterogeneity issues • Enable communication and collaboration for analysis • Enable reuse of analytical components • Support scientific workflow design and modeling

SEEK data access, analysis, mediation Data Access (EcoGrid) • Distributed data network for environmental, ecological, and systematics data • Interoperate diverse environmental data systems Workflow Tools (Kepler) • Problem-solving environment for scientific data analysis and visualization  “scientific workflows” Semantic Mediation (SMS) • Leverage ontologies for “smart”data/component discovery and integration

Managing Data Heterogeneity • Data comes from heterogeneous sources • Real-world observations • Spatial-temporal contexts • Collection/measurement protocols and procedures • Many representations for thesame information (count, area, density) • Data, Syntax, Schema, Semantic heterogeneity • Discovery and “synthesis” (integration) performed manually • Discovery often based on intuitive notion of “what is out there” • Synthesis of data is very time consuming, and limits use

Scientific workflow systems support data analysis KEPLER

A simple Kepler workflow Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions, ...) (T. McPhillips)

A simple Kepler workflow Lists Nexus filesto process (project) Reads text files Parses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. (T. McPhillips)

A simple Kepler workflow An example workflow run, executed as a Dataflow Process Network

Smart Mediation Services (SMS) motivation • Scientific Workflow Life-cycle • Resource Discovery • discover relevant datasets • discover relevant actors or workflow templates • Workflow Design and Configuration • data  actor (data binding) • data  data (data integration / merging / interlinking) • actor  actor (actor / workflow composition) • Challenge: do all this in the presence of … • 100’s of workflows and templates • 1000’s of actors (e.g. actors for web services, data analytics, …) • 10,000’s of datasets • 1,000,000’s of data items • … highly complex, heterogeneous data – price to pay for these resources: $$$ (lots) – scientist’s time wasted: priceless!

Approach & SMS capabilities • Annotations “connect” resources to ontologies • Conceptually describe a resource and/or its “data schema” • Annotations provide the means for ontology-based discovery, integration, … Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration

“Hybrid” types … Semantic + Structural Typing O : Observation  obsProperty.SpeciesOccurrence S : SpeciesData(site, day, spp, occ) O O Oout S S Sout • Structural Types: Given a structural type language S • Datasets, inputs, and outputs can be assigned structural types S  S • Semantic Types: Given an ontology language O (e.g., OWL-DL) • Datasets, inputs, and outputs can be assigned ontology types O O   Oout  Oin Semantically compatiblebut structurally incompatible A1 A2 Sout Sin Semantic & structural types can be combined using logic constraints  := (site,day,sp,occ)SpeciesData(site, day, sp, occ) (y)Observation(y), obsProp(y, occ),SpeciesOccurrence(occ)

Semantic Type Annotation in Kepler • Component input and output port annotation • Each port can be annotated with multiple classes from multiple ontologies • Annotations are stored within the component metadata

Component Annotation and Indexing • Component Annotations • New components can be annotated and indexed into the component library (e.g., specializing generic actors) • Existing components can also be revised, annotated, and indexed (hiding previous versions)

Approach & SMS capabilities • Ontology-based “smart” search • Find components by semantic types • Find components by input/output semantic types • Ontology-based query rewriting for discovery/integration • Joint work with GEON project (see SSDBM-04, SWDB-04) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration

Smart Search Browse for Components Search for Component Name Search for Category / Keyword Find a component (here: an actor) in different locations (“categories”) • … based on the semantic annotation of the component (or its ports)

Searching in context • Search for components with compatible input/output semantic types • … searches over actor library • … applies subsumption checking on port annotations

Approach & SMS capabilities • Workflow validation and analysis • Check that workflows are semantically & structurally well-typed • Infer semantic type annotations of derived data (ie, type inference) • An initial approach and prototype based on mapping composition (see QLQP-05) • User-oriented provenance • Collect & query data-lineage of WF runs (see IPAW-06) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration

Navigate errors and warnings within the workflow Search for and insert “adapters” to fix (structural and semantic) errors … Statically perform semantic and structural type checking Workflow validation in Kepler

Approach & SMS capabilities • Integrating and transforming data • Merge (“smart union”) datasets • Find mappings between data schemas for transformation • data binding, component connections (see DILS-04) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration

Smart (Data) Integration: Merge • Discover data of interest • … connect to merge actor • … “compute merge” • align attributes via annotations • open dialog for user refinement • store merge mapping in MOML • … enjoy! • … your merged dataset • almost, can be much more complicated

Under the hood of “Smart Merge” … Biomass Site Site Biomass a1 a2 a3 a4 a 5 10 b 6 11 a1 a3 a4 a 5.0 10 b 6.0 11 a 0.1 c 0.2 d 0.3 Merge Result a5 a6 a7 a8 0.1 a 0.2 c 0.3 d • Exploits semantic type annotations and ontology definitions to find mappings between sources • Executing the merge actor results in an integrated data product (via “outer union”) a1 a3 a1a8 a4 a3a6 Merge a6 a4 a8

Approach & SMS capabilities • Workflow design support • (Semi-) automatically combine resource discovery, integration, and validation • Abstract  Executable WF • … ongoing work! Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration Automated SWF Refinement

Initial Work on Provenance Framework • Provenance • Track origin and derivation information about scientific workflows, their runs and derived information (datasets, metadata…) • Need for Provenance • Association of process and results • reproduce results • “explain & debug” results (via lineage tracing, parameter settings, …) • optimize: “Smart Re-Runs” • Types of Provenance Information: • Data provenance • Intermediate and end results including files and db references • Process (=workflow instance) provenance • Keep the wf definition with data and parameters used in the run • Error and execution logs • Workflow design provenance (quite different) • WF design is a (little supported) process (art, magic, …) • for free via cvs: edit history • need more “structure” (e.g. templates) for individual & collaborative workflow design

Kepler Provenance Recording Utility • Parametric and customizable • Different report formats • Variable levels of detail • Verbose-all, verbose-some, medium, on error • Multiple cache destinations • Saves information on • User name, Date, Run, etc…

Provenance: Possible Next Steps • Provenance Meeting: Last week at SDSC • Deciding on terms and definitions • .kar file generation, registration and search for provenance information • Possible data/metadata formats • Automatic report generation from accumulated data • A GUI to keep track of the changes • Adding provenance repositories • A relational schema for the provenance info in addition to the existing XML

Some other workflow systems • SCIRun • Sciflo • Triana • Taverna • Pegasus • Some commercial tools: • Windows Workflow Foundation • Mac OS X Automator • http://www.isi.edu/~gil/AAAI08TutorialSlides/5-Survey.pdf • http://www.isi.edu/~gil/AAAI08TutorialSlides/ • See reading for this week

Data Stewardship • Putting a number of data life cycle, management aspects together • Keep the ideas in mind as you complete your assignments • Why it is important • Some examples

Why it is important • 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: http://www.forbes.com2002/09/12/0912data_print.html ) • 1986 BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as ﬁle formats evolve,” Houston Chronicle, Jan. 16, 2003. [Online]. Available: http://www.chron.com/cs/CDA/story.hts/tech/1739675 ) • R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long-term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec. 2004. [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_International_Polar_Year:_Making_Data_and_Information_Available_for_the_Long_Term.ppt

At the heart of it • Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc. • Inability to know the inter-relations, assumptions and missing information • We’ll look at a (data) use case for this shortly • But first we will look at what, how and who in terms of the full life cycle

What to collect? • Documentation • Metadata • Provenance • Ancillary Information • Knowledge

Who does this? • Roles: • Data creator • Data analyst • Data manager • Data curator

How it is done • Opening and examining Archive Information Packages • Reviewing data management plans and documentation • Talking (!) to the people: • Data creator • Data analyst • Data manager • Data curator • Sometimes, reading the data and code

Data Workflow Management, Data Preservation and Stewardship