Data Workflow Management, Data Preservation and Stewardship

Data Workflow Management, Data Preservation and Stewardship Peter Fox Data Science – ITEC/CSCI/ERTH-4350/6350 Week 10, November 5, 2013

Contents • Scientific Data Workflows • Data Stewardship • Summary • Next class(es) • Projects

Scientific Data Workflow • What it is • Why you would use it • Some more detail in the context of Kepler • www.kepler-project.org • Some pointer to other workflow systems

What is a workflow? • General definition: series of tasks performed to produce a final outcome • Scientific workflow – “data analysis pipeline” • Automate tedious jobs that scientists traditionally performed by hand for each dataset • Process large volumes of data faster than scientists could do by hand

Background: Business Workflows • Example: planning a trip • Need to perform a series of tasks: book a flight, reserve a hotel room, arrange for a rental car, etc. • Each task may depend on outcome of previous task • Days you reserve the hotel depend on days of the flight • If hotel has shuttle service, may not need to rent a car • E.g. tripit.com?

What about scientific workflows? • Perform a set of transformations/ operations on a scientific dataset • Examples • Generating images from raw data • Identifying areas of interest in a large dataset • Classifying set of objects • Querying a web service for more information on a set of objects • Many others…

More on Scientific Workflows • Formal models of the flow of data among processing components • May be simple and linear or more complex • Can process many data types: • Archived data • Streaming sensor data • Images (e.g., medical or satellite) • Simulation output • Observational data

Challenges • Questions: • What are some challenges for scientists implementing scientific workflows? • What are some challenges to executing these workflows? • What are limitations of writing a program?

Challenges • Mastering a programming language • Visualizing workflow • Sharing/exchanging workflow • Formatting issues • Locating datasets, services, or functions

Kepler Scientific Workflow Management System • Graphical interface for developing and executing scientific workflows • Scientists can create workflows by dragging and dropping • Automates low-level data processing tasks • Provides access to data repositories, compute resources, workflow libraries

Benefits of Scientific Workflows • Documentation of aspects of analysis • Visual communication of analytical steps • Ease of testing/debugging • Reproducibility • Reuse of part or all of workflow in a different project

Additional Benefits • Integration of multiple computing environments • Automated access to distributed resources via web services and Grid technologies • System functionality to assist with integration of heterogeneous components

Why not just use a script? • Script does not specify low-level task scheduling and communication • May be platform-dependent • Can’t be easily reused • May not have sufficient documentation to be adapted for another purpose

Why is a GUI useful? • No need to learn a programming language • Visual representation of what workflow does • Allows you to monitor workflow execution • Enables user interaction • Facilitates sharing of workflows

The Kepler Project • Goals • Produce an open-source scientific workflow system • enable scientists to design scientific workflows and execute them • Support scientists in a variety of disciplines • e.g., biology, ecology, astronomy • Important features • access to scientific data • flexible means for executing complex analyses • enable use of Grid-based approaches to distributed computation • semantic models of scientific tasks • effective UI for workflow design

Distributed execution • Opportunities for parallel execution • Fine-grained parallelism • Coarse-grained parallelism • Few or no cycles • Limited dependencies among components • ‘Trivially parallel’ • Many science problems fit this mold • parameter sweep, iteration of stochastic models • Current ‘plumbing’ approaches to distributed execution • workflow acts as a controller • stages data resources • writes job description files • controls execution of jobs on nodes • requires expert understanding of the Grid system • Scientists need to focus on just the computations • try to avoid plumbing as much as possible

Distributed Kepler • Higher-order component for executing a model on one or more remote nodes • Master and slave controllers handle setup and communication among nodes, and establish data channels • Extremely easy for scientist to utilize • requires no knowledge of grid computing systems IN OUT Controller Controller Master Slave

Data Management Token Token {1,5,2} ref-276 {1,5,2} • Need for integrated management of external data • EarthGrid access is partial, need refactoring • Include other data sources, such as JDBC, OpeNDAP, etc. • Data needs to be a first class object in Kepler, not just represented as an actor • Need support for data versioning to support provenance • e.g., Need to pass data by reference • workflows contain large data tokens (100’s of megabytes) • intelligent handling of unique identifiers (e.g., LSID) A B

Managing Data Heterogeneity • Data comes from heterogeneous sources • Real-world observations • Spatial-temporal contexts • Collection/measurement protocols and procedures • Many representations for thesame information (count, area, density) • Data, Syntax, Schema, Semantic heterogeneity • Discovery and “synthesis” (integration) performed manually • Discovery often based on intuitive notion of “what is out there” • Synthesis of data is very time consuming, and limits use

Scientific workflow systems support data analysis KEPLER

A simple Kepler workflow Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions, ...) (T. McPhillips)

A simple Kepler workflow Lists Nexus filesto process (project) Reads text files Parses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. (T. McPhillips)

A simple Kepler workflow An example workflow run, executed as a Dataflow Process Network

Smart Mediation Services (SMS) motivation • Scientific Workflow Life-cycle • Resource Discovery • discover relevant datasets • discover relevant actors or workflow templates • Workflow Design and Configuration • data  actor (data binding) • data  data (data integration / merging / interlinking) • actor  actor (actor / workflow composition) • Challenge: do all this in the presence of … • 100’s of workflows and templates • 1000’s of actors (e.g. actors for web services, data analytics, …) • 10,000’s of datasets • 1,000,000’s of data items • … highly complex, heterogeneous data – price to pay for these resources: $$$ (lots) – scientist’s time wasted: priceless!

Navigate errors and warnings within the workflow Search for and insert “adapters” to fix (structural and semantic) errors … Statically perform semantic and structural type checking Workflow validation in Kepler

+Provenance Framework • Provenance • Track origin and derivation information about scientific workflows, their runs and derived information (datasets, metadata…) • Types of Provenance Information: • Data provenance • Intermediate and end results including files and db references • Process (=workflow instance) provenance • Keep the wf definition with data and parameters used in the run • Error and execution logs • Workflow design provenance

Kepler Provenance Recording Utility • Parametric and customizable • Different report formats • Variable levels of detail • Verbose-all, verbose-some, medium, on error • Multiple cache destinations • Saves information on • User name, Date, Run, etc…

Some other workflow systems • SCIRun • Sciflo • Triana • Taverna • Pegasus • Some commercial tools: • Windows Workflow Foundation • Mac OS X Automator • http://www.isi.edu/~gil/AAAI08TutorialSlides/5-Survey.pdf • http://www.isi.edu/~gil/AAAI08TutorialSlides/ • See reading for this week

Data Stewardship • Putting a number of data life cycle, management aspects together • Keep the ideas in mind as you complete your assignments • Why it is important • Some examples

Why it is important • 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: http://www.forbes.com2002/09/12/0912data_print.html ) • 1986 BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as ﬁle formats evolve,” Houston Chronicle, Jan. 16, 2003. [Online]. Available: http://www.chron.com/cs/CDA/story.hts/tech/1739675 ) • R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long-term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec. 2004. [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_International_Polar_Year:_Making_Data_and_Information_Available_for_the_Long_Term.ppt

At the heart of it • Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc. • Inability to know the inter-relations, assumptions and missing information • We’ll look at a (data) use case for this shortly • But first we will look at what, how and who in terms of the full life cycle

What to collect? • Documentation • Metadata • Provenance • Ancillary Information • Knowledge

Who does this? • Roles: • Data creator • Data analyst • Data manager • Data curator

How it is done • Opening and examining Archive Information Packages • Reviewing data management plans and documentation • Talking (!) to the people: • Data creator • Data analyst • Data manager • Data curator • Sometimes, reading the data and code

Data-Information-Knowledge Ecosystem Producers Consumers Experience Data Information Knowledge Creation Gathering Presentation Organization Integration Conversation Context

Acquisition • Learn / read what you can about the developer of the means of acquisition • Documents may not be easy to find • Remember bias!!! • Document things as you go • Have a checklist (see the Data Management list) and review it often

20080602 Fox VSTO et al.

Curation (partial) • Consider the organization and presentation of the data • Document what has been (and has not been) done • Consider and address the provenance of the data to date, you are now THE next person • Be as technology-neutral as possible • Look to add information and metainformation

Preservation • Usually refers to the full life cycle • Archiving is a component • Stewardship is one act of preservation • Intent is that ‘you can open it any time in the future’ and that ‘it will be there’ • This involves steps that may not be conventionally thought of • Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations

Some examples and experience • NASA, NOAA • http://wiki.esipfed.org/index.php/Preservation_and_Stewardship • Library community • Note: • Mostly in relation to publications, books, etc but some for data • Note that knowledge is in publications but the structure form is meant for humans not computers, despite advances in text analysis • Very little for the type of knowledge we are considering: in machine accessible form

Back in the day... NASA SEEDS Working Group on Data Lifecycle • Second Workshop Report • http://esdswg.gsfc.nasa.gov/pdf/W2_lcbo_bothwell.pdf • Many LTA recommendations • Earth Sciences Data Lifecycle Report • http://esdswg.gsfc.nasa.gov/pdf/lta_prelim_rprt2.pdf • Many lessons learned from USGS experience, plus some recommendations • SEEDS Final Report (2003) - Section 4 • http://esdswg.gsfc.nasa.gov/pdf/FinRec.pdf • Final recommendations vis a vis data lifecycle MODIS Pilot Project • GES DISC, MODAPS, NOAA/CLASS, ESDIS effort • Transferred some MODIS Level 0 data to CLASS

Mostly Technical Issues • Data Preservation • Bit-level integrity • Data readability • Documentation • Metadata • Semantics • Persistent Identifiers • Virtual Data Products • Lineage Persistence • Required ancillary data • Applicable standards

Mostly Non-Technical Issues • Policy (constrained by money…) • Front end of the lifecycle • Long-term planning, data formats, documentation... • Governance and policy • Legal requirements • Archive to archive transitions • Money (intertwined with policy) • Cost-benefit trades • Long-term needs of NASA Science Programs • User input • Identifying likely users • Levels of service • Funding source and mechanism

Use case: a real live one; deals mostly with structure and (some) content HDF4 Format "Maps"for Long Term Readability C. Lynnes, GES DISC R. Duerr and J. Crider, NSIDC M. Yang and P. Cao, The HDF Group HDF=Hierarchical Data Format NSIDC=National Snow and Ice Data Center GES=Goddard Earth Science DISC=Data and Information Service Center

In the year 2025... A user of HDF-4 data will run into the following likely hurdles: • The HDF-4 API and utilities are no longer supported... • ...now that we are at HDF-7 • The archived API binary does not work on today's OS's • ...like Android 9.1 • The source does not compile on the current OS • ...or is it the compiler version, gcc v. 12.x? • The HDF spec is too complex to write a simple read program... • ...without re-creating much of the API What to do?

HDF Mapping Files Concept: create text-based "maps" of the HDF-4 file layouts while we still have a viable HDF-4 API (i.e., now) • XML • Stored separately from, but close to the data files • Includes • internal metadata • variable info • chunk-level info • byte offsets and length • linked blocks • compression information Task funded by ESDIS project • The HDF Group, NSIDC and GES DISC

Map sample (extract) <hdf4:SDS objName="TotalCounts_A" objPath="/ascending/Data Fields" objID="xid-DFTAG_NDG-5"> <hdf4:Attribute name="_FillValue" ntDesc="16-bit signed integer"> 0 0 </hdf4:Attribute> <hdf4:Datatype dtypeClass="INT" dtypeSize="2" byteOrder="BE" /> <hdf4:Dataspace ndims="2"> 180 360 </hdf4:Dataspace> <hdf4:Datablock nblocks="1"> <hdf4:Block offset="27266625" nbytes="20582" compression="coder_type=DEFLATE" /> </hdf4:Datablock> </hdf4:SDS>

Status and Future Status • Map creation utility (part of HDF) • Prototype read programs • C • Perl • Paper in TGRS special issue • Inventory of HDF-4 data products within EOSDIS Possible Future Steps • Revise XML schema • Revise map utility and add to HDF baseline • Implement map creation and storage operationally • e.g., add to ECS or S4PA metadata files

NASA/ MODIS Contextual Info 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Instrument/sensor characteristics Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 50 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Data Workflow Management, Data Preservation and Stewardship