750 likes | 977 Views
Data Workflow Management, Data Stewardship. Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 11, November 15, 2011. Contents. Scientific Data Workflows Data Stewardship Summary Next class(es). Scientific Data Workflow. What it is Why you would use it
E N D
Data Workflow Management, Data Stewardship Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 11, November 15, 2011
Contents • Scientific Data Workflows • Data Stewardship • Summary • Next class(es)
Scientific Data Workflow • What it is • Why you would use it • Some more detail in the context of Kepler • www.kepler-project.org • Some pointer to other workflow systems
What is a workflow? • General definition: series of tasks performed to produce a final outcome • Scientific workflow – “data analysis pipeline” • Automate tedious jobs that scientists traditionally performed by hand for each dataset • Process large volumes of data faster than scientists could do by hand
Background: Business Workflows • Example: planning a trip • Need to perform a series of tasks: book a flight, reserve a hotel room, arrange for a rental car, etc. • Each task may depend on outcome of previous task • Days you reserve the hotel depend on days of the flight • If hotel has shuttle service, may not need to rent a car
What about scientific workflows? • Perform a set of transformations/ operations on a scientific dataset • Examples • Generating images from raw data • Identifying areas of interest in a large dataset • Classifying set of objects • Querying a web service for more information on a set of objects • Many others…
More on Scientific Workflows • Formal models of the flow of data among processing components • May be simple and linear or more complex • Can process many data types: • Archived data • Streaming sensor data • Images (e.g., medical or satellite) • Simulation output • Observational data
Challenges • Questions: • What are some challenges for scientists implementing scientific workflows? • What are some challenges to executing these workflows? • What are limitations of writing a program?
Challenges • Mastering a programming language • Visualizing workflow • Sharing/exchanging workflow • Formatting issues • Locating datasets, services, or functions
Kepler Scientific Workflow Management System • Graphical interface for developing and executing scientific workflows • Scientists can create workflows by dragging and dropping • Automates low-level data processing tasks • Provides access to data repositories, compute resources, workflow libraries
Benefits of Scientific Workflows • Documentation of aspects of analysis • Visual communication of analytical steps • Ease of testing/debugging • Reproducibility • Reuse of part or all of workflow in a different project
Additional Benefits • Integration of multiple computing environments • Automated access to distributed resources via web services and Grid technologies • System functionality to assist with integration of heterogeneous components
Why not just use a script? • Script does not specify low-level task scheduling and communication • May be platform-dependent • Can’t be easily reused • May not have sufficient documentation to be adapted for another purpose
Why is a GUI useful? • No need to learn a programming language • Visual representation of what workflow does • Allows you to monitor workflow execution • Enables user interaction • Facilitates sharing of workflows
The Kepler Project • Goals • Produce an open-source scientific workflow system • enable scientists to design scientific workflows and execute them • Support scientists in a variety of disciplines • e.g., biology, ecology, astronomy • Important features • access to scientific data • flexible means for executing complex analyses • enable use of Grid-based approaches to distributed computation • semantic models of scientific tasks • effective UI for workflow design
Usage statistics • Projects using Kepler: • SEEK (ecology) • SciDAC (molecular bio, ...) • CPES (plasma simulation) • GEON (geosciences) • CiPRes (phylogenetics) • CalIT2 • ROADnet (real-time data) • LOOKING (oceanography) • CAMERA (metagenomics) • Resurgence (Computational chemistry) • NORIA (ocean observing CI) • NEON (ecology observing CI) • ChIP-chip (genomics) • COMET (environmental science) • Cheshire Digital Library (archival) • Digital preservation (DIGARCH) • Cell Biology (Scripps) • DART (X-Ray crystallography) • Ocean Life • Assembling theTree of Life project • Processing Phylodata (pPOD) • FermiLab (particle physics) • Source code access • 154 people accessed source code • 30 members have write permission Kepler downloads Total = 9204 Beta = 6675 red=Windows blue=Macintosh
Distributed execution • Opportunities for parallel execution • Fine-grained parallelism • Coarse-grained parallelism • Few or no cycles • Limited dependencies among components • ‘Trivially parallel’ • Many science problems fit this mold • parameter sweep, iteration of stochastic models • Current ‘plumbing’ approaches to distributed execution • workflow acts as a controller • stages data resources • writes job description files • controls execution of jobs on nodes • requires expert understanding of the Grid system • Scientists need to focus on just the computations • try to avoid plumbing as much as possible
Distributed Kepler • Higher-order component for executing a model on one or more remote nodes • Master and slave controllers handle setup and communication among nodes, and establish data channels • Extremely easy for scientist to utilize • requires no knowledge of grid computing systems IN OUT Controller Controller Master Slave
Data Management Token Token {1,5,2} ref-276 {1,5,2} • Need for integrated management of external data • EarthGrid access is partial, need refactoring • Include other data sources, such as JDBC, OpeNDAP, etc. • Data needs to be a first class object in Kepler, not just represented as an actor • Need support for data versioning to support provenance • e.g., Need to pass data by reference • workflows contain large data tokens (100’s of megabytes) • intelligent handling of unique identifiers (e.g., LSID) A B
Science Environment for Ecological Knowledge SEEK is an NSF-funded, multidisciplinary research project to facilitate … Access to distributed ecological, environmental, and biodiversity data • Enable data sharing & reuse • Enhance data discovery at global scales Scalable analysis and synthesis • Taxonomic, spatial, temporal, conceptual integration of data, addressing data heterogeneity issues • Enable communication and collaboration for analysis • Enable reuse of analytical components • Support scientific workflow design and modeling
SEEK data access, analysis, mediation Data Access (EcoGrid) • Distributed data network for environmental, ecological, and systematics data • Interoperate diverse environmental data systems Workflow Tools (Kepler) • Problem-solving environment for scientific data analysis and visualization “scientific workflows” Semantic Mediation (SMS) • Leverage ontologies for “smart”data/component discovery and integration
Managing Data Heterogeneity • Data comes from heterogeneous sources • Real-world observations • Spatial-temporal contexts • Collection/measurement protocols and procedures • Many representations for thesame information (count, area, density) • Data, Syntax, Schema, Semantic heterogeneity • Discovery and “synthesis” (integration) performed manually • Discovery often based on intuitive notion of “what is out there” • Synthesis of data is very time consuming, and limits use
A simple Kepler workflow Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions, ...) (T. McPhillips)
A simple Kepler workflow Lists Nexus filesto process (project) Reads text files Parses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. (T. McPhillips)
A simple Kepler workflow An example workflow run, executed as a Dataflow Process Network
SMS motivation • Scientific Workflow Life-cycle • Resource Discovery • discover relevant datasets • discover relevant actors or workflow templates • Workflow Design and Configuration • data actor (data binding) • data data (data integration / merging / interlinking) • actor actor (actor / workflow composition) • Challenge: do all this in the presence of … • 100’s of workflows and templates • 1000’s of actors (e.g. actors for web services, data analytics, …) • 10,000’s of datasets • 1,000,000’s of data items • … highly complex, heterogeneous data – price to pay for these resources: $$$ (lots) – scientist’s time wasted: priceless!
Some other workflow systems • SCIRun • Sciflo • Triana • Taverna • Pegasus • Some commercial tools: • Windows Workflow Foundation • Mac OS X Automator • http://www.isi.edu/~gil/AAAI08TutorialSlides/5-Survey.pdf • http://www.isi.edu/~gil/AAAI08TutorialSlides/ • See reading for this week
Data Stewardship • Putting a number of data life cycle, management aspects together • Keep the ideas in mind as you complete your assignments • Why it is important • Some examples
Why it is important • 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: http://www.forbes.com2002/09/12/0912data_print.html ) • 1986 BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as file formats evolve,” Houston Chronicle, Jan. 16, 2003. [Online]. Available: http://www.chron.com/cs/CDA/story.hts/tech/1739675 ) • R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long-term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec. 2004. [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_International_Polar_Year:_Making_Data_and_Information_Available_for_the_Long_Term.ppt
At the heart of it • Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc. • Inability to know the inter-relations, assumptions and missing information • We’ll look at a (data) use case for this shortly • But first we will look at what, how and who in terms of the full life cycle
What to collect? • Documentation • Metadata • Provenance • Ancillary Information • Knowledge
Who does this? • Roles: • Data creator • Data analyst • Data manager • Data curator
How it is done • Opening and examining Archive Information Packages • Reviewing data management plans and documentation • Talking (!) to the people: • Data creator • Data analyst • Data manager • Data curator • Sometimes, reading the data and code
Data-Information-Knowledge Ecosystem Producers Consumers Experience Data Information Knowledge Creation Gathering Presentation Organization Integration Conversation Context
Acquisition • Learn / read what you can about the developer of the means of acquisition • Documents may not be easy to find • Remember bias!!! • Document things as you go • Have a checklist (see the Data Management list) and review it often
Curation (partial) • Consider the organization and presentation of the data • Document what has been (and has not been) done • Consider and address the provenance of the data to date, you are now THE next person • Be as technology-neutral as possible • Look to add information and metainformation
Preservation • Usually refers to the full life cycle • Archiving is a component • Stewardship is the act of preservation • Intent is that ‘you can open it any time in the future’ and that ‘it will be there’ • This involves steps that may not be conventionally thought of • Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations
Some examples and experience • NASA, NOAA • http://wiki.esipfed.org/index.php/Preservation_and_Stewardship • Library community • Note: • Mostly in relation to publications, books, etc but some for data • Note that knowledge is in publications but the structure form is meant for humans not computers, despite advances in text analysis • Very little for the type of knowledge we are considering: in machine accessible form
Back in the day... NASA SEEDS Working Group on Data Lifecycle • Second Workshop Report • http://esdswg.gsfc.nasa.gov/pdf/W2_lcbo_bothwell.pdf • Many LTA recommendations • Earth Sciences Data Lifecycle Report • http://esdswg.gsfc.nasa.gov/pdf/lta_prelim_rprt2.pdf • Many lessons learned from USGS experience, plus some recommendations • SEEDS Final Report (2003) - Section 4 • http://esdswg.gsfc.nasa.gov/pdf/FinRec.pdf • Final recommendations vis a vis data lifecycle MODIS Pilot Project • GES DISC, MODAPS, NOAA/CLASS, ESDIS effort • Transferred some MODIS Level 0 data to CLASS
Mostly Technical Issues • Data Preservation • Bit-level integrity • Data readability • Documentation • Metadata • Semantics • Persistent Identifiers • Virtual Data Products • Lineage Persistence • Required ancillary data • Applicable standards
Mostly Non-Technical Issues • Policy (constrained by money…) • Front end of the lifecycle • Long-term planning, data formats, documentation... • Governance and policy • Legal requirements • Archive to archive transitions • Money (intertwined with policy) • Cost-benefit trades • Long-term needs of NASA Science Programs • User input • Identifying likely users • Levels of service • Funding source and mechanism
Use case: a real live one; deals mostly with structure and (some) content HDF4 Format "Maps"for Long Term Readability C. Lynnes, GES DISC R. Duerr and J. Crider, NSIDC M. Yang and P. Cao, The HDF Group HDF=Hierarchical Data Format NSIDC=National Snow and Ice Data Center GES=Goddard Earth Science DISC=Data and Information Service Center
In the year 2025... A user of HDF-4 data will run into the following likely hurdles: • The HDF-4 API and utilities are no longer supported... • ...now that we are at HDF-7 • The archived API binary does not work on today's OS's • ...like Android 3.1 • The source does not compile on the current OS • ...or is it the compiler version, gcc v. 7.x? • The HDF spec is too complex to write a simple read program... • ...without re-creating much of the API What to do?
HDF Mapping Files Concept: create text-based "maps" of the HDF-4 file layouts while we still have a viable HDF-4 API (i.e., now) • XML • Stored separately from, but close to the data files • Includes • internal metadata • variable info • chunk-level info • byte offsets and length • linked blocks • compression information Task funded by ESDIS project • The HDF Group, NSIDC and GES DISC
Map sample (extract) <hdf4:SDS objName="TotalCounts_A" objPath="/ascending/Data Fields" objID="xid-DFTAG_NDG-5"> <hdf4:Attribute name="_FillValue" ntDesc="16-bit signed integer"> 0 0 </hdf4:Attribute> <hdf4:Datatype dtypeClass="INT" dtypeSize="2" byteOrder="BE" /> <hdf4:Dataspace ndims="2"> 180 360 </hdf4:Dataspace> <hdf4:Datablock nblocks="1"> <hdf4:Block offset="27266625" nbytes="20582" compression="coder_type=DEFLATE" /> </hdf4:Datablock> </hdf4:SDS>
Status and Future Status • Map creation utility (part of HDF) • Prototype read programs • C • Perl • Paper in TGRS special issue • Inventory of HDF-4 data products within EOSDIS Possible Future Steps • Revise XML schema • Revise map utility and add to HDF baseline • Implement map creation and storage operationally • e.g., add to ECS or S4PA metadata files
NASA/ MODIS Contextual Info 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group
Instrument/sensor characteristics Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 50 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group