1 / 54

Annual Review 1 May 2012

UAHuntsville The University of Alabama in Huntsville. Instant Karma: Applying a Proven Provenance Tool to NASA’s AMSR-E Data Production Stream ACCESS-09-0006. Annual Review 1 May 2012. Instant Karma: Applying a Proven Provenance Tool to NASA’s AMSR-E Data Production Stream.

maj
Download Presentation

Annual Review 1 May 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UAHuntsville The University of Alabama in Huntsville Instant Karma: Applying a Proven Provenance Tool to NASA’s AMSR-E Data Production StreamACCESS-09-0006 Annual Review 1 May 2012

  2. Instant Karma: Applying a Proven Provenance Tool to NASA’s AMSR-E Data Production Stream PI: MichaelGoodman, NASA / MSFC Objectives • Improve the collection, preservation, utility and dissemination of provenance information within the NASA Earth Science community • Customize and integrate Karma, a proven provenance tool into NASA data production • Collect and disseminate provenance of AMSR-E (Advanced Microwave Scanning Radiometer – Earth Observing System) standard data products, initially focusing on Sea Ice • Engage the Sea Ice science team and user community • Adhere to the Open Provenance Model (OPM) Key Milestones Approach • Evaluate current AMSR-E SIPS product generation 06/10 • Extend Karma provenance collection tools for SIPS 09/10 • Enhance Karma Provenance Browser interface 10/10 • Instrument AMSR-E Sea Ice production in Testbed 12/10 • Evaluate with Sea Ice science team 03/11 • Introduce Provenance Browser to NSIDC DAAC 06/11 • Instrument AMSR-E Sea Ice production in Ops 09/11 • Evaluate with AMSR-E Sea Ice user community 02/12 • Instrument other AMSR-E data streams 02/12 • Apply Karma to Sea Ice data production workflows • Customize Karma’s provenance dissemination user interface • Evaluate usefulness of provenance collected • Measure traffic to Karma Provenance Browser • Collect user feedback • Expand use of Karma to other AMSR-E data production streams Co-Is/Partners Thorsten Markus, NASA GSFC; Beth Plale, Indiana University; Helen Conover, Rahul Ramachandran, UAHuntsville TRLin= 7 TRLcurrent= 7.9 ACCESS Advancing Collaborative Connections for Earth System Science 16 April 2012

  3. Instant Karma: Applying a Proven Provenance Tool to NASA’s AMSR-E Data Production Stream PI: MichaelGoodman, NASA / MSFC Objectives • Improve the collection, preservation, utility and dissemination of provenance information within the NASA Earth Science community • Customize and integrate Karma, a proven provenance tool into NASA data production • Collect and disseminate provenance of AMSR-E (Advanced Microwave Scanning Radiometer – Earth Observing System) standard data products, initially focusing on Sea Ice • Engage the Sea Ice science team and user community • Adhere to the Open Provenance Model (OPM) Daily Script Ocean Land Snow Sea Ice Provenance Browser Pass Script L2A Brightness Temps L2B Ocean L2B Land L2B Rain Processing Testbed Accomplishments • Applied current provenance research results to a legacy NASA science processing system • Instrumented all processing streams for all AMSR-E standard products, using a software library developed at UAH and provenance logging specifications provided by IU • Captured high level science relevant provenance and context information in collaboration with the Science Team • Developed a provenance browser customized for AMSR-E science users • Engaged user community via presentations to AMSR-E Science Team, beta-testing with AMSR-E Sea Ice Team, and outreach to Sea Ice community at GSFC • Developed plan to transition provenance collection and display tools into production at the AMSR-E SIPS • Moving toward ISO rather than OPM, in line with NASA’s current metadata approach • Enhanced Karma visualization based on Sea-Ice rolled into visualization tools for other Karma users Monthly Script ProvenanceCollection QueryAPI Ocean Rain Snow Weekly Script Pentad Script Provenance Repository Ocean Snow Co-Is/Partners Thorsten Markus, NASA GSFC; Beth Plale, Indiana University; Helen Conover, Rahul Ramachandran, UAHuntsville TRLin= 7 TRLout= 7.9 ACCESS Advancing Collaborative Connections for Earth System Science 16 April 2012

  4. Presentation Outline • Introduction • Provenance for AMSR-E SIPS • Advanced Uses of Provenance • ESDSWG / ESIP / Outreach • Accomplishments / Lessons Learned • Contact Info / Acronyms • Back-up Slides Instant Karma Year 2 Annual Review ACCESS-09-0006

  5. Instant Karma Overview Data Center Operations Provenance Research • Collaboration among • AMSR-E SIPS (MSFC Earth Science Office and UAHuntsville ITSC) • Indiana University, Data to Insight Center • AMSR-E Sea Ice science team (GSFC) • Primary goal is to improve the collection, preservation, utility and dissemination of provenance and contextinformationwithin the NASA Earth Science community, building on • Karma provenance tools • AMSR-E standard product generation, with initial focus on Sea Ice Earth Science Instant Karma Project Instant Karma Year 2 Annual Review ACCESS-09-0006

  6. Team Leads Instant Karma Year 2 Annual Review ACCESS-09-0006

  7. Team Membersover the two-year project Instant Karma Year 2 Annual Review ACCESS-09-0006

  8. Project Approach Research Track: Karma Project Focus on capturing information for contextual understanding and long term data preservation rather than full data reproducibility • SIPS processing flow is very controlled, with less need for detailed provenance • Files in a given version of a dataset will differ only by list of input files, processing location and environment, processing date and time • Provenance as centerpiece around which other information is aggregated Provenance research Collection, storage, display methods Research results Requirements Science user needs Customized provenance database and browser AMSR-E Sea Ice Team Provenance @ SIPS NASA data preservation spec Operational environment Operations Track: AMSR-E SIPS Instant Karma Year 2 Annual Review ACCESS-09-0006

  9. Science-Relevant Provenance and Context Information • Datalineage(data inputs, software and hardware) plus additional contextual knowledge about science algorithms, instrument variations, etc. • Lots of information already available, but scattered across multiple locations • Processing system configuration • Data collection and file level metadata • Processing history information • Quality assurance information • Software documentation (e.g., algorithm theoretical basis documents, release notes) • Data documentation (e.g., guide documents, README files) • Project goal was to collate and organize information from multiple sources, make available through the AMSR-E Provenance Browser Instant Karma Year 2 Annual Review ACCESS-09-0006

  10. AMSR-E Provenance Use Cases • Browse provenance graphs : convey rich information about final data granule details [Use case 1] • Spatial location, time of observation, algorithms employed, input data and ancillary files • Provenance bundle to include pointers to relevant documentation • Answer “Something isn’t right” question [Use case 1 variant] • E.g., did not receive data for several days so snow melt mask may be inaccurate. • Compare two data granules [Use case 2] • Query system to compare provenance graph structure and provenance details (e.g., versions of software, number and versions of input files) • General provenance graph for a given science process, e.g., Sea Ice processing [Use case 3] • Current algorithms and versions, nominal number and versions of input files, pointers to relevant documentation • Embed provenance information as annotations in data files [Use case 4] • ISO “Lineage” model (evolving NASA ES conventions) Instant Karma Year 2 Annual Review ACCESS-09-0006

  11. Instant Karma Major Milestones Instant Karma Year 2 Annual Review ACCESS-09-0006

  12. Technology/Software Readiness Levels key deployed planned Target Deployments Testbed AMSR-E SIPS SIPS / DAAC Instant Karma Year 2 Annual Review ACCESS-09-0006

  13. Presentation Outline • Introduction • Provenance for AMSR-E SIPS • Advanced Uses of Provenance • ESDSWG / ESIP / Outreach • Accomplishments / Lessons Learned • Contact Info / Acronyms • Back-up Slides Instant Karma Year 2 Annual Review ACCESS-09-0006

  14. AMSR-E Product Suite Instant Karma Year 2 Annual Review ACCESS-09-0006

  15. SIPS-GHRC Processing Architecture • Algorithm Packages from the Science Team and Team Lead of the Science Computing Facility • Processing automation controlled by SIPS scripts • Pass processing is data driven • Level-3 product generation is scheduled after nominal availability of input products Control Script Delivered Algorithm Package Provenance logging at this level Ancillary File(s) Science Data Metadata Processing History Quality Assurance Browse imagery Custom subsets Input Data and Metadata File(s) Instant Karma Year 2 Annual Review ACCESS-09-0006

  16. Provenance Information Architecture Daily Script P Provenance Browser P P M Ocean Land Snow Sea Ice M M P M Pass Script QueryAPI M L2A Brightness Temps L2B Ocean L2B Land L2B Rain Provenance Repository P Processing Testbed M ProvenanceLogs Granule Metadata P M Monthly Script Ocean Rain Snow Weekly Script P M Pentad Script Ocean Snow M Product / Algorithm Metadata Form Instant Karma Year 2 Annual Review ACCESS-09-0006

  17. Earth science Library for Processing History (ELPH) • Perl Library reference implementation can be used to instrument script-driven processing • Based on IU work in software instrumentation and Karma provenance log specification • Logs “consumed”, “invoked” and “produced” events • Assigns unique URNs to all artifacts (e.g., data files or software processes), of the form urn:host-id/environment:datetime:name • host-id indicates computer system on which processing is performed • environment indicates processing environment (e.g., dev, test or ops for AMSR-E) • datetime(yyyymmddhhmmss) indicates production date/time for science data files, last modification date/time for other files, system date/time for processes or workflows • name indicates name of file or process Instant Karma Year 2 Annual Review ACCESS-09-0006

  18. Provenance and Context Metadata • Harvesting granule information from ECS metadata • Also recording processing location associated with each data granule • Providing Context Summary information for algorithms and data products, in consultation with AMSR-E Science Computing Facility and Sea Ice team • Algorithm versions and descriptions • Parameters and data fields in science products • Ancillary files used in processing • Flag values and explanations • Pointers to full documentation • Aligning with NASA Preservation and Context Specification • Developing ISO Lineage compliant processing descriptions, in consultation Ted Habermann • High-level description of product generation processes to be stored at NSIDC DAAC with collection-level metadata • Granule specific information, which references high level description, to be embedded in data files Instant Karma Year 2 Annual Review ACCESS-09-0006

  19. Context Summary Metadata Schema Basic Product Information Delivered Algorithm Package File Information e.g., grid type Documentation Links Science Algorithm Information Geophysical Parameters / Variables and Associated Information Data fields Ancillary files Flag values Instant Karma Year 2 Annual Review ACCESS-09-0006

  20. AMSR-E Provenance Browser Explore Provenance Browse by Product Type Search by File Name Browse by Product Images Instant Karma Year 2 Annual Review ACCESS-09-0006

  21. AMSR-E Provenance Browser Granule Metadata Processing Graph Parameters and Algorithms Algorithm Information Instant Karma Year 2 Annual Review ACCESS-09-0006

  22. AMSR-E Provenance Browser Parameter Selected Algorithm Used Instant Karma Year 2 Annual Review ACCESS-09-0006

  23. AMSR-E Provenance BrowserModes of Exploration • Browse list by product type, file name • Mid-winter Sea Ice data was incorrect after reprocessing. Look at January data to determine what was different about this processing run. • Browse images by product type, parameter/variable • Scan series of images to look for possible problems, e.g., incomplete data for 2011-10-04 • Search by file name • Question about a particular data file that doesn’t look right. Search by file name; may discover that this was a problem file later replaced, and get new file. • View general processing information • For information about how a files from a particular product are generated, view general processing graph and get pointers to more detailed documentation. Instant Karma Year 2 Annual Review ACCESS-09-0006

  24. Presentation Outline • Introduction • Provenance for AMSR-E SIPS • Advanced Uses of Provenance • ESDSWG / ESIP / Outreach • Accomplishments / Lessons Learned • Contact Info / Acronyms • Back-up Slides Instant Karma Year 2 Annual Review ACCESS-09-0006

  25. Advanced Uses of Provenance • Provenance over months and years • Forward provenance • Comparing provenance graphs • Provenance quality analysis

  26. I. Month-view of Product Activity Daily processing for Sea Ice products over 1 month is linked graph of provenance graphs, one per day. View shows links between provenance graphs. E.g., ice mask updated for each day, feeds into the next day. This view can be used to detect problems in processing over extended time periods.

  27. Zoom in to Review Dependencies  Zoom to review complex dependencies between runs responsible for producing data products  Shows dependencies for Sea Ice Drift product, built from five daily Sea Ice products  Can see here that input data for fifth day is incomplete (instrument offline) Zoom in and out to see entire provenance history or specific details

  28. II. Forward Provenance Standard provenance view is historical; it shows actions that went into creating a data product Shown here is 12km SeaIce product  Clicking Data Forward option switches to displaying later-in-time (forward) provenance dependencies

  29. Forward Provenance Graph 12KM Sea Ice Product Forward provenance can be helpful if problem is discovered in data product. The progeny of bad data product can be tracked down visually by following forward provenance chain. High level view provides quick snapshot of downstream dependencies on data product  Useful to quickly see downstream consequences of reprocessing

  30. III. Comparing Two Provenance Graphs  Compare two workflows side-by-side with node-to-node mapping that uses Direct Classification of node Attendance (DCA)* * DePiero, F. and Trivedi, M. and Serbin, S., Graph matching using a direct classification of node attendance, Pattern Recognition, (29)6, pp 1031—1048.

  31. Provenance Graph Comparison Matched sub-graph: Red nodes are matched to corresponding nodes in comparison graph Left hand graph has 19 data files that cannot be paired with graph on right One can see from zooming in that additional nodes are input files. This abnormal case occurred because instrument went off-line so data was not available

  32. Comparison of Metadata Clicking on data product or process in one graph (e.g., on left), automatically highlights corresponding product or process in comparison graph Displaying metadata for data product or process in one graph automatically displays metadata for corresponding node in comparison graph

  33. IV. Analysis of Provenance Quality • Methodology Duplicate detection • Annotations are mined for duplicates and conflicts. Structural analysis • Structures of provenance graphs that are of the same types are compared and mined for differences or similarities. Anomaly detection • Detection of outliers in a provenance graph Validation of attributes • Fields such as time ranges are validated to ensure that time ranges are valid. • Applied to 1 month of data processing

  34. Analysis of Quality of Provenance Structural analysis of 1 month of provenance detected an incorrect timestamp range: • Process_6250<-Process_6251 Time range is reversed: noEarlierThan: 2011-10-13T00:44:32.0000-04:00 noLaterThan: 2011-10-13T00:44:31.0000-04:00 This is an artifact of data produced and consumed entries in log file being reversed, which results in begin and end time of time range being swapped

  35. Plot of number of annotations for each node and edge in monthly ocean workflow used edges Annotations Annotations for wasDerivedFrom edges Data Object annotations • Anomaly detection • Exposes duplicate annotations occurring from duplicate entries in the log files.

  36. Annotations; zoom in I 4wasDerivedFrom edges that have fewer annotations than usual. used edges Annotations Annotations for wasDerivedFrom edges Data Object annotations • Anomaly detection • Exposes edges that have less annotations than usual

  37. Annotations; zoom in II wasTriggeredBy edge has annotations with same uri (keys), but different values. Note: Image is a close up of the highlighted part in previous slide. • Anomaly detection • Exposes annotations that share the same key, but have different values. This may or may not be a problem. But Karma is able to warn the user about this.

  38. Presentation Outline • Introduction • Provenance for AMSR-E SIPS • Advanced Uses of Provenance • ESDSWG / ESIP / Outreach • Accomplishments / Lessons Learned • Contact Info / Acronyms • Back-up Slides Instant Karma Year 2 Annual Review ACCESS-09-0006

  39. ESDSWG and ESIP Activities • NASA Earth Science Data Systems Technology Infusion Working Group • UAH team members participate in ESDSWG Tech Infusion, specifically the Data Stewardship Subgroup. Key subgroup activities: • NASA Earth Science Data Preservation Content Specification • Preservation / Provenance Ontology • Co-I Conover attended the ESDSWG meeting in October 2010 – Session on science-relevant provenance • NASA Earth Science Data Systems Standards Process Group • Co-I Conover presented the provenance browser and metadata work at the ESDSWG meeting in November 2011. This led to current work with ISO Lineage spec. • ESIP Federation Information Technology and Interoperability Committee and the Preservation and Stewardship Cluster • Co-I Plale gave a presentation to IT&I in September 2010 on Instant Karma, provenance collection, semantics, and interoperability. This group determined that the best place to engage in provenance efforts is through the ESIP Federation Preservation and Stewardship Cluster’s provenance working group. • Co-I Conover attended the ESIP Federation meeting in July 2010 – Data Identifiers presentation of special interest to this project • Co-I Plale attended the January 2011 ESIP meeting – Sessions on stewardship, science-relevant provenance among others • Co-I Conover attended the July 2012 ESIP meeting – Sessions on ISO metadata, NASA Earth Science Data Preservation Content Specification, provenance ontology, others Instant Karma Year 2 Annual Review ACCESS-09-0006

  40. Outreach to Community:Posters and Presentations “Instant Karma: Collecting Provenance for AMSR-E,” Beth Plale and Helen Conover. Joint AMSR-E Science Team Meeting, Huntsville AL, June 2, 2010. “Instant Karma: Provenance Collection at the AMSR-E SIPS,” Helen Conover, Beth Plale, Sunil Movva, PrajaktaPurohit, Kathryn Regner, Yiming Sun. Poster presented at the 9th Earth Science Data Systems Working Group Meeting, New Orleans, October 20-22, 2010.   “Instant Karma: Provenance Collection at the AMSR-E SIPS,” Helen Conover, Beth Plale, Sunil Movva, PrajaktaPurohit, Kathryn Regner, Yiming Sun. Poster presented at the International Symposium on the A-Train Satellite Constellation 2010, New Orleans, October 25-28, 2010. "Karma Provenance: Why and How? Provenance collection of unmanaged workflows”, Mehmet Aktas, Presentation at the Super Computing 2010 Conference, New Orleans, LA, November 14-20, 2010. "Karma Provenance: Why and How? Provenance collection of unmanaged workflows”, M. Aktas, Presentation at the CS Department of University of West Florida, Pensacola, FL, November 20, 2010. “Metadata and Provenance Collection and Representation: Antecedent to Scientific Data Preservation,” B. Plale, Open Data Seminar, University of Michigan, November 2010. “Applying the Karma Provenance Tool to NASA’s AMSR-E Data Production Stream,” R. Ramachandran, H. Conover, K. Regner, S. Movva, Michael Goodman, B. Plale, P. Purohit, Y. Sun. American Geophysical Union Fall Meeting, December 13–17, 2010. Instant Karma Year 2 Annual Review ACCESS-09-0006

  41. Outreach to Community:Posters and Presentations “Data Provenance for Preservation of Digital Geospatial Data,” B. Plale, B. Cao, C. Herath, and Y. Sun, Geological Society of America special volume on Transforming Data To Knowledge for Geosciences, v. 482, p. 125-137, 2011. “Instant Karma: Accessing Provenance Information for AMSR-E Science Data Products”, H. Conover, K. Regner. Presented at AMSR-E Science Team Meeting, Asheville, NC, June 2011. “Instrumenting Earth Science Applications for OPM-Driven Provenance,” Mehmet Aktas, Beth Plale, Helen Conover, PrajaktaPurohit. White paper distributed at ESIP Federation Meeting, Santa Fe, July 2011. “Instant Karma Status Update: Provenance at the AMSR-E SIPS,” Conover, Plale, Aktas, B. Beaumont, D. Conway, S. Graves, S. Jensen, H. Joshi, A. Kulkarni, Y. Luo, R. Ping, P. Purohit, R. Ramachandran, K. Regner, C. Stein. Poster presented by Conover at ESIP Federation Meeting, Santa Fe, July 2011. “Provenance Collection and Display Tools for the AMSR-E SIPS,” H. Conover, B. Beaumont, A. Kulkarni, R. Ramachandran, K. Regner, S. Graves, D. Conway. Presentation and poster at NASA Earth Science Data Systems Working Groups (ESDSWG) Meeting, Newport News, VA, November 2011. “Key Provenance of Earth Science Observational Data Products,” H. Conover, B. Plale, M. Aktas, R. Ramachandran, P. Purohit, S. Jensen, and S. Graves. Presented at American Geophysical Union Fall Meeting, session IN22 Issues in Scientific Data Preservation and Stewardship, December 2011. Instant Karma Year 2 Annual Review ACCESS-09-0006

  42. Outreach to Community:Posters and Presentations “Provenance Collection and Display for the AMSR-E SIPS,” H. Conover, B. Beaumont, A. Kulkarni, R. Ramachandran, K. Regner, S. Graves, D. Conway. Poster presented at ESIP Federation Meeting, Washington DC, January 2012. “Metadata and Provenance Capture: Fins in a Sea of Data,” B. Plale, invited talk, Purdue University, March 2012 “Metadata and Provenance: Dimensions of Use in Science”, B. Plale, invited talk, Biodiversity Workshop, Pacific Rim Applications and Grid Middleware Assembly (PRAGMA) 22, Melbourne, AU March 2012 “Metadata and Provenance Capture and Use,” B. Plale, invited talk, IBM TJ Watson, April 2012 In Progress: -- Paper in work, targeting IEEE Transactions on Geoscience and Remote Sensing, special issue on Geoscience Data Provenance. Instant Karma Year 2 Annual Review ACCESS-09-0006

  43. Presentation Outline • Introduction • Provenance for AMSR-E SIPS • Advanced Uses of Provenance • ESDSWG / ESIP / Outreach • Accomplishments / Lessons Learned • Contact Info / Acronyms • Back-up Slides Instant Karma Year 2 Annual Review ACCESS-09-0006

  44. Summary of Accomplishments • Event system support implemented for capturing provenance from legacy systems • Tools for harvesting provenance from logs, specifically logs of AMSR-E reprocessed data • Enhancements to provenance plug-in for Cytoscape, driven by AMSR-E data needs: • Graph comparison: of graph structure and metadata based • Provenance chain visualization: over long duration (months, years) • Data forward provenance: provenance into future of derived products across science products • Enhancements based on AMSR-E data products in Instant Karma rolled into Karma provenance system and made available to other Karma users including NSF-funded GENI project Instant Karma Year 2 Annual Review ACCESS-09-0006

  45. Summary of Accomplishments • Applied current provenance research results to a legacy NASA science processing system • Instrumented all processing streams for all AMSR-E standard products, using a software library developed at UAH and provenance logging specifications provided by IU. • Captured high level science relevant provenance and context information in collaboration with the AMSR-E Science Team • Developed a provenance browser customized for AMSR-E science users • Engaged user community via • Presentations to AMSR-E Science Team • Collaboration and beta-testing with AMSR-E Sea Ice Team • Outreach to Sea Ice community at GSFC • Developed plan to transition provenance collection and display tools into production at the AMSR-E SIPS • Coordinating with NSIDC DAAC on delivery of provenance information Instant Karma Year 2 Annual Review ACCESS-09-0006

  46. Transition Plans • Currently able to collect provenance information for all AMSR-E standard products generated at SIPS-GHRC • Level-2B and Level-3 products • Implemented in Provenance Testbed, ready for test at the AMSR-E SIPS • Plan to implement provenance collection in AMSR-E SIPS before reprocessing to begin in summer 2012 • Working with NSIDC DAAC on delivery of ISOLineage compliant provenance information with data products • High level processing description, to be stored with collection-level metadata at NSIDC DAAC • Granule specific information, referencing high level description, to be embedded in data files • Provenance Browser to be available from AMSR-E SIPS / GHRC through mission close-out, then transitioned to NSIDC DAAC Instant Karma Year 2 Annual Review ACCESS-09-0006

  47. Project Results Products for potential reuse resulting from this research: • Provenance collection tools • Software library with reference implementation in Perl • Schema for provenance logs • InstantKarmaAdaptor to ingest provenance logs into Karma • Conventions for provenance and context metadata • ISO Lineage for product generation processes • High level description, to be stored at NSIDC DAAC • ISO Lineage for data granules • Granule specific information, to be embedded in data files • References high level description • Additional high-level data product and algorithm context information • Provenance storage and display • Drupal profile with provenance database and two modules • Provenance browser • Data product and algorithm context form • Provenance-custom plugins to Cytoscape for graph comparison and provenance viewing over time Instant Karma Year 2 Annual Review ACCESS-09-0006

  48. Lessons Learnedadapting cutting edge research to a legacy production system • Expect to encounter similar issues in retrofitting legacy science processing systems for provenance capture, such as heterogeneous system components and lack of direct control over science software • Efficient instrumenting of a science processing system requires joint effort by experts with the processing system and with provenance collection. • Need to consider provenance collection at beginning of system design • Science algorithms should be self describing • Need a way to track intermediate files which may be overwritten each time • Evolving standards and conventions make project planning more difficult (but allow project to contribute to this evolution). • Provenance metadata requirements • Unique identifiers for data collections and files • NASA Earth science conventions for ISO metadata • NASA Earth science data preservation requirements Instant Karma Year 2 Annual Review ACCESS-09-0006

  49. Lessons Learnedstrict reproducibility vs. full contextual information • Contextual information • Need to identify what is important and how to present it concisely • Presenting key details of the science algorithm may be most important to scientists • Reproducibility • Full reproducibility requires logging everything. Metadata may get bigger than science data. • Need to filter out most valuable information from entire processing log. Instant Karma Year 2 Annual Review ACCESS-09-0006

  50. Recommendations • Design provenance collection into entire data lifecycle, from Level 0 processing onward • Agree on standard protocols/schemas for provenance and related information (ISO Lineage?) • Develop a semantic mapping between ISO Lineage, OPM and other provenance schemas • Allow for instrumentation of science algorithms (not black box) • Possible follow-on work • Monitor provenance browser usage to determine most important features • Develop provenance libraries for common science programming and scripting languages (C, FORTRAN, Perl, Python, etc.) that incorporate provenance logging into common commands (e.g., read, invoke, write, error) • Document provenance collection and dissemination methodology for broader use in NASA Earth science product generation systems • Apply these tools and techniques to future missions (e.g., GPM) Instant Karma Year 2 Annual Review ACCESS-09-0006

More Related