250 likes | 375 Views
Handling Datasets. Michelle Gierach , PO.DAAC Project Scientist Eric Tauer , PO.DAAC Project System Engineer. Introduction. We saw a theme spanning several 2011 UWG recommendations, (6, 14, 19, 20, 23) . The theme spoke to a fundamental need/goal:
E N D
Handling Datasets Michelle Gierach, PO.DAAC Project Scientist Eric Tauer, PO.DAAC Project System Engineer
Introduction • We saw a theme spanning several 2011 UWG recommendations, (6, 14, 19, 20, 23). • The theme spoke to a fundamental need/goal: Approach and handle datasets with consistency, and accept and/or deploy them only because it makes sense to do so. • This is worth solving! We want to provide the right datasets, and we want users to be able to easily connect with the right datasets. • We enthusiastically agree with the UWG recommendations! Therefore… Our intent is to capture the lifecycle policy, (including how we accept, handle, and characterize datasets), to ensure: Consistency in our approach, Soundness in our decisions, and the Availability of descriptive measures to our users.
The next two discussions… In the next two discussions, we will address those 5 UWG Recommendations (6, 14, 19, 20, 23), via the following talks: • The proposed end-to-end lifecycle phases (enabling consistency), and assessment criteria (describing our data to our users) (Eric) • The results of the Gap Analysis, and the corresponding Dataset Assessment (Michelle)
Recommendations Covered Recommendation 6. Carry out the dataset gap analyses and create a reporting structure that categorizes what is available, what could be created, the potential costs involved, estimates of user needs, and other data management factors. This compilation should enable prioritization of efforts that will fill the most significant data voids. Recommendation 14. There needs to be a clear path for all datasets generated outside of PO.DAAC to be accepted and hosted by the PO.DAAC. The PSTs have a role in determining whether a dataset is valuable and of good quality. The processes and procedures should be published and readily available to potential dataset developers. All datasets should go through the same data acceptance path. A metric exclusively based on the number of peer-reviewed papers using the dataset is NOT recommended. Recommendation 19. The UWG has previously recommended that the PO.DAAC work on providing climatologies, anomalies, indices, and various dataset statistics for selected datasets. This does not include developing CDRs as part of the core PO.DAAC mission. This recommendation is repeated, because it could be partially complementary to the IPCC/CMIP5 efforts, e.g., these climatologists prefer to work with global monthly mean data fields. Contributions of CDR datasets to PO.DAAC from outside research should be considered. Recommendation 20. Better up front planning is required if NASA research program results are to be directed toward PO.DAAC. Datasets must meet format and metadata standards, and contribute to body of core data types. The Dataset Lifecycle Management plans are a framework for these decisions. Software must be designed to integrate with and beneficially augment the PO.DAAC systems. PO.DAAC should not accept orphan datasets or software projects. Recommendation 23. Guiding user to data: Explain and use Dataset Lifecycle Management vocabulary with appropriate linkages Clarify what Sort by 'Popularity...' means
Related 2011 UWG Recommendations The specification and documentation of the Dataset Lifecycle Policy stems from UWG Recommendations: 14, 20, 23 “There needs to be a clear path for all datasets generated outside of PO.DAAC to be accepted and hosted by the PO.DAAC” “All datasets should go through the same data acceptance path” “Better up front planning is required if NASA research program results are to be directed toward PO.DAAC” “Dataset Lifecycle Management plans are a framework for these decisions” “Explain and use Dataset Lifecycle Management vocabulary with appropriate linkages”
Why a Lifecycle Policy? Major Goal: Better describe our data to better map it to our users. Consistency in our approach Match users to data
Existing Work… Dataset Lifecycle work underway internal and external to PO.DAAC. Internal: Significant research and work performed by Chris Finch (UWG 2010 Presentation) Work within PO.DAAC to streamline process; Mature teams with a very solid understanding of their roles Existing exit-criteria checklist for product release External: Quite a bit of reference available via Industry efforts and progress Models can be leveraged from implementations at other DAACs, Big Data, Data One Question: Any specific recommendations regarding lifecycle models appropriate to PO.DAAC?
Proposed PO.DAAC Lifecycle Phases *Additionally, we include “Retire the Dataset”, but these are the primary operational phases.
Lifecycle Policy: A Means to an End ESDIS Goals User Goals PO.DAAC Dataset Goals Lifecycle Policy Controls… How we do business Procedures Consistent Approach How we Describe our Data MATURITY
Better-Described Data Assessment and Characterization? Maturity? We want to quantitatively evaluate our datasets We don’t want to claim datasets are “good” or “bad” NASA and NOAA call their evaluation: “Maturity.” Question: (Rhetorical, at this point) What does “maturity” mean to you? Do you prefer it to “Assessment and Characterization”?
Constant Collection • Over the lifecycle, various data points are collected • Decisional (e.g., Uniqueness: Rare or hard-to-find data) • Descriptive (e.g., Spatial Resolution) • Those data points might control decisions or flow (exit criteria) and/or might be used to describe the “maturity” to the user. • We think “maturity” means: A quantified characterization of dataset features. A higher number means more “mature”
- constant collection - Dataset Lifecycle Phases Increasing Knowledge of Maturity Identify a Dataset of Interest Green-Light the Dataset Tailor the Dataset Policy Ingest the Dataset Archive the Dataset Register/Catalog the Dataset Distribute the Dataset Verify the Dataset Rollout the Dataset Maintain the Dataset
Related 2011 UWG Recommendations The creation of a PO.DAAC Dataset Maturity Model stems from UWG Recommendations: 6, 14, 20, 23 [Identify the] “potential costs involved, estimates of user needs, and other data management factors” “The PSTs have a role in determining whether a dataset is valuable and of good quality. The processes and procedures should be published and readily available to potential dataset developers” “A metric exclusively based on the number of peer-reviewed papers using the dataset is NOT recommended.” “Datasets must meet format and metadata standards” “PO.DAAC should not accept orphan datasets” “Clarify what Sort by 'Popularity...' means”
Dataset Maturity We adhere to the lifecycle for consistency, but a key outcome of the lifecycle must be maturity measures.
Ref: NASA Data Maturity Levels • Beta • Products intended to enable users to gain familiarity with the parameters and the data formats. • Provisional • Product was defined to facilitate data exploration and process studies that do not require rigorous validation. These data are partially validated and improvements are continuing; quality may not be optimal since validation and quality assurance are ongoing. • Validated • Products are high quality data that have been fully validated and quality checked, and that are deemed suitable for systematic studies such as climate change, as well as for shorter term, process studies. These are publication quality data with well-defined uncertainties, but they are also subject to continuing validation, quality assurance, and further improvements in subsequent versions. Users are expected to be familiar with quality summaries of all data before publication of results; when in doubt, contact the appropriate instrument team. • Stage 1 Validation: Product accuracy is estimated using a small number of independent measurements obtained from selected locations and time periods and ground-truth/field program efforts. • Stage 2 Validation: Product accuracy is estimated over a significant set of locations and time periods by comparison with reference in situ or other suitable reference data. Spatial and temporal consistency of the product and with similar products has been evaluated over globally representative locations and time periods. Results are published in the peer-reviewed literature. • Stage 3 Validation: Product accuracy has been assessed. Uncertainties in the product and its associated structure are well quantified from comparison with reference in situ or other suitable reference data. Uncertainties are characterized in a statistically robust way over multiple locations and time periods representing global conditions. Spatial and temporal consistency of the product and with similar products has been evaluated over globally representative locations and periods. Results are published in the peer-reviewed literature. • Stage 4 Validation: Validation results for stage 3 are systematically updated when new product versions are released and as the time-series expands.
Ref: NOAA Maturity Model See: ftp://ftp.ncdc.noaa.gov/pub/data/sds/ms-privette-P1.3.conf.header.pdf
Laundry List of Criteria Access: Readily available? Foreign repository? Behind firewalls or open FTP? Toolkits: Data visualization routine? Data reader? Verified reader/subroutine? Relationships: Sibling/child datasets identified? Motivation/justification identified? Rarity: Hard-to-find data? Atypical sensor/resolution/etc.? Specification: Resolution (spatial / temporal) Spatial coverage Start time End time Data format? Exotic structure? Sizing / volume expectation? Community Assessment: Papers written / number of citations # of Likes # of downloads/views Technical Quality: QQC+Latency / Gappiness Accuracy Sampling issues? Caveats/known issues identified? Processing: Has it been manipulated? Cal/Val state? Verification state? Provenance: Maturity of platform/instrument/sensor Maturity of Program Parent datasets identified (if applicable) Is the sensor fully described? Is the context of the reading(s) fully described? State-of-the-Art technology? Documentation: What is the state of the documentation? Is the documentation captured (archived)? Adherence to Process Guidelines Did it get fast-tracked? Tons of waivers? Were all exit criteria met satisfactorily? Consistent use of units?
Used for Maturity Index Access: Readily available? Foreign repository? Behind firewalls or open FTP? Toolkits: Data visualization routine? Data reader? Verified reader/subroutine? Relationships: Sibling/child datasets identified? Motivation/justification identified? Rarity: Hard-to-find data? Atypical sensor/resolution/etc.? Specification: Resolution (spatial / temporal) Spatial coverage Start time End time Data format? Exotic structure? Sizing / volume expectation? Community Assessment: Papers written / number of citations # of Likes # of downloads/views Technical Quality: QQC+Latency / Gappiness Accuracy Sampling issues? Caveats/known issues identified? Processing: Has it been manipulated? Cal/Val state? Verification state? Provenance: Maturity of platform/instrument/sensor Maturity of Program Parent datasets identified (if applicable) Is the sensor fully described? Is the context of the reading(s) fully described? State-of-the-Art technology? Documentation: What is the state of the documentation? Is the documentation captured (archived)? Adherence to Process Guidelines Did it get fast-tracked? Tons of waivers? Were all exit criteria met satisfactorily? Consistent use of units?
Proposed Presentation ? Question: What does “maturity” mean to you? Do you prefer it to “Assessment and Characterization”? Does this provide better described datasets and better mapping of data to our users? Users would be presented with layers of information: • Scores derived from the various criteria categories • An ultimate maturity index (simple mathematical average) from the combined values: • Ultimately could allow weighting • At this point, seems it would overcomplicate
Wrap Up The lifecycle document, while capturing process, becomes a means to an even greater end. The driving current is consistency, and as our goals hinge on matching users to datasets, the lifecycle becomes the means to ensuring fully characterized datasets. We hope the approach is reasonable, and that we are accurate in our assessment that the policy aspects of the Dataset Lifecycle can and will help ensure conformity to process, and consistent availability of maturity data across all PO.DAAC holdings. Next steps: Need to ultimately identify (and if necessary, implement) the infrastructure needed to guide us through this lifecycle Still need to resolve some key questions, such as: How does the lifecycle morph with respect to different types of datasets? (Remote Datasets? Self-generated Datasets?)
Dataset Lifecycle Phases Identify a Dataset of Interest Michelle’s discussion starts here… Green-Light the Dataset Tailor the Dataset Policy Ingest the Dataset Archive the Dataset Register/Catalog the Dataset Distribute the Dataset Verify the Dataset Rollout the Dataset Maintain the Dataset