1 / 11

Working Group: Data Foundations and Terminology ( Practical Policy Considerations ) Reagan Moore

Working Group: Data Foundations and Terminology ( Practical Policy Considerations ) Reagan Moore. Mapping Terminology to Use Cases. Consider a hydrologist who needs to: Acquire data sets needed for research Execute an analysis Save the research results

manju
Download Presentation

Working Group: Data Foundations and Terminology ( Practical Policy Considerations ) Reagan Moore

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Working Group: Data FoundationsandTerminology(PracticalPolicyConsiderations)Reagan Moore

  2. Mapping Terminology to Use Cases • Consider a hydrologist who needs to: • Acquire data sets needed for research • Execute an analysis • Save the research results • Enable another hydrologist to re-execute the analysis • Embed the goal of data discovery, access, analysis and management in the larger context of Reproducible data-driven research • Where did the data come from? • How was the data created? • How was the data managed?

  3. Concepts • There is a duality between: • Procedures that generate data objects • Data objects generated by a procedure • Terminology is needed that describes: • Operations executed by a researcher to create data objects • Operations executed by a repository to manage data objects

  4. Choose gauge or outlet (HIS) Eco-Hydrology RHESSys workflow to develop a nested watershed parameter file (worldfile) containing a nested ecogeomorphic object framework, and full, initial system state. Extract drainage area (NHDPlus) Digital Elevation Model (DEM) Slope Aspect Nested watershed structure Streams (NHD) Soil and vegetation parameter files Roads (DOT) Strata Patch Land Use NLCD (EPA) Hillslope Basin Leaf Area Index Landsat TM Stream network Phenology MODIS Flowtable Worldfile Soil Data USDA RHESSys

  5. Researcher operations vs Repository operations • Researcher operations • Pick the location of a stream gauge and a date • Access USGS data sets to determine the watershed that surrounds the stream gauge • Access USDA for soils data for the watershed • Access NASA for LandSat data • Access NOAA for precipitation data • Access USDOT for roads and dams • Project each data set to the region of interest • Generate the appropriate environment variables  • Conduct the watershed analysis • Store the workflow, the input files, and the results • Data Repository management operations • Authenticate the user • Authorize the deposition • Add a retention period • Extract descriptive metadata • Record provenance information • Log the event • Create derived data products (image thumbnails) • Add access controls (collection sticky bits) • Verify checksum • Version • Replicate • Index • Choose a storage location • Choose the physical path name

  6. Concepts needed for Reproducible research • Data Bits (0s and 1s) • Digital object Named bits • Data object Named bits plus representation object • Representation object Context containing provenance, description, structural, and administrative information • Operations Data manipulation function • Workflow Set of chained operations • Workflow object Text file listing the chained operations

  7. Definition of operation • X.1255 - An operation on a digital entity involves the following elements: • EntityID: the identifier of the digital entity requesting invocation of the operation; • TargetEntityID: the identifier of the digital entity to be operated upon; • OperationID: the identifier that specifies the operation to be performed; • Input: a sequence of bits containing the input to the operation, including any parameters, content or other information; and • Output: a sequence of bits containing the output of the operation, including any content or other information. • Challenge is how to characterize the response of the data management system to a requested operation. The repository may authenticate and authorize, modify state information, log information, add retention, … • Pre-process workflow that controls the input (access control, error checking, logging) • Operation • Post-process workflow that controls the output (changes to state information, audits)

  8. Data Access Steps • Access a known repository.  The researcher has an explicit repository in mind for each data set • Query the repository for data sets that satisfy spatial/temporal relationships • Either  • get a list of identifiers, retrieve the data sets, and apply a data subsetting algorithm locally • Or apply the data subsetting algorithm at the remote repository • Name the local data subset for processing within the research workflow.  This can be • a local collection name • or a global persistent identifier.

  9. Interactions with collections: Remote metadata catalog and Remote data repository Remote MD catalog DataONE Model: User queries remote MD repository using spatial/temporal parameters Related Metadata for Data Sets Repository sends identifiers & MD for files that satisfy spatial/ temporal requirements User Local Data repository ` Data Collection Remote Data repository ` Data Collection User retrieves files using the identifiers Data Collection Data Sets OPeNDAP Model: User queries remote data repository using spatial/temporal parameters for desired physical variables Local Data repository ` Remote Data repository ` Desired data sets are generated by remote data repository and returned to user Data Collection Data Collection Data Collection Data Collection Data Collection Data Collection Data Sets Data Sets

  10. Purpose- reason a collection is assembled Properties- attributes needed to ensure the purpose Policies- enforce and maintain collection properties Procedures- functions that implement the policies Persistent state information - results of applying procedures Property assessment criteria – validation that state information conforms to the desired purpose Federation- controlled sharing oflogical name spaces Policy: Assertion or assurance that is enforced about a collection or a dataset Policy-Based Data Management

  11. Policy Concept Graph Purpose DATA_ID DATA_REPL_NUM DATA_CHECKSUM Collection Defines Replication Policy Has Isa Isa Isa Has Isa Checksum Policy Defines Digital Object Attribute Has Isa Quota Policy Has Isa Integrity Data Type Policy Isa Updates Isa Isa Authenticity Persistent State Information Isa Property Policy Procedure Defines Updates Controls Access control Isa Isa SubType Has HasFeature GetUserACL Periodic Assessment Criteria Policy HasFeature Workflow Isa Policy Enforcement Point SetDataType Completeness HasFeature Chains Isa SetQuota Correctness Isa Function HasFeature Invokes Isa DataObjRepl Consensus Isa Isa SysChksumDataObj Operation Consistency Client Action

More Related