110 likes | 228 Views
Working Group: Data Foundations and Terminology ( Practical Policy Considerations ) Reagan Moore. Mapping Terminology to Use Cases. Consider a hydrologist who needs to: Acquire data sets needed for research Execute an analysis Save the research results
E N D
Working Group: Data FoundationsandTerminology(PracticalPolicyConsiderations)Reagan Moore
Mapping Terminology to Use Cases • Consider a hydrologist who needs to: • Acquire data sets needed for research • Execute an analysis • Save the research results • Enable another hydrologist to re-execute the analysis • Embed the goal of data discovery, access, analysis and management in the larger context of Reproducible data-driven research • Where did the data come from? • How was the data created? • How was the data managed?
Concepts • There is a duality between: • Procedures that generate data objects • Data objects generated by a procedure • Terminology is needed that describes: • Operations executed by a researcher to create data objects • Operations executed by a repository to manage data objects
Choose gauge or outlet (HIS) Eco-Hydrology RHESSys workflow to develop a nested watershed parameter file (worldfile) containing a nested ecogeomorphic object framework, and full, initial system state. Extract drainage area (NHDPlus) Digital Elevation Model (DEM) Slope Aspect Nested watershed structure Streams (NHD) Soil and vegetation parameter files Roads (DOT) Strata Patch Land Use NLCD (EPA) Hillslope Basin Leaf Area Index Landsat TM Stream network Phenology MODIS Flowtable Worldfile Soil Data USDA RHESSys
Researcher operations vs Repository operations • Researcher operations • Pick the location of a stream gauge and a date • Access USGS data sets to determine the watershed that surrounds the stream gauge • Access USDA for soils data for the watershed • Access NASA for LandSat data • Access NOAA for precipitation data • Access USDOT for roads and dams • Project each data set to the region of interest • Generate the appropriate environment variables • Conduct the watershed analysis • Store the workflow, the input files, and the results • Data Repository management operations • Authenticate the user • Authorize the deposition • Add a retention period • Extract descriptive metadata • Record provenance information • Log the event • Create derived data products (image thumbnails) • Add access controls (collection sticky bits) • Verify checksum • Version • Replicate • Index • Choose a storage location • Choose the physical path name
Concepts needed for Reproducible research • Data Bits (0s and 1s) • Digital object Named bits • Data object Named bits plus representation object • Representation object Context containing provenance, description, structural, and administrative information • Operations Data manipulation function • Workflow Set of chained operations • Workflow object Text file listing the chained operations
Definition of operation • X.1255 - An operation on a digital entity involves the following elements: • EntityID: the identifier of the digital entity requesting invocation of the operation; • TargetEntityID: the identifier of the digital entity to be operated upon; • OperationID: the identifier that specifies the operation to be performed; • Input: a sequence of bits containing the input to the operation, including any parameters, content or other information; and • Output: a sequence of bits containing the output of the operation, including any content or other information. • Challenge is how to characterize the response of the data management system to a requested operation. The repository may authenticate and authorize, modify state information, log information, add retention, … • Pre-process workflow that controls the input (access control, error checking, logging) • Operation • Post-process workflow that controls the output (changes to state information, audits)
Data Access Steps • Access a known repository. The researcher has an explicit repository in mind for each data set • Query the repository for data sets that satisfy spatial/temporal relationships • Either • get a list of identifiers, retrieve the data sets, and apply a data subsetting algorithm locally • Or apply the data subsetting algorithm at the remote repository • Name the local data subset for processing within the research workflow. This can be • a local collection name • or a global persistent identifier.
Interactions with collections: Remote metadata catalog and Remote data repository Remote MD catalog DataONE Model: User queries remote MD repository using spatial/temporal parameters Related Metadata for Data Sets Repository sends identifiers & MD for files that satisfy spatial/ temporal requirements User Local Data repository ` Data Collection Remote Data repository ` Data Collection User retrieves files using the identifiers Data Collection Data Sets OPeNDAP Model: User queries remote data repository using spatial/temporal parameters for desired physical variables Local Data repository ` Remote Data repository ` Desired data sets are generated by remote data repository and returned to user Data Collection Data Collection Data Collection Data Collection Data Collection Data Collection Data Sets Data Sets
Purpose- reason a collection is assembled Properties- attributes needed to ensure the purpose Policies- enforce and maintain collection properties Procedures- functions that implement the policies Persistent state information - results of applying procedures Property assessment criteria – validation that state information conforms to the desired purpose Federation- controlled sharing oflogical name spaces Policy: Assertion or assurance that is enforced about a collection or a dataset Policy-Based Data Management
Policy Concept Graph Purpose DATA_ID DATA_REPL_NUM DATA_CHECKSUM Collection Defines Replication Policy Has Isa Isa Isa Has Isa Checksum Policy Defines Digital Object Attribute Has Isa Quota Policy Has Isa Integrity Data Type Policy Isa Updates Isa Isa Authenticity Persistent State Information Isa Property Policy Procedure Defines Updates Controls Access control Isa Isa SubType Has HasFeature GetUserACL Periodic Assessment Criteria Policy HasFeature Workflow Isa Policy Enforcement Point SetDataType Completeness HasFeature Chains Isa SetQuota Correctness Isa Function HasFeature Invokes Isa DataObjRepl Consensus Isa Isa SysChksumDataObj Operation Consistency Client Action