190 likes | 472 Views
3 August 2006. Data Integrity Issues. 2 of 19. PDS Requirements for Data Integrity. The PDS has made a commitment to ensure the integrity of its data archives. This commitment is primarily spelled out in the Level 3 requirement 4.1.2:PDS will develop and implement procedures for periodically e
E N D
1. Data Integrity Issues:How to Proceed? Engineering Node
Elizabeth Rye
2. 3 August 2006
Data Integrity Issues
2 of 19 PDS Requirements for Data Integrity The PDS has made a commitment to ensure the integrity of its data archives. This commitment is primarily spelled out in the Level 3 requirement 4.1.2:
“PDS will develop and implement procedures for periodically ensuring the integrity of the data.”
Several other Level 3 requirements suggest additional implications for data integrity assurance.
3. 3 August 2006
Data Integrity Issues
3 of 19 PDS Requirements for Data Integrity The PDS is responsible for assisting data providers in determining how to validate the data they provide:
“PDS will provide criteria for validating archival products” (1.3.3)
The PDS is responsible for ascertaining that the data we deliver to the NSSDC is valid:
“PDS will meet U.S. federal regulations for the preservation and management of data.” (2.8.3)
“PDS will meet U.S. federal regulations for preservation and management of the data through its Memorandum of Understanding (MOU) with the National Space Science Data Center (NSSDC)” (4.1.5)
4. 3 August 2006
Data Integrity Issues
4 of 19 PDS Requirements for Data Integrity The PDS is responsible for enabling our users to verify the integrity of the data they receive from us:
“PDS will develop and maintain online mechanisms allowing users to download portions of the archive” (3.2.1)
“PDS will develop and maintain a mechanism for offline delivery of portions of the archive to users”( 3.2.2)
“PDS will provide mechanisms to ensure that data have been transferred intact” (3.2.3)
The PDS needs to ensure the maintenance of data integrity through the media refreshing process:
“PDS will develop and implement procedures for periodically refreshing the data by updating the underlying storage technology” (4.1.3)
5. 3 August 2006
Data Integrity Issues
5 of 19 PDS Requirements for Data Integrity The PDS has a stated goal of utilizing standardized procedures in areas that affect inter-node data transfers:
“PDS will provide standard protocols for accessing data, metadata and computing resources across the distributed archive” (2.7.3)
6. 3 August 2006
Data Integrity Issues
6 of 19 PDS Requirements for Data Integrity From the above requirements, we can derive several areas of concern for data integrity:
Verifying the integrity of data stored on physical media
Detecting errors introduced during transfer of data to newer media
Detecting errors that occur during transmission of data:
From data providers to the PDS
Between PDS nodes
From the PDS to the NSSDC
From the PDS to end users
7. 3 August 2006
Data Integrity Issues
7 of 19 PDS Requirements for Data Integrity There are two additional areas not derivable from existing PDS requirements where data integrity issues are involved:
The re-delivery of non-archived data during the operations phase of a mission
The potential updating of data to newer formats long after it has been archived
8. 3 August 2006
Data Integrity Issues
8 of 19 Mitch Gordon Survey For each numbered item, do you think that it is an important issue for us to address?
Section A - It is critical that the PDS be able to ascertain the integrity of its archive. This includes (but is not limited to):
detecting errors that occur during the transmission of data from providers to the PDS,
detecting errors that occur during the transmission of data between PDS nodes,
detecting errors that occur during the transmission of data from the PDS to end users.
detecting errors that occur during the transmission of data from the PDS to the NSSDC
verifying the integrity of data stored on various types of external physical media (all of which have finite life spans),
detecting errors introduced during transfer of data to newer media,
9. 3 August 2006
Data Integrity Issues
9 of 19 Mitch Gordon Survey
10. 3 August 2006
Data Integrity Issues
10 of 19 Possible Solutions to the Problem Checksums are widely accepted in the broader community as a means for ensuring data integrity
MD5 checksums, in particular, are well suited to this purpose
There has been no mechanism beside checksums suggested by any of the nodes as a means for detecting changes in data
There is no consensus within the PDS as to whether we should limit ourselves to the MD5 checksum algorithm
There is little consensus within the PDS as to whether we should use a standardized approach to utilizing checksums to verify data integrity
11. 3 August 2006
Data Integrity Issues
11 of 19 Mitch Gordon Survey Section B - Identify a tool that can help (not necessarily be sufficient) with any, or hopefully all, of the above.
Use a single tool, MD5, for generating and validating checksums
Section C - Establish policies for the use of the tool in a variety of situations.
12. 3 August 2006
Data Integrity Issues
12 of 19 Mitch Gordon Survey
13. 3 August 2006
Data Integrity Issues
13 of 19 Issues to be Addressed
14. 3 August 2006
Data Integrity Issues
14 of 19 Standardization Issue Should we have a standardized approach across the PDS for storing and accessing checksums or should each node be permitted to use whatever mechanism it chooses?
Some flexibility needed to deal with variety of ways in which data providers deliver data to the PDS
Standardization permits the development of tools for generating, accessing, and periodically validating against checksums
Standardization permits the addition of checksum tools to existing interfaces (like PDS-D and NSSDC delivery mechanism) to utilize and validate against checksums
15. 3 August 2006
Data Integrity Issues
15 of 19 Urgency Volume of data returned from missions is increasing exponentially every couple of years
Going back and calculating checksums for every file already in the PDS holdings is currently feasible, but will become a significantly more difficult task with each passing year
16. 3 August 2006
Data Integrity Issues
16 of 19 Policy Questions to be Answered At what level of detail should checksums be required?
For what parts of the archiving process should checksums be required?
To what degree should standardization among nodes be insisted upon?
When should we begin requiring checksums?
17. 3 August 2006
Data Integrity Issues
17 of 19 Current Proposal (SCR 3-1034, V9) Mandates generation of file checksums for every file on every archive volume
Mandates standardized format and location for storage of checksums
Is insufficient to solve all data integrity problems, but is a necessary part of the solution
Required for all missions archiving to v3.8 or higher of Standards Reference (roughly missions starting process late this year)
18. 3 August 2006
Data Integrity Issues
18 of 19 Most Recent Votes on Checksum SCR
19. 3 August 2006
Data Integrity Issues
19 of 19 Options for Next Step Proceed with MC vote on version 9 of SCR
Form new working group to come up with a new proposal
MC draft policy on data integrity to provide further guidance to Tech group
Drop the issue (fails to meet our requirements)
Other?