330 likes | 562 Views
Analysing the Impact of File Formats on Data Integrity. Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008. Overview. Introduction File format data and information loss What happens if data is corrupted in files? Categories of file format data
E N D
Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008
Overview • Introduction • File format data and information loss • What happens if data is corrupted in files? • Categories of file format data • Measuring Information Loss • Robustness Indicators • Study results for different file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Overview • Introduction • File format data and information loss • What happens if data is corrupted in files? • Categories of file format data • Measuring Information Loss • Robustness Indicators • Study results for different file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Background • EU-founded project “Planets” characterisation of file format content www.planets-project.eu University of Cologne, Computer Science for the Humanities (Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI)) Planets partner www.hki.uni-koeln.de Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Context • Long-term preservation of digital information Which file format to choose? Criteria, e.g.: Open standard Spread of usage Hard-/Software-Dependencies Authenticity … Robustness Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Robustness ::= Error resilience of file formats against bit-stream corruption Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Issues/ Research topics • Is there any correlation between file format and data integrity? • If so, are there any differences among file formats concerning the degree of robustness? • Which file format based factors are responsible for varying degrees of robustness? • How can we improve the robustness of file formats? Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Benefits • Digital preservation: Decision support for choosing file format for long-term preservation • Contribution to file format research • Improvement of existing file formats • Design of future file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Overview • Introduction • File format data and information loss • What happens if data is corrupted in files? • Categories of file format data • Measuring Information Loss • Robustness Indicators • Study results for different file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
File Format Data and Information loss What is “File Format” in our context? • Set of rules, constituting the logical organisation of data • Set of rules, indicating how to interpret data • Set of rules file format specification • File Format Data::= Binary data, formatted according to the rules of a file format Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
What happens if data is corrupted in files? G Testimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
First 224 Byte of testfile FF G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Plain information loss: 1 byte data = = 1 Pixel G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
What happens if data is corrupted in files? G Testimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Part of the TIF Image File Directory, Tag: Photometric Interpretation 00 G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Conditional information loss: 1 bit changes == 100% information changed G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Categories of File Format Data • Technical data (data for processing): Image width: 277 Image length: 339 Compression: uncompressed Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
“Payload” data (basic data of usage): Pixel data, starting from byte #0x008 Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Overview • Introduction • File format data and information loss • What happens if data is corrupted in files? • Categories of file format data • Measuring information loss • Robustness Indicators • Study results for different file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Robustness Indicators (1) RB = Δ (b0 ,b1) / m where • b0 is the basic data of usage before being corrupted, • b1 is the basic data of usage after being corrupted, • m is the number of corruption procedures. RB indicates an average information loss. Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example A file X may have 2000 byte of payload data. Presuming the number of byte changed after the file has been corrupted 3 times is per each corruption procedure 1. Δ (b0 ,b1) = 200 byte 2. Δ (b0 ,b1) = 150 byte 3. Δ (b0 ,b1) = 250 byte The average information loss for file X based on 3 corruption procedures is then RB= 600 / 3 = 200 Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
RB related to the total number of payload data: (2) RBt= RB / n where n is the total number of basic data of usage (payload data). (3)RBt= RB / n * 100 = RBt expressed in percentage Interpretation: RBt = 0 % : max. Robustness (min. Information loss) Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example (continued) • RBt= 200 / 2000 = 0.1 • RBt= 200 / 2000 * 100 = 10 (%) Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Study on Robustness for various file formats: Example Results TIF - uncompressed - LZW - JPEG (2 different compression levels) - ZIP PNG (filtered, unfiltered) JPEG2000 (lossless, lossy) BMP (uncompressed) G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Study on Robustness for various file formats: Example Results Method - simulation of file corruption: every file is corrupted up to 3000 times (3000 corruption procedures) - applying 3-5 different corruption ratios: • less than 0.01% • 0.01% • 0.1% • 1.0% • more than 1.0% G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Method - compressed payload data is decompressed - original payload data and corrupted one is compared - computing Robustness Indicators Values G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “bad case” Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “good case” Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “good case” with visualized differences in pixel data Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Thank you very much! Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008