1 / 31

Analysing the Impact of File Formats on Data Integrity

Analysing the Impact of File Formats on Data Integrity. Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008. Overview. Introduction File format data and information loss What happens if data is corrupted in files? Categories of file format data

geraldine
Download Presentation

Analysing the Impact of File Formats on Data Integrity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

  2. Overview • Introduction • File format data and information loss • What happens if data is corrupted in files? • Categories of file format data • Measuring Information Loss • Robustness Indicators • Study results for different file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  3. Overview • Introduction • File format data and information loss • What happens if data is corrupted in files? • Categories of file format data • Measuring Information Loss • Robustness Indicators • Study results for different file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  4. Background • EU-founded project “Planets”  characterisation of file format content www.planets-project.eu University of Cologne, Computer Science for the Humanities (Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI))  Planets partner www.hki.uni-koeln.de Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  5. Context • Long-term preservation of digital information Which file format to choose? Criteria, e.g.: Open standard Spread of usage Hard-/Software-Dependencies Authenticity … Robustness Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  6. Robustness ::= Error resilience of file formats against bit-stream corruption Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  7. Issues/ Research topics • Is there any correlation between file format and data integrity? • If so, are there any differences among file formats concerning the degree of robustness? • Which file format based factors are responsible for varying degrees of robustness? • How can we improve the robustness of file formats? Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  8. Benefits • Digital preservation: Decision support for choosing file format for long-term preservation • Contribution to file format research • Improvement of existing file formats • Design of future file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  9. Overview • Introduction • File format data and information loss • What happens if data is corrupted in files? • Categories of file format data • Measuring Information Loss • Robustness Indicators • Study results for different file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  10. File Format Data and Information loss What is “File Format” in our context? • Set of rules, constituting the logical organisation of data • Set of rules, indicating how to interpret data • Set of rules  file format specification • File Format Data::= Binary data, formatted according to the rules of a file format Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  11. What happens if data is corrupted in files? G Testimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  12. First 224 Byte of testfile FF G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  13. Plain information loss: 1 byte data = = 1 Pixel G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  14. What happens if data is corrupted in files? G Testimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  15. Part of the TIF Image File Directory, Tag: Photometric Interpretation 00 G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  16. Conditional information loss: 1 bit changes == 100% information changed G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  17. Categories of File Format Data • Technical data (data for processing): Image width: 277 Image length: 339 Compression: uncompressed Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  18. “Payload” data (basic data of usage): Pixel data, starting from byte #0x008 Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  19. Overview • Introduction • File format data and information loss • What happens if data is corrupted in files? • Categories of file format data • Measuring information loss • Robustness Indicators • Study results for different file formats Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  20. Robustness Indicators (1) RB = Δ (b0 ,b1) / m where • b0 is the basic data of usage before being corrupted, • b1 is the basic data of usage after being corrupted, • m is the number of corruption procedures. RB indicates an average information loss. Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  21. Example A file X may have 2000 byte of payload data. Presuming the number of byte changed after the file has been corrupted 3 times is per each corruption procedure 1. Δ (b0 ,b1) = 200 byte 2. Δ (b0 ,b1) = 150 byte 3. Δ (b0 ,b1) = 250 byte The average information loss for file X based on 3 corruption procedures is then RB= 600 / 3 = 200 Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  22. RB related to the total number of payload data: (2) RBt= RB / n where n is the total number of basic data of usage (payload data). (3)RBt= RB / n * 100 = RBt expressed in percentage Interpretation: RBt = 0 % : max. Robustness (min. Information loss) Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  23. Example (continued) • RBt= 200 / 2000 = 0.1 • RBt= 200 / 2000 * 100 = 10 (%) Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  24. Study on Robustness for various file formats: Example Results TIF - uncompressed - LZW - JPEG (2 different compression levels) - ZIP PNG (filtered, unfiltered) JPEG2000 (lossless, lossy) BMP (uncompressed) G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  25. Study on Robustness for various file formats: Example Results Method - simulation of file corruption: every file is corrupted up to 3000 times (3000 corruption procedures) - applying 3-5 different corruption ratios: • less than 0.01% • 0.01% • 0.1% • 1.0% • more than 1.0% G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  26. Method - compressed payload data is decompressed - original payload data and corrupted one is compared - computing Robustness Indicators Values G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  27. G Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  28. Example: Jp2 formatted image, corruption of 1 Byte, “bad case” Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  29. Example: Jp2 formatted image, corruption of 1 Byte, “good case” Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  30. Example: Jp2 formatted image, corruption of 1 Byte, “good case” with visualized differences in pixel data Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

  31. Thank you very much! Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

More Related