1 / 124

Assessing File Format Robustness

Assessing File Format Robustness. Manfred Thaller Universität zu* Köln June 23 rd , 2009 * University at not of Cologne. PART I – “What is in a format?” (Formats and Registries) EXERCISE I – Evaluate some

Download Presentation

Assessing File Format Robustness

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assessing File Format Robustness Manfred Thaller Universität zu* Köln June 23rd, 2009 *University at not of Cologne

  2. PART I – “What is in a format?” (Formats and Registries) EXERCISE I – Evaluate some PART II – “What is not in a format?” (Formats in PLANETS – and beyond) EXERCISE II – A bit of modelling

  3. I – What is (in) a format?

  4. An image

  5. An image 6 rows 5 columns

  6. 5 rows 6 columns

  7. An image 1 == ochre 0 == red

  8. An image 1 == blue 0 == yellow

  9. An image Store: 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1

  10. An image Store: 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1 Uncompressed

  11. An image Store: 6,1,3,0,3,1,1,0,4,1,1,0,4,1,1,0,7,1 (Compressed)Run Length Encoded

  12. An image Store: SetSize: 5 by 6 SetBackgroundColor: Ochre SetForegroundColor: Red SetLetterHeight: 4 MoveTo: 3,5 DrawLetter: T Vector Format

  13. An image 6 rows 5 columns 1 == ochre 0 == red Uncompressed

  14. An image dimensions 1 == ochre 0 == red Uncompressed

  15. An image dimensions photogrammetric interpretation Uncompressed

  16. An image dimensions photogrammetric interpretation compression

  17. An image <basic information> <rendering information> <storage information>

  18. An image <basicinformation> (implicit / explicit) <renderinginformation> (implicit / explicit) <storageinformation> (implicit / explicit) … andthedata?

  19. An image Data either as data stream 1,1,1,1,1,1, 0,0,0,1,1,1, 0,1,1,1,1,0, 1,1,1,1,0,1, 1,1,1,1,1,1

  20. An image Data either as data stream or as processing instructions SetSize: 5 by 6 SetBackgroundColor: Ochre SetForegroundColor: Red SetLetterHeight: 4 MoveTo: 3,5 DrawLetter: T

  21. File format <basic information> What to do? <rendering information> How to do it? <storage information> How to move it from persistent to deployed form? <data> What to deploy?

  22. File format <basic information> What to do? <rendering information> How to do it? <storage information> How to move it from persistent to deployed form? <data> What to deploy?

  23. File format <basic information> Mandatory <rendering information> Useful <storage information> Historical <data> Mandatory

  24. File format A deterministic specification how the properties of a digital object can reversibly be converted into a linear bytestream (bitstream). *******

  25. File format: TIFF

  26. File format: PDF 1 0 obj << /Type /Page /Parent 281 0 R /Resources 2 0 R /Contents 3 0 R /StructParents 2 /MediaBox [ 0 0 612 792 ] /CropBox [ 0 0 612 792 ] /Rotate 0 >> endobj

  27. File format: PDF 2 0 obj << /ProcSet [ /PDF /Text ] /Font << /TT2 292 0 R /TT4 288 0 R >> /ExtGState << /GS1 300 0 R >> /ColorSpace << /Cs6 289 0 R >> >> endobj

  28. File format: PDF 3 0 obj << /Length 4605 /Filter /FlateDecode >> stream H‰„WÛŽÛÈ}×Wô#Œ4jR”¨`±Àø ™Í" ¶(²5j›"¹lräý‘|oêÖ-j —‹udTÙÂ…fPnˆ¿ìþ>Ó›Ež²ÝÕ˽âä”uª2i*<<v ú[Óžk9Q‰¼‡x»XTP{ ‹±/[i²½Ö)}ÔÏö&ªÙH;<Cµ … and about 4000 bytes more ŠøL"È÷ےƐ¬JYØÂm]j¥Ýqõ¥ÏººÕ™·²ôÒ·Ûº¤–÷.u-kP0 4“øTxM<é識9uôøˆòLi¦ØoTÖ m–;ǯ÷¤ÿlÕºvéU—˱¤Lm°gŸˆu1Åëu5l3¯’¢O %òËTîü7?ìNdh endstream endobj

  29. File format: XML (here SVG) <?xml version="1.0" encoding="UTF-16"?> <svg:svg width="800" height="1000" xmlns:svg="http://www.w3.org ... <svg:rect x="0" y="0" width="800" height="1000" fill="white" /> <svg:g transform="translate(-140,0)"> <svg:line x1="600" y1="20" x2="500" y2="20" stroke="black" … <svg:text x="600" y="28.8" font-size="6" fill="black" … </svg:g> <svg:g transform="translate(-140,0)"> <svg:text x="500" y="24.4"> <svg:tspan font-size="4" fill="black">Leiste</svg:tspan> </svg:text> </svg:g> <svg:defs> <svg:g id="halbeSaeuleLeiste0">

  30. File format: XML (here SVG)

  31. File format: XML (ETH: “column XML”) <?xml version="1.0" encoding="UTF-8"?> <Autor name="Vitruv"> <Ordnung name="Ionisch" THz="" THn="" MH="" TBz="" TBn="" … <Element name="Gebaelk" original="" THz="" THn="" MH="" … <Element name="Gesims" original="corona" THz="" THn="" MH="" … <Element name="Leiste" original="" THz="" THn="" MH="0.03" … <Element name="Kyma" original="sima" THz="" THn="" … <Element name="Leiste" original="" THz="" THn="" MH="0.017" … <Element name="Kyma_reversa" original="cymatium" THz="" … <Element name="Platte" original="corona" THz="" THn="" … <Element name="Leiste" original="" THz="" THn="" MH="0.017" … <Element name="Kyma_reversa" original="cymatium" THz="" … <hElement name="Band" typ="1" dx="0.048" r="0.019"/> <hElement name="Band" typ="1" dx="0.048" r="0.019"/> </Element> *******

  32. II – Why would we want to know?

  33. Bit rot An Image file before ….

  34. Bit rot ... and after one byte is changed.

  35. Bit rot Undetectable by software. ... and after one byte is changed.

  36. Bit rot Processing dictionary Payload

  37. Bit rot One byte is damaged, one byte cannot be displayed correctly.

  38. Bit rot One byte is damaged, ten bytes cannot be displayed correctly.

  39. Demo

  40. But 1 … Why should I care? Can I not just pay a technician to keep some system of checksums? Counter-but 1 … Do you rather buy a brand of car with a reputation of an excellent network of maintenance shops, or one with a reputation for needing little maintenance?

  41. But 2 … But is bit rot really that important? I have read, that files most of the time get either unreadable completely, or stay completely undamaged? Counter-but 2a … In disaster recovery: yes! With files on degrading storage systems / devices: no!

  42. But 2 … But is bit rot really that important? Counter-but 2b … Bit rot is, indeed, just one problem.! We do this is just to show, that there are differences between the technical fitness for preservation between formats. Others go beyond 15 minutes. *******

  43. IBM “Digital Library”, ca. 1997Kn A 269_1802.idx Kn A 269_1802.log TDI00001.jpg TDI00002.jpg TDI00005.jpg TDI00006.jpg TDI00007.jpg TDI00008.jpg TDI00010.jpg TDI00011.jpg TDI00013.jpg TDI00015.jpg TDI00017.jpg TDI00020.jpg TDI00023.jpg TDI00024.jpg TDI00028.jpg TDI00032.jpg TDI00036.jpg TDI00037.jpg TDI00039.jpg TDI00040.jpg TDI00042.jpg TDI00047.jpg TDI00048.jpg TDI00050.jpg TDI00053.jpg TDI00054.jpg TDI00056.jpg TDI00057.jpg TDI00060.jpg TDI00062.jpg TDI00063.jpg TDI00064.jpg

  44. IBM “Digital Library”, ca. 1997#*Shelfmark=Kn A 269/1802*Scanner-Op=Omniscan 6000*Count=C2*Timestamp=4.1.2000xdo=TDI00005.jpgend=end #*Shelfmark=Kn A 269/1802 *Scanner-Op=Omniscan 6000*Count=F3 *Timestamp=4.1.2000xdo=TDI00006.jpg end=end

  45. IBM “Digital Library” / savingKnA115_646+A1r.jpg KnA115_646+A2.jpg KnA115_646+A3.jpg KnA115_646+A4.jpg KnA115_646+B1.jpg KnA115_646+B2.jpg KnA115_646+B3.jpg KnA115_646+B4.jpg KnA115_646+C1.jpg KnA115_646+C2.jpg KnA115_646+C3.jpg KnA115_646+C4.jpg KnA115_646+D1.jpg KnA115_646+D2.jpg KnA115_646+D3.jpg KnA115_646+D4.jpg KnA115_646+E1.jpg KnA115_646+E2.jpg KnA115_646+E3.jpg KnA115_646+E4.jpg KnA115_646+F1.jpg KnA115_646+F2.jpg KnA115_646+G4v.jpgKnA115_646.xml

  46. IBM “Digital Library” / saving<body> <div><page>TB Lsp <imageseqno="KnA115_646+A1r.jpg" nativeno="A1r" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+A2.jpg" nativeno="A2" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+A3.jpg" nativeno="A3" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+A4.jpg" nativeno="A4" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+B1.jpg" nativeno="B1" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+B2.jpg" nativeno="B2" timestamp="14.02.1997"/></page> ...

  47. SNIA – Storage Networking Industry Associationhttp://www.snia.org/homeXAM – eXtensible Access Methodehttp://www.snia.org/forums/xam/

  48. Obsolescence Software able to read does not exist any more. Format specification lost. Implied algorithm lost. Required object lost.

  49. But 3 … But is there not a simple list for this type of problems, which I can consult easily? Counter-but 3 … No.

  50. III – Which format to choose?

More Related