1.26k likes | 1.45k Views
Assessing File Format Robustness. Manfred Thaller Universität zu* Köln June 23 rd , 2009 * University at not of Cologne. PART I – “What is in a format?” (Formats and Registries) EXERCISE I – Evaluate some
E N D
Assessing File Format Robustness Manfred Thaller Universität zu* Köln June 23rd, 2009 *University at not of Cologne
PART I – “What is in a format?” (Formats and Registries) EXERCISE I – Evaluate some PART II – “What is not in a format?” (Formats in PLANETS – and beyond) EXERCISE II – A bit of modelling
An image 6 rows 5 columns
5 rows 6 columns
An image 1 == ochre 0 == red
An image 1 == blue 0 == yellow
An image Store: 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1
An image Store: 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1 Uncompressed
An image Store: 6,1,3,0,3,1,1,0,4,1,1,0,4,1,1,0,7,1 (Compressed)Run Length Encoded
An image Store: SetSize: 5 by 6 SetBackgroundColor: Ochre SetForegroundColor: Red SetLetterHeight: 4 MoveTo: 3,5 DrawLetter: T Vector Format
An image 6 rows 5 columns 1 == ochre 0 == red Uncompressed
An image dimensions 1 == ochre 0 == red Uncompressed
An image dimensions photogrammetric interpretation Uncompressed
An image dimensions photogrammetric interpretation compression
An image <basic information> <rendering information> <storage information>
An image <basicinformation> (implicit / explicit) <renderinginformation> (implicit / explicit) <storageinformation> (implicit / explicit) … andthedata?
An image Data either as data stream 1,1,1,1,1,1, 0,0,0,1,1,1, 0,1,1,1,1,0, 1,1,1,1,0,1, 1,1,1,1,1,1
An image Data either as data stream or as processing instructions SetSize: 5 by 6 SetBackgroundColor: Ochre SetForegroundColor: Red SetLetterHeight: 4 MoveTo: 3,5 DrawLetter: T
File format <basic information> What to do? <rendering information> How to do it? <storage information> How to move it from persistent to deployed form? <data> What to deploy?
File format <basic information> What to do? <rendering information> How to do it? <storage information> How to move it from persistent to deployed form? <data> What to deploy?
File format <basic information> Mandatory <rendering information> Useful <storage information> Historical <data> Mandatory
File format A deterministic specification how the properties of a digital object can reversibly be converted into a linear bytestream (bitstream). *******
File format: PDF 1 0 obj << /Type /Page /Parent 281 0 R /Resources 2 0 R /Contents 3 0 R /StructParents 2 /MediaBox [ 0 0 612 792 ] /CropBox [ 0 0 612 792 ] /Rotate 0 >> endobj
File format: PDF 2 0 obj << /ProcSet [ /PDF /Text ] /Font << /TT2 292 0 R /TT4 288 0 R >> /ExtGState << /GS1 300 0 R >> /ColorSpace << /Cs6 289 0 R >> >> endobj
File format: PDF 3 0 obj << /Length 4605 /Filter /FlateDecode >> stream H‰„WÛŽÛÈ}×Wô#Œ4jR”¨`±Àø ™Í" ¶(²5j›"¹lräý‘|oêÖ-j —‹udTÙÂ…fPnˆ¿ìþ>Ó›Ež²ÝÕ˽âä”uª2i*<<v ú[Óžk9Q‰¼‡x»XTP{ ‹±/[i²½Ö)}ÔÏö&ªÙH;<Cµ … and about 4000 bytes more ŠøL"È÷ےƬJYØÂm]j¥Ýqõ¥ÏººÕ™·²ôÒ·Ûº¤–÷.u-kP0 4“øTxM<é識9uôøˆòLi¦ØoTÖ m–;ǯ÷¤ÿlÕºvéU—˱¤Lm°gŸˆu1Åëu5l3¯’¢O %òËTîü7?ìNdh endstream endobj
File format: XML (here SVG) <?xml version="1.0" encoding="UTF-16"?> <svg:svg width="800" height="1000" xmlns:svg="http://www.w3.org ... <svg:rect x="0" y="0" width="800" height="1000" fill="white" /> <svg:g transform="translate(-140,0)"> <svg:line x1="600" y1="20" x2="500" y2="20" stroke="black" … <svg:text x="600" y="28.8" font-size="6" fill="black" … </svg:g> <svg:g transform="translate(-140,0)"> <svg:text x="500" y="24.4"> <svg:tspan font-size="4" fill="black">Leiste</svg:tspan> </svg:text> </svg:g> <svg:defs> <svg:g id="halbeSaeuleLeiste0">
File format: XML (ETH: “column XML”) <?xml version="1.0" encoding="UTF-8"?> <Autor name="Vitruv"> <Ordnung name="Ionisch" THz="" THn="" MH="" TBz="" TBn="" … <Element name="Gebaelk" original="" THz="" THn="" MH="" … <Element name="Gesims" original="corona" THz="" THn="" MH="" … <Element name="Leiste" original="" THz="" THn="" MH="0.03" … <Element name="Kyma" original="sima" THz="" THn="" … <Element name="Leiste" original="" THz="" THn="" MH="0.017" … <Element name="Kyma_reversa" original="cymatium" THz="" … <Element name="Platte" original="corona" THz="" THn="" … <Element name="Leiste" original="" THz="" THn="" MH="0.017" … <Element name="Kyma_reversa" original="cymatium" THz="" … <hElement name="Band" typ="1" dx="0.048" r="0.019"/> <hElement name="Band" typ="1" dx="0.048" r="0.019"/> </Element> *******
Bit rot An Image file before ….
Bit rot ... and after one byte is changed.
Bit rot Undetectable by software. ... and after one byte is changed.
Bit rot Processing dictionary Payload
Bit rot One byte is damaged, one byte cannot be displayed correctly.
Bit rot One byte is damaged, ten bytes cannot be displayed correctly.
But 1 … Why should I care? Can I not just pay a technician to keep some system of checksums? Counter-but 1 … Do you rather buy a brand of car with a reputation of an excellent network of maintenance shops, or one with a reputation for needing little maintenance?
But 2 … But is bit rot really that important? I have read, that files most of the time get either unreadable completely, or stay completely undamaged? Counter-but 2a … In disaster recovery: yes! With files on degrading storage systems / devices: no!
But 2 … But is bit rot really that important? Counter-but 2b … Bit rot is, indeed, just one problem.! We do this is just to show, that there are differences between the technical fitness for preservation between formats. Others go beyond 15 minutes. *******
IBM “Digital Library”, ca. 1997Kn A 269_1802.idx Kn A 269_1802.log TDI00001.jpg TDI00002.jpg TDI00005.jpg TDI00006.jpg TDI00007.jpg TDI00008.jpg TDI00010.jpg TDI00011.jpg TDI00013.jpg TDI00015.jpg TDI00017.jpg TDI00020.jpg TDI00023.jpg TDI00024.jpg TDI00028.jpg TDI00032.jpg TDI00036.jpg TDI00037.jpg TDI00039.jpg TDI00040.jpg TDI00042.jpg TDI00047.jpg TDI00048.jpg TDI00050.jpg TDI00053.jpg TDI00054.jpg TDI00056.jpg TDI00057.jpg TDI00060.jpg TDI00062.jpg TDI00063.jpg TDI00064.jpg
IBM “Digital Library”, ca. 1997#*Shelfmark=Kn A 269/1802*Scanner-Op=Omniscan 6000*Count=C2*Timestamp=4.1.2000xdo=TDI00005.jpgend=end #*Shelfmark=Kn A 269/1802 *Scanner-Op=Omniscan 6000*Count=F3 *Timestamp=4.1.2000xdo=TDI00006.jpg end=end
IBM “Digital Library” / savingKnA115_646+A1r.jpg KnA115_646+A2.jpg KnA115_646+A3.jpg KnA115_646+A4.jpg KnA115_646+B1.jpg KnA115_646+B2.jpg KnA115_646+B3.jpg KnA115_646+B4.jpg KnA115_646+C1.jpg KnA115_646+C2.jpg KnA115_646+C3.jpg KnA115_646+C4.jpg KnA115_646+D1.jpg KnA115_646+D2.jpg KnA115_646+D3.jpg KnA115_646+D4.jpg KnA115_646+E1.jpg KnA115_646+E2.jpg KnA115_646+E3.jpg KnA115_646+E4.jpg KnA115_646+F1.jpg KnA115_646+F2.jpg KnA115_646+G4v.jpgKnA115_646.xml
IBM “Digital Library” / saving<body> <div><page>TB Lsp <imageseqno="KnA115_646+A1r.jpg" nativeno="A1r" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+A2.jpg" nativeno="A2" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+A3.jpg" nativeno="A3" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+A4.jpg" nativeno="A4" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+B1.jpg" nativeno="B1" timestamp="14.02.1997"/></page> <page><imageseqno="KnA115_646+B2.jpg" nativeno="B2" timestamp="14.02.1997"/></page> ...
SNIA – Storage Networking Industry Associationhttp://www.snia.org/homeXAM – eXtensible Access Methodehttp://www.snia.org/forums/xam/
Obsolescence Software able to read does not exist any more. Format specification lost. Implied algorithm lost. Required object lost.
But 3 … But is there not a simple list for this type of problems, which I can consult easily? Counter-but 3 … No.