90 likes | 104 Views
This project meeting discusses the various file formats used in archiving different types of data, including raw binary data, texts, images, and multimedia. It focuses on the challenges of storing mathematics content which consists mostly of text, formulas, diagrams, and some images, with an emphasis on preserving the integrity and structure of text files. The meeting concludes that mark-up formats like XML or TeX are better suited for archiving purposes compared to structured formats like MS Word or PDF.
E N D
File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14th - 16th, 2002 Springer-Verlag Heidelberg Göttingen State and University Library(SUB) emani@mail.sub.uni-goettingen.de
Archives StoreDifferent Kind of Data ... • archives have to deals with different kind of data • raw binary data • texts • images • multimedia • ... EMANI Project Meeting SUB Göttingen
... in Different File Formats • binary data: stream of bytes • text: ASCII, other encodings of simple text, formatted text • images: vector or pixel oriented graphics • multimedia: a plethora of different file types for different purposes EMANI Project Meeting SUB Göttingen
Focus on ... • mathematics consists mostly of text, formulas, diagrams, and some images • further contents might be (compiled) programs, interactive simulations etc. • for learned journals the contents is overwhelmingly text with few images EMANI Project Meeting SUB Göttingen
Text! text files usually contains to kinds of information: • textual data providing the contents (words) of the file • structural data containing the information for the presentation of the text EMANI Project Meeting SUB Göttingen
Two Kinds of Problems • loss of structure leads to loss of formatting • loss of text leads to loss of meaning if problems occur with the media or the program that reads the file, some information may be lost the latter is usually considered more serious EMANI Project Meeting SUB Göttingen
Two Types of Text File Formats • structured format (e.g. Microsoft Word, PDF): file consits of text (more or less uninterrupted) and tables (usually at the beginning or the end of the file) that provide additional information, formatting etc. • mark-up format (e.g. HTML, XML, RTF, TeX): file consists of stream of text with formatting information interspersed EMANI Project Meeting SUB Göttingen
For Archiving Purposes • the file format chosen should be readable without the use of specialized programs • the file format should be robust against damage of media and loss of data EMANI Project Meeting SUB Göttingen
Types of Text Format • mark-up languages like XML or TeX store text and formatting together. Text can be reconstructed using any text editor, format probably regained. • structured formats like MS Word or PDF need the dedicated program for proper representation and may or may not allow the extraction of the text contained, depending on the particular situation, usually not visible to the user. Consequence: Mark-up formats are better suited for archiving EMANI Project Meeting SUB Göttingen