610 likes | 740 Views
Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002. Documents: form vs. content ?. Traditional environment:. Form. Content. Documents: form vs. content ?. Digital environment:. Content. Form.
E N D
Digital preservation.Principles and potential role of XMLGiovanni Michetti Urbino, 9th october 2002
Documents:form vs. content ? Traditional environment: Form Content
Documents:form vs. content ? Digital environment: Content Form
Documents:structure • Structure is unavoidably inside documents • Complexity grows structure grows • Structure is (part of the) message • We deal with structure not in digital environment only
Documents:structure and digital environment • Moving information onto new media • Need of functionalities to manage the explosive growth of information • Need to make structure explicit
Markup • The proper description of an information resource requires: • identifying its logical components • making its structure explicit Markup
Markup • Markup: every means of making interpretation of a document explicit
From a record ... University of Urbino Faculty of Arts Rome, 1st August 2002 Dr. Giovanni Michetti Protocol n. 1234/AB Subject: Teaching appointment We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002. We remind you that for the stipulation of the contract we need, according to the legislative decree n. 80/1998, the authorization by the administration you belong to. The Dean Prof. Giorgio Cerboni Baiardi Faculty of Arts Piano S. Lucia 6 - 61029 Urbino Tel: 0722.320125 Fax: 0722.322553 Email: preslet@lettere.uniurb.it
… to a marked record ... <XML><letter><sender>University of Urbino Faculty of Arts </sender> <date>Rome, 1st August 2002</date> <addressee>Dr. Giovanni Michetti</addressee> <protocolnumb>Protocol n. 1234/AB</protocolnumb> <subject>Subject: Teaching appointment</subject> <text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002. We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text> <author>The Dean Prof. Giorgio Cerboni Baiardi</author> <heading>Faculty of Arts Piano S. Lucia 6 - 61029 Urbino Tel: 0722.320125 Fax: 0722.322553 Email: preslet@lettere.uniurb.it</heading></letter></XML>
… to a DTD ... <! ELEMENT letter (sender, date, addressee, protocolnumb, subject, text, author, heading)> <!ELEMENT sender (#PCDATA)> <!ELEMENT date (#PCDATA)> <!ELEMENT addressee (#PCDATA)> <!ELEMENT protocolnumb (#PCDATA)> <!ELEMENT subject (#PCDATA)> <!ELEMENT text (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT heading (#PCDATA)>
… to a more precise DTD <! ELEMENT letter (sender, date, addressee, precedent?, protocolnumb, classif?, subject, text, attachments?, author, heading)> <!ELEMENT sender, date, addressee, protocolnumb, subject, text, author, heading (#PCDATA)> <!ELEMENT precedent (#PCDATA)> <!ELEMENT classif (#PCDATA)> <!ELEMENT attachments (#PCDATA)>
Let’s refine the markup ... <XML><letter><sender><body>University of Urbino</body> <bureau>Faculty of Arts</bureau></sender> <date><place>Rome,</place><time>1st August 2002</time></date> <addressee>Dr. Giovanni Michetti</addressee> <protocolnumb>Protocol n. 1234/AB</protocolnumb> <subject>Subject: Teaching appointment</subject> <text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002. We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text> <author><role>The Dean</role> <name>Prof. Giorgio Cerboni Baiardi</name></author> <heading>Faculty of Arts Piano S. Lucia 6 - 61029 Urbino Tel: 0722.320125 Fax: 0722.322553 Email: preslet@lettere.uniurb.it</heading></letter></XML>
... keeping on refining ... <XML><letter><sender><body>University of Urbino</body> <bureau>Faculty of Arts</bureau></sender> <date><place>Rome,</place><time>1st August 2002</time></date> <addressee>Dr. Giovanni Michetti</addressee> [Protocolnumb + Subject + Text] <author><role>The Dean</role> <name><title>Prof.</title><propername>Giorgio</propername><surname>Cerboni Baiardi</surname></name></author> <heading><bureau>Faculty of Arts</bureau> <address>Piano S. Lucia 6 - 61029 Urbino</address> <tel>Tel: 0722.320125</tel><fax>Fax: 0722.322553</fax><email>Email: preslet@lettere.uniurb.it</email></heading></letter></XML>
… and let’s refine the DTD <! ELEMENT letter (sender, date, addressee, precedent?, protocolnumb, classifi?, subject, text, attachment?, author, heading)> <!ELEMENT sender (body, bureau)> <!ELEMENT body (#PCDATA)> <!ELEMENT bureau (#PCDATA)> <!ELEMENT date (place, time)> <!ELEMENT place (#PCDATA)> <!ELEMENT time (#PCDATA)> <!ELEMENT addressee (#PCDATA)> <!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)> <!ELEMENT author (role, name)> <!ELEMENT role (#PCDATA)> <!ELEMENT name (title, propername, surname)> <!ELEMENT title, propername, surname (#PCDATA)> <!ELEMENT heading (bureau,address, tel, fax, email)> <!ELEMENT address, tel, fax, email (#PCDATA)>
The final DTD <! ELEMENT letter (sender, date, addressee+, precedent?, protocolnumb, classifi?, subject, text, attachment?, author, heading?)> <!ELEMENT sender (body?, bureau)> <!ELEMENT body (#PCDATA)> <!ELEMENT bureau (#PCDATA)> <!ELEMENT date (place, time)> <!ELEMENT place (#PCDATA)> <!ELEMENT time (#PCDATA)> <!ELEMENT addressee (#PCDATA)> <!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)> <!ELEMENT author (role?, name)> <!ELEMENT role (#PCDATA)> <!ELEMENT name (title?, propername?, surname)> <!ELEMENT title, propername, surname (#PCDATA)> <!ELEMENT heading (bureau?, address?, tel?, fax?, email?)> <!ELEMENT address, tel, fax, email (#PCDATA)>
XML declaration • Every XML document should start with an XML declaration, like <?XML version="1.0"> • Such declaration must be right at the start of the document: there should be nothing before it (comments, instructions, white spaces, ...)
XML declaration • A parser uses the first 5 characters <?XML to understand which kind of character set the document uses • The version attribute must have value 1.0
XML declaration • It is possible to specify the language encoding using the optional encoding attribute. • Example: <?XML version="1.0" encoding="ISO-8859-1"?>
Elements • Elements are the most important components of XML documents: they are the logical components through which you can identify the structure of documents. Example: <author>Giovanni Michetti</author> end-tag start-tag tag-name delimiter content element
Elements • Each start-tag must have a corresponding end-tag (starting with a forward slash) • Empty elements (like <img>, <br>, <hr> in HTML) are represented by a tag starting with a delimiter and ending with a forward slash before the closing bracket. Example: <image/>
Attributes • Attributes are expressed as name-value pairs associated with elements and appearing only in start-tags • Names are separated from related values by an equal sign (=). Values are wrapped in single or double quotes • Attributes must be associated to elements • No matter of the order of the attributes inside a start-tag
XML tree • An XML document is a kind of a hierarchical tree. It starts from a root (root or document element) and it develops from it into child elements, that can be sibling
XML tree • Each element has one and only one father (except from root) • Each element is completely wrapped inside another element
Entities • Example: <author>Giovanni Michetti</author> • The string Giovanni Michetti (the element content) is also called character data. Character data can appear anywhere inside elements, or as values of attributes
Entities • There are special characters that are not allowed in text blocks: what if we want to use the less than symbol < in a mathematical formula (a < b ) ? Stratagem 1 Stratagem 2
Entities 1. CDATA sections: They start with the CDATA start marker <!CDATA[ and end with the CDATA end marker ]]>
Entities 2. Entity references: Example: << • The parser recognizes the entity < and substitute it with the proper value <
Entities • A parser is a piece of software able to read and interpret an XML document. A parser read the XML document as plain text • Some parsers (validating parsers) are able to check the conformance of an XML document with a DTD
Entities • Standard (i.e. predefined) entities: < < > > & & ' ' " " • Any XML parser recognizes these entities and substitutes them with the proper values
Well-formed documents • Any XML document must be well formed: it has to comply with some constraints, some of which are: • Each start-tag has a corresponding end-tag • Elements can’t overlap • There must be one and only one root element • Attribute values must be quoted • An element can’t contain different attributes with the same name
Document Type Definition (DTD) • Once able to create a set of attributes and tags, we need to share it with other users in order to adopt the same syntax • We need a Document Type Definition (DTD)
Document Type Definition (DTD) • A DTD defines what markup can be used in a document that is supposed to conform to a specific structure, whose components are identified by tags
Document Type Definition (DTD) • For example, a DTD defines what elements a document can contain, their occurrences, their order, and so on • A DTD can set out which attributes an element can take and whether they must be valued. It is also possible to define a set of predefined values for the attributes, and so on
Internal and external DTD • A DTD can be an external file or it can be included as part of the XML document. If it is an external file, the XML document must contain an explicit reference inside the Document Type Declaration: <!DOCTYPE MyXMLDocs SYSTEM “file.dtd”>
Internal and external DTD • A DTD can also be written inside the document type declaration. In this case we have an internal DTD, like: <!DOCTYPE MyXMLDoc [ <!ELEMENT MyXMLDoc (#PCDATA)> ]> • In this case, all the constraints on the structure of the document are provided as declarations inside the square brackets
Element declarations • A DTD is a set of declarations, the most important of which is the element declaration. Any DTD must have at least one element declaration (referred to the root element) • The syntax for a declaration is: <!ELEMENT elementname (contentmodel)>
Element declarations • Example: <!ELEMENT anthology (poem+)> <!ELEMENT poem (title?, (stanza+|line+) ) <!ELEMENT title (#PCDATA)> <!ELEMENT stanza (verso+)> <!ELEMENT line (#PCDATA)>
Cardinality suffixes • Cardinality suffixes are symbols used to specify how many times an element can occur at a certain point of the structure. Symbols used are: ? 0-1 + 1-n * 0-n (none) 1
Connectors • Connectors are symbols used to specify order and relationships between components of a model • Symbols used are: , (comma) | (vertical line)
Attribute declarations • An attribute declaration allows to define attributes associated to a given element • The syntax for a declaration is: <!ATTLIST element_name attribute_definition*> where an attribute definition is like: attribute_name attribute_type default_declaration
Valid documents • Well-formed documents: XML documents conforming to the rules laid down in the XML 1.0 specifications • Valid documents: well-formed documents conforming to the rules laid down in a DTD
Stylesheets • So far the structure. But how can we render documents in the proper way? Stylesheets
Stylesheets • Since content is separated from style, we do need no more to re-write the whole document each time we want to change the layout: we simply need to change the “instructions” that modify rendering. In other words, we can modify representation without modifying content • XSL (eXtensible Stylesheet Language) is a style language based upon DSSL (Document Style Semantics and Specification Language)
So far the document … • … but a document is (generally) part of a file, which is in turn part of a series or a more complex archival collection Archival bond
Archives:a complex system of relationships Archive Document Series File
Preserving, of course; but what? ? Original data ? ? Hardware Preserving Context allowing data to be interpreted