1 / 27

Unit no. 4 Mark-up

Unit no. 4 Mark-up. Adolf Knoll National Library of the Czech Republic adolf.knoll@nkp.cz. Learning objectives. After the completion of this unit the learner will be able to: Understand what to do with the digital output for further use

Download Presentation

Unit no. 4 Mark-up

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unit no. 4 Mark-up Adolf Knoll National Library of the Czech Republic adolf.knoll@nkp.cz

  2. Learning objectives After the completion of this unit the learner will be able to: • Understand what to do with the digital output for further use • Understand the basics of the mark-up languages, especially XML • Have a basic orientation in their application to be able to make correct decisions for building a digitization project

  3. Production of a digital document Digitization Data Original document Digital document Metadata Description

  4. Data direct product of digitization: digital images, full text, video & audio files usually a set of files that represent the original document Metadata added value through textual information they express: identification with the original structure and links to data files technical information about data accessibility administrative matters etc. What do we produce?

  5. Mark-up Created because of a need to store additional (hidden) information in text in order to: • better format it when displayed and/or printed = prescriptive mark-up • classify parts of it as objects relevant to various rules of description such as cataloguing rules, rules of providing technical parameters, various good practices, rules of associating them with their visual representation, etc. = descriptive mark-up

  6. Mark-up • For example, in MS Word the paragraph is marked with a ¶ • In the HTML code the paragraph is marked with <p>paragraph</p> • In HTML the bold text or the break of the line are marked as follows: This is an HTML <b>document</b>, which consists of<br>elements</br>. • All this is procedural (prescriptive) mark-up. Mind the use of <> brackets to start with <start> and end with </start> the marked-up element. The paragraph is marked with¶ Paragraph¶

  7. Objects • The markup marks: • OBJECTS • Which objects? • THOSE, WHICH WE DEFINE AS OBJECTS • On which basis do we define them? • On the basis of CERTAIN RULES • How the rules are establish? • On the basis of an agreement; they are usually a written (even published) document specifying the objects that should be followed and described. Examples: AACR2 Cataloguing Rules in libraries, ISBD rules, CDWA or AMICO description rules for museum objects, Data Dictionary for Still Digital Images, etc. • The description rules do not define how the objects are marked up – this is done via a mark-up formal language • The most sophisticated mark-up approach is SGML

  8. General markup language SGML • Standard Generalized Markup Language (ISO standard from 1986) is the base for other derived approaches that may be called mark-up languages of the 2nd generation: • HTML (prescriptive) • TEI • … • XML (descriptive) The markup language marks the object withoutassigning any kind of behaviour to it. Its behaviour is prescribed by an independent rule.

  9. How does it work? • the main construction unit of an SGML-based mark-up approach is called ELEMENT • each element must be defined by an external content descriptive rule; e.g. a cataloguing rule (AACR2 or another one) defines the element Title; it may also define the sub-elements such as Main Title, Parallel Title, or Sub-Title, etc. • it results there may be hierarchical relationships between elements (parents with children)

  10. How to define the metadata standard? • We need formal rules to express the content descriptive standards • In SGML environment, this is done in the Document Type Definition (DTD) • DTD can, among others, do the following: • List all the elements and set up their properties (mandatory, non-mandatory, repeatable etc.) • Define relations between elements • Refine their attributes, e.g. through a list of permitted values • Point from them to external entitities, i.e. other definitions or binary data, e.g. digital images

  11. If we take as example that we need a description element author, then: Content definition of the element author is given by description rules / e.g., AACR2 formal definition of the element author is given by rules for formal definition / e.g., DTD Formal rule for display of the element author rules of transformation for display / e.g., XSLT for XML is given by In this way, we work in XML

  12. XMLeXtensible Markup Language DTD *.dtd It contains the reference to the DTD that controls it XML file *.xml It can contain the reference to the transformation rule that formats it for display, e.g. a XSLT file *.xslt DTD for XML is still written in SGML syntax; therefore, a W3C Schema has been introduced to replace it. Like this, a document can be controlled either by a DTD (*.dtd) or by a Schema (*.xsd). <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE Monograph SYSTEM "http://digit.nkp.cz/Monographs/DTD/1.0/Monograph.dtd"> <?xml-stylesheet type="text/xsl" href="http://digit.nkp.cz/Monographs/DTD/1.0/mon.xslt"?>

  13. DTD = Document Type Definition • The basic construction piece is ELEMENT • ELEMENT can have a content or it can be EMPTY • ELEMENTS can consist of other elements

  14. Here the element Title consists of a group of three elements (MainTitle, SubTitle, and ParallelTitle); from them only the MainTitle is mandatory, SubTitle and ParallelTitle are not, while ParallelTitle can be repeatable. In a DTD it is written like this: <!ELEMENT Title (MainTitle, SubTitle?, ParallelTitle*)> <!ELEMENT MainTitle (#PCDATA)> <!ELEMENT SubTitle (#PCDATA)> <!ELEMENT ParallelTitle (#PCDATA)>

  15. The element PageRepresentation enables to link the concrete page with the image or full text that represent it. • <!ELEMENT MonographPage (PageNumber+, Notes?, PageRepresentation+)> • <!ATTLIST MonographPage • Type (Advertisement | BackCover | BackEndSheet | Blank | FlyLeaf | FrontCover | FrontEndSheet | Index | ListOfIllustrations | ListOfMaps | ListOfTables | NormalPage | Spine | Table | TableOfContents | TitlePage) "NormalPage" • > • <!ELEMENT PageNumber (#PCDATA)> • <!ELEMENT PageRepresentation ((PageImage | PageText), TechnicalDescription?)> • <!ELEMENT PageImage EMPTY> • <!ATTLIST PageImage • href CDATA #REQUIRED • > • <!ELEMENT PageText EMPTY> • <!ATTLIST PageText • href CDATA #REQUIRED • > To note: we can also set up a list of attributes; here these are Type of the MonographPage or href, i.e. reference to external data entity.

  16. <!ELEMENT MonographPage (PageNumber+, Notes?, PageRepresentation+)> • <!ATTLIST MonographPage • Type (Advertisement | BackCover | BackEndSheet | Blank | FlyLeaf | FrontCover | FrontEndSheet | Index | ListOfIllustrations | ListOfMaps | ListOfTables | NormalPage | Spine | Table | TableOfContents | TitlePage) "NormalPage" • > The above part of a DTD means this: The element MonographPage consists of the elements PageNumber, Notes and PageRepresentation. We classify the MonographPage in relationship to its content into the Types such as Advertisement, BackCover, …, TableOfContents, and TitlePage. We have set up the defaulf value as NormalPage, because we expect this will be the most frequent choice. The meaning of the qualifying signs is as follows: Element - lack of sign = the element is mandatory and it occurs only once Element+ - the sign + = the element is mandatory and occurs at least once Element? - the sign ? = the element is not mandatory and it can occur only once Element* - the sign * = the element is not mandatory and it occurs at least once

  17. <!ELEMENT PageNumber (#PCDATA)> • <!ELEMENT PageRepresentation ((PageImage | PageText), TechnicalDescription?)> • <!ELEMENT PageImage EMPTY> • <!ATTLIST PageImage • href CDATA #REQUIRED • > • <!ELEMENT PageText EMPTY> • <!ATTLIST PageText • href CDATA #REQUIRED • > Each element that does not consist of any further elements must be defined, too. The expression (#PCDATA) announces that in the XML files written on the basis of this DTD, an analyzable string of metadata is expected, here, for example, a page number like this <PageNumber>221</PageNumber> The sign | in (PageImage | PageText)indicates that only one of the two elements is applied for the concrete PageRepresentation. The philosophy of this DTD shows that in case of the page representation both by image and text, each of them is attached to a new PageRepresentation. The ATTLIST (list of attributes) sets up the href attribute as a reference/navigation link to non-analyzable external data (CDATA). The elements PageImage and PageText are empty as they serve only to link the page to the image or full text files. <PageRepresentation> <PageImage href=“http://digit.nkp.cz/Data/Image7.jpg"/> </ PageRepresentation>

  18. <MonographPage Type="FlyLeaf"> • <PageNumber>2</PageNumber> • <Notes>List of publications of U. Eco at Bompiani</Notes> <PageRepresentation> • <PageImage href="Data/Image4.gif"/> • </PageRepresentation> • </MonographPage> • This is a concrete section from an XML file, where we can see that the reference is made to the image in GIF format located in the Data subdirectory. We can also see that it is the page no. 2 of the Type Flyleaf. • For more understanding, we will now make a simple project whose aim is to write a DTD for the document we may need in a project of digitization of old postcards. • The steps are: analysis of the document, establishment of needed elements and their relationships, setup of the element linking to digitized images, writing the DTD, writing an XML file based on the DTD, and its display. • The aim is to show how it is done, not to teach everything as it requires a more thourough XML training course.

  19. How to write a simple DTD? • Analyze well the object you wish to describe and represent • Try to establish the necessary elements for description and their basic properties (mandatory yes/no, repeatable yes/no) • Try to define whether these elements will consist of other elements • Establish from which elements the visual image files will be referenced to

  20. Digitized postcard • Root element: PostcardDescription • Elements of the 2nd level: • author (consists of surname and name elements) • title • theme • publisher (consists of PlaceOfPublication, NameOfPublisher, DateOfPublication) • PhysicalDescription (consists of Size and Technique elements) • TypeOfDocument • VisualRepresentation (consists of ImageOfRectoPart and ImageOfVersoPart elements) • language • annotation The necessary elements and hierarchies for a DTD of a Digitized Postcard

  21. They can be represented by this graph

  22. <?xml version="1.0" encoding="UTF-8"?> <!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com) by Adolf Knoll (National Library) --> <!ELEMENT PostcardDescription (author*, title, theme+, publisher+, PhysicalDescription, TypeOfDocument, VisualRepresentation?, language, annotation)> <!ELEMENT author (surname, name*)> <!--If the author has a name that cannot be split into parts, this name is always written in the field marked as surname.--> <!ELEMENT surname (#PCDATA)> <!ELEMENT name (#PCDATA)> <!--The title must be always entered; if missing, an artificial title will be created.--> <!ELEMENT title (#PCDATA)> <!ELEMENT theme (#PCDATA)> <!ELEMENT publisher (PlaceOfPublication?, NameOfPublisher?, DateOfPublication)> <!ELEMENT PlaceOfPublication (#PCDATA)> <!ELEMENT NameOfPublisher (#PCDATA)> <!ELEMENT DateOfPublication (#PCDATA)> <!ELEMENT PhysicalDescription (Size, Technique)> <!ELEMENT Size (#PCDATA)> <!ELEMENT Technique (#PCDATA)> <!ELEMENT TypeOfDocument (#PCDATA)> <!--Here will be links to computer graphic files representing the postcard.--> <!ELEMENT VisualRepresentation (ImageOfRectoPart*, ImageOfVersoPart*)> <!ELEMENT ImageOfRectoPart EMPTY> <!ATTLIST ImageOfRectoPart (preview | normal | excellent) #REQUIRED CDATA #REQUIRED > <!ELEMENT ImageOfVersoPart EMPTY> <!ATTLIST ImageOfVersoPart (preview | normal | excellent) #REQUIRED CDATA #REQUIRED > <!ELEMENT language (#PCDATA)> <!ELEMENT annotation (#PCDATA)> Postcard.dtd

  23. <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE PostcardDescription SYSTEM "Postcard.dtd"> <?xml-stylesheet type="text/xsl" href="Postcard.xslt"?> <PostcardDescription> <author> <surname>Lyer</surname> <name>Antonín</name> </author> <title>Hronov</title> <theme>views of streets</theme> <theme>Nádražní ulice</theme> <theme>Dvorská ulice</theme> <theme>Jiráskova ulice</theme> <theme>Náměstí</theme> <publisher> <PlaceOfPublication>Hronov</PlaceOfPublication> <NameOfPublisher>Karel Šefelín</NameOfPublisher> <DateOfPublication>[1910]</DateOfPublication> </publisher> <PhysicalDescription> <Size>9x13 cm</Size> <Technique>colour printing</Technique> </PhysicalDescription> <TypeOfDocument>postcard</TypeOfDocument> <VisualRepresentation> <ImageOfRectoPart quality="normal" href="vzorky/pohled-b.jpg"/> <ImageOfRectoPart quality="excellent" href="vzorky/pohled-b.png"/> <ImageOfVersoPart quality="excellent" href="vzorky/pohled-b-2.png"/> </VisualRepresentation> <language>cz</language> <annotation>The postcard was sent by my great grand-mother to her husband, who was in military service in first years of the World War I.</annotation> </PostcardDescription> Reference to a formatting stylesheet Postcard.xml Reference to image files

  24. How does it work in a web browser? • When we click on the xml file: • The browser will look for the formatting file (stylesheet – the *.xslt file) and will call it • It will display the file following the prescribed rules • We can click on the links leading to images that represent the postcard visually and we will be navigated to them • So, let’s try it and click on the file Postcard.xml

  25. XML Conclusions • The language enables to define and control any type of descriptions • It can relate them to the outer data • It makes the structure of the digitized documents clear and readable for the long term • It enables that the output of our work (production of XML files and digitized documents) corresponds with what we defined we wished to do • It means that for example our Digital Library can be fed by correct and standardized documents that enable, among others, also their long-term digital preservation

  26. Work with XML • From the user perspective a good digitization project develops XML editors that: • make the work easy (filling forms) • check the validity against the applied DTD • output only correct XML structures • If you wish to check your forces, dowload the free M-TOOL from the Manuscriptorium Digital Library free tools at http://manuscriptorium.com/Site/ENG/mtool_eng.asp and try to work with it

  27. Where to find more? General • http://www.w3.org/XML/ (XML Home) • http://www.xml.com/pub/a/98/10/guide0.html (Technical Introduction to XML) • http://www.altova.com/ (XMLSpy editor) Applied • http://digit.nkp.cz/techstandards.html (several DTDs implemented in functioning digital libraries) • http://www.loc.gov/standards/mets/ (METS format for containerization of XML-based digital documents) • http://www.tei-c.org/ (TEI – Text Encoding Initiative)

More Related